Log in

Relevant bibliographies by topics / Video question answering / Journal articles

To see the other types of publications on this topic, follow the link: Video question answering.

Journal articles on the topic 'Video question answering'

Author: Grafiati

Published: 14 September 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Video question answering.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Lei, Chenyi, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. "Multi-Question Learning for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11328–35. http://dx.doi.org/10.1609/aaai.v34i07.6794.

Full text

Abstract:

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

APA, Harvard, Vancouver, ISO, and other styles

2

Ruwa, Nelson, Qirong Mao, Liangjun Wang, and Jianping Gou. "Affective question answering on video." Neurocomputing 363 (October 2019): 125–39. http://dx.doi.org/10.1016/j.neucom.2019.06.046.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Wang, Yueqian, Yuxuan Wang, Kai Chen, and Dongyan Zhao. "STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 19215–23. http://dx.doi.org/10.1609/aaai.v38i17.29890.

Full text

Abstract:

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

APA, Harvard, Vancouver, ISO, and other styles

4

Zong, Linlin, Jiahui Wan, Xianchao Zhang, Xinyue Liu, Wenxin Liang, and Bo Xu. "Video-Context Aligned Transformer for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 19795–803. http://dx.doi.org/10.1609/aaai.v38i17.29954.

Full text

Abstract:

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

APA, Harvard, Vancouver, ISO, and other styles

5

Huang, Deng, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. "Location-Aware Graph Convolutional Networks for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11021–28. http://dx.doi.org/10.1609/aaai.v34i07.6737.

Full text

Abstract:

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.

APA, Harvard, Vancouver, ISO, and other styles

6

Gao, Lianli, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. "Structured Two-Stream Attention Network for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6391–98. http://dx.doi.org/10.1609/aaai.v33i01.33016391.

Full text

Abstract:

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

APA, Harvard, Vancouver, ISO, and other styles

7

Kumar, Krishnamoorthi Magesh, and P. Valarmathie. "Domain and Intelligence Based Multimedia Question Answering System." International Journal of Evaluation and Research in Education (IJERE) 5, no. 3 (September 1, 2016): 227. http://dx.doi.org/10.11591/ijere.v5i3.4544.

Full text

Abstract:

Multimedia question answering systems have become very popular over the past few years. It allows users to share their thoughts by answering given question or obtain information from a set of answered questions. However, existing QA systems support only textual answer which is not so instructive for many users. The user’s discussion can be enhanced by adding suitable multimedia data. Multimedia answers offer intuitive information with more suitable image, voice and video. This system includes a set of information as well as classification of question and answer, query generation, multimedia data selection and presentation. This system will take all kinds of media such as text, images, videos, and videos which will be combined with a textual answer. In a way, it automatically collects information from the user to improvising the answer. This method consists of ranking for answers to select the best answer. By dealing out a huge set of QA pairs and adding them to a database, multimedia question answering approach for users which finds multimedia answers by matching their questions with those in the database. The effectiveness of Multimedia system is determined by ranking of text, image, audio and video in users answer. The answer which is given by the user it’s processed by Semantic match algorithm and the best answers can be viewed by Naive Bayesian ranking system.

APA, Harvard, Vancouver, ISO, and other styles

8

Xue, Hongyang, Zhou Zhao, and Deng Cai. "Unifying the Video and Question Attentions for Open-Ended Video Question Answering." IEEE Transactions on Image Processing 26, no. 12 (December 2017): 5656–66. http://dx.doi.org/10.1109/tip.2017.2746267.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Jang, Yunseok, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim, and Gunhee Kim. "Video Question Answering with Spatio-Temporal Reasoning." International Journal of Computer Vision 127, no. 10 (June 18, 2019): 1385–412. http://dx.doi.org/10.1007/s11263-019-01189-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Zhuang, Yueting, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu, and Jun Xiao. "Multichannel Attention Refinement for Video Question Answering." ACM Transactions on Multimedia Computing, Communications, and Applications 16, no. 1s (April 28, 2020): 1–23. http://dx.doi.org/10.1145/3366710.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Garcia, Noa, Mayu Otani, Chenhui Chu, and Yuta Nakashima. "KnowIT VQA: Answering Knowledge-Based Questions about Videos." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 10826–34. http://dx.doi.org/10.1609/aaai.v34i07.6713.

Full text

Abstract:

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

APA, Harvard, Vancouver, ISO, and other styles

12

Mao, Jianguo, Wenbin Jiang, Hong Liu, Xiangdong Wang, and Yajuan Lyu. "Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (June 26, 2023): 13380–88. http://dx.doi.org/10.1609/aaai.v37i11.26570.

Full text

Abstract:

Recently, video question answering has attracted growing attention. It involves answering a question based on a fine-grained understanding of video multi-modal information. Most existing methods have successfully explored the deep understanding of visual modality. We argue that a deep understanding of linguistic modality is also essential for answer reasoning, especially for videos that contain character dialogues. To this end, we propose an Inferential Knowledge-Enhanced Integrated Reasoning method. Our method consists of two main components: 1) an Inferential Knowledge Reasoner to generate inferential knowledge for linguistic modality inputs that reveals deeper semantics, including the implicit causes, effects, mental states, etc. 2) an Integrated Reasoning Mechanism to enhance video content understanding and answer reasoning by leveraging the generated inferential knowledge. Experimental results show that our method achieves significant improvement on two mainstream datasets. The ablation study further demonstrates the effectiveness of each component of our approach.

APA, Harvard, Vancouver, ISO, and other styles

13

Jiang, Jianwen, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. "Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11101–8. http://dx.doi.org/10.1609/aaai.v34i07.6766.

Full text

Abstract:

Understanding questions and finding clues for answers are the key for video question answering. Compared with image question answering, video question answering (Video QA) requires to find the clues accurately on both spatial and temporal dimension simultaneously, and thus is more challenging. However, the relationship between spatio-temporal information and question still has not been well utilized in most existing methods for Video QA. To tackle this problem, we propose a Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method. In QueST, we divide the semantic features generated from question into two separate parts: the spatial part and the temporal part, respectively guiding the process of constructing the contextual attention on spatial and temporal dimension. Under the guidance of the corresponding contextual attention, visual features can be better exploited on both spatial and temporal dimensions. To evaluate the effectiveness of the proposed method, experiments are conducted on TGIF-QA dataset, MSRVTT-QA dataset and MSVD-QA dataset. Experimental results and comparisons with the state-of-the-art methods have shown that our method can achieve superior performance.

APA, Harvard, Vancouver, ISO, and other styles

14

Yang, Saelyne, Sunghyun Park, Yunseok Jang, and Moontae Lee. "YTCommentQA: Video Question Answerability in Instructional Videos." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 19359–67. http://dx.doi.org/10.1609/aaai.v38i17.29906.

Full text

Abstract:

Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.

APA, Harvard, Vancouver, ISO, and other styles

15

Yu, Zhou, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. "ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9127–34. http://dx.doi.org/10.1609/aaai.v33i01.33019127.

Full text

Abstract:

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos.

APA, Harvard, Vancouver, ISO, and other styles

16

Cherian, Anoop, Chiori Hori, Tim K. Marks, and Jonathan Le Roux. "(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 1 (June 28, 2022): 444–53. http://dx.doi.org/10.1609/aaai.v36i1.19922.

Full text

Abstract:

Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D ``views'' of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.

APA, Harvard, Vancouver, ISO, and other styles

17

Chu, Wenqing, Hongyang Xue, Zhou Zhao, Deng Cai, and Chengwei Yao. "The forgettable-watcher model for video question answering." Neurocomputing 314 (November 2018): 386–93. http://dx.doi.org/10.1016/j.neucom.2018.06.069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Zhu, Linchao, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. "Uncovering the Temporal Context for Video Question Answering." International Journal of Computer Vision 124, no. 3 (July 13, 2017): 409–21. http://dx.doi.org/10.1007/s11263-017-1033-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Lee, Yue-Shi, Yu-Chieh Wu, and Jie-Chi Yang. "BVideoQA: Online English/Chinese bilingual video question answering." Journal of the American Society for Information Science and Technology 60, no. 3 (March 2009): 509–25. http://dx.doi.org/10.1002/asi.21002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Xiao, Junbin, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. "Video as Conditional Graph Hierarchy for Multi-Granular Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (June 28, 2022): 2804–12. http://dx.doi.org/10.1609/aaai.v36i3.20184.

Full text

Abstract:

Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (e.g., objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also demonstrate the model's reliability as it shows meaningful visual-textual evidences for the predicted answers.

APA, Harvard, Vancouver, ISO, and other styles

21

Lee, Kyungjae, Nan Duan, Lei Ji, Jason Li, and Seung-won Hwang. "Segment-Then-Rank: Non-Factoid Question Answering on Instructional Videos." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8147–54. http://dx.doi.org/10.1609/aaai.v34i05.6327.

Full text

Abstract:

We study the problem of non-factoid QA on instructional videos. Existing work focuses either on visual or textual modality of video content, to find matching answers to the question. However, neither is flexible enough for our problem setting of non-factoid answers with varying lengths. Motivated by this, we propose a two-stage model: (a) multimodal segmentation of video into span candidates and (b) length-adaptive ranking of the candidates to the question. First, for segmentation, we propose Segmenter for generating span candidates of diverse length, considering both textual and visual modality. Second, for ranking, we propose Ranker to score the candidates, dynamically combining the two models with complementary strength for both short and long spans respectively. Experimental result demonstrates that our model achieves state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

22

Gao, Feng, Yuanyuan Ge, and Yongge Liu. "Remember and forget: video and text fusion for video question answering." Multimedia Tools and Applications 77, no. 22 (March 27, 2018): 29269–82. http://dx.doi.org/10.1007/s11042-018-5868-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Li, Zhangbin, Dan Guo, Jinxing Zhou, Jing Zhang, and Meng Wang. "Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (March 24, 2024): 3306–14. http://dx.doi.org/10.1609/aaai.v38i4.28116.

Full text

Abstract:

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.

APA, Harvard, Vancouver, ISO, and other styles

24

Shao, Zhuang, Jiahui Wan, and Linlin Zong. "A Video Question Answering Model Based on Knowledge Distillation." Information 14, no. 6 (June 12, 2023): 328. http://dx.doi.org/10.3390/info14060328.

Full text

Abstract:

Video question answering (QA) is a cross-modal task that requires understanding the video content to answer questions. Current techniques address this challenge by employing stacked modules, such as attention mechanisms and graph convolutional networks. These methods reason about the semantics of video features and their interaction with text-based questions, yielding excellent results. However, these approaches often learn and fuse features representing different aspects of the video separately, neglecting the intra-interaction and overlooking the latent complex correlations between the extracted features. Additionally, the stacking of modules introduces a large number of parameters, making model training more challenging. To address these issues, we propose a novel multimodal knowledge distillation method that leverages the strengths of knowledge distillation for model compression and feature enhancement. Specifically, the fused features in the larger teacher model are distilled into knowledge, which guides the learning of appearance and motion features in the smaller student model. By incorporating cross-modal information in the early stages, the appearance and motion features can discover their related and complementary potential relationships, thus improving the overall model performance. Despite its simplicity, our extensive experiments on the widely used video QA datasets, MSVD-QA and MSRVTT-QA, demonstrate clear performance improvements over prior methods. These results validate the effectiveness of the proposed knowledge distillation approach.

APA, Harvard, Vancouver, ISO, and other styles

25

Li, Xiangpeng, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. "Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8658–65. http://dx.doi.org/10.1609/aaai.v33i01.33018658.

Full text

Abstract:

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

APA, Harvard, Vancouver, ISO, and other styles

26

Jiang, Pin, and Yahong Han. "Reasoning with Heterogeneous Graph Alignment for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11109–16. http://dx.doi.org/10.1609/aaai.v34i07.6767.

Full text

Abstract:

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.

APA, Harvard, Vancouver, ISO, and other styles

27

Gu, Mao, Zhou Zhao, Weike Jin, Richang Hong, and Fei Wu. "Graph-Based Multi-Interaction Network for Video Question Answering." IEEE Transactions on Image Processing 30 (2021): 2758–70. http://dx.doi.org/10.1109/tip.2021.3051756.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Yu-Chieh Wu and Jie-Chi Yang. "A Robust Passage Retrieval Algorithm for Video Question Answering." IEEE Transactions on Circuits and Systems for Video Technology 18, no. 10 (October 2008): 1411–21. http://dx.doi.org/10.1109/tcsvt.2008.2002831.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Wang, Weining, Yan Huang, and Liang Wang. "Long video question answering: A Matching-guided Attention Model." Pattern Recognition 102 (June 2020): 107248. http://dx.doi.org/10.1016/j.patcog.2020.107248.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Ye, Yunan, Shifeng Zhang, Yimeng Li, Xufeng Qian, Siliang Tang, Shiliang Pu, and Jun Xiao. "Video question answering via grounded cross-attention network learning." Information Processing & Management 57, no. 4 (July 2020): 102265. http://dx.doi.org/10.1016/j.ipm.2020.102265.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Zhang, Wenqiao, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. "Frame Augmented Alternating Attention Network for Video Question Answering." IEEE Transactions on Multimedia 22, no. 4 (April 2020): 1032–41. http://dx.doi.org/10.1109/tmm.2019.2935678.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Zha, Zheng-Jun, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. "Spatiotemporal-Textual Co-Attention Network for Video Question Answering." ACM Transactions on Multimedia Computing, Communications, and Applications 15, no. 2s (August 12, 2019): 1–18. http://dx.doi.org/10.1145/3320061.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Jiang, Yimin, Tingfei Yan, Mingze Yao, Huibing Wang, and Wenzhe Liu. "Cascade transformers with dynamic attention for video question answering." Computer Vision and Image Understanding 242 (May 2024): 103983. http://dx.doi.org/10.1016/j.cviu.2024.103983.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Jiao, Guie. "Realization of Video Question Answering System Based on Flash under RIA." Applied Mechanics and Materials 411-414 (September 2013): 970–73. http://dx.doi.org/10.4028/www.scientific.net/amm.411-414.970.

Full text

Abstract:

Video question answering system is designed and developed based on Flash platform technology,belongs of B / S (Browser / Server) structure,used to meet the needs of specific groupss teaching and studying.This paper describes the core code of the real-time Q&A video module about video creating and video playback,also describes the relevant realization technology and other functional modules of the system.

APA, Harvard, Vancouver, ISO, and other styles

35

Liu, Mingyang, Ruomei Wang, Fan Zhou, and Ge Lin. "Temporally Multi-Modal Semantic Reasoning with Spatial Language Constraints for Video Question Answering." Symmetry 14, no. 6 (May 31, 2022): 1133. http://dx.doi.org/10.3390/sym14061133.

Full text

Abstract:

Video question answering (QA) aims to understand the video scene and underlying plot by answering video questions. An algorithm that can competently cope with this task needs to be able to: (1) collect multi-modal information scattered in the video frame sequence while extracting, interpreting, and utilizing the potential semantic clues provided by each piece of modal information in the video, (2) integrate the multi-modal context of the above semantic clues and understand the cause and effect of the story as it evolves, and (3) identify and integrate those temporally adjacent or non-adjacent effective semantic clues implied in the above context information to provide reasonable and sufficient visual semantic information for the final question reasoning. In response to the above requirements, a novel temporally multi-modal semantic reasoning with spatial language constraints video QA solution is reported in this paper, which includes a significant feature extraction module used to extract multi-modal features according to a significant sampling strategy, a spatial language constraints module used to recognize and reason spatial dimensions in video frames under the guidance of questions, and a temporal language interaction module used to locate the temporal dimension semantic clues of the appearance features and motion features sequence. Specifically, for a question, the result processed by the spatial language constraints module is to obtain visual clues related to the question from a single image and filter out unwanted spatial information. Further, the temporal language interaction module symmetrically integrates visual clues of the appearance information and motion information scattered throughout the temporal dimensions, obtains the temporally adjacent or non-adjacent effective semantic clue, and filters out irrelevant or detrimental context information. The proposed video QA solution is validated on several video QA benchmarks. Comprehensive ablation experiments have confirmed that modeling the significant video information can improve QA ability. The spatial language constraints module and temporal language interaction module can better collect and summarize visual semantic clues.

APA, Harvard, Vancouver, ISO, and other styles

36

Jin, Yao, Guocheng Niu, Xinyan Xiao, Jian Zhang, Xi Peng, and Jun Yu. "Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 7 (June 26, 2023): 8141–49. http://dx.doi.org/10.1609/aaai.v37i7.25983.

Full text

Abstract:

Open-ended Video question answering (open-ended VideoQA) aims to understand video content and question semantics to generate the correct answers. Most of the best performing models define the problem as a discriminative task of multi-label classification. In real-world scenarios, however, it is difficult to define a candidate set that includes all possible answers. In this paper, we propose a Knowledge-constrained Generative VideoQA Algorithm (KcGA) with an encoder-decoder pipeline, which enables out-of-domain answer generation through an adaptive external knowledge module and a multi-stream information control mechanism. We use ClipBERT to extract the video-question features, extract framewise object-level external knowledge from a commonsense knowledge base and compute the contextual-aware episode memory units via an attention based GRU to form the external knowledge features, and exploit multi-stream information control mechanism to fuse video-question and external knowledge features such that the semantic complementation and alignment are well achieved. We evaluate our model on two open-ended benchmark datasets to demonstrate that we can effectively and robustly generate high-quality answers without restrictions of training data.

APA, Harvard, Vancouver, ISO, and other styles

37

Patel, Hardik Bhikhabhai, and Sailesh Suryanarayan Iyer. "Comparative Study of Multimedia Question Answering System Models." ECS Transactions 107, no. 1 (April 24, 2022): 2033–42. http://dx.doi.org/10.1149/10701.2033ecst.

Full text

Abstract:

Question Answering Systems is a mechanized way to deal with finding right solution to the inquiries posed by human in their own language. The enormous measure of data returned by web crawlers like Google, Yahoo, and so on, clients are over-burden to track down the right data from the web search tools. This QA approach takes care of these issues. Rather than returning a rundown of reports and connections from the momentum web crawler, the QA framework gives data from a bunch of replied, alongside reasonable media information. Question answering procedures plans to exploit top to bottom etymological and media content investigation just as area information to return accurate responses to regular language questions. QA framework has rundown of boundary like: question size, m-dimension, technique count, answer cardinality, and answer range. This paper investigates the advancement of mixed media question replying research, their plan boundary, and subtleties its future perspectives. Multiple Line Questions and Video Question Answering methods have been discussed at length and their performance measured on QA framework parameters.

APA, Harvard, Vancouver, ISO, and other styles

38

Gao, Lianli, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. "Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering." IEEE Transactions on Image Processing 31 (2022): 202–15. http://dx.doi.org/10.1109/tip.2021.3120867.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Yang, Zekun, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. "A comparative study of language transformers for video question answering." Neurocomputing 445 (July 2021): 121–33. http://dx.doi.org/10.1016/j.neucom.2021.02.092.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Zhao, Zhou, Zhu Zhang, Shuwen Xiao, Zhenxin Xiao, Xiaohui Yan, Jun Yu, Deng Cai, and Fei Wu. "Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks." IEEE Transactions on Image Processing 28, no. 12 (December 2019): 5939–52. http://dx.doi.org/10.1109/tip.2019.2922062.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Yin, Chengxiang, Jian Tang, Zhiyuan Xu, and Yanzhi Wang. "Memory Augmented Deep Recurrent Neural Network for Video Question Answering." IEEE Transactions on Neural Networks and Learning Systems 31, no. 9 (September 2020): 3159–67. http://dx.doi.org/10.1109/tnnls.2019.2938015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Wang, Zheng, Fangtao Li, Kaoru Ota, Mianxiong Dong, and Bin Wu. "ReGR: Relation-aware graph reasoning framework for video question answering." Information Processing & Management 60, no. 4 (July 2023): 103375. http://dx.doi.org/10.1016/j.ipm.2023.103375.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Al Mehmadi, Shima M., Yakoub Bazi, Mohamad M. Al Rahhal, and Mansour Zuair. "Learning to enhance areal video captioning with visual question answering." International Journal of Remote Sensing 45, no. 18 (August 30, 2024): 6395–407. http://dx.doi.org/10.1080/01431161.2024.2388875.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Zhuang, Xuqiang, Fang’ai Liu, Jian Hou, Jianhua Hao, and Xiaohong Cai. "Modality attention fusion model with hybrid multi-head self-attention for video understanding." PLOS ONE 17, no. 10 (October 6, 2022): e0275156. http://dx.doi.org/10.1371/journal.pone.0275156.

Full text

Abstract:

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.

APA, Harvard, Vancouver, ISO, and other styles

45

Peng, Min, Chongyang Wang, Yu Shi, and Xiang-Dong Zhou. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 2038–46. http://dx.doi.org/10.1609/aaai.v37i2.25296.

Full text

Abstract:

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

APA, Harvard, Vancouver, ISO, and other styles

46

Kim, Seonhoon, Seohyeong Jeong, Eunbyul Kim, Inho Kang, and Nojun Kwak. "Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (May 18, 2021): 13171–79. http://dx.doi.org/10.1609/aaai.v35i14.17556.

Full text

Abstract:

Video Question Answering (VideoQA) requires fine-grained understanding of both video and language modalities to answer the given questions. In this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning. In the self-supervised pre-training stage, we transform the original problem format of predicting the correct answer into the one that predicts the relevant question to provide a model with broader contextual inputs without any further dataset or annotation. For contrastive learning in the main stage, we add a masking noise to the input corresponding to the ground-truth answer, and consider the original input of the ground-truth answer as a positive sample, while treating the rest as negative samples. By mapping the positive sample closer to the masked input, we show that the model performance is improved. We further employ locally aligned attention to focus more effectively on the video frames that are particularly relevant to the given corresponding subtitle sentences. We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA. Experimental results show that our model achieves state-of-the-art performance on all datasets. We also validate our approaches through further analyses.

APA, Harvard, Vancouver, ISO, and other styles

47

Park, Gyu-Min, A.-Yeong Kim, and Seong-Bae Park. "Confident Multiple Choice Learning-based Ensemble Model for Video Question-Answering." Journal of KIISE 49, no. 4 (April 30, 2022): 284–90. http://dx.doi.org/10.5626/jok.2022.49.4.284.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Liu, Yun, Xiaoming Zhang, Feiran Huang, Bo Zhang, and Zhoujun Li. "Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering." IEEE Transactions on Image Processing 31 (2022): 1684–96. http://dx.doi.org/10.1109/tip.2022.3142526.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Zhao, Zhou, Zhu Zhang, Xinghua Jiang, and Deng Cai. "Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks." IEEE Transactions on Image Processing 28, no. 8 (August 2019): 3860–72. http://dx.doi.org/10.1109/tip.2019.2902106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Yu, Ting, Jun Yu, Zhou Yu, and Dacheng Tao. "Compositional Attention Networks With Two-Stream Fusion for Video Question Answering." IEEE Transactions on Image Processing 29 (2020): 1204–18. http://dx.doi.org/10.1109/tip.2019.2940677.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!