Bibliografias temáticas / Video question answering

Índice

Artigos de revistas
Teses / dissertações
Livros
Capítulos de livros
Trabalhos de conferências

Literatura científica selecionada sobre o tema "Video question answering"

Autor: Grafiati

Publicado: 14 de setembro de 2024

Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos

Selecione um tipo de fonte:

Consulte a lista de atuais artigos, livros, teses, anais de congressos e outras fontes científicas relevantes para o tema "Video question answering".

Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.

Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.

Artigos de revistas sobre o assunto "Video question answering"

Lei, Chenyi, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang e Houqiang Li. "Multi-Question Learning for Visual Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 11328–35. http://dx.doi.org/10.1609/aaai.v34i07.6794.

Texto completo da fonte

Resumo:

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Ruwa, Nelson, Qirong Mao, Liangjun Wang e Jianping Gou. "Affective question answering on video". Neurocomputing 363 (outubro de 2019): 125–39. http://dx.doi.org/10.1016/j.neucom.2019.06.046.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Wang, Yueqian, Yuxuan Wang, Kai Chen e Dongyan Zhao. "STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 17 (24 de março de 2024): 19215–23. http://dx.doi.org/10.1609/aaai.v38i17.29890.

Texto completo da fonte

Resumo:

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

Estilos ABNT, Harvard, Vancouver, APA, etc.

Zong, Linlin, Jiahui Wan, Xianchao Zhang, Xinyue Liu, Wenxin Liang e Bo Xu. "Video-Context Aligned Transformer for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 17 (24 de março de 2024): 19795–803. http://dx.doi.org/10.1609/aaai.v38i17.29954.

Texto completo da fonte

Resumo:

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Huang, Deng, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan e Chuang Gan. "Location-Aware Graph Convolutional Networks for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 11021–28. http://dx.doi.org/10.1609/aaai.v34i07.6737.

Texto completo da fonte

Resumo:

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Gao, Lianli, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei e Heng Tao Shen. "Structured Two-Stream Attention Network for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julho de 2019): 6391–98. http://dx.doi.org/10.1609/aaai.v33i01.33016391.

Texto completo da fonte

Resumo:

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Kumar, Krishnamoorthi Magesh, e P. Valarmathie. "Domain and Intelligence Based Multimedia Question Answering System". International Journal of Evaluation and Research in Education (IJERE) 5, n.º 3 (1 de setembro de 2016): 227. http://dx.doi.org/10.11591/ijere.v5i3.4544.

Texto completo da fonte

Resumo:

Multimedia question answering systems have become very popular over the past few years. It allows users to share their thoughts by answering given question or obtain information from a set of answered questions. However, existing QA systems support only textual answer which is not so instructive for many users. The user’s discussion can be enhanced by adding suitable multimedia data. Multimedia answers offer intuitive information with more suitable image, voice and video. This system includes a set of information as well as classification of question and answer, query generation, multimedia data selection and presentation. This system will take all kinds of media such as text, images, videos, and videos which will be combined with a textual answer. In a way, it automatically collects information from the user to improvising the answer. This method consists of ranking for answers to select the best answer. By dealing out a huge set of QA pairs and adding them to a database, multimedia question answering approach for users which finds multimedia answers by matching their questions with those in the database. The effectiveness of Multimedia system is determined by ranking of text, image, audio and video in users answer. The answer which is given by the user it’s processed by Semantic match algorithm and the best answers can be viewed by Naive Bayesian ranking system.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Xue, Hongyang, Zhou Zhao e Deng Cai. "Unifying the Video and Question Attentions for Open-Ended Video Question Answering". IEEE Transactions on Image Processing 26, n.º 12 (dezembro de 2017): 5656–66. http://dx.doi.org/10.1109/tip.2017.2746267.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Jang, Yunseok, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim e Gunhee Kim. "Video Question Answering with Spatio-Temporal Reasoning". International Journal of Computer Vision 127, n.º 10 (18 de junho de 2019): 1385–412. http://dx.doi.org/10.1007/s11263-019-01189-x.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Zhuang, Yueting, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu e Jun Xiao. "Multichannel Attention Refinement for Video Question Answering". ACM Transactions on Multimedia Computing, Communications, and Applications 16, n.º 1s (28 de abril de 2020): 1–23. http://dx.doi.org/10.1145/3366710.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Mais fontes

Teses / dissertações sobre o assunto "Video question answering"

Engin, Deniz. "Video question answering with limited supervision". Electronic Thesis or Diss., Université de Rennes (2023-....), 2024. http://www.theses.fr/2024URENS016.

Texto completo da fonte

Resumo:

Le contenu vidéo a considérablement augmenté en volume et en diversité à l'ère numérique, et cette expansion a souligné la nécessité de technologies avancées de compréhension des vidéos. Poussée par cette nécessité, cette thèse explore la compréhension sémantique des vidéos, en exploitant plusieurs modes perceptuels similaires aux processus cognitifs humains et un apprentissage efficace avec une supervision limitée, semblable aux capacités d'apprentissage humain. Cette thèse se concentre spécifiquement sur la réponse aux questions sur les vidéos comme l'une des principales tâches de compréhension vidéo. Notre première contribution traite de la réponse aux questions sur les vidéos à long terme, nécessitant une compréhension du contenu vidéo étendu. Alors que les approches récentes dépendent de sources externes générées par les humains, nous traitons des données brutes pour générer des résumés vidéo. Notre contribution suivante explore la réponse aux questions vidéo en zéro-shot et en few-shot, visant à améliorer l'apprentissage efficace à partir de données limitées. Nous exploitons la connaissance des modèles à grande échelle existants en éliminant les défis d'adaptation des modèles pré-entraînés à des données limitées. Nous démontrons que ces contributions améliorent considérablement les capacités des systèmes de réponse aux questions vidéo multimodaux, où les données étiquetées spécifiquement annotées par l'homme sont limitées ou indisponibles
Video content has significantly increased in volume and diversity in the digital era, and this expansion has highlighted the necessity for advanced video understanding technologies. Driven by this necessity, this thesis explores semantically understanding videos, leveraging multiple perceptual modes similar to human cognitive processes and efficient learning with limited supervision similar to human learning capabilities. This thesis specifically focuses on video question answering as one of the main video understanding tasks. Our first contribution addresses long-range video question answering, requiring an understanding of extended video content. While recent approaches rely on human-generated external sources, we process raw data to generate video summaries. Our following contribution explores zero-shot and few-shot video question answering, aiming to enhance efficient learning from limited data. We leverage the knowledge of existing large-scale models by eliminating challenges in adapting pre-trained models to limited data. We demonstrate that these contributions significantly enhance the capabilities of multimodal video question-answering systems, where specifically human-annotated labeled data is limited or unavailable

Estilos ABNT, Harvard, Vancouver, APA, etc.

Chowdhury, Muhammad Iqbal Hasan. "Question-answering on image/video content". Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/205096/1/Muhammad%20Iqbal%20Hasan_Chowdhury_Thesis.pdf.

Texto completo da fonte

Resumo:

This thesis explores a computer's ability to understand multimodal data where the correspondence between image/video content and natural language text are utilised to answer open-ended natural language questions through question-answering tasks. Static image data consisting of both indoor and outdoor scenes, where complex textual questions are arbitrarily posed to a machine to generate correct answers, was examined. Dynamic videos consisting of both single-camera and multi-camera settings for the exploration of more challenging and unconstrained question-answering tasks were also considered. In exploring these challenges, new deep learning processes were developed to improve a computer's ability to understand and consider multimodal data.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Zeng, Kuo-Hao, e 曾國豪. "Video titling and Question-Answering". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/a3a6sw.

Texto completo da fonte

Resumo:

碩士
國立清華大學
電機工程學系所
105
Video titling and question answering are two important tasks toward high-level visual data understanding. To address those two tasks, we propose a large-scale dataset and demonstrate several models on such dataset in this work. A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. On the other hand, for video question-answering task: we propose to learn a deep model to answer a free-form natural language question about the contents of a video. We make a program automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN, VQA, SA, and SS. In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. To demonstrate our idea, we collected a large-scale Video Titles in the Wild (VTW) dataset of $18100$ automatically crawled user-generated videos and titles. We then utilize an automatic QA generator to generate a large number of QA pairs for training and collect the manually generated QA pairs from Amazon Mechanical Turk. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Next, our sentence augmentation method also outperforms the baselines on the M-VAD dataset. Finally, the results of video question answering show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Xu, Huijuan. "Vision and language understanding with localized evidence". Thesis, 2018. https://hdl.handle.net/2144/34790.

Texto completo da fonte

Resumo:

Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Livros sobre o assunto "Video question answering"

McCallum, Richard. Evangelical Christian Responses to Islam. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781350418240.

Texto completo da fonte

Resumo:

Do Christians and Muslims worship the same God? Who was Muhammad? How does the Israeli-Palestinian conflict affect Christian-Muslim relations? This is a book about Evangelical Christians and how they are answering challenging questions about Islam. Drawing on over 300 texts published by Evangelicals in the first two decades of the twenty-first century, this book explores what the Evangelical micro-public sphere has to say about key issues in Christian-Muslim relations today. From the books they write, the blogs they post and the videos they make, it is clear that Evangelical Christians profoundly disagree with one another when discussing Islam. Answers to the questions range from seeing Muslims as the enemy posing an existential threat to Christians, through to welcoming them as good neighbours or even as close cousins.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Walker, Stephen. Digital Mediation. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781526525772.

Texto completo da fonte

Resumo:

Digital mediation is here to stay, but how do mediators, advisers and clients achieve the same results from digital mediations as they do from face to face mediations? Do new skills and mindsets need to be learnt? Can you build rapport online? Can you read emotions? How do you market online? How do you decide whether it’s the right choice for your dispute? How does digital mediation fit into the world of the Digital Justice System and mandatory mediation? Answering these questions and many more, this is the only book to focus on mediation as opposed to other means of Online Dispute Resolution such as arbitration. This title: - Includes checklists and templates written by a mediator who has conducted over 280 digital mediations - Covers topics including smart systems, ‘smart settle’, the use of artificial intelligence, ChatGPT and mixed media mediations - Teaches mediators, advisers and clients the different skills and mindsets essential to success in the world of digital mediation - Shows how to market mediation online with practical guidance on websites, videos, blogs and podcasts - This book is essential reading for all mediators wishing to adapt to the new norm of digital mediation. This title is included in Bloomsbury Professional's Mediation online service.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Capítulos de livros sobre o assunto "Video question answering"

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Video Question Answering". In Visual Question Answering, 119–33. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_8.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Video Representation Learning". In Visual Question Answering, 111–17. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_7.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Advanced Models for Video Question Answering". In Visual Question Answering, 135–43. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_9.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Gao, Lei, Guangda Li, Yan-Tao Zheng, Richang Hong e Tat-Seng Chua. "Video Reference: A Video Question Answering Engine". In Lecture Notes in Computer Science, 799–801. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-11301-7_92.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Xiao, Junbin, Pan Zhou, Tat-Seng Chua e Shuicheng Yan. "Video Graph Transformer for Video Question Answering". In Lecture Notes in Computer Science, 39–58. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_3.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Piergiovanni, AJ, Kairo Morton, Weicheng Kuo, Michael S. Ryoo e Anelia Angelova. "Video Question Answering with Iterative Video-Text Co-tokenization". In Lecture Notes in Computer Science, 76–94. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_5.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Chen, Xuanwei, Rui Liu, Xiaomeng Song e Yahong Han. "Locating Visual Explanations for Video Question Answering". In MultiMedia Modeling, 290–302. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-67832-6_24.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Gupta, Pranay, e Manish Gupta. "NewsKVQA: Knowledge-Aware News Video Question Answering". In Advances in Knowledge Discovery and Data Mining, 3–15. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-05981-0_1.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Ge, Yuanyuan, Youjiang Xu e Yahong Han. "Video Question Answering Using a Forget Memory Network". In Communications in Computer and Information Science, 404–15. Singapore: Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-7299-4_33.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Gao, Kun, Xianglei Zhu e Yahong Han. "Initialized Frame Attention Networks for Video Question Answering". In Communications in Computer and Information Science, 349–59. Singapore: Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-10-8530-7_34.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Trabalhos de conferências sobre o assunto "Video question answering"

Zhao, Wentian, Seokhwan Kim, Ning Xu e Hailin Jin. "Video Question Answering on Screencast Tutorials". In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/148.

Texto completo da fonte

Resumo:

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Jenni, Kommineni, M. Srinivas, Roshni Sannapu e Murukessan Perumal. "CSA-BERT: Video Question Answering". In 2023 IEEE Statistical Signal Processing Workshop (SSP). IEEE, 2023. http://dx.doi.org/10.1109/ssp53291.2023.10207954.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Li, Hao, Peng Jin, Zesen Cheng, Songyang Zhang, Kai Chen, Zhennan Wang, Chang Liu e Jie Chen. "TG-VQA: Ternary Game of Video Question Answering". In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/116.

Texto completo da fonte

Resumo:

Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i.e., annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e.g., video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (10^4 videos), surpassing most of those pre-trained on large-scale data (10^7 videos).

Estilos ABNT, Harvard, Vancouver, APA, etc.

Zhao, Zhou, Qifan Yang, Deng Cai, Xiaofei He e Yueting Zhuang. "Video Question Answering via Hierarchical Spatio-Temporal Attention Networks". In Twenty-Sixth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2017. http://dx.doi.org/10.24963/ijcai.2017/492.

Texto completo da fonte

Resumo:

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

Estilos ABNT, Harvard, Vancouver, APA, etc.

Chao, Guan-Lin, Abhinav Rastogi, Semih Yavuz, Dilek Hakkani-Tur, Jindong Chen e Ian Lane. "Learning Question-Guided Video Representation for Multi-Turn Video Question Answering". In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/w19-5926.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Bhalerao, Mandar, Shlok Gujar, Aditya Bhave e Anant V. Nimkar. "Visual Question Answering Using Video Clips". In 2019 IEEE Bombay Section Signature Conference (IBSSC). IEEE, 2019. http://dx.doi.org/10.1109/ibssc47189.2019.8973090.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Yang, Zekun, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima e Haruo Takemura. "BERT Representations for Video Question Answering". In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2020. http://dx.doi.org/10.1109/wacv45572.2020.9093596.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Li, Yicong, Xiang Wang, Junbin Xiao, Wei Ji e Tat-Seng Chua. "Invariant Grounding for Video Question Answering". In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.00294.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Fang, Jiannan, Lingling Sun e Yaqi Wang. "Video question answering by frame attention". In Eleventh International Conference on Digital Image Processing, editado por Xudong Jiang e Jenq-Neng Hwang. SPIE, 2019. http://dx.doi.org/10.1117/12.2539615.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Lei, Jie, Licheng Yu, Mohit Bansal e Tamara Berg. "TVQA: Localized, Compositional Video Question Answering". In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018. http://dx.doi.org/10.18653/v1/d18-1167.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

Oferecemos descontos em todos os planos premium para autores cujas obras estão incluídas em seleções literárias temáticas. Contate-nos para obter um código promocional único!