Bibliografie tematiche / Video question answering

Indice

Articoli di riviste
Tesi
Libri
Capitoli di libri
Atti di convegni

Letteratura scientifica selezionata sul tema "Video question answering"

Autore: Grafiati

Pubblicato: 14 settembre 2024

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Consulta la lista di attuali articoli, libri, tesi, atti di convegni e altre fonti scientifiche attinenti al tema "Video question answering".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Articoli di riviste sul tema "Video question answering"

Lei, Chenyi, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang e Houqiang Li. "Multi-Question Learning for Visual Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 07 (3 aprile 2020): 11328–35. http://dx.doi.org/10.1609/aaai.v34i07.6794.

Testo completo

Abstract (sommario):

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

Gli stili APA, Harvard, Vancouver, ISO e altri

Ruwa, Nelson, Qirong Mao, Liangjun Wang e Jianping Gou. "Affective question answering on video". Neurocomputing 363 (ottobre 2019): 125–39. http://dx.doi.org/10.1016/j.neucom.2019.06.046.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Wang, Yueqian, Yuxuan Wang, Kai Chen e Dongyan Zhao. "STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 17 (24 marzo 2024): 19215–23. http://dx.doi.org/10.1609/aaai.v38i17.29890.

Testo completo

Abstract (sommario):

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

Gli stili APA, Harvard, Vancouver, ISO e altri

Zong, Linlin, Jiahui Wan, Xianchao Zhang, Xinyue Liu, Wenxin Liang e Bo Xu. "Video-Context Aligned Transformer for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 17 (24 marzo 2024): 19795–803. http://dx.doi.org/10.1609/aaai.v38i17.29954.

Testo completo

Abstract (sommario):

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

Gli stili APA, Harvard, Vancouver, ISO e altri

Huang, Deng, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan e Chuang Gan. "Location-Aware Graph Convolutional Networks for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 07 (3 aprile 2020): 11021–28. http://dx.doi.org/10.1609/aaai.v34i07.6737.

Testo completo

Abstract (sommario):

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.

Gli stili APA, Harvard, Vancouver, ISO e altri

Gao, Lianli, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei e Heng Tao Shen. "Structured Two-Stream Attention Network for Video Question Answering". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 6391–98. http://dx.doi.org/10.1609/aaai.v33i01.33016391.

Testo completo

Abstract (sommario):

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

Gli stili APA, Harvard, Vancouver, ISO e altri

Kumar, Krishnamoorthi Magesh, e P. Valarmathie. "Domain and Intelligence Based Multimedia Question Answering System". International Journal of Evaluation and Research in Education (IJERE) 5, n. 3 (1 settembre 2016): 227. http://dx.doi.org/10.11591/ijere.v5i3.4544.

Testo completo

Abstract (sommario):

Multimedia question answering systems have become very popular over the past few years. It allows users to share their thoughts by answering given question or obtain information from a set of answered questions. However, existing QA systems support only textual answer which is not so instructive for many users. The user’s discussion can be enhanced by adding suitable multimedia data. Multimedia answers offer intuitive information with more suitable image, voice and video. This system includes a set of information as well as classification of question and answer, query generation, multimedia data selection and presentation. This system will take all kinds of media such as text, images, videos, and videos which will be combined with a textual answer. In a way, it automatically collects information from the user to improvising the answer. This method consists of ranking for answers to select the best answer. By dealing out a huge set of QA pairs and adding them to a database, multimedia question answering approach for users which finds multimedia answers by matching their questions with those in the database. The effectiveness of Multimedia system is determined by ranking of text, image, audio and video in users answer. The answer which is given by the user it’s processed by Semantic match algorithm and the best answers can be viewed by Naive Bayesian ranking system.

Gli stili APA, Harvard, Vancouver, ISO e altri

Xue, Hongyang, Zhou Zhao e Deng Cai. "Unifying the Video and Question Attentions for Open-Ended Video Question Answering". IEEE Transactions on Image Processing 26, n. 12 (dicembre 2017): 5656–66. http://dx.doi.org/10.1109/tip.2017.2746267.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Jang, Yunseok, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim e Gunhee Kim. "Video Question Answering with Spatio-Temporal Reasoning". International Journal of Computer Vision 127, n. 10 (18 giugno 2019): 1385–412. http://dx.doi.org/10.1007/s11263-019-01189-x.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Zhuang, Yueting, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu e Jun Xiao. "Multichannel Attention Refinement for Video Question Answering". ACM Transactions on Multimedia Computing, Communications, and Applications 16, n. 1s (28 aprile 2020): 1–23. http://dx.doi.org/10.1145/3366710.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Più fonti

Tesi sul tema "Video question answering"

Engin, Deniz. "Video question answering with limited supervision". Electronic Thesis or Diss., Université de Rennes (2023-....), 2024. http://www.theses.fr/2024URENS016.

Testo completo

Abstract (sommario):

Le contenu vidéo a considérablement augmenté en volume et en diversité à l'ère numérique, et cette expansion a souligné la nécessité de technologies avancées de compréhension des vidéos. Poussée par cette nécessité, cette thèse explore la compréhension sémantique des vidéos, en exploitant plusieurs modes perceptuels similaires aux processus cognitifs humains et un apprentissage efficace avec une supervision limitée, semblable aux capacités d'apprentissage humain. Cette thèse se concentre spécifiquement sur la réponse aux questions sur les vidéos comme l'une des principales tâches de compréhension vidéo. Notre première contribution traite de la réponse aux questions sur les vidéos à long terme, nécessitant une compréhension du contenu vidéo étendu. Alors que les approches récentes dépendent de sources externes générées par les humains, nous traitons des données brutes pour générer des résumés vidéo. Notre contribution suivante explore la réponse aux questions vidéo en zéro-shot et en few-shot, visant à améliorer l'apprentissage efficace à partir de données limitées. Nous exploitons la connaissance des modèles à grande échelle existants en éliminant les défis d'adaptation des modèles pré-entraînés à des données limitées. Nous démontrons que ces contributions améliorent considérablement les capacités des systèmes de réponse aux questions vidéo multimodaux, où les données étiquetées spécifiquement annotées par l'homme sont limitées ou indisponibles
Video content has significantly increased in volume and diversity in the digital era, and this expansion has highlighted the necessity for advanced video understanding technologies. Driven by this necessity, this thesis explores semantically understanding videos, leveraging multiple perceptual modes similar to human cognitive processes and efficient learning with limited supervision similar to human learning capabilities. This thesis specifically focuses on video question answering as one of the main video understanding tasks. Our first contribution addresses long-range video question answering, requiring an understanding of extended video content. While recent approaches rely on human-generated external sources, we process raw data to generate video summaries. Our following contribution explores zero-shot and few-shot video question answering, aiming to enhance efficient learning from limited data. We leverage the knowledge of existing large-scale models by eliminating challenges in adapting pre-trained models to limited data. We demonstrate that these contributions significantly enhance the capabilities of multimodal video question-answering systems, where specifically human-annotated labeled data is limited or unavailable

Gli stili APA, Harvard, Vancouver, ISO e altri

Chowdhury, Muhammad Iqbal Hasan. "Question-answering on image/video content". Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/205096/1/Muhammad%20Iqbal%20Hasan_Chowdhury_Thesis.pdf.

Testo completo

Abstract (sommario):

This thesis explores a computer's ability to understand multimodal data where the correspondence between image/video content and natural language text are utilised to answer open-ended natural language questions through question-answering tasks. Static image data consisting of both indoor and outdoor scenes, where complex textual questions are arbitrarily posed to a machine to generate correct answers, was examined. Dynamic videos consisting of both single-camera and multi-camera settings for the exploration of more challenging and unconstrained question-answering tasks were also considered. In exploring these challenges, new deep learning processes were developed to improve a computer's ability to understand and consider multimodal data.

Gli stili APA, Harvard, Vancouver, ISO e altri

Zeng, Kuo-Hao, e 曾國豪. "Video titling and Question-Answering". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/a3a6sw.

Testo completo

Abstract (sommario):

碩士
國立清華大學
電機工程學系所
105
Video titling and question answering are two important tasks toward high-level visual data understanding. To address those two tasks, we propose a large-scale dataset and demonstrate several models on such dataset in this work. A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. On the other hand, for video question-answering task: we propose to learn a deep model to answer a free-form natural language question about the contents of a video. We make a program automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN, VQA, SA, and SS. In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. To demonstrate our idea, we collected a large-scale Video Titles in the Wild (VTW) dataset of $18100$ automatically crawled user-generated videos and titles. We then utilize an automatic QA generator to generate a large number of QA pairs for training and collect the manually generated QA pairs from Amazon Mechanical Turk. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Next, our sentence augmentation method also outperforms the baselines on the M-VAD dataset. Finally, the results of video question answering show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.

Gli stili APA, Harvard, Vancouver, ISO e altri

Xu, Huijuan. "Vision and language understanding with localized evidence". Thesis, 2018. https://hdl.handle.net/2144/34790.

Testo completo

Abstract (sommario):

Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.

Gli stili APA, Harvard, Vancouver, ISO e altri

Libri sul tema "Video question answering"

McCallum, Richard. Evangelical Christian Responses to Islam. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781350418240.

Testo completo

Abstract (sommario):

Do Christians and Muslims worship the same God? Who was Muhammad? How does the Israeli-Palestinian conflict affect Christian-Muslim relations? This is a book about Evangelical Christians and how they are answering challenging questions about Islam. Drawing on over 300 texts published by Evangelicals in the first two decades of the twenty-first century, this book explores what the Evangelical micro-public sphere has to say about key issues in Christian-Muslim relations today. From the books they write, the blogs they post and the videos they make, it is clear that Evangelical Christians profoundly disagree with one another when discussing Islam. Answers to the questions range from seeing Muslims as the enemy posing an existential threat to Christians, through to welcoming them as good neighbours or even as close cousins.

Gli stili APA, Harvard, Vancouver, ISO e altri

Walker, Stephen. Digital Mediation. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781526525772.

Testo completo

Abstract (sommario):

Digital mediation is here to stay, but how do mediators, advisers and clients achieve the same results from digital mediations as they do from face to face mediations? Do new skills and mindsets need to be learnt? Can you build rapport online? Can you read emotions? How do you market online? How do you decide whether it’s the right choice for your dispute? How does digital mediation fit into the world of the Digital Justice System and mandatory mediation? Answering these questions and many more, this is the only book to focus on mediation as opposed to other means of Online Dispute Resolution such as arbitration. This title: - Includes checklists and templates written by a mediator who has conducted over 280 digital mediations - Covers topics including smart systems, ‘smart settle’, the use of artificial intelligence, ChatGPT and mixed media mediations - Teaches mediators, advisers and clients the different skills and mindsets essential to success in the world of digital mediation - Shows how to market mediation online with practical guidance on websites, videos, blogs and podcasts - This book is essential reading for all mediators wishing to adapt to the new norm of digital mediation. This title is included in Bloomsbury Professional's Mediation online service.

Gli stili APA, Harvard, Vancouver, ISO e altri

Capitoli di libri sul tema "Video question answering"

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Video Question Answering". In Visual Question Answering, 119–33. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_8.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Video Representation Learning". In Visual Question Answering, 111–17. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_7.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He e Wenwu Zhu. "Advanced Models for Video Question Answering". In Visual Question Answering, 135–43. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_9.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Gao, Lei, Guangda Li, Yan-Tao Zheng, Richang Hong e Tat-Seng Chua. "Video Reference: A Video Question Answering Engine". In Lecture Notes in Computer Science, 799–801. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-11301-7_92.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Xiao, Junbin, Pan Zhou, Tat-Seng Chua e Shuicheng Yan. "Video Graph Transformer for Video Question Answering". In Lecture Notes in Computer Science, 39–58. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_3.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Piergiovanni, AJ, Kairo Morton, Weicheng Kuo, Michael S. Ryoo e Anelia Angelova. "Video Question Answering with Iterative Video-Text Co-tokenization". In Lecture Notes in Computer Science, 76–94. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_5.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Chen, Xuanwei, Rui Liu, Xiaomeng Song e Yahong Han. "Locating Visual Explanations for Video Question Answering". In MultiMedia Modeling, 290–302. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-67832-6_24.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Gupta, Pranay, e Manish Gupta. "NewsKVQA: Knowledge-Aware News Video Question Answering". In Advances in Knowledge Discovery and Data Mining, 3–15. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-05981-0_1.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Ge, Yuanyuan, Youjiang Xu e Yahong Han. "Video Question Answering Using a Forget Memory Network". In Communications in Computer and Information Science, 404–15. Singapore: Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-7299-4_33.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Gao, Kun, Xianglei Zhu e Yahong Han. "Initialized Frame Attention Networks for Video Question Answering". In Communications in Computer and Information Science, 349–59. Singapore: Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-10-8530-7_34.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Atti di convegni sul tema "Video question answering"

Zhao, Wentian, Seokhwan Kim, Ning Xu e Hailin Jin. "Video Question Answering on Screencast Tutorials". In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/148.

Testo completo

Abstract (sommario):

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.

Gli stili APA, Harvard, Vancouver, ISO e altri

Jenni, Kommineni, M. Srinivas, Roshni Sannapu e Murukessan Perumal. "CSA-BERT: Video Question Answering". In 2023 IEEE Statistical Signal Processing Workshop (SSP). IEEE, 2023. http://dx.doi.org/10.1109/ssp53291.2023.10207954.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Li, Hao, Peng Jin, Zesen Cheng, Songyang Zhang, Kai Chen, Zhennan Wang, Chang Liu e Jie Chen. "TG-VQA: Ternary Game of Video Question Answering". In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/116.

Testo completo

Abstract (sommario):

Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i.e., annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e.g., video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (10^4 videos), surpassing most of those pre-trained on large-scale data (10^7 videos).

Gli stili APA, Harvard, Vancouver, ISO e altri

Zhao, Zhou, Qifan Yang, Deng Cai, Xiaofei He e Yueting Zhuang. "Video Question Answering via Hierarchical Spatio-Temporal Attention Networks". In Twenty-Sixth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2017. http://dx.doi.org/10.24963/ijcai.2017/492.

Testo completo

Abstract (sommario):

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

Gli stili APA, Harvard, Vancouver, ISO e altri

Chao, Guan-Lin, Abhinav Rastogi, Semih Yavuz, Dilek Hakkani-Tur, Jindong Chen e Ian Lane. "Learning Question-Guided Video Representation for Multi-Turn Video Question Answering". In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/w19-5926.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Bhalerao, Mandar, Shlok Gujar, Aditya Bhave e Anant V. Nimkar. "Visual Question Answering Using Video Clips". In 2019 IEEE Bombay Section Signature Conference (IBSSC). IEEE, 2019. http://dx.doi.org/10.1109/ibssc47189.2019.8973090.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Yang, Zekun, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima e Haruo Takemura. "BERT Representations for Video Question Answering". In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2020. http://dx.doi.org/10.1109/wacv45572.2020.9093596.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Li, Yicong, Xiang Wang, Junbin Xiao, Wei Ji e Tat-Seng Chua. "Invariant Grounding for Video Question Answering". In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.00294.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Fang, Jiannan, Lingling Sun e Yaqi Wang. "Video question answering by frame attention". In Eleventh International Conference on Digital Image Processing, a cura di Xudong Jiang e Jenq-Neng Hwang. SPIE, 2019. http://dx.doi.org/10.1117/12.2539615.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Lei, Jie, Licheng Yu, Mohit Bansal e Tamara Berg. "TVQA: Localized, Compositional Video Question Answering". In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018. http://dx.doi.org/10.18653/v1/d18-1167.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Offriamo sconti su tutti i piani premium per gli autori le cui opere sono incluse in raccolte letterarie tematiche. Contattaci per ottenere un codice promozionale unico!