Academic literature on the topic 'Multimodal embedding and retrieval'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Multimodal embedding and retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Multimodal embedding and retrieval"

1

Kim, Donghyun, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. "MULE: Multimodal Universal Language Embedding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11254–61. http://dx.doi.org/10.1609/aaai.v34i07.6785.

Full text
Abstract:
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabli
APA, Harvard, Vancouver, ISO, and other styles
2

Kim, Jongseok, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. "Dual Compositional Learning in Interactive Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (2021): 1771–79. http://dx.doi.org/10.1609/aaai.v35i2.16271.

Full text
Abstract:
We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models the difference between the reference and target image in the embedding space and matche
APA, Harvard, Vancouver, ISO, and other styles
3

Wang, Di, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. "Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval." IEEE Transactions on Image Processing 25, no. 10 (2016): 4540–54. http://dx.doi.org/10.1109/tip.2016.2592800.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Full text
Abstract:
AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves re
APA, Harvard, Vancouver, ISO, and other styles
5

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Full text
Abstract:
In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neur
APA, Harvard, Vancouver, ISO, and other styles
6

Qi, Jidong. "Neurophysiological and psychophysical references for trends in supervised VQA multimodal deep learning: An interdisciplinary meta-analysis." Applied and Computational Engineering 30, no. 1 (2024): 189–201. http://dx.doi.org/10.54254/2755-2721/30/20230096.

Full text
Abstract:
Leading trends in multimodal deep learning for visual-question answering include Multimodal joint-embedding model, multimodal attention-based model, and multimodal external knowledge-based model. Several mechanisms and strategies are used in these models, including representation fusion methods, co-attention mechanisms, and knowledge base retrieval mechanisms. While a variety of works have comprehensively reviewed these strategies, a key gap in research is that there is no interdisciplinary analysis that connects these mechanisms with discoveries on human. As discussions of Neuro-AI continues
APA, Harvard, Vancouver, ISO, and other styles
7

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Full text
Abstract:
Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as
APA, Harvard, Vancouver, ISO, and other styles
8

Mithun, Niluthpol C., Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. "Joint embeddings with multimodal cues for video-text retrieval." International Journal of Multimedia Information Retrieval 8, no. 1 (2019): 3–18. http://dx.doi.org/10.1007/s13735-018-00166-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Yang, Bang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, and Yuexian Zou. "Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (2024): 6458–66. http://dx.doi.org/10.1609/aaai.v38i6.28466.

Full text
Abstract:
While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic
APA, Harvard, Vancouver, ISO, and other styles
10

Xu, Tong, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. "Socializing the Videos: A Multimodal Approach for Social Relation Recognition." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1 (2021): 1–23. http://dx.doi.org/10.1145/3416493.

Full text
Abstract:
As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To tha
APA, Harvard, Vancouver, ISO, and other styles
More sources
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!