Log in

Relevant bibliographies by topics / Multimodal embedding and retrieval / Journal articles

To see the other types of publications on this topic, follow the link: Multimodal embedding and retrieval.

Journal articles on the topic 'Multimodal embedding and retrieval'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multimodal embedding and retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Kim, Donghyun, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. "MULE: Multimodal Universal Language Embedding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11254–61. http://dx.doi.org/10.1609/aaai.v34i07.6785.

Full text

Abstract:

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available1.

APA, Harvard, Vancouver, ISO, and other styles

2

Kim, Jongseok, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. "Dual Compositional Learning in Interactive Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (May 18, 2021): 1771–79. http://dx.doi.org/10.1609/aaai.v35i2.16271.

Full text

Abstract:

We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models the difference between the reference and target image in the embedding space and matches it with the embedding of the text query. That is, we consider two cyclic directional mappings for triplets of (reference image, text query, target image) by using both Composition Network and Correction Network. We also propose a joint training loss that can further improve the robustness of multimodal representation learning. We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network. Moreover, an ensemble of our model won the first place in Fashion-IQ 2020 challenge held in a CVPR 2020 workshop.

APA, Harvard, Vancouver, ISO, and other styles

3

Wang, Di, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. "Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval." IEEE Transactions on Image Processing 25, no. 10 (October 2016): 4540–54. http://dx.doi.org/10.1109/tip.2016.2592800.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (July 2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Full text

Abstract:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

APA, Harvard, Vancouver, ISO, and other styles

5

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (November 20, 2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Full text

Abstract:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

APA, Harvard, Vancouver, ISO, and other styles

6

Qi, Jidong. "Neurophysiological and psychophysical references for trends in supervised VQA multimodal deep learning: An interdisciplinary meta-analysis." Applied and Computational Engineering 30, no. 1 (January 22, 2024): 189–201. http://dx.doi.org/10.54254/2755-2721/30/20230096.

Full text

Abstract:

Leading trends in multimodal deep learning for visual-question answering include Multimodal joint-embedding model, multimodal attention-based model, and multimodal external knowledge-based model. Several mechanisms and strategies are used in these models, including representation fusion methods, co-attention mechanisms, and knowledge base retrieval mechanisms. While a variety of works have comprehensively reviewed these strategies, a key gap in research is that there is no interdisciplinary analysis that connects these mechanisms with discoveries on human. As discussions of Neuro-AI continues to thrive, it is important to consider synergies among human level investigations and ANNs, specifically for using AI to reproduce higher order cognitive functions such as multisensory integration. Thus, Present meta-analysis aimed at the reviewing and connecting neurophysiological and psychophysical references to trends in VQA multimodal deep learning, focusing on 1) Providing back-up explanations for why several strategies in VQA MMDL leads to performances that are closer to human level and 2) Using VQA MMDL as an example to demonstrate how interdisciplinary perspective may foster the development of human level AI. The result of the meta-analysis builds connections between several sub-fields: Joint embedding mechanisms and SC neurons, multimodal attention mechanism and the retro-cue effect, and external knowledge base and engram mechanisms.

APA, Harvard, Vancouver, ISO, and other styles

7

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Full text

Abstract:

Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as well as strengthen relations between input data and semantic space to generalize across seen and unseen classes. In this paper, we propose a novel method termed Learning Cross-Aligned Latent Embeddings (LCALE) as an alternative to these GAN based methods for ZS-CMR. Unlike using the class-embeddings as the semantic space, our method seeks for a shared low-dimensional latent space of input multimodal features and class-embeddings by modality-specific variational autoencoders. Notably, we align the distributions learned from multimodal input features and from class-embeddings to construct latent embeddings that contain the essential cross-modal correlation associated with unseen classes. Effective cross-reconstruction and cross-alignment criterions are further developed to preserve class-discriminative information in latent space, which benefits the efficiency for retrieval and enable the knowledge transfer to unseen classes. We evaluate our model using four benchmark datasets on image-text retrieval tasks and one large-scale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.

APA, Harvard, Vancouver, ISO, and other styles

8

Mithun, Niluthpol C., Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. "Joint embeddings with multimodal cues for video-text retrieval." International Journal of Multimedia Information Retrieval 8, no. 1 (January 12, 2019): 3–18. http://dx.doi.org/10.1007/s13735-018-00166-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Yang, Bang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, and Yuexian Zou. "Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6458–66. http://dx.doi.org/10.1609/aaai.v38i6.28466.

Full text

Abstract:

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM.

APA, Harvard, Vancouver, ISO, and other styles

10

Xu, Tong, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. "Socializing the Videos: A Multimodal Approach for Social Relation Recognition." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1 (April 16, 2021): 1–23. http://dx.doi.org/10.1145/3416493.

Full text

Abstract:

As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To that end, in this article, we propose a novel multimodal-based solution to deal with the character relation recognition task. Specifically, we capture the target character pairs via a search module and then design a multistream architecture for jointly embedding the visual and textual information, in which feature fusion and attention mechanism are adapted for better integrating the multimodal inputs. Finally, supervised learning is applied to classify character relations. Experiments on real-world data sets validate that our solution outperforms several competitive baselines.

APA, Harvard, Vancouver, ISO, and other styles

11

Xu, Xing, Jialin Tian, Kaiyi Lin, Huimin Lu, Jie Shao, and Heng Tao Shen. "Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1s (March 31, 2021): 1–17. http://dx.doi.org/10.1145/3424341.

Full text

Abstract:

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

APA, Harvard, Vancouver, ISO, and other styles

12

Peng, Min, Chongyang Wang, Yu Shi, and Xiang-Dong Zhou. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 2038–46. http://dx.doi.org/10.1609/aaai.v37i2.25296.

Full text

Abstract:

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

APA, Harvard, Vancouver, ISO, and other styles

13

Khan, Arijit. "Knowledge Graphs Querying." ACM SIGMOD Record 52, no. 2 (August 10, 2023): 18–29. http://dx.doi.org/10.1145/3615952.3615956.

Full text

Abstract:

Knowledge graphs (KGs) such as DBpedia, Freebase, YAGO, Wikidata, and NELL were constructed to store large-scale, real-world facts as (subject, predicate, object) triples - that can also be modeled as a graph, where a node (a subject or an object) represents an entity with attributes, and a directed edge (a predicate) is a relationship between two entities. Querying KGs is critical in web search, question answering (QA), semantic search, personal assistants, fact checking, and recommendation. While significant progress has been made on KG construction and curation, thanks to deep learning recently we have seen a surge of research on KG querying and QA. The objectives of our survey are two-fold. First, research on KG querying has been conducted by several communities, such as databases, data mining, semantic web, machine learning, information retrieval, and natural language processing (NLP), with different focus and terminologies; and also in diverse topics ranging from graph databases, query languages, join algorithms, graph patterns matching, to more sophisticated KG embedding and natural language questions (NLQs). We aim at uniting different interdisciplinary topics and concepts that have been developed for KG querying. Second, many recent advances on KG and query embedding, multimodal KG, and KG-QA come from deep learning, IR, NLP, and computer vision domains. We identify important challenges of KG querying that received less attention by graph databases, and by the DB community in general, e.g., incomplete KG, semantic matching, multimodal data, and NLQs. We conclude by discussing interesting opportunities for the data management community, for instance, KG as a unified data model and vector-based query processing.

APA, Harvard, Vancouver, ISO, and other styles

14

Chen, Weijia, Zhijun Lu, Lijue You, Lingling Zhou, Jie Xu, and Ken Chen. "Artificial Intelligence–Based Multimodal Risk Assessment Model for Surgical Site Infection (AMRAMS): Development and Validation Study." JMIR Medical Informatics 8, no. 6 (June 15, 2020): e18186. http://dx.doi.org/10.2196/18186.

Full text

Abstract:

Background Surgical site infection (SSI) is one of the most common types of health care–associated infections. It increases mortality, prolongs hospital length of stay, and raises health care costs. Many institutions developed risk assessment models for SSI to help surgeons preoperatively identify high-risk patients and guide clinical intervention. However, most of these models had low accuracies. Objective We aimed to provide a solution in the form of an Artificial intelligence–based Multimodal Risk Assessment Model for Surgical site infection (AMRAMS) for inpatients undergoing operations, using routinely collected clinical data. We internally and externally validated the discriminations of the models, which combined various machine learning and natural language processing techniques, and compared them with the National Nosocomial Infections Surveillance (NNIS) risk index. Methods We retrieved inpatient records between January 1, 2014, and June 30, 2019, from the electronic medical record (EMR) system of Rui Jin Hospital, Luwan Branch, Shanghai, China. We used data from before July 1, 2018, as the development set for internal validation and the remaining data as the test set for external validation. We included patient demographics, preoperative lab results, and free-text preoperative notes as our features. We used word-embedding techniques to encode text information, and we trained the LASSO (least absolute shrinkage and selection operator) model, random forest model, gradient boosting decision tree (GBDT) model, convolutional neural network (CNN) model, and self-attention network model using the combined data. Surgeons manually scored the NNIS risk index values. Results For internal bootstrapping validation, CNN yielded the highest mean area under the receiver operating characteristic curve (AUROC) of 0.889 (95% CI 0.886-0.892), and the paired-sample t test revealed statistically significant advantages as compared with other models (P<.001). The self-attention network yielded the second-highest mean AUROC of 0.882 (95% CI 0.878-0.886), but the AUROC was only numerically higher than the AUROC of the third-best model, GBDT with text embeddings (mean AUROC 0.881, 95% CI 0.878-0.884, P=.47). The AUROCs of LASSO, random forest, and GBDT models using text embeddings were statistically higher than the AUROCs of models not using text embeddings (P<.001). For external validation, the self-attention network yielded the highest AUROC of 0.879. CNN was the second-best model (AUROC 0.878), and GBDT with text embeddings was the third-best model (AUROC 0.872). The NNIS risk index scored by surgeons had an AUROC of 0.651. Conclusions Our AMRAMS based on EMR data and deep learning methods—CNN and self-attention network—had significant advantages in terms of accuracy compared with other conventional machine learning methods and the NNIS risk index. Moreover, the semantic embeddings of preoperative notes improved the model performance further. Our models could replace the NNIS risk index to provide personalized guidance for the preoperative intervention of SSIs. Through this case, we offered an easy-to-implement solution for building multimodal RAMs for other similar scenarios.

APA, Harvard, Vancouver, ISO, and other styles

15

Romberg, Stefan, Rainer Lienhart, and Eva Hörster. "Multimodal Image Retrieval." International Journal of Multimedia Information Retrieval 1, no. 1 (March 7, 2012): 31–44. http://dx.doi.org/10.1007/s13735-012-0006-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Zou, Zhuo. "Performance analysis of using multimodal embedding and word embedding transferred to sentiment classification." Applied and Computational Engineering 5, no. 1 (June 14, 2023): 417–22. http://dx.doi.org/10.54254/2755-2721/5/20230610.

Full text

Abstract:

Multimodal machine learning is one of artificial intelligence's most important research topics. Contrastive Language-Image Pretraining (CLIP) is one of the applications of multimodal machine Learning and is well applied to computer vision. However, there is a research gap in applying CLIP in natural language processing. Therefore, based on IMDB, this paper applies the multimodal features of CLIP and three other pre-trained word vectors, Glove, Word2vec, and BERT, to compare their effects on sentiment classification of natural language processing, to test the performance of CLIP multimodal feature tuning in natural language processing. The results show that the multimodal feature of CLIP does not produce a significant effect on sentiment classification, and other multimodal features gain better effects. The highest accuracy is produced by BERT, and the Word embedding of CLIP is the lowest of the four accuracies of word embedding. At the same time, glove and word2vec are relatively close. The reason may be that the pre-trained CLIP model learns SOTA image representations from pictures and their descriptions, which is unsuitable for sentiment classification tasks. The specific reason remains untested.

APA, Harvard, Vancouver, ISO, and other styles

17

Dash, Sandeep Kumar, Saurav Saha, Partha Pakray, and Alexander Gelbukh. "Generating image captions through multimodal embedding." Journal of Intelligent & Fuzzy Systems 36, no. 5 (May 14, 2019): 4787–96. http://dx.doi.org/10.3233/jifs-179027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Qi, Fan, Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. "Discriminative multimodal embedding for event classification." Neurocomputing 395 (June 2020): 160–69. http://dx.doi.org/10.1016/j.neucom.2017.11.078.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Lee, Jin Young. "Deep multimodal embedding for video captioning." Multimedia Tools and Applications 78, no. 22 (July 24, 2019): 31793–805. http://dx.doi.org/10.1007/s11042-019-08011-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Kitanovski, Ivan, Gjorgji Strezoski, Ivica Dimitrovski, Gjorgji Madjarov, and Suzana Loskovska. "Multimodal medical image retrieval system." Multimedia Tools and Applications 76, no. 2 (January 25, 2016): 2955–78. http://dx.doi.org/10.1007/s11042-016-3261-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Xu, Hong. "Multimodal bird information retrieval system." Applied and Computational Engineering 53, no. 1 (March 28, 2024): 96–102. http://dx.doi.org/10.54254/2755-2721/53/20241282.

Full text

Abstract:

Multimodal bird information retrieval system can help people popularize bird knowledge and help bird conservation. In this paper, we use the self-built bird dataset, the ViT-B/32 model in CLIP model as the training model, python as the development language, and PyQT5 to complete the interface development. The system mainly realizes the uploading and displaying of bird pictures, the multimodal retrieval function of bird information, and the introduction of related bird information. The results of the trial run show that the system can accomplish the multimodal retrieval of bird information, retrieve the species of birds and other related information through the pictures uploaded by the user, or retrieve the most similar bird information through the text content described by the user.

APA, Harvard, Vancouver, ISO, and other styles

22

Yang, Xi, Xinbo Gao, and Qi Tian. "Polar Embedding for Aurora Image Retrieval." IEEE Transactions on Image Processing 24, no. 11 (November 2015): 3332–44. http://dx.doi.org/10.1109/tip.2015.2442913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Tam, G. K. L., and R. W. H. Lau. "Embedding Retrieval of Articulated Geometry Models." IEEE Transactions on Pattern Analysis and Machine Intelligence 34, no. 11 (November 2012): 2134–46. http://dx.doi.org/10.1109/tpami.2012.17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Zhou, Wengang, Houqiang Li, Jian Sun, and Qi Tian. "Collaborative Index Embedding for Image Retrieval." IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 5 (May 1, 2018): 1154–66. http://dx.doi.org/10.1109/tpami.2017.2676779.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Wang, Can, Jun Zhao, Xiaofei He, Chun Chen, and Jiajun Bu. "Image retrieval using nonlinear manifold embedding." Neurocomputing 72, no. 16-18 (October 2009): 3922–29. http://dx.doi.org/10.1016/j.neucom.2009.04.011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Kulvinder Singh, Et al. "Enhancing Multimodal Information Retrieval Through Integrating Data Mining and Deep Learning Techniques." International Journal on Recent and Innovation Trends in Computing and Communication 11, no. 9 (October 30, 2023): 560–69. http://dx.doi.org/10.17762/ijritcc.v11i9.8844.

Full text

Abstract:

Multimodal information retrieval, the task of re trieving relevant information from heterogeneous data sources such as text, images, and videos, has gained significant attention in recent years due to the proliferation of multimedia content on the internet. This paper proposes an approach to enhance multimodal information retrieval by integrating data mining and deep learning techniques. Traditional information retrieval systems often struggle to effectively handle multimodal data due to the inherent complexity and diversity of such data sources. In this study, we leverage data mining techniques to preprocess and structure multimodal data efficiently. Data mining methods enable us to extract valuable patterns, relationships, and features from different modalities, providing a solid foundation for sub- sequent retrieval tasks. To further enhance the performance of multimodal information retrieval, deep learning techniques are employed. Deep neural networks have demonstrated their effectiveness in various multimedia tasks, including image recognition, natural language processing, and video analysis. By integrating deep learning models into our retrieval framework, we aim to capture complex intermodal dependencies and semantically rich representations, enabling more accurate and context-aware retrieval.

APA, Harvard, Vancouver, ISO, and other styles

27

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (March 24, 2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Full text

Abstract:

Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.

APA, Harvard, Vancouver, ISO, and other styles

28

Wang, Shiping, and Wenzhong Guo. "Sparse Multigraph Embedding for Multimodal Feature Representation." IEEE Transactions on Multimedia 19, no. 7 (July 2017): 1454–66. http://dx.doi.org/10.1109/tmm.2017.2663324.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Hama, Kenta, Takashi Matsubara, Kuniaki Uehara, and Jianfei Cai. "Exploring Uncertainty Measures for Image-caption Embedding-and-retrieval Task." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 2 (June 2021): 1–19. http://dx.doi.org/10.1145/3425663.

Full text

Abstract:

With the significant development of black-box machine learning algorithms, particularly deep neural networks, the practical demand for reliability assessment is rapidly increasing. On the basis of the concept that “Bayesian deep learning knows what it does not know,” the uncertainty of deep neural network outputs has been investigated as a reliability measure for classification and regression tasks. By considering an embedding task as a regression task, several existing studies have quantified the uncertainty of embedded features and improved the retrieval performance of cutting-edge models by model averaging. However, in image-caption embedding-and-retrieval tasks, well-known samples are not always easy to retrieve. This study shows that the existing method has poor performance in reliability assessment and investigates another aspect of image-caption embedding-and-retrieval tasks. We propose posterior uncertainty by considering the retrieval task as a classification task, which can accurately assess the reliability of retrieval results. The consistent performance of the two uncertainty measures is observed with different datasets (MS-COCO and Flickr30k), different deep-learning architectures (dropout and batch normalization), and different similarity functions. To the best of our knowledge, this is the first study to perform a reliability assessment on image-caption embedding-and-retrieval tasks.

APA, Harvard, Vancouver, ISO, and other styles

30

Cao, Yu, Shawn Steffey, Jianbiao He, Degui Xiao, Cui Tao, Ping Chen, and Henning Müller. "Medical Image Retrieval: A Multimodal Approach." Cancer Informatics 13s3 (January 2014): CIN.S14053. http://dx.doi.org/10.4137/cin.s14053.

Full text

Abstract:

Medical imaging is becoming a vital component of war on cancer. Tremendous amounts of medical image data are captured and recorded in a digital format during cancer care and cancer research. Facing such an unprecedented volume of image data with heterogeneous image modalities, it is necessary to develop effective and efficient content-based medical image retrieval systems for cancer clinical practice and research. While substantial progress has been made in different areas of content-based image retrieval (CBIR) research, direct applications of existing CBIR techniques to the medical images produced unsatisfactory results, because of the unique characteristics of medical images. In this paper, we develop a new multimodal medical image retrieval approach based on the recent advances in the statistical graphic model and deep learning. Specifically, we first investigate a new extended probabilistic Latent Semantic Analysis model to integrate the visual and textual information from medical images to bridge the semantic gap. We then develop a new deep Boltzmann machine-based multimodal learning model to learn the joint density model from multimodal information in order to derive the missing modality. Experimental results with large volume of real-world medical images have shown that our new approach is a promising solution for the next-generation medical imaging indexing and retrieval system.

APA, Harvard, Vancouver, ISO, and other styles

31

Rafailidis, D., S. Manolopoulou, and P. Daras. "A unified framework for multimodal retrieval." Pattern Recognition 46, no. 12 (December 2013): 3358–70. http://dx.doi.org/10.1016/j.patcog.2013.05.023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Qiu, Dong, Haihuan Jiang, and Shuqiao Chen. "Fuzzy Information Retrieval Based on Continuous Bag-of-Words Model." Symmetry 12, no. 2 (February 3, 2020): 225. http://dx.doi.org/10.3390/sym12020225.

Full text

Abstract:

In this paper, we study the feasibility of performing fuzzy information retrieval by word embedding. We propose a fuzzy information retrieval approach to capture the relationships between words and query language, which combines some techniques of deep learning and fuzzy set theory. We try to leverage large scale data and the continuous-bag-of words model to find the relevant feature of words and obtain word embedding. To enhance retrieval effectiveness, we measure the relativity among words by word embedding, with the property of symmetry. Experimental results show that the recall ratio, precision ratio, and harmonic average of two ratios of the proposed method outperforms the ones of the traditional methods.

APA, Harvard, Vancouver, ISO, and other styles

33

Nguyen, Huy Manh, Tomo Miyazaki, Yoshihiro Sugaya, and Shinichiro Omachi. "Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence." Applied Sciences 11, no. 7 (April 3, 2021): 3214. http://dx.doi.org/10.3390/app11073214.

Full text

Abstract:

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

APA, Harvard, Vancouver, ISO, and other styles

34

Huang, Chuan Bo, and Li Xiang. "Image Retrieval Based on Semi-Supervised Orthogonal Discriminant Embedding." Applied Mechanics and Materials 347-350 (August 2013): 3532–36. http://dx.doi.org/10.4028/www.scientific.net/amm.347-350.3532.

Full text

Abstract:

An retrieval algorithm based on dimensionality reduction is proposed to effectively extract the features to improve the performance of image retrieval. Firstly, the most important properties of the subspaces with respect to image retrieval is captured by intelligently utilizing the similarity and dissimilarity information of semantic and geometric structure in image database. Secondly, We propose Semi-supervised Orthogonal Discriminant Embedding Label Propagation method (SODELP) for image retrieval. The experimental results show that our method has the discrimination power against colour, texture and shape features and has good retrieval performance.

APA, Harvard, Vancouver, ISO, and other styles

35

Kumari, Sneha, Rajiv Pandey, Amit Singh, and Himanshu Pathak. "SPARQL: Semantic Information Retrieval by Embedding Prepositions." International Journal of Network Security & Its Applications 6, no. 1 (January 31, 2014): 49–57. http://dx.doi.org/10.5121/ijnsa.2014.6105.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Yu, Mengyang, Li Liu, and Ling Shao. "Binary Set Embedding for Cross-Modal Retrieval." IEEE Transactions on Neural Networks and Learning Systems 28, no. 12 (December 2017): 2899–910. http://dx.doi.org/10.1109/tnnls.2016.2609463.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Almasri, Feras, and Olivier Debeir. "Schematics Retrieval Using Whole-Graph Embedding Similarity." Electronics 13, no. 7 (March 22, 2024): 1176. http://dx.doi.org/10.3390/electronics13071176.

Full text

Abstract:

This paper addresses the pressing environmental concern of plastic waste, particularly in the biopharmaceutical production sector, where single-use assemblies (SUAs) significantly contribute to this issue. To address and mitigate this problem, we propose a unique approach centered around the standardization and optimization of SUA drawings through digitization and structured representation. Leveraging the non-Euclidean properties of SUA drawings, we employ a graph-based representation, utilizing graph convolutional networks (GCNs) to capture complex structural relationships. Introducing a novel weakly supervised method for the similarity-based retrieval of SUA graph networks, we optimize graph embeddings in a low-dimensional Euclidean space. Our method demonstrates effectiveness in retrieving similar graphs that share the same functionality, offering a promising solution to reduce plastic waste in pharmaceutical assembly processes.

APA, Harvard, Vancouver, ISO, and other styles

38

Mollenhauer, Hilton H. "Stain contamination and embedding in electron microscopy." Proceedings, annual meeting, Electron Microscopy Society of America 44 (August 1986): 50–53. http://dx.doi.org/10.1017/s0424820100141986.

Full text

Abstract:

Many factors (e.g., resolution of microscope, type of tissue, and preparation of sample) affect electron microscopical images and alter the amount of information that can be retrieved from a specimen. Of interest in this report are those factors associated with the evaluation of epoxy embedded tissues. In this context, informational retrieval is dependant, in part, on the ability to “see” sample detail (e.g., contrast) and, in part, on tue quality of sample preservation. Two aspects of this problem will be discussed: 1) epoxy resins and their effect on image contrast, information retrieval, and sample preservation; and 2) the interaction between some stains commonly used for enhancing contrast and information retrieval.

APA, Harvard, Vancouver, ISO, and other styles

39

Qiao, Ya-nan, Qinghe Du, and Di-fang Wan. "A study on query terms proximity embedding for information retrieval." International Journal of Distributed Sensor Networks 13, no. 2 (February 2017): 155014771769489. http://dx.doi.org/10.1177/1550147717694891.

Full text

Abstract:

Information retrieval is applied widely to models and algorithms in wireless networks for cyber-physical systems. Query terms proximity has proved that it is a very useful information to improve the performance of information retrieval systems. Query terms proximity cannot retrieve documents independently, and it must be incorporated into original information retrieval models. This article proposes the concept of query term proximity embedding, which is a new method to incorporate query term proximity into original information retrieval models. Moreover, term-field-convolutions frequency framework, which is an implementation of query term proximity embedding, is proposed in this article, and experimental results show that this framework can improve the performance effectively compared with traditional proximity retrieval models.

APA, Harvard, Vancouver, ISO, and other styles

40

P. Bhopale, Bhopale, and Ashish Tiwari. "LEVERAGING NEURAL NETWORK PHRASE EMBEDDING MODEL FOR QUERY REFORMULATION IN AD-HOC BIOMEDICAL INFORMATION RETRIEVAL." Malaysian Journal of Computer Science 34, no. 2 (April 30, 2021): 151–70. http://dx.doi.org/10.22452/mjcs.vol34no2.2.

Full text

Abstract:

This study presents a spark enhanced neural network phrase embedding model to leverage query representation for relevant biomedical literature retrieval. Information retrieval for clinical decision support demands high precision. In recent years, word embeddings have been evolved as a solution to such requirements. It represents vocabulary words in low-dimensional vectors in the context of their similar words; however, it is inadequate to deal with semantic phrases or multi-word units. Learning vector embeddings for phrases by maintaining word meanings is a challenging task. This study proposes a scalable phrase embedding technique to embed multi-word units into vector representations using a state-of-the-art word embedding technique, keeping both word and phrase in the same vectors space. It will enhance the effectiveness and efficiency of query language models by expanding unseen query terms and phrases for the semantically associated query terms. Embedding vectors are evaluated via a query expansion technique for ad-hoc retrieval task over two benchmark corpora viz. TREC-CDS 2014 collection with 733,138 PubMed articles and OHSUMED corpus having 348,566 articles collected from a Medline database. The results show that the proposed technique has significantly outperformed other state-of-the-art retrieval techniques

APA, Harvard, Vancouver, ISO, and other styles

41

Dong, Bin, Songlei Jian, and Kai Lu. "Learning Multimodal Representations by Symmetrically Transferring Local Structures." Symmetry 12, no. 9 (September 13, 2020): 1504. http://dx.doi.org/10.3390/sym12091504.

Full text

Abstract:

Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.

APA, Harvard, Vancouver, ISO, and other styles

42

Wang, Zhen, Liu Liu, Yiqun Duan, and Dacheng Tao. "Continual Learning through Retrieval and Imagination." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8594–602. http://dx.doi.org/10.1609/aaai.v36i8.20837.

Full text

Abstract:

Continual learning is an intellectual ability of artificial agents to learn new streaming labels from sequential data. The main impediment to continual learning is catastrophic forgetting, a severe performance degradation on previously learned tasks. Although simply replaying all previous data or continuously adding the model parameters could alleviate the issue, it is impractical in real-world applications due to the limited available resources. Inspired by the mechanism of the human brain to deepen its past impression, we propose a novel framework, Deep Retrieval and Imagination (DRI), which consists of two components: 1) an embedding network that constructs a unified embedding space without adding model parameters on the arrival of new tasks; and 2) a generative model to produce additional (imaginary) data based on the limited memory. By retrieving the past experiences and corresponding imaginary data, DRI distills knowledge and rebalances the embedding space to further mitigate forgetting. Theoretical analysis demonstrates that DRI can reduce the loss approximation error and improve the robustness through retrieval and imagination, bringing better generalizability to the network. Extensive experiments show that DRI performs significantly better than the existing state-of-the-art continual learning methods and effectively alleviates catastrophic forgetting.

APA, Harvard, Vancouver, ISO, and other styles

43

Zhang, Guihao, and Jiangzhong Cao. "Feature Fusion Based on Transformer for Cross-modal Retrieval." Journal of Physics: Conference Series 2558, no. 1 (August 1, 2023): 012012. http://dx.doi.org/10.1088/1742-6596/2558/1/012012.

Full text

Abstract:

Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.

APA, Harvard, Vancouver, ISO, and other styles

44

Moon, Jucheol, Nhat Anh Le, Nelson Hebert Minaya, and Sang-Il Choi. "Multimodal Few-Shot Learning for Gait Recognition." Applied Sciences 10, no. 21 (October 29, 2020): 7619. http://dx.doi.org/10.3390/app10217619.

Full text

Abstract:

A person’s gait is a behavioral trait that is uniquely associated with each individual and can be used to recognize the person. As information about the human gait can be captured by wearable devices, a few studies have led to the proposal of methods to process gait information for identification purposes. Despite recent advances in gait recognition, an open set gait recognition problem presents challenges to current approaches. To address the open set gait recognition problem, a system should be able to deal with unseen subjects who have not included in the training dataset. In this paper, we propose a system that learns a mapping from a multimodal time series collected using insole to a latent (embedding vector) space to address the open set gait recognition problem. The distance between two embedding vectors in the latent space corresponds to the similarity between two multimodal time series. Using the characteristics of the human gait pattern, multimodal time series are sliced into unit steps. The system maps unit steps to embedding vectors using an ensemble consisting of a convolutional neural network and a recurrent neural network. To recognize each individual, the system learns a decision function using a one-class support vector machine from a few embedding vectors of the person in the latent space, then the system determines whether an unknown unit step is recognized as belonging to a known individual. Our experiments demonstrate that the proposed framework recognizes individuals with high accuracy regardless they have been registered or not. If we could have an environment in which all people would be wearing the insole, the framework would be used for user verification widely.

APA, Harvard, Vancouver, ISO, and other styles

45

Zhuang, Yueting, Jun Song, Fei Wu, Xi Li, Zhongfei Zhang, and Yong Rui. "Multimodal Deep Embedding via Hierarchical Grounded Compositional Semantics." IEEE Transactions on Circuits and Systems for Video Technology 28, no. 1 (January 2018): 76–89. http://dx.doi.org/10.1109/tcsvt.2016.2606648.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Huang, Feiran, Xiaoming Zhang, Jie Xu, Chaozhuo Li, and Zhoujun Li. "Network embedding by fusing multimodal contents and links." Knowledge-Based Systems 171 (May 2019): 44–55. http://dx.doi.org/10.1016/j.knosys.2019.02.003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Kompus, Kristiina, Tom Eichele, Kenneth Hugdahl, and Lars Nyberg. "Multimodal Imaging of Incidental Retrieval: The Low Route to Memory." Journal of Cognitive Neuroscience 23, no. 4 (April 2011): 947–60. http://dx.doi.org/10.1162/jocn.2010.21494.

Full text

Abstract:

Memories of past episodes frequently come to mind incidentally, without directed search. It has remained unclear how incidental retrieval processes are initiated in the brain. Here we used fMRI and ERP recordings to find brain activity that specifically correlates with incidental retrieval, as compared to intentional retrieval. Intentional retrieval was associated with increased activation in dorsolateral prefrontal cortex. By contrast, incidental retrieval was associated with a reduced fMRI signal in posterior brain regions, including extrastriate and parahippocampal cortex, and a modulation of a posterior ERP component 170 msec after the onset of visual retrieval cues. Successful retrieval under both intentional and incidental conditions was associated with increased activation in the hippocampus, precuneus, and ventrolateral prefrontal cortex, as well as increased amplitude of the P600 ERP component. These results demonstrate how early bottom–up signals from posterior cortex can lead to reactivation of episodic memories in the absence of strategic retrieval attempts.

APA, Harvard, Vancouver, ISO, and other styles

48

UbaidullahBokhari, Mohammad, and Faraz Hasan. "Multimodal Information Retrieval: Challenges and Future Trends." International Journal of Computer Applications 74, no. 14 (July 26, 2013): 9–12. http://dx.doi.org/10.5120/12951-9967.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Yamaguchi, Masataka. "2. Multimodal Retrieval between Vision and Language." Journal of The Institute of Image Information and Television Engineers 72, no. 9 (2018): 655–58. http://dx.doi.org/10.3169/itej.72.655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Calumby, Rodrigo Tripodi. "Diversity-oriented Multimodal and Interactive Information Retrieval." ACM SIGIR Forum 50, no. 1 (June 27, 2016): 86. http://dx.doi.org/10.1145/2964797.2964811.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!