Log in

Relevant bibliographies by topics / Multimodal Embeddings / Journal articles

To see the other types of publications on this topic, follow the link: Multimodal Embeddings.

Journal articles on the topic 'Multimodal Embeddings'

Author: Grafiati

Published: 26 October 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multimodal Embeddings.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. "On Isotropy of Multimodal Embeddings." Information 14, no. 7 (July 10, 2023): 392. http://dx.doi.org/10.3390/info14070392.

Full text

Abstract:

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.

APA, Harvard, Vancouver, ISO, and other styles

2

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. "LGMRec: Local and Global Graph Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (March 24, 2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Full text

Abstract:

The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.

APA, Harvard, Vancouver, ISO, and other styles

3

Shang, Bin, Yinliang Zhao, Jun Liu, and Di Wang. "LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (March 24, 2024): 8957–65. http://dx.doi.org/10.1609/aaai.v38i8.28744.

Full text

Abstract:

Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings leads to local structural and the whole graph information missing. To address these challenges, we propose a novel link aware fusion and aggregation based multimodal knowledge graph completion model named LAFA, which is composed of link aware fusion module and link aware aggregation module. The link aware fusion module alleviates noise of irrelevant visual information by calculating the importance between an entity and its associated images in different link scenarios, and fuses the visual and structural embeddings according to the importance through our proposed modality embedding fusion mechanism. The link aware aggregation module assigns neighbor structural information to a given central entity by calculating the importance between the entity and its neighbors, and aggregating the fused embeddings through linear combination according to the importance. Extensive experiments on standard datasets validate that LAFA can obtain state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

4

Sun, Zhongkai, Prathusha Sarma, William Sethares, and Yingyu Liang. "Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8992–99. http://dx.doi.org/10.1609/aaai.v34i05.6431.

Full text

Abstract:

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.

APA, Harvard, Vancouver, ISO, and other styles

5

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (July 2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Full text

Abstract:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

APA, Harvard, Vancouver, ISO, and other styles

6

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (March 24, 2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Full text

Abstract:

Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.

APA, Harvard, Vancouver, ISO, and other styles

7

Zhang, Linhai, Deyu Zhou, Yulan He, and Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (May 18, 2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Full text

Abstract:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to what have been done in previous approaches which project signals from different modalities onto a homogeneous space. In this paper, we propose a Multimodal Event Representation Learning framework (MERL) to learn event representations based on both text and image modalities simultaneously. Event textual triples are projected as Gaussian density embeddings by a dual-path Gaussian triple encoder, while event images are projected as point embeddings by a visual event component-aware image encoder. Moreover, a novel score function motivated by statistical hypothesis testing is introduced to coordinate two embedding spaces. Experiments are conducted on various multimodal event-related tasks and results show that MERL outperforms a number of unimodal and multimodal baselines, demonstrating the effectiveness of the proposed framework.

APA, Harvard, Vancouver, ISO, and other styles

8

Sah, Shagan, Sabarish Gopalakishnan, and Raymond Ptucha. "Aligned attention for common multimodal embeddings." Journal of Electronic Imaging 29, no. 02 (March 25, 2020): 1. http://dx.doi.org/10.1117/1.jei.29.2.023013.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, and Yu Huang. "A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (March 24, 2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Full text

Abstract:

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.

APA, Harvard, Vancouver, ISO, and other styles

10

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Full text

Abstract:

Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as well as strengthen relations between input data and semantic space to generalize across seen and unseen classes. In this paper, we propose a novel method termed Learning Cross-Aligned Latent Embeddings (LCALE) as an alternative to these GAN based methods for ZS-CMR. Unlike using the class-embeddings as the semantic space, our method seeks for a shared low-dimensional latent space of input multimodal features and class-embeddings by modality-specific variational autoencoders. Notably, we align the distributions learned from multimodal input features and from class-embeddings to construct latent embeddings that contain the essential cross-modal correlation associated with unseen classes. Effective cross-reconstruction and cross-alignment criterions are further developed to preserve class-discriminative information in latent space, which benefits the efficiency for retrieval and enable the knowledge transfer to unseen classes. We evaluate our model using four benchmark datasets on image-text retrieval tasks and one large-scale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhu, Chaoyu, Zhihao Yang, Xiaoqiong Xia, Nan Li, Fan Zhong, and Lei Liu. "Multimodal reasoning based on knowledge graph embedding for specific diseases." Bioinformatics 38, no. 8 (February 12, 2022): 2235–45. http://dx.doi.org/10.1093/bioinformatics/btac085.

Full text

Abstract:

Abstract Motivation Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. Results This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model’s reliability in discovering new knowledge by manually proofreading predicted drug–gene, gene–disease and disease–drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models. Availability and implementation The constructed SDKG-11 and the implementation by TensorFlow are available from https://github.com/ZhuChaoY/SDKG-11. Supplementary information Supplementary data are available at Bioinformatics online.

APA, Harvard, Vancouver, ISO, and other styles

12

Tripathi, Aakash, Asim Waqas, Yasin Yilmaz, and Ghulam Rasool. "Abstract 4905: Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches." Cancer Research 84, no. 6_Supplement (March 22, 2024): 4905. http://dx.doi.org/10.1158/1538-7445.am2024-4905.

Full text

Abstract:

Abstract Integrating multimodal lung data including clinical notes, medical images, and molecular data is critical for predictive modeling tasks like survival prediction, yet effectively aligning these disparate data types remains challenging. We present a novel method to integrate heterogeneous lung modalities by first thoroughly analyzing various domain-specific models and selecting the optimal model for embedding feature extraction per data type based on performance on representative pretrained tasks. For clinical notes, the GatorTron models showed the lowest regression loss on an initial evaluation set, with the large GatorTron-medium model achieving 12.9 loss. After selecting the top performers, we extracted robust embeddings on the full lung dataset built using the Multimodal Integration of Oncology Data System (MINDS) framework. MINDS provides an end-to-end platform for aggregating and normalizing multimodal patient data. We aligned the multimodal embeddings to a central pre-trained language model using contrastive representation learning based on a cosine similarity loss function. To adapt the language model to the new modalities, we employed a parameter-efficient tuning method called adapter tuning, which introduces small trainable adapter layers that leave the base model weights frozen. This avoids catastrophic forgetting of the pretrained weights. We evaluated our multimodal model on prognostic prediction tasks including survival regression and subtype classification using both public and internal lung cancer datasets spanning multiple histologic subtypes and stages. Our aligned multimodal model demonstrated improved performance over models utilizing only single modalities, highlighting the benefits of integrating complementary information across diverse lung data types. This work illustrates the potential of flexible multimodal modeling for critical lung cancer prediction problems using heterogeneous real-world patient data. Our model provides a strong foundation for incorporating emerging data types, modalities, and predictive tasks in the future. Citation Format: Aakash Tripathi, Asim Waqas, Yasin Yilmaz, Ghulam Rasool. Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4905.

APA, Harvard, Vancouver, ISO, and other styles

13

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (November 20, 2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Full text

Abstract:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

APA, Harvard, Vancouver, ISO, and other styles

14

Mai, Sijie, Haifeng Hu, and Songlong Xing. "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Full text

Abstract:

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

APA, Harvard, Vancouver, ISO, and other styles

15

Kim, Donghyun, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. "MULE: Multimodal Universal Language Embedding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11254–61. http://dx.doi.org/10.1609/aaai.v34i07.6785.

Full text

Abstract:

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available1.

APA, Harvard, Vancouver, ISO, and other styles

16

Wehrmann, Jônatas, Anderson Mattjie, and Rodrigo C. Barros. "Order embeddings and character-level convolutions for multimodal alignment." Pattern Recognition Letters 102 (January 2018): 15–22. http://dx.doi.org/10.1016/j.patrec.2017.11.020.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Mithun, Niluthpol C., Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. "Joint embeddings with multimodal cues for video-text retrieval." International Journal of Multimedia Information Retrieval 8, no. 1 (January 12, 2019): 3–18. http://dx.doi.org/10.1007/s13735-018-00166-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S, and C. Gururaj. "Multimodal Offensive Meme Classification u sing Transformers and BiLSTM." International Journal of Engineering and Advanced Technology 11, no. 3 (February 28, 2022): 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Full text

Abstract:

Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a classification head. The SwinT + BiLSTM has performed better when compared to the ViT + BiLSTM across all the dimensions. The performance of the models has improved significantly when the contextual embeddings from DistilBert replace the custom embeddings. We have achieved the highest recall of 0.631 by combining outputs of four models using the soft voting technique.

APA, Harvard, Vancouver, ISO, and other styles

19

Chen, Weijia, Zhijun Lu, Lijue You, Lingling Zhou, Jie Xu, and Ken Chen. "Artificial Intelligence–Based Multimodal Risk Assessment Model for Surgical Site Infection (AMRAMS): Development and Validation Study." JMIR Medical Informatics 8, no. 6 (June 15, 2020): e18186. http://dx.doi.org/10.2196/18186.

Full text

Abstract:

Background Surgical site infection (SSI) is one of the most common types of health care–associated infections. It increases mortality, prolongs hospital length of stay, and raises health care costs. Many institutions developed risk assessment models for SSI to help surgeons preoperatively identify high-risk patients and guide clinical intervention. However, most of these models had low accuracies. Objective We aimed to provide a solution in the form of an Artificial intelligence–based Multimodal Risk Assessment Model for Surgical site infection (AMRAMS) for inpatients undergoing operations, using routinely collected clinical data. We internally and externally validated the discriminations of the models, which combined various machine learning and natural language processing techniques, and compared them with the National Nosocomial Infections Surveillance (NNIS) risk index. Methods We retrieved inpatient records between January 1, 2014, and June 30, 2019, from the electronic medical record (EMR) system of Rui Jin Hospital, Luwan Branch, Shanghai, China. We used data from before July 1, 2018, as the development set for internal validation and the remaining data as the test set for external validation. We included patient demographics, preoperative lab results, and free-text preoperative notes as our features. We used word-embedding techniques to encode text information, and we trained the LASSO (least absolute shrinkage and selection operator) model, random forest model, gradient boosting decision tree (GBDT) model, convolutional neural network (CNN) model, and self-attention network model using the combined data. Surgeons manually scored the NNIS risk index values. Results For internal bootstrapping validation, CNN yielded the highest mean area under the receiver operating characteristic curve (AUROC) of 0.889 (95% CI 0.886-0.892), and the paired-sample t test revealed statistically significant advantages as compared with other models (P<.001). The self-attention network yielded the second-highest mean AUROC of 0.882 (95% CI 0.878-0.886), but the AUROC was only numerically higher than the AUROC of the third-best model, GBDT with text embeddings (mean AUROC 0.881, 95% CI 0.878-0.884, P=.47). The AUROCs of LASSO, random forest, and GBDT models using text embeddings were statistically higher than the AUROCs of models not using text embeddings (P<.001). For external validation, the self-attention network yielded the highest AUROC of 0.879. CNN was the second-best model (AUROC 0.878), and GBDT with text embeddings was the third-best model (AUROC 0.872). The NNIS risk index scored by surgeons had an AUROC of 0.651. Conclusions Our AMRAMS based on EMR data and deep learning methods—CNN and self-attention network—had significant advantages in terms of accuracy compared with other conventional machine learning methods and the NNIS risk index. Moreover, the semantic embeddings of preoperative notes improved the model performance further. Our models could replace the NNIS risk index to provide personalized guidance for the preoperative intervention of SSIs. Through this case, we offered an easy-to-implement solution for building multimodal RAMs for other similar scenarios.

APA, Harvard, Vancouver, ISO, and other styles

20

N.D., Smelik. "Multimodal topic model for texts and images utilizing their embeddings." Machine Learning and Data Analysis 2, no. 4 (2016): 421–41. http://dx.doi.org/10.21469/22233792.2.4.05.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Abdou, Ahmed, Ekta Sood, Philipp Müller, and Andreas Bulling. "Gaze-enhanced Crossmodal Embeddings for Emotion Recognition." Proceedings of the ACM on Human-Computer Interaction 6, ETRA (May 13, 2022): 1–18. http://dx.doi.org/10.1145/3530879.

Full text

Abstract:

Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide detailed insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.

APA, Harvard, Vancouver, ISO, and other styles

22

Chen, Qihua, Xuejin Chen, Chenxuan Wang, Yixiong Liu, Zhiwei Xiong, and Feng Wu. "Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (March 24, 2024): 1174–82. http://dx.doi.org/10.1609/aaai.v38i2.27879.

Full text

Abstract:

The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between over-segmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing datasets for neuron segment connection. To learn sophisticated biological imaging features from the connectivity annotations, we propose a novel connectivity-aware contrastive learning method to generate dense volumetric EM image embedding. The learned embeddings can be easily incorporated with any point or voxel-based morphological representations for automatic neuron tracing. Extensive comparisons of different combination schemes of image and morphological representation in identifying split errors across the whole fly brain demonstrate the superiority of the proposed approach, especially for the locations that contain severe imaging artifacts, such as section missing and misalignment. The dataset and code are available at https://github.com/Levishery/Flywire-Neuron-Tracing.

APA, Harvard, Vancouver, ISO, and other styles

23

Hu, Wenbo, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (March 24, 2024): 2256–64. http://dx.doi.org/10.1609/aaai.v38i3.27999.

Full text

Abstract:

Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.

APA, Harvard, Vancouver, ISO, and other styles

24

Shen, Aili, Bahar Salehi, Jianzhong Qi, and Timothy Baldwin. "A General Approach to Multimodal Document Quality Assessment." Journal of Artificial Intelligence Research 68 (July 22, 2020): 607–32. http://dx.doi.org/10.1613/jair.1.11647.

Full text

Abstract:

The perceived quality of a document is affected by various factors, including grammat- icality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text — such as images, font choices, and visual layout — we propose a joint model that combines the text content with a visual rendering of the document for document qual- ity assessment. Our joint model achieves state-of-the-art results over five datasets in two domains (Wikipedia and academic papers), which demonstrates the complementarity of textual and visual features, and the general applicability of our model. To examine what kinds of features our model has learned, we further train our model in a multi-task learning setting, where document quality assessment is the primary task and feature learning is an auxiliary task. Experimental results show that visual embeddings are better at learning structural features while textual embeddings are better at learning readability scores, which further verifies the complementarity of visual and textual features.

APA, Harvard, Vancouver, ISO, and other styles

25

Tseng, Shao-Yen, Shrikanth Narayanan, and Panayiotis Georgiou. "Multimodal Embeddings From Language Models for Emotion Recognition in the Wild." IEEE Signal Processing Letters 28 (2021): 608–12. http://dx.doi.org/10.1109/lsp.2021.3065598.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Jing, Xuebin, Liang He, Zhida Song, and Shaolei Wang. "Audio–Visual Fusion Based on Interactive Attention for Person Verification." Sensors 23, no. 24 (December 15, 2023): 9845. http://dx.doi.org/10.3390/s23249845.

Full text

Abstract:

With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter–attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

APA, Harvard, Vancouver, ISO, and other styles

27

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache, and Benoit Favre. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Full text

Abstract:

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

APA, Harvard, Vancouver, ISO, and other styles

28

Skantze, Gabriel, and Bram Willemsen. "CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings." Journal of Artificial Intelligence Research 74 (July 9, 2022): 1201–23. http://dx.doi.org/10.1613/jair.1.13689.

Full text

Abstract:

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model’s performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model’s original zero-shot performance.

APA, Harvard, Vancouver, ISO, and other styles

29

Wang, Jenq-Haur, Mehdi Norouzi, and Shu Ming Tsai. "Augmenting Multimodal Content Representation with Transformers for Misinformation Detection." Big Data and Cognitive Computing 8, no. 10 (October 11, 2024): 134. http://dx.doi.org/10.3390/bdcc8100134.

Full text

Abstract:

Information sharing on social media has become a common practice for people around the world. Since it is difficult to check user-generated content on social media, huge amounts of rumors and misinformation are being spread with authentic information. On the one hand, most of the social platforms identify rumors through manual fact-checking, which is very inefficient. On the other hand, with an emerging form of misinformation that contains inconsistent image–text pairs, it would be beneficial if we could compare the meaning of multimodal content within the same post for detecting image–text inconsistency. In this paper, we propose a novel approach to misinformation detection by multimodal feature fusion with transformers and credibility assessment with self-attention-based Bi-RNN networks. Firstly, captions are derived from images using an image captioning module to obtain their semantic descriptions. These are compared with surrounding text by fine-tuning transformers for consistency check in semantics. Then, to further aggregate sentiment features into text representation, we fine-tune a separate transformer for text sentiment classification, where the output is concatenated to augment text embeddings. Finally, Multi-Cell Bi-GRUs with self-attention are used to train the credibility assessment model for misinformation detection. From the experimental results on tweets, the best performance with an accuracy of 0.904 and an F1-score of 0.921 can be obtained when applying feature fusion of augmented embeddings with sentiment classification results. This shows the potential of the innovative way of applying transformers in our proposed approach to misinformation detection. Further investigation is needed to validate the performance on various types of multimodal discrepancies.

APA, Harvard, Vancouver, ISO, and other styles

30

Kang, Yu, Tianqiao Liu, Hang Li, Yang Hao, and Wenbiao Ding. "Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10875–83. http://dx.doi.org/10.1609/aaai.v36i10.21334.

Full text

Abstract:

Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

31

Yang, Bang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, and Yuexian Zou. "Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6458–66. http://dx.doi.org/10.1609/aaai.v38i6.28466.

Full text

Abstract:

While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM.

APA, Harvard, Vancouver, ISO, and other styles

32

Wang, Fengjun, Sarai Mizrachi, Moran Beladev, Guy Nadav, Gil Amsalem, Karen Lastmann Assaraf, and Hadas Harush Boker. "MuMIC – Multimodal Embedding for Multi-Label Image Classification with Tempered Sigmoid." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (June 26, 2023): 15603–11. http://dx.doi.org/10.1609/aaai.v37i13.26850.

Full text

Abstract:

Multi-label image classification is a foundational topic in various domains. Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification. For instance, Contrastive Language-Image Pretraining (CLIP) demonstrates impressive image-text representation learning abilities and is robust to natural distribution shifts. This success inspires us to leverage multimodal learning for multi-label classification tasks, and benefit from contrastively learnt pretrained models. We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function, thus enables the optimization on multi-label objectives and transfer learning on CLIP. MuMIC is capable of providing high classification performance, handling real-world noisy data, supporting zero-shot predictions, and producing domain-specific image embeddings. In this study, a total of 120 image classes are defined, and more than 140K positive annotations are collected on approximately 60K Booking.com images. The final MuMIC model is deployed on Booking.com Content Intelligence Platform, and it outperforms other state-of-the-art models with 85.6% GAP@10 and 83.8% GAP on all 120 classes, as well as a 90.1% macro mAP score across 32 majority classes. We summarize the modelling choices which are extensively tested through ablation studies. To the best of our knowledge, we are the first to adapt contrastively learnt multimodal pretraining for real-world multi-label image classification problems, and the innovation can be transferred to other domains.

APA, Harvard, Vancouver, ISO, and other styles

33

Nikzad-Khasmakhi, N., M. A. Balafar, M. Reza Feizi-Derakhshi, and Cina Motamed. "BERTERS: Multimodal representation learning for expert recommendation system with transformers and graph embeddings." Chaos, Solitons & Fractals 151 (October 2021): 111260. http://dx.doi.org/10.1016/j.chaos.2021.111260.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Liu, Hao, Ting Li, Renjun Hu, Yanjie Fu, Jingjing Gu, and Hui Xiong. "Joint Representation Learning for Multi-Modal Transportation Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1036–43. http://dx.doi.org/10.1609/aaai.v33i01.33011036.

Full text

Abstract:

Multi-modal transportation recommendation has a goal of recommending a travel plan which considers various transportation modes, such as walking, cycling, automobile, and public transit, and how to connect among these modes. The successful development of multi-modal transportation recommendation systems can help to satisfy the diversified needs of travelers and improve the efficiency of transport networks. However, existing transport recommender systems mainly focus on unimodal transport planning. To this end, in this paper, we propose a joint representation learning framework for multi-modal transportation recommendation based on a carefully-constructed multi-modal transportation graph. Specifically, we first extract a multi-modal transportation graph from large-scale map query data to describe the concurrency of users, Origin-Destination (OD) pairs, and transport modes. Then, we provide effective solutions for the optimization problem and develop an anchor embedding for transport modes to initialize the embeddings of transport modes. Moreover, we infer user relevance and OD pair relevance, and incorporate them to regularize the representation learning. Finally, we exploit the learned representations for online multimodal transportation recommendations. Indeed, our method has been deployed into one of the largest navigation Apps to serve hundreds of millions of users, and extensive experimental results with real-world map query data demonstrate the enhanced performance of the proposed method for multimodal transportation recommendations.

APA, Harvard, Vancouver, ISO, and other styles

35

Chen, Guang, Fangxiang Feng, Guangwei Zhang, Xiaoxu Li, and Ruifan Li. "A Visually Enhanced Neural Encoder for Synset Induction." Electronics 12, no. 16 (August 20, 2023): 3521. http://dx.doi.org/10.3390/electronics12163521.

Full text

Abstract:

The synset induction task is to automatically cluster semantically identical instances, which are often represented by texts and images. Previous works mainly consider textual parts, while ignoring the visual counterparts. However, how to effectively employ the visual information to enhance the semantic representation for the synset induction is challenging. In this paper, we propose a Visually Enhanced NeUral Encoder (i.e., VENUE) to learn a multimodal representation for the synset induction task. The key insight lies in how to construct multimodal representations through intra-modal and inter-modal interactions among images and text. Specifically, we first design the visual interaction module through the attention mechanism to capture the correlation among images. To obtain the multi-granularity textual representations, we fuse the pre-trained tags and word embeddings. Second, we design a masking module to filter out weakly relevant visual information. Third, we present a gating module to adaptively regulate the modalities’ contributions to semantics. A triplet loss is adopted to train the VENUE encoder for learning discriminative multimodal representations. Then, we perform clustering algorithms on the obtained representations to induce synsets. To verify our approach, we collect a multimodal dataset, i.e., MMAI-Synset, and conduct extensive experiments. The experimental results demonstrate that our method outperforms strong baselines on three groups of evaluation metrics.

APA, Harvard, Vancouver, ISO, and other styles

36

Xu, Xing, Jialin Tian, Kaiyi Lin, Huimin Lu, Jie Shao, and Heng Tao Shen. "Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1s (March 31, 2021): 1–17. http://dx.doi.org/10.1145/3424341.

Full text

Abstract:

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

APA, Harvard, Vancouver, ISO, and other styles

37

Anitha Mummireddygari and N Ananda Reddy. "Optimizing Speaker Recognition in Complex Environments : An Enhanced Framework with Artificial Neural Networks for Multi-Speaker Settings." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10, no. 3 (May 28, 2024): 387–98. http://dx.doi.org/10.32628/cseit24103116.

Full text

Abstract:

This study focuses on the development of an advanced speaker recognition system utilizing Convolutional Neural Networks (CNN) in conjunction with Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and K Nearest Neighbor (KNN) for classification. The proposed system aims to improve accuracy by refining the fine-tuning layer within the CNN architecture. By leveraging the unique characteristics of human voice as a biometric identifier, the system extracts voice data features using MFCC, then employs CNN with triplet loss to generate 128-dimensional embeddings. These embeddings are subsequently classified using the KNN method. The system's performance was evaluated using 50 speakers from the TIMIT dataset and 60 speakers from live recordings made with a smartphone, demonstrating high accuracy. This study highlights the potential of combining CNN and MFCC for robust speaker recognition and suggests that future research could further enhance recognition accuracy by integrating multimodal biometric systems, which combine different types of biometric data for more comprehensive identification.

APA, Harvard, Vancouver, ISO, and other styles

38

D.S. Rao, Rakhi Madhukararao Joshi,. "Multi-camera Vehicle Tracking and Recognition with Multimodal Contrastive Domain Sharing GAN and Topological Embeddings." Journal of Electrical Systems 20, no. 2s (April 4, 2024): 675–86. http://dx.doi.org/10.52783/jes.1532.

Full text

Abstract:

Tracking vehicles across a city using a network of multiple cameras are pivotal for enhancing urban and traffic management systems. However, this task is riddled with challenges such as wide geographical coverage, frequent view obstructions, and the diverse appearances of vehicles from various angles. To address these complexities, the proposed solution, dubbed Overlapped Vehicle Detection and Tracking using Multimodal Contrastive Domain Sharing Generative Adversarial Network optimized with Efficient Multi-camera system (MCDS-GAN), leverages cutting-edge techniques from computer vision, image processing, machine learning, and sensor fusion. This advanced system detects and tracks vehicles even in scenarios where multiple camera views overlap, making it applicable across domains like traffic management, surveillance, and autonomous vehicles. The methodology involves utilizing datasets like Common Objects in Context and ImageNet for training. Detection and tracking are performed using the Multimodal Contrastive Domain Sharing Generative Adversarial Network, followed by vehicle re-identification facilitated by the Topological Information Embedded Convolution Neural Network (TIE-CNN). Moreover, optimization techniques are employed to ensure synchronization and efficiency within the system. Implemented in Python, the effectiveness of MCDS-GAN is rigorously evaluated using metrics such as Accuracy, Precision, Recall, Latency, Response Time, and Scalability. Simulation results showcase its superiority, achieving significantly higher accuracy rates compared to existing methods such as OC-MCT-OFOV, MT-MCT-VM-CLM, and TI-VRI.

APA, Harvard, Vancouver, ISO, and other styles

39

Kim, MinJun, SeungWoo Song, YouHan Lee, Haneol Jang, and KyungTae Lim. "BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 18381–89. http://dx.doi.org/10.1609/aaai.v38i16.29798.

Full text

Abstract:

The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.

APA, Harvard, Vancouver, ISO, and other styles

40

Alam, Mohammad Arif Ul. "College Student Retention Risk Analysis from Educational Database Using Multi-Task Multi-Modal Neural Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (June 28, 2022): 12689–97. http://dx.doi.org/10.1609/aaai.v36i11.21545.

Full text

Abstract:

We develop a Multimodal Spatiotemporal Neural Fusion network for MTL (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each advising note, Long-Short Term Memory (LSTM) network to model temporal advising note embeddings, LSTM network to model students' temporal performance variables and students' static demographics altogether. The final fused representation from MSNF has been utilized on a Multi-Task Cascade Learning (MTCL) model towards building MSNF-MTCL for predicting 5 student retention risks. We evaluate MSNF-MTCL on a large educational database consists of 36,445 college students over 18 years period of time that provides promising performances comparing with the nearest state-of-art models. Additionally, we test the fairness of such model given the existence of biases.

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Ruochi, Tianming Zhou, and Jian Ma. "Multiscale and integrative single-cell Hi-C analysis with Higashi." Nature Biotechnology 40, no. 2 (October 11, 2021): 254–61. http://dx.doi.org/10.1038/s41587-021-01034-y.

Full text

Abstract:

AbstractSingle-cell Hi-C (scHi-C) can identify cell-to-cell variability of three-dimensional (3D) chromatin organization, but the sparseness of measured interactions poses an analysis challenge. Here we report Higashi, an algorithm based on hypergraph representation learning that can incorporate the latent correlations among single cells to enhance overall imputation of contact maps. Higashi outperforms existing methods for embedding and imputation of scHi-C data and is able to identify multiscale 3D genome features in single cells, such as compartmentalization and TAD-like domain boundaries, allowing refined delineation of their cell-to-cell variability. Moreover, Higashi can incorporate epigenomic signals jointly profiled in the same cell into the hypergraph representation learning framework, as compared to separate analysis of two modalities, leading to improved embeddings for single-nucleus methyl-3C data. In an scHi-C dataset from human prefrontal cortex, Higashi identifies connections between 3D genome features and cell-type-specific gene regulation. Higashi can also potentially be extended to analyze single-cell multiway chromatin interactions and other multimodal single-cell omics data.

APA, Harvard, Vancouver, ISO, and other styles

42

Liang, Meiyu, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, and Zhe Xue. "Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (March 24, 2024): 13744–53. http://dx.doi.org/10.1609/aaai.v38i12.29280.

Full text

Abstract:

Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

43

Zhang, Litian, Xiaoming Zhang, Ziyi Zhou, Feiran Huang, and Chaozhuo Li. "Reinforced Adaptive Knowledge Learning for Multimodal Fake News Detection." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (March 24, 2024): 16777–85. http://dx.doi.org/10.1609/aaai.v38i15.29618.

Full text

Abstract:

Nowadays, detecting multimodal fake news has emerged as a foremost concern since the widespread dissemination of fake news may incur adverse societal impact. Conventional methods generally focus on capturing the linguistic and visual semantics within the multimodal content, which fall short in effectively distinguishing the heightened level of meticulous fabrications. Recently, external knowledge is introduced to provide valuable background facts as complementary to facilitate news detection. Nevertheless, existing knowledge-enhanced endeavors directly incorporate all knowledge contexts through static entity embeddings, resulting in the potential noisy and content-irrelevant knowledge. Moreover, the integration of knowledge entities makes it intractable to model the sophisticated correlations between multimodal semantics and knowledge entities. In light of these limitations, we propose a novel Adaptive Knowledge-Aware Fake News Detection model, dubbed AKA-Fake. For each news, AKA-Fake learns a compact knowledge subgraph under a reinforcement learning paradigm, which consists of a subset of entities and contextual neighbors in the knowledge graph, restoring the most informative knowledge facts. A novel heterogeneous graph learning module is further proposed to capture the reliable cross-modality correlations via topology refinement and modality-attentive pooling. Our proposal is extensively evaluated over three popular datasets, and experimental results demonstrate the superiority of AKA-Fake.

APA, Harvard, Vancouver, ISO, and other styles

44

Faizabadi, Ahmed Rimaz, Hasan Firdaus Mohd Zaki, Zulkifli Zainal Abidin, Muhammad Afif Husman, and Nik Nur Wahidah Nik Hashim. "Learning a Multimodal 3D Face Embedding for Robust RGBD Face Recognition." Journal of Integrated and Advanced Engineering (JIAE) 3, no. 1 (March 9, 2023): 37–46. http://dx.doi.org/10.51662/jiae.v3i1.84.

Full text

Abstract:

Machine vision will play a significant role in the next generation of IR 4.0 systems. Recognition and analysis of faces are essential in many vision-based applications. Deep Learning provides the thrust for the advancement in visual recognition. An important tool for visual recognition tasks is Convolution Neural networks (CNN). However, the 2D methods for machine vision suffer from Pose, Illumination, and Expression (PIE) challenges and occlusions. The 3D Race Recognition (3DFR) is very promising for dealing with PIE and a certain degree of occlusions and is suitable for unconstrained environments. However, the 3D data is highly irregular, affecting the performance of deep networks. Most of the 3D Face recognition models are implemented from a research aspect and rarely find a complete 3DFR application. This work attempts to implement a complete end-to-end robust 3DFR pipeline. For this purpose, we implemented a CuteFace3D. This face recognition model is trained on the most challenging dataset, where the state-of-the-art model had below 95% accuracy. An accuracy of 98.89% is achieved on the intellifusion test dataset. Further, for open world and unseen domain adaptation, embeddings learning is achieved using KNN. Then a complete FR pipeline for RGBD face recognition is implemented using a RealSense D435 depth camera. With the KNN classifier and k-fold validation, we achieved 99.997% for the open set RGBD pipeline on registered users. The proposed method with early fusion four-channel input is found to be more robust and has achieved higher accuracy in the benchmark dataset.

APA, Harvard, Vancouver, ISO, and other styles

45

Benzinho, José, João Ferreira, Joel Batista, Leandro Pereira, Marisa Maximiano, Vítor Távora, Ricardo Gomes, and Orlando Remédios. "LLM Based Chatbot for Farm-to-Fork Blockchain Traceability Platform." Applied Sciences 14, no. 19 (October 2, 2024): 8856. http://dx.doi.org/10.3390/app14198856.

Full text

Abstract:

Blockchain technology has been used with great effect in farm-to-fork traceability projects. However, this technology has a steep learning curve when it comes to its user interface. To minimize this difficulty, we created a solution based on a Large Language Model (LLM) conversational agent. Our implementation, starting with an existing knowledge base that is prepared and processed with an embedding model to be stored in a vector database, follows a Retrieval-Augmented Generation (RAG) approach. Other non-textual media like images and videos are aggregated with the embeddings to enrich the user experience. User queries are combined with a proximity search in the vector database and feed into an LLM that considers the conversation history with the user in its replies. Given the asynchronous nature of these models, we implemented a similarly asynchronous scheme using Server-Sent Events that deliver the models’ replies to a UI that supports multimodal media types such as images and videos by providing the visualization of these resources. The end solution allows users to interact with advanced technologies using a natural language interface; this in turn empowers food traceability projects to overcome their natural difficulty in engaging early adopters.

APA, Harvard, Vancouver, ISO, and other styles

46

Liu, Xinyi, Bo Peng, Meiliu Wu, Mingshu Wang, Heng Cai, and Qunying Huang. "Occupation Prediction with Multimodal Learning from Tweet Messages and Google Street View Images." AGILE: GIScience Series 5 (May 30, 2024): 1–6. http://dx.doi.org/10.5194/agile-giss-5-36-2024.

Full text

Abstract:

Abstract. Despite the development of various heuristic and machine learning models, social media user occupation predication remains challenging due to limited high-quality ground truth data and difficulties in effectively integrating multiple data sources in different modalities, which can be complementary and contribute to informing the profession or job role of an individual. In response, this study introduces a novel semi-supervised multimodal learning method for Twitter user occupation prediction with a limited number of training samples. Specifically, an unsupervised learning model is first designed to extract textual and visual embeddings from individual tweet messages (textual) and Google Street View images (visual), with the latter capturing the geographical and environmental context surrounding individuals’ residential and workplace areas. Next, these high-dimensional multimodal features are fed into a multilayer transfer learning model for individual occupation classification. The proposed occupation prediction method achieves high evaluation scores for identifying Office workers, Students, and Others or Jobless people, with the F1 score for identifying Office workers surpassing the best previously reported scores for occupation classification using social media data.

APA, Harvard, Vancouver, ISO, and other styles

47

Sun, Jianguo, Hanqi Yin, Ye Tian, Junpeng Wu, Linshan Shen, and Lei Chen. "Two-Level Multimodal Fusion for Sentiment Analysis in Public Security." Security and Communication Networks 2021 (June 3, 2021): 1–10. http://dx.doi.org/10.1155/2021/6662337.

Full text

Abstract:

Large amounts of data are widely stored in cyberspace. Not only can they bring much convenience to people’s lives and work, but they can also assist the work in the information security field, such as microexpression recognition and sentiment analysis in the criminal investigation. Thus, it is of great significance to recognize and analyze the sentiment information, which is usually described by different modalities. Due to the correlation among different modalities data, multimodal can provide more comprehensive and robust information than unimodal in data analysis tasks. The complementary information from different modalities can be obtained by multimodal fusion methods. These approaches can process multimodal data through fusion algorithms and ensure the accuracy of the information used for subsequent classification or prediction tasks. In this study, a two-level multimodal fusion (TlMF) method with both data-level and decision-level fusion is proposed to achieve the sentiment analysis task. In the data-level fusion stage, a tensor fusion network is utilized to obtain the text-audio and text-video embeddings by fusing the text with audio and video features, respectively. During the decision-level fusion stage, the soft fusion method is adopted to fuse the classification or prediction results of the upstream classifiers, so that the final classification or prediction results can be as accurate as possible. The proposed method is tested on the CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets, and the empirical results and ablation studies confirm the effectiveness of TlMF in capturing useful information from all the test modalities.

APA, Harvard, Vancouver, ISO, and other styles

48

Yuan, Hui, Yuanyuan Tang, Wei Xu, and Raymond Yiu Keung Lau. "Exploring the influence of multimodal social media data on stock performance: an empirical perspective and analysis." Internet Research 31, no. 3 (January 12, 2021): 871–91. http://dx.doi.org/10.1108/intr-11-2019-0461.

Full text

Abstract:

PurposeDespite the extensive academic interest in social media sentiment for financial fields, multimodal data in the stock market has been neglected. The purpose of this paper is to explore the influence of multimodal social media data on stock performance, and investigate the underlying mechanism of two forms of social media data, i.e. text and pictures.Design/methodology/approachThis research employs panel vector autoregressive models to quantify the effect of the sentiment derived from two modalities in social media, i.e. text information and picture information. Through the models, the authors examine the short-term and long-term associations between social media sentiment and stock performance, measured by three metrics. Specifically, the authors design an enhanced sentiment analysis method, integrating random walk and word embeddings through Global Vectors for Word Representation (GloVe), to construct a domain-specific lexicon and apply it to textual sentiment analysis. Secondly, the authors exploit a deep learning framework based on convolutional neural networks to analyze the sentiment in picture data.FindingsThe empirical results derived from vector autoregressive models reveal that both measures of the sentiment extracted from textual information and pictorial information in social media are significant leading indicators of stock performance. Moreover, pictorial information and textual information have similar relationships with stock performance.Originality/valueTo the best of the authors’ knowledge, this is the first study that incorporates multimodal social media data for sentiment analysis, which is valuable in understanding pictures of social media data. The study offers significant implications for researchers and practitioners. This research informs researchers on the attention of multimodal social media data. The study’s findings provide some managerial recommendations, e.g. watching not only words but also pictures in social media.

APA, Harvard, Vancouver, ISO, and other styles

49

Mingote, Victoria, Ignacio Viñals, Pablo Gimeno, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. "Multimodal Diarization Systems by Training Enrollment Models as Identity Representations." Applied Sciences 12, no. 3 (January 21, 2022): 1141. http://dx.doi.org/10.3390/app12031141.

Full text

Abstract:

This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system.

APA, Harvard, Vancouver, ISO, and other styles

50

Krawczuk, Patrycja, Zachary Fox, Dakota Murdock, Jennifer Doherty, Antoinette Stroupe, Stephen M. Schwartz, Lynne Penberthy, et al. "Abstract 2318: Multimodal machine learning for the automatic classification of recurrent cancers." Cancer Research 84, no. 6_Supplement (March 22, 2024): 2318. http://dx.doi.org/10.1158/1538-7445.am2024-2318.

Full text

Abstract:

Abstract The National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) registries maintain and organize cancer incidence information allowing researchers to derive valuable insights into cancer epidemiology. While significant attention has been devoted to identifying cancers either from clinical text or through tabular data collected by SEER registries, there has been less emphasis on integrating these distinct modes of data. In our multimodal deep learning approach, we use longitudinal tabular data from the Consolidated Tumor Case (CTC) database that encompass a patient’s past diagnoses. This tabular information can augment clinical text to aid in the classification of pathology reports indicative of recurrent cancers. Four NCI SEER registries (Louisiana, New Jersey, Seattle and Utah) have manually labeled 61,150 pathology reports with one of six categories, which we refine into a four-class classification problem. Each pathology report is identified as either positive for recurrence, negative for recurrence/not disease free, new tumor, or an “other” (no malignancy/uncertain) class. Natural Language Processing techniques can extract meaningful information from clinical pathology reports, aiding in the identification of subtle indicators of recurrence by using relevant context. We use a hierarchical self-attention model (HiSAN) to construct document embeddings and classify the pathology report. To further enhance the predictive accuracy of our modeling approach we fuse the textual information from a pathology report with categorical data about patient’s cancer history. For each report, we create a patient context vector that encapsulates tumor-level information from patient’s previous cancer(s). The selected CTC records are associated with cancers diagnosed more than 120 days before the date of biospecimen collection stated in the pathology report. The patient context vector is crafted based on diverse categorical features; including cancer staging, patient age, treatment and sites of metastasis at the time of diagnosis. Features are represented using a combination of one-hot encoding and binning. Additionally, we employ patient and feature-level normalization to maintain proportional significance of features for individuals with multiple past diagnoses. We present preliminary results corresponding to different approaches for classifying cancer recurrence; first, we observe that using only the pathology reports as input yields an accuracy of 68%. Secondly, when using only CTC features with an XGBoost model, we achieve an accuracy of 49%. Finally we show that leveraging multiple data modalities, i.e. HiSAN generated pathology report embeddings and CTC data, significantly improves the model’s predictive accuracy to 76%. This research demonstrates a promising path forward in enhancing classification of clinical text by incorporating longitudinal patient history data. Citation Format: Patrycja Krawczuk, Zachary Fox, Dakota Murdock, Jennifer Doherty, Antoinette Stroupe, Stephen M. Schwartz, Lynne Penberthy, Elizabeth Hsu, Serban Negoita, Valentina Petkov, Heidi Hanson. Multimodal machine learning for the automatic classification of recurrent cancers [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2318.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!