Articoli di riviste sul tema "Multi-Modal representations"

Segui questo link per vedere altri tipi di pubblicazioni sul tema: Multi-Modal representations.

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-50 articoli di riviste per l'attività di ricerca sul tema "Multi-Modal representations".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi gli articoli di riviste di molte aree scientifiche e compila una bibliografia corretta.

1

Wu, Lianlong, Seewon Choi, Daniel Raggi, Aaron Stockdill, Grecia Garcia Garcia, Fiorenzo Colarusso, Peter C. H. Cheng e Mateja Jamnik. "Generation of Visual Representations for Multi-Modal Mathematical Knowledge". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 21 (24 marzo 2024): 23850–52. http://dx.doi.org/10.1609/aaai.v38i21.30586.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web front-end user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Trees with just one click, utilising a back-end engine based on RST. A table of cognitive costs, based on the cognitive Representational Interpretive Structure Theory (RIST), that a representation places on a particular profile of user is produced at the same time. MaRE is general and domain independent, applicable to other representations encoded in RST. It may enhance mathematical education and research, facilitating multi-modal knowledge representation and discovery.
2

Zhang, Yi, Mingyuan Chen, Jundong Shen e Chongjun Wang. "Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 8 (28 giugno 2022): 9100–9108. http://dx.doi.org/10.1609/aaai.v36i8.20895.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities. Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels, which neglects the diversity of each modality and fails to capture richer semantic information for each label from different perspectives. Besides, associated relationships of modalities and labels have not been fully exploited. In this paper, we propose versaTile multi-modAl learning for multI-labeL emOtion Recognition (TAILOR), aiming to refine multi-modal representations and enhance discriminative capacity of each label. Specifically, we design an adversarial multi-modal refinement module to sufficiently explore the commonality among different modalities and strengthen the diversity of each modality. To further exploit label-modal dependence, we devise a BERT-like cross-modal encoder to gradually fuse private and common modality representations in a granularity descent way, as well as a label-guided decoder to adaptively generate a tailored representation for each label with the guidance of label semantics. In addition, we conduct experiments on the benchmark MMER dataset CMU-MOSEI in both aligned and unaligned settings, which demonstrate the superiority of TAILOR over the state-of-the-arts.
3

Zhang, Dong, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu e Guodong Zhou. "Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 16 (18 maggio 2021): 14347–55. http://dx.doi.org/10.1609/aaai.v35i16.17687.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal named entity recognition (MNER) aims to discover named entities in free text and classify them into pre-defined types with images. However, dominant MNER models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have the potential to refine multi-modal representation learning. To deal with this issue, we propose a unified multi-modal graph fusion (UMGF) approach for MNER. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). Then, we stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, we achieve an attention-based multi-modal representation for each word and perform entity labeling with a CRF decoder. Experimentation on the two benchmark datasets demonstrates the superiority of our MNER model.
4

Liu, Hao, Jindong Han, Yanjie Fu, Jingbo Zhou, Xinjiang Lu e Hui Xiong. "Multi-modal transportation recommendation with unified route representation learning". Proceedings of the VLDB Endowment 14, n. 3 (novembre 2020): 342–50. http://dx.doi.org/10.14778/3430915.3430924.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal transportation recommendation aims to provide the most appropriate travel route with various transportation modes according to certain criteria. After analyzing large-scale navigation data, we find that route representations exhibit two patterns: spatio-temporal autocorrelations within transportation networks and the semantic coherence of route sequences. However, there are few studies that consider both patterns when developing multi-modal transportation systems. To this end, in this paper, we study multi-modal transportation recommendation with unified route representation learning by exploiting both spatio-temporal dependencies in transportation networks and the semantic coherence of historical routes. Specifically, we propose to unify both dynamic graph representation learning and hierarchical multi-task learning for multi-modal transportation recommendations. Along this line, we first transform the multi-modal transportation network into time-dependent multi-view transportation graphs and propose a spatiotemporal graph neural network module to capture the spatial and temporal autocorrelation. Then, we introduce a coherent-aware attentive route representation learning module to project arbitrary-length routes into fixed-length representation vectors, with explicit modeling of route coherence from historical routes. Moreover, we develop a hierarchical multi-task learning module to differentiate route representations for different transport modes, and this is guided by the final recommendation feedback as well as multiple auxiliary tasks equipped in different network layers. Extensive experimental results on two large-scale real-world datasets demonstrate the performance of the proposed system outperforms eight baselines.
5

Wang, Huansha, Qinrang Liu, Ruiyang Huang e Jianpeng Zhang. "Multi-Modal Entity Alignment Method Based on Feature Enhancement". Applied Sciences 13, n. 11 (1 giugno 2023): 6747. http://dx.doi.org/10.3390/app13116747.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal entity alignment refers to identifying equivalent entities between two different multi-modal knowledge graphs that consist of multi-modal information such as structural triples and descriptive images. Most previous multi-modal entity alignment methods have mainly used corresponding encoders of each modality to encode entity information and then perform feature fusion to obtain the multi-modal joint representation. However, this approach does not fully utilize the multi-modal information of aligned entities. To address this issue, we propose MEAFE, a multi-modal entity alignment method based on feature enhancement. The MEAFE adopts the multi-modal pre-trained model, OCR model, and GATv2 network to enhance the model’s ability to extract useful features in entity structure triplet information and image description, respectively, thereby generating more effective multi-modal representations. Secondly, it further adds modal distribution information of the entity to enhance the model’s understanding and modeling ability of the multi-modal information. Experiments on bilingual and cross-graph multi-modal datasets demonstrate that the proposed method outperforms models that use traditional feature extraction methods.
6

Wu, Tianxing, Chaoyu Gao, Lin Li e Yuxiang Wang. "Leveraging Multi-Modal Information for Cross-Lingual Entity Matching across Knowledge Graphs". Applied Sciences 12, n. 19 (8 ottobre 2022): 10107. http://dx.doi.org/10.3390/app121910107.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
In recent years, the scale of knowledge graphs and the number of entities have grown rapidly. Entity matching across different knowledge graphs has become an urgent problem to be solved for knowledge fusion. With the importance of entity matching being increasingly evident, the use of representation learning technologies to find matched entities has attracted extensive attention due to the computability of vector representations. However, existing studies on representation learning technologies cannot make full use of knowledge graph relevant multi-modal information. In this paper, we propose a new cross-lingual entity matching method (called CLEM) with knowledge graph representation learning on rich multi-modal information. The core is the multi-view intact space learning method to integrate embeddings of multi-modal information for matching entities. Experimental results on cross-lingual datasets show the superiority and competitiveness of our proposed method.
7

Han, Ning, Jingjing Chen, Hao Zhang, Huanwen Wang e Hao Chen. "Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval". ACM Transactions on Multimedia Computing, Communications, and Applications 18, n. 2 (31 maggio 2022): 1–23. http://dx.doi.org/10.1145/3483381.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.
8

Ying, Qichao, Xiaoxiao Hu, Yangming Zhou, Zhenxing Qian, Dan Zeng e Shiming Ge. "Bootstrapping Multi-View Representations for Fake News Detection". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 4 (26 giugno 2023): 5384–92. http://dx.doi.org/10.1609/aaai.v37i4.25670.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Previous researches on multimedia fake news detection include a series of complex feature extraction and fusion networks to gather useful information from the news. However, how cross-modal consistency relates to the fidelity of news and how features from different modalities affect the decision-making are still open questions. This paper presents a novel scheme of Bootstrapping Multi-view Representations (BMR) for fake news detection. Given a multi-modal news, we extract representations respectively from the views of the text, the image pattern and the image semantics. Improved Multi-gate Mixture-of-Expert networks (iMMoE) are proposed for feature refinement and fusion. Representations from each view are separately used to coarsely predict the fidelity of the whole news, and the multimodal representations are able to predict the cross-modal consistency. With the prediction scores, we reweigh each view of the representations and bootstrap them for fake news detection. Extensive experiments conducted on typical fake news detection datasets prove that BMR outperforms state-of-the-art schemes.
9

Huang, Yufeng, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao et al. "Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 3 (24 marzo 2024): 2417–25. http://dx.doi.org/10.1609/aaai.v38i3.28017.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.
10

van Tulder, Gijs, e Marleen de Bruijne. "Learning Cross-Modality Representations From Multi-Modal Images". IEEE Transactions on Medical Imaging 38, n. 2 (febbraio 2019): 638–48. http://dx.doi.org/10.1109/tmi.2018.2868977.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
11

Kiela, Douwe, e Stephen Clark. "Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception". Journal of Artificial Intelligence Research 60 (26 dicembre 2017): 1003–30. http://dx.doi.org/10.1613/jair.5665.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal semantics, which aims to ground semantic representations in perception, has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics. After having shown the quality of such auditorily grounded representations, we show how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments, and provide further analysis. We find that features transfered from deep neural networks outperform bag of audio words approaches. To our knowledge, this is the first work to construct multi-modal models from a combination of textual information and auditory information extracted from deep neural networks, and the first work to evaluate the performance of tri-modal (textual, visual and auditory) semantic models.
12

Cui, Xiaohui, Xiaolong Qu, Dongmei Li, Yu Yang, Yuxun Li e Xiaoping Zhang. "MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems". Electronics 12, n. 12 (15 giugno 2023): 2688. http://dx.doi.org/10.3390/electronics12122688.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
With the emergence of online music platforms, music recommender systems are becoming increasingly crucial in music information retrieval. Knowledge graphs (KGs) are a rich source of semantic information for entities and relations, allowing for improved modeling and analysis of entity relations to enhance recommendations. Existing research has primarily focused on the modeling and analysis of structural triples, while largely ignoring the representation and information processing capabilities of multi-modal data such as music videos and lyrics, which has hindered the improvement and user experience of music recommender systems. To address these issues, we propose a Multi-modal Knowledge Graph Convolutional Network (MKGCN) to enhance music recommendation by leveraging the multi-modal knowledge of music items and their high-order structural and semantic information. Specifically, there are three aggregators in MKGCN: the multi-modal aggregator aggregates the text, image, audio, and sentiment features of each music item in a multi-modal knowledge graph (MMKG); the user aggregator and item aggregator use graph convolutional networks to aggregate multi-hop neighboring nodes on MMKGs to model high-order representations of user preferences and music items, respectively. Finally, we utilize the aggregated embedding representations for recommendation. In training MKGCN, we adopt the ratio negative sampling strategy to generate high-quality negative samples. We construct four different-sized music MMKGs using the public dataset Last-FM and conduct extensive experiments on them. The experimental results demonstrate that MKGCN achieves significant improvements and outperforms several state-of-the-art baselines.
13

Dong, Bin, Songlei Jian e Kai Lu. "Learning Multimodal Representations by Symmetrically Transferring Local Structures". Symmetry 12, n. 9 (13 settembre 2020): 1504. http://dx.doi.org/10.3390/sym12091504.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.
14

Li, Yehao, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin e Tao Mei. "Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training". ACM Transactions on Multimedia Computing, Communications, and Applications 18, n. 2 (31 maggio 2022): 1–16. http://dx.doi.org/10.1145/3473140.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
15

Gu, Zhihao, Jiangning Zhang, Liang Liu, Xu Chen, Jinlong Peng, Zhenye Gan, Guannan Jiang, Annan Shu, Yabiao Wang e Lizhuang Ma. "Rethinking Reverse Distillation for Multi-Modal Anomaly Detection". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 8 (24 marzo 2024): 8445–53. http://dx.doi.org/10.1609/aaai.v38i8.28687.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
In recent years, there has been significant progress in employing color images for anomaly detection in industrial scenarios, but it is insufficient for identifying anomalies that are invisible in RGB images alone. As a supplement, introducing extra modalities such as depth and surface normal maps can be helpful to detect these anomalies. To this end, we present a novel Multi-Modal Reverse Distillation (MMRD) paradigm that consists of a frozen multi-modal teacher encoder to generate distillation targets and a learnable student decoder targeting to restore multi-modal representations from the teacher. Specifically, the teacher extracts complementary visual features from different modalities via a siamese architecture and then parameter-freely fuses these information from multiple levels as the targets of distillation. For the student, it learns modality-related priors from the teacher representations of normal training data and performs interaction between them to form multi-modal representations for target reconstruction. Extensive experiments show that our MMRD outperforms recent state-of-the-art methods on both anomaly detection and localization on MVTec-3D AD and Eyecandies benchmarks. Codes will be available upon acceptance.
16

Wang, Zi, Chenglong Li, Aihua Zheng, Ran He e Jin Tang. "Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 3 (28 giugno 2022): 2633–41. http://dx.doi.org/10.1609/aaai.v36i3.20165.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal person Re-ID introduces more complementary information to assist the traditional Re-ID task. Existing multi-modal methods ignore the importance of modality-specific information in the feature fusion stage. To this end, we propose a novel method to boost modality-specific representations for multi-modal person Re-ID: Interact, Embed, and EnlargE (IEEE). First, we propose a cross-modal interacting module to exchange useful information between different modalities in the feature extraction phase. Second, we propose a relation-based embedding module to enhance the richness of feature descriptors by embedding the global feature into the fine-grained local information. Finally, we propose multi-modal margin loss to force the network to learn modality-specific information for each modality by enlarging the intra-class discrepancy. Superior performance on multi-modal Re-ID dataset RGBNT201 and three constructed Re-ID datasets validate the effectiveness of the proposed method compared with the state-of-the-art approaches.
17

Wróblewska, Anna, Jacek Dąbrowski, Michał Pastuszak, Andrzej Michałowski, Michał Daniluk, Barbara Rychalska, Mikołaj Wieczorek e Sylwia Sysko-Romańczuk. "Designing Multi-Modal Embedding Fusion-Based Recommender". Electronics 11, n. 9 (27 aprile 2022): 1391. http://dx.doi.org/10.3390/electronics11091391.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Recommendation systems have lately been popularised globally. However, often they need to be adapted to particular data and the use case. We have developed a machine learning-based recommendation system, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our system supports multiple types of interaction data with various modalities of metadata through a multi-modal fusion of different data representations. We deployed the system into numerous e-commerce stores, e.g., food and beverages, shoes, fashion items, and telecom operators. We present our system and its main algorithms for data representations and multi-modal fusion. We show benchmark results on open datasets that outperform the state-of-the-art prior work. We also demonstrate use cases for different e-commerce sites.
18

He, Qibin. "Prompting Multi-Modal Image Segmentation with Semantic Grouping". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 3 (24 marzo 2024): 2094–102. http://dx.doi.org/10.1609/aaai.v38i3.27981.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propagation by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore, an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters.
19

Liang, Meiyu, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang e Zhe Xue. "Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 12 (24 marzo 2024): 13744–53. http://dx.doi.org/10.1609/aaai.v38i12.29280.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.
20

Bodapati, Jyostna Devi, Veeranjaneyulu Naralasetti, Shaik Nagur Shareef, Saqib Hakak, Muhammad Bilal, Praveen Kumar Reddy Maddikunta e Ohyun Jo. "Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction". Electronics 9, n. 6 (30 maggio 2020): 914. http://dx.doi.org/10.3390/electronics9060914.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Diabetic Retinopathy (DR) is one of the major causes of visual impairment and blindness across the world. It is usually found in patients who suffer from diabetes for a long period. The major focus of this work is to derive optimal representation of retinal images that further helps to improve the performance of DR recognition models. To extract optimal representation, features extracted from multiple pre-trained ConvNet models are blended using proposed multi-modal fusion module. These final representations are used to train a Deep Neural Network (DNN) used for DR identification and severity level prediction. As each ConvNet extracts different features, fusing them using 1D pooling and cross pooling leads to better representation than using features extracted from a single ConvNet. Experimental studies on benchmark Kaggle APTOS 2019 contest dataset reveals that the model trained on proposed blended feature representations is superior to the existing methods. In addition, we notice that cross average pooling based fusion of features from Xception and VGG16 is the most appropriate for DR recognition. With the proposed model, we achieve an accuracy of 97.41%, and a kappa statistic of 94.82 for DR identification and an accuracy of 81.7% and a kappa statistic of 71.1% for severity level prediction. Another interesting observation is that DNN with dropout at input layer converges more quickly when trained using blended features, compared to the same model trained using uni-modal deep features.
21

Liu, Hao, Ting Li, Renjun Hu, Yanjie Fu, Jingjing Gu e Hui Xiong. "Joint Representation Learning for Multi-Modal Transportation Recommendation". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 1036–43. http://dx.doi.org/10.1609/aaai.v33i01.33011036.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal transportation recommendation has a goal of recommending a travel plan which considers various transportation modes, such as walking, cycling, automobile, and public transit, and how to connect among these modes. The successful development of multi-modal transportation recommendation systems can help to satisfy the diversified needs of travelers and improve the efficiency of transport networks. However, existing transport recommender systems mainly focus on unimodal transport planning. To this end, in this paper, we propose a joint representation learning framework for multi-modal transportation recommendation based on a carefully-constructed multi-modal transportation graph. Specifically, we first extract a multi-modal transportation graph from large-scale map query data to describe the concurrency of users, Origin-Destination (OD) pairs, and transport modes. Then, we provide effective solutions for the optimization problem and develop an anchor embedding for transport modes to initialize the embeddings of transport modes. Moreover, we infer user relevance and OD pair relevance, and incorporate them to regularize the representation learning. Finally, we exploit the learned representations for online multimodal transportation recommendations. Indeed, our method has been deployed into one of the largest navigation Apps to serve hundreds of millions of users, and extensive experimental results with real-world map query data demonstrate the enhanced performance of the proposed method for multimodal transportation recommendations.
22

Yang, Fan, Wei Li, Menglong Yang, Binbin Liang e Jianwei Zhang. "Multi-Modal Disordered Representation Learning Network for Description-Based Person Search". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 15 (24 marzo 2024): 16316–24. http://dx.doi.org/10.1609/aaai.v38i15.29567.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance.
23

Jüttner, Martin, e Ingo Rentschler. "Imagery in multi-modal object learning". Behavioral and Brain Sciences 25, n. 2 (aprile 2002): 197–98. http://dx.doi.org/10.1017/s0140525x0238004x.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Spatial objects may not only be perceived visually but also by touch. We report recent experiments investigating to what extent prior object knowledge acquired in either the haptic or visual sensory modality transfers to a subsequent visual learning task. Results indicate that even mental object representations learnt in one sensory modality may attain a multi-modal quality. These findings seem incompatible with picture-based reasoning schemas but leave open the possibility of modality-specific reasoning mechanisms.
24

Tao, Rui, Meng Zhu, Haiyan Cao e Honge Ren. "Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective". Sensors 24, n. 10 (14 maggio 2024): 3130. http://dx.doi.org/10.3390/s24103130.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image–text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image–text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image–text conservation area image datasets.
25

Pugeault, Nicolas, Florentin Wörgötter e Norbert Krüger. "Disambiguating Multi–Modal Scene Representations Using Perceptual Grouping Constraints". PLoS ONE 5, n. 6 (9 giugno 2010): e10663. http://dx.doi.org/10.1371/journal.pone.0010663.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
26

Lara, Bruno, Juan Manuel Rendon-Mancha e Marcos A. Capistran. "Prediction of Undesired Situations based on Multi-Modal Representations". IEEE Latin America Transactions 5, n. 2 (maggio 2007): 103–8. http://dx.doi.org/10.1109/tla.2007.4381351.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
27

Geng, Shijie, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li e Anoop Cherian. "Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 2 (18 maggio 2021): 1415–23. http://dx.doi.org/10.1609/aaai.v35i2.16231.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
28

Yan, Facheng, Mingshu Zhang e Bin Wei. "Multimodal integration for fake news detection on social media platforms". MATEC Web of Conferences 395 (2024): 01013. http://dx.doi.org/10.1051/matecconf/202439501013.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
The widespread dissemination of fake news on social media platforms can cause serious social impact, making the detection of fake news on social media platforms an urgent problem to be solved. Up to now, scholars have proposed various methods ranging from traditional manual feature extraction to deep learning algorithms for detecting fake news. However, these methods still have some limitations and two difficult problems: (1) How to learn informative news feature representations without losing information as much as possible? (2) How to effectively fuse multi-modal information to obtain high-order complementary information about news and enhance fake news detection? To overcome these two difficulties, this article proposes a multi-modal fusion fake news detection model. Firstly, the model uses BERT and VGG-19 to obtain the text and image feature representations of news content, respectively, and then further fuses multi-modal information through a multi-modal attention mechanism module to obtain high-order complementary information between different modalities, thereby obtaining informative news feature representations for fake news detection. Experimental results on two real-world public datasets demonstrate the effectiveness of our model compared to mainstream detection methods.
29

Escobar-Grisales, Daniel, Cristian David Ríos-Urrego e Juan Rafael Orozco-Arroyave. "Deep Learning and Artificial Intelligence Applied to Model Speech and Language in Parkinson’s Disease". Diagnostics 13, n. 13 (25 giugno 2023): 2163. http://dx.doi.org/10.3390/diagnostics13132163.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder in the world, and it is characterized by the production of different motor and non-motor symptoms which negatively affect speech and language production. For decades, the research community has been working on methodologies to automatically model these biomarkers to detect and monitor the disease; however, although speech impairments have been widely explored, language remains underexplored despite being a valuable source of information, especially to assess cognitive impairments associated with non-motor symptoms. This study proposes the automatic assessment of PD patients using different methodologies to model speech and language biomarkers. One-dimensional and two-dimensional convolutional neural networks (CNNs), along with pre-trained models such as Wav2Vec 2.0, BERT, and BETO, were considered to classify PD patients vs. Healthy Control (HC) subjects. The first approach consisted of modeling speech and language independently. Then, the best representations from each modality were combined following early, joint, and late fusion strategies. The results show that the speech modality yielded an accuracy of up to 88%, thus outperforming all language representations, including the multi-modal approach. These results suggest that speech representations better discriminate PD patients and HC subjects than language representations. When analyzing the fusion strategies, we observed that changes in the time span of the multi-modal representation could produce a significant loss of information in the speech modality, which was likely linked to a decrease in accuracy in the multi-modal experiments. Further experiments are necessary to validate this claim with other fusion methods using different time spans.
30

Zhai, Hanming, Xiaojun Lv, Zhiwen Hou, Xin Tong e Fanliang Bu. "MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion". Mathematical Biosciences and Engineering 20, n. 8 (2023): 14096–116. http://dx.doi.org/10.3934/mbe.2023630.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
<abstract><p>With the rise of multi-modal methods, multi-modal knowledge graphs have become a better choice for storing human knowledge. However, knowledge graphs often suffer from the problem of incompleteness due to the infinite and constantly updating nature of knowledge, and thus the task of knowledge graph completion has been proposed. Existing multi-modal knowledge graph completion methods mostly rely on either embedding-based representations or graph neural networks, and there is still room for improvement in terms of interpretability and the ability to handle multi-hop tasks. Therefore, we propose a new method for multi-modal knowledge graph completion. Our method aims to learn multi-level graph structural features to fully explore hidden relationships within the knowledge graph and to improve reasoning accuracy. Specifically, we first use a Transformer architecture to separately learn about data representations for both the image and text modalities. Then, with the help of multimodal gating units, we filter out irrelevant information and perform feature fusion to obtain a unified encoding of knowledge representations. Furthermore, we extract multi-level path features using a width-adjustable sliding window and learn about structural feature information in the knowledge graph using graph convolutional operations. Finally, we use a scoring function to evaluate the probability of the truthfulness of encoded triplets and to complete the prediction task. To demonstrate the effectiveness of the model, we conduct experiments on two publicly available datasets, FB15K-237-IMG and WN18-IMG, and achieve improvements of 1.8 and 0.7%, respectively, in the Hits@1 metric.</p></abstract>
31

Hua, Yan, Yingyun Yang e Jianhe Du. "Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval". Electronics 9, n. 3 (10 marzo 2020): 466. http://dx.doi.org/10.3390/electronics9030466.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-modal correlations constructed on multi-scale semantic similarity are incorporated to train the deep model in an end-to-end way. Experiments validate the effectiveness of our proposed method on multi-modal retrieval tasks, and our method outperforms state-of-the-art methods on NUS-WIDE, MIR Flickr, and Wikipedia datasets.
32

Yang, Yiying, Fukun Yin, Wen Liu, Jiayuan Fan, Xin Chen, Gang Yu e Tao Chen. "PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 7 (24 marzo 2024): 6594–602. http://dx.doi.org/10.1609/aaai.v38i7.28481.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, Pm-INR, for the outdoor unbounded large-scale scene. The core of our method is multi-modal prior extraction and crossmodal prior fusion modules. The former encodes codebooks from different modality inputs and extracts valuable priors, while the latter fuses priors to maintain view consistency and preserve unique features among multi-modal priors. Finally, feature-rich cross-modal priors are injected into the sampling regions to allow each region to perceive global information without filling the sampling space. Extensive experiments have demonstrated the effectiveness and robustness of our method for outdoor unbounded large-scale scene novel view synthesis, which outperforms state-of-the-art methods in terms of PSNR, SSIM, and LPIPS.
33

Hu, Lianyu, Liqing Gao, Zekang Liu, Chi-Man Pun e Wei Feng. "COMMA: Co-articulated Multi-Modal Learning". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 3 (24 marzo 2024): 2238–46. http://dx.doi.org/10.1609/aaai.v38i3.27997.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.
34

Hill, Felix, Roi Reichart e Anna Korhonen. "Multi-Modal Models for Concrete and Abstract Concept Meaning". Transactions of the Association for Computational Linguistics 2 (dicembre 2014): 285–96. http://dx.doi.org/10.1162/tacl_a_00183.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Multi-modal models that learn semantic representations from both linguistic and perceptual input outperform language-only models on a range of evaluations, and better reflect human concept acquisition. Most perceptual input to such models corresponds to concrete noun concepts and the superiority of the multi-modal approach has only been established when evaluating on such concepts. We therefore investigate which concepts can be effectively learned by multi-modal models. We show that concreteness determines both which linguistic features are most informative and the impact of perceptual input in such models. We then introduce ridge regression as a means of propagating perceptual information from concrete nouns to more abstract concepts that is more robust than previous approaches. Finally, we present weighted gram matrix combination, a means of combining representations from distinct modalities that outperforms alternatives when both modalities are sufficiently rich.
35

Liu, Xuanwu, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Yazhou Ren e Maozu Guo. "Ranking-Based Deep Cross-Modal Hashing". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 4400–4407. http://dx.doi.org/10.1609/aaai.v33i01.33014400.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Cross-modal hashing has been receiving increasing interests for its low storage cost and fast query speed in multi-modal data retrievals. However, most existing hashing methods are based on hand-crafted or raw level features of objects, which may not be optimally compatible with the coding process. Besides, these hashing methods are mainly designed to handle simple pairwise similarity. The complex multilevel ranking semantic structure of instances associated with multiple labels has not been well explored yet. In this paper, we propose a ranking-based deep cross-modal hashing approach (RDCMH). RDCMH firstly uses the feature and label information of data to derive a semi-supervised semantic ranking list. Next, to expand the semantic representation power of hand-crafted features, RDCMH integrates the semantic ranking information into deep cross-modal hashing and jointly optimizes the compatible parameters of deep feature representations and of hashing functions. Experiments on real multi-modal datasets show that RDCMH outperforms other competitive baselines and achieves the state-of-the-art performance in cross-modal retrieval applications.
36

Sezerer, Erhan, e Selma Tekir. "Incorporating Concreteness in Multi-Modal Language Models with Curriculum Learning". Applied Sciences 11, n. 17 (6 settembre 2021): 8241. http://dx.doi.org/10.3390/app11178241.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations. It is shown by several studies that language acquisition in humans starts with learning concrete concepts through images and then continues with learning abstract ideas through the text. In this work, the curriculum learning method is used to teach the model concrete/abstract concepts through images and their corresponding captions to accomplish multi-modal language modeling/representation. We use the BERT and Resnet-152 models on each modality and combine them using attentive pooling to perform pre-training on the newly constructed dataset, which is collected from the Wikimedia Commons based on concrete/abstract words. To show the performance of the proposed model, downstream tasks and ablation studies are performed. The contribution of this work is two-fold: A new dataset is constructed from Wikimedia Commons based on concrete/abstract words, and a new multi-modal pre-training approach based on curriculum learning is proposed. The results show that the proposed multi-modal pre-training approach contributes to the success of the model.
37

Gou, Yingdong, Kexin Wang, Siwen Wei e Changxin Shi. "GMDA: GCN-Based Multi-Modal Domain Adaptation for Real-Time Disaster Detection". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 31, n. 06 (dicembre 2023): 957–73. http://dx.doi.org/10.1142/s0218488523500435.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Nowadays, with the rapid expansion of social media as a means of quick communication, real-time disaster information is widely disseminated through these platforms. Determining which real-time and multi-modal disaster information can effectively support humanitarian aid has become a major challenge. In this paper, we propose a novel end-to-end model, named GCN-based Multi-modal Domain Adaptation (GMDA), which consists of three essential modules: the GCN-based feature extraction module, the attention-based fusion module and the MMD domain adaptation module. The GCN-based feature extraction module integrates text and image representations through GCNs, while the attention-based fusion module then merges these multi-modal representations using an attention mechanism. Finally, the MMD domain adaptation module is utilized to alleviate the dependence of GMDA on source domain events by computing the maximum mean discrepancy across domains. Our proposed model has been extensively evaluated and has shown superior performance compared to state-of-the-art multi-modal domain adaptation models in terms of F1 score and variance stability.
38

Jang, Jiho, Chaerin Kong, DongHyeon Jeon, Seonhoon Kim e Nojun Kwak. "Unifying Vision-Language Representation Space with Single-Tower Transformer". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 1 (26 giugno 2023): 980–88. http://dx.doi.org/10.1609/aaai.v37i1.25178.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.
39

Kabir, Anowarul, e Amarda Shehu. "GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction". Biomolecules 12, n. 11 (18 novembre 2022): 1709. http://dx.doi.org/10.3390/biom12111709.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
40

Alam, Mohammad Arif Ul. "College Student Retention Risk Analysis from Educational Database Using Multi-Task Multi-Modal Neural Fusion". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 11 (28 giugno 2022): 12689–97. http://dx.doi.org/10.1609/aaai.v36i11.21545.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
We develop a Multimodal Spatiotemporal Neural Fusion network for MTL (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each advising note, Long-Short Term Memory (LSTM) network to model temporal advising note embeddings, LSTM network to model students' temporal performance variables and students' static demographics altogether. The final fused representation from MSNF has been utilized on a Multi-Task Cascade Learning (MTCL) model towards building MSNF-MTCL for predicting 5 student retention risks. We evaluate MSNF-MTCL on a large educational database consists of 36,445 college students over 18 years period of time that provides promising performances comparing with the nearest state-of-art models. Additionally, we test the fairness of such model given the existence of biases.
41

Qian, Shengsheng, Dizhan Xue, Huaiwen Zhang, Quan Fang e Changsheng Xu. "Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 3 (18 maggio 2021): 2440–48. http://dx.doi.org/10.1609/aaai.v35i3.16345.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Cross-modal retrieval has become an active study field with the expanding scale of multimodal data. To date, most existing methods transform multimodal data into a common representation space where semantic similarities between items can be directly measured across different modalities. However, these methods typically suffer from following limitations: 1) They usually attempt to bridge the modality gap by designing losses in the common representation space which may not be sufficient to eliminate potential heterogeneity of different modalities in the common space. 2) They typically treat labels as independent individuals and ignore label relationships which are important for constructing semantic links between multimodal data. In this work, we propose a novel Dual Adversarial Graph Neural Networks (DAGNN) composed of the dual generative adversarial networks and the multi-hop graph neural networks, which learn modality-invariant and discriminative common representations for cross-modal retrieval. Firstly, we construct the dual generative adversarial networks to project multimodal data into a common representation space. Secondly, we leverage the multi-hop graph neural networks, in which a layer aggregation mechanism is proposed to exploit multi-hop propagation information, to capture the label correlation dependency and learn inter-dependent classifiers. Comprehensive experiments conducted on two cross-modal retrieval benchmark datasets, NUS-WIDE and MIRFlickr, indicate the superiority of DAGNN.
42

Lu, Lyujian, Saad Elbeleidy, Lauren Zoe Baker e Hua Wang. "Learning Multi-Modal Biomarker Representations via Globally Aligned Longitudinal Enrichments". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 01 (3 aprile 2020): 817–24. http://dx.doi.org/10.1609/aaai.v34i01.5426.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Alzheimer's Disease (AD) is a chronic neurodegenerative disease that severely impacts patients' thinking, memory and behavior. To aid automatic AD diagnoses, many longitudinal learning models have been proposed to predict clinical outcomes and/or disease status, which, though, often fail to consider missing temporal phenotypic records of the patients that can convey valuable information of AD progressions. Another challenge in AD studies is how to integrate heterogeneous genotypic and phenotypic biomarkers to improve diagnosis prediction. To cope with these challenges, in this paper we propose a longitudinal multi-modal method to learn enriched genotypic and phenotypic biomarker representations in the format of fixed-length vectors that can simultaneously capture the baseline neuroimaging measurements of the entire dataset and progressive variations of the varied counts of follow-up measurements over time of every participant from different biomarker sources. The learned global and local projections are aligned by a soft constraint and the structured-sparsity norm is used to uncover the multi-modal structure of heterogeneous biomarker measurements. While the proposed objective is clearly motivated to characterize the progressive information of AD developments, it is a nonsmooth objective that is difficult to efficiently optimize in general. Thus, we derive an efficient iterative algorithm, whose convergence is rigorously guaranteed in mathematics. We have conducted extensive experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) data using one genotypic and two phenotypic biomarkers. Empirical results have demonstrated that the learned enriched biomarker representations are more effective in predicting the outcomes of various cognitive assessments. Moreover, our model has successfully identified disease-relevant biomarkers supported by existing medical findings that additionally warrant the correctness of our method from the clinical perspective.
43

Zhang, Heng, Vishal M. Patel e Rama Chellappa. "Low-Rank and Joint Sparse Representations for Multi-Modal Recognition". IEEE Transactions on Image Processing 26, n. 10 (ottobre 2017): 4741–52. http://dx.doi.org/10.1109/tip.2017.2721838.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
44

Blown, Eric, e Tom G. K. Bryce. "Conceptual Coherence Revealed in Multi‐Modal Representations of Astronomy Knowledge". International Journal of Science Education 32, n. 1 (30 giugno 2009): 31–67. http://dx.doi.org/10.1080/09500690902974207.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
45

Torasso, Pietro. "Multiple representations and multi-modal reasoning in medical diagnostic systems". Artificial Intelligence in Medicine 23, n. 1 (agosto 2001): 49–69. http://dx.doi.org/10.1016/s0933-3657(01)00075-6.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
46

Wachinger, Christian, e Nassir Navab. "Entropy and Laplacian images: Structural representations for multi-modal registration". Medical Image Analysis 16, n. 1 (gennaio 2012): 1–17. http://dx.doi.org/10.1016/j.media.2011.03.001.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
47

Fang, Feiyi, Tao Zhou, Zhenbo Song e Jianfeng Lu. "MMCAN: Multi-Modal Cross-Attention Network for Free-Space Detection with Uncalibrated Hyperspectral Sensors". Remote Sensing 15, n. 4 (20 febbraio 2023): 1142. http://dx.doi.org/10.3390/rs15041142.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Free-space detection plays a pivotal role in autonomous vehicle applications, and its state-of-the-art algorithms are typically based on semantic segmentation of road areas. Recently, hyperspectral images have proven useful supplementary information in multi-modal segmentation for providing more texture details to the RGB representations, thus performing well in road segmentation tasks. Existing multi-modal segmentation methods assume that all the inputs are well-aligned, and then the problem is converted to fuse feature maps from different modalities. However, there exist cases where sensors cannot be well-calibrated. In this paper, we propose a novel network named multi-modal cross-attention network (MMCAN) for multi-modal free-space detection with uncalibrated hyperspectral sensors. We first introduce a cross-modality transformer using hyperspectral data to enhance RGB features, then aggregate these representations alternatively via multiple stages. This transformer promotes the spread and fusion of information between modalities that cannot be aligned at the pixel level. Furthermore, we propose a triplet gate fusion strategy, which can increase the proportion of RGB in the multiple spectral fusion processes while maintaining the specificity of each modality. The experimental results on a multi-spectral dataset demonstrate that our MMCAN model has achieved state-of-the-art performance. The method can be directly used on the pictures taken in the field without complex preprocessing. Our future goal is to adapt the algorithm to multi-object segmentation and generalize it to other multi-modal combinations.
48

Gao, Jingsheng, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu e Yuzhuo Fu. "LAMM: Label Alignment for Multi-Modal Prompt Learning". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 3 (24 marzo 2024): 1815–23. http://dx.doi.org/10.1609/aaai.v38i3.27950.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.
49

Bao, Peijun, Wenhan Yang, Boon Poh Ng, Meng Hwa Er e Alex C. Kot. "Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 1 (26 giugno 2023): 215–22. http://dx.doi.org/10.1609/aaai.v37i1.25093.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.
50

Liu, Shuang, Mei Li, Zhong Zhang, Baihua Xiao e Tariq S. Durrani. "Multi-Evidence and Multi-Modal Fusion Network for Ground-Based Cloud Recognition". Remote Sensing 12, n. 3 (2 febbraio 2020): 464. http://dx.doi.org/10.3390/rs12030464.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
In recent times, deep neural networks have drawn much attention in ground-based cloud recognition. Yet such kind of approaches simply center upon learning global features from visual information, which causes incomplete representations for ground-based clouds. In this paper, we propose a novel method named multi-evidence and multi-modal fusion network (MMFN) for ground-based cloud recognition, which could learn extended cloud information by fusing heterogeneous features in a unified framework. Namely, MMFN exploits multiple pieces of evidence, i.e., global and local visual features, from ground-based cloud images using the main network and the attentive network. In the attentive network, local visual features are extracted from attentive maps which are obtained by refining salient patterns from convolutional activation maps. Meanwhile, the multi-modal network in MMFN learns multi-modal features for ground-based cloud. To fully fuse the multi-modal and multi-evidence visual features, we design two fusion layers in MMFN to incorporate multi-modal features with global and local visual features, respectively. Furthermore, we release the first multi-modal ground-based cloud dataset named MGCD which not only contains the ground-based cloud images but also contains the multi-modal information corresponding to each cloud image. The MMFN is evaluated on MGCD and achieves a classification accuracy of 88.63% comparative to the state-of-the-art methods, which validates its effectiveness for ground-based cloud recognition.

Vai alla bibliografia