Se connecter

Bibliographies thématiques / Transformers Multimodaux / Articles de revues

Articles de revues sur le sujet « Transformers Multimodaux »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Transformers Multimodaux.

Auteur : Grafiati

Publié le 13 avril 2024

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleurs articles de revues pour votre recherche sur le sujet « Transformers Multimodaux ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les articles de revues sur diverses disciplines et organisez correctement votre bibliographie.

1

Jaiswal, Sushma, Harikumar Pallthadka, Rajesh P. Chinchewadi et Tarun Jaiswal. « Optimized Image Captioning : Hybrid Transformers Vision Transformers and Convolutional Neural Networks : Enhanced with Beam Search ». International Journal of Intelligent Systems and Applications 16, n^o 2 (8 avril 2024) : 53–61. http://dx.doi.org/10.5815/ijisa.2024.02.05.

Texte intégral

Résumé :

Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Bayat, Nasrin, Jong-Hwan Kim, Renoa Choudhury, Ibrahim F. Kadhim, Zubaidah Al-Mashhadani, Mark Aldritz Dela Virgen, Reuben Latorre, Ricardo De La Paz et Joon-Hyuk Park. « Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired ». Journal of Imaging 9, n^o 8 (15 août 2023) : 161. http://dx.doi.org/10.3390/jimaging9080161.

Texte intégral

Résumé :

This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user’s walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Shao, Zilei. « A literature review on multimodal deep learning models for detecting mental disorders in conversational data : Pre-transformer and transformer-based approaches ». Applied and Computational Engineering 18, n^o 1 (23 octobre 2023) : 215–24. http://dx.doi.org/10.54254/2755-2721/18/20230993.

Texte intégral

Résumé :

This paper provides a comprehensive review of multimodal deep learning models that utilize conversational data to detect mental health disorders. In addition to discussing models based on the Transformer, such as BERT (Bidirectional Encoder Representations from Transformers), this paper addresses models that existed prior to the Transformer, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The paper covers the application of these models in the construction of multimodal deep learning systems to detect mental disorders. In addition, the difficulties encountered by multimodal deep learning systems are brought up. Furthermore, the paper proposes research directions for enhancing the performance and robustness of these models in mental health applications. By shedding light on the potential of multimodal deep learning in mental health care, this paper aims to foster further research and development in this critical domain.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Hendricks, Lisa Anne, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac et Aida Nematzadeh. « Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers ». Transactions of the Association for Computational Linguistics 9 (2021) : 570–85. http://dx.doi.org/10.1162/tacl_a_00385.

Texte intégral

Résumé :

Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Chen, Yu, Ming Yin, Yu Li et Qian Cai. « CSU-Net : A CNN-Transformer Parallel Network for Multimodal Brain Tumour Segmentation ». Electronics 11, n^o 14 (16 juillet 2022) : 2226. http://dx.doi.org/10.3390/electronics11142226.

Texte intégral

Résumé :

Medical image segmentation techniques are vital to medical image processing and analysis. Considering the significant clinical applications of brain tumour image segmentation, it represents a focal point of medical image segmentation research. Most of the work in recent times has been centred on Convolutional Neural Networks (CNN) and Transformers. However, CNN has some deficiencies in modelling long-distance information transfer and contextual processing information, while Transformer is relatively weak in acquiring local information. To overcome the above defects, we propose a novel segmentation network with an “encoder–decoder” architecture, namely CSU-Net. The encoder consists of two parallel feature extraction branches based on CNN and Transformer, respectively, in which the features of the same size are fused. The decoder has a dual Swin Transformer decoder block with two learnable parameters for feature upsampling. The features from multiple resolutions in the encoder and decoder are merged via skip connections. On the BraTS 2020, our model achieves 0.8927, 0.8857, and 0.8188 for the Whole Tumour (WT), Tumour Core (TC), and Enhancing Tumour (ET), respectively, in terms of Dice scores.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Sun, Qixuan, Nianhua Fang, Zhuo Liu, Liang Zhao, Youpeng Wen et Hongxiang Lin. « HybridCTrm : Bridging CNN and Transformer for Multimodal Brain Image Segmentation ». Journal of Healthcare Engineering 2021 (1 octobre 2021) : 1–10. http://dx.doi.org/10.1155/2021/7467261.

Texte intégral

Résumé :

Multimodal medical image segmentation is always a critical problem in medical image segmentation. Traditional deep learning methods utilize fully CNNs for encoding given images, thus leading to deficiency of long-range dependencies and bad generalization performance. Recently, a sequence of Transformer-based methodologies emerges in the field of image processing, which brings great generalization and performance in various tasks. On the other hand, traditional CNNs have their own advantages, such as rapid convergence and local representations. Therefore, we analyze a hybrid multimodal segmentation method based on Transformers and CNNs and propose a novel architecture, HybridCTrm network. We conduct experiments using HybridCTrm on two benchmark datasets and compare with HyperDenseNet, a network based on fully CNNs. Results show that our HybridCTrm outperforms HyperDenseNet on most of the evaluation metrics. Furthermore, we analyze the influence of the depth of Transformer on the performance. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu et Faouzi Bader. « Multimodal transformers for wireless communications : A case study in beam prediction ». ITU Journal on Future and Evolving Technologies 4, n^o 3 (5 septembre 2023) : 461–71. http://dx.doi.org/10.52953/jwra8095.

Texte intégral

Résumé :

Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.

Styles APA, Harvard, Vancouver, ISO, etc.

8

Xu, Yifan, Huapeng Wei, Minxuan Lin, Yingying Deng, Kekai Sheng, Mengdan Zhang, Fan Tang, Weiming Dong, Feiyue Huang et Changsheng Xu. « Transformers in computational visual media : A survey ». Computational Visual Media 8, n^o 1 (27 octobre 2021) : 33–62. http://dx.doi.org/10.1007/s41095-021-0247-3.

Texte intégral

Résumé :

AbstractTransformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

Styles APA, Harvard, Vancouver, ISO, etc.

9

Zhong, Enmin, Carlos R. del-Blanco, Daniel Berjón, Fernando Jaureguizar et Narciso García. « Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer ». Sensors 23, n^o 16 (10 août 2023) : 7066. http://dx.doi.org/10.3390/s23167066.

Texte intégral

Résumé :

Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. The primary challenge lies in achieving real-time recognition while managing temporal dependencies that can impact performance. Existing methods employ 3D convolutional or Transformer-based architectures with hand skeleton estimation, but both have limitations. To address these challenges, a hybrid approach that combines 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed. The method involves using a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics of hand gestures. A Transformer network with a self-attention mechanism is then employed to efficiently capture long-range temporal dependencies in the skeleton sequence. Evaluation of the Briareo and Multimodal Hand Gesture datasets resulted in accuracy scores of 95.49% and 97.25%, respectively. Notably, this approach achieves real-time performance using a standard CPU, distinguishing it from methods that require specialized GPUs. The hybrid approach’s real-time efficiency and high accuracy demonstrate its superiority over existing state-of-the-art methods. In summary, the hybrid 3D-CNN and Transformer approach effectively addresses real-time recognition challenges and efficient handling of temporal dependencies, outperforming existing methods in both accuracy and speed.

Styles APA, Harvard, Vancouver, ISO, etc.

10

Nia, Zahra Movahedi, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary et Jude D. Kong. « Twitter-based gender recognition using transformers ». Mathematical Biosciences and Engineering 20, n^o 9 (2023) : 15957–77. http://dx.doi.org/10.3934/mbe.2023711.

Texte intégral

Résumé :

<abstract> <p>Social media contains useful information about people and society that could help advance research in many different areas of health (e.g. by applying opinion mining, emotion/sentiment analysis and statistical analysis) such as mental health, health surveillance, socio-economic inequality and gender vulnerability. User demographics provide rich information that could help study the subject further. However, user demographics such as gender are considered private and are not freely available. In this study, we propose a model based on transformers to predict the user's gender from their images and tweets. The image-based classification model is trained in two different methods: using the profile image of the user and using various image contents posted by the user on Twitter. For the first method a Twitter gender recognition dataset, publicly available on Kaggle and for the second method the PAN-18 dataset is used. Several transformer models, i.e. vision transformers (ViT), LeViT and Swin Transformer are fine-tuned for both of the image datasets and then compared. Next, different transformer models, namely, bidirectional encoders representations from transformers (BERT), RoBERTa and ELECTRA are fine-tuned to recognize the user's gender by their tweets. This is highly beneficial, because not all users provide an image that indicates their gender. The gender of such users could be detected from their tweets. The significance of the image and text classification models were evaluated using the Mann-Whitney U test. Finally, the combination model improved the accuracy of image and text classification models by 11.73 and 5.26% for the Kaggle dataset and by 8.55 and 9.8% for the PAN-18 dataset, respectively. This shows that the image and text classification models are capable of complementing each other by providing additional information to one another. Our overall multimodal method has an accuracy of 88.11% for the Kaggle and 89.24% for the PAN-18 dataset and outperforms state-of-the-art models. Our work benefits research that critically require user demographic information such as gender to further analyze and study social media content for health-related issues.</p> </abstract>

Styles APA, Harvard, Vancouver, ISO, etc.

11

Liang, Yi, Turdi Tohti et Askar Hamdulla. « False Information Detection via Multimodal Feature Fusion and Multi-Classifier Hybrid Prediction ». Algorithms 15, n^o 4 (29 mars 2022) : 119. http://dx.doi.org/10.3390/a15040119.

Texte intégral

Résumé :

In the existing false information detection methods, the quality of the extracted single-modality features is low, the information between different modalities cannot be fully fused, and the original information will be lost when the information of different modalities is fused. This paper proposes a false information detection via multimodal feature fusion and multi-classifier hybrid prediction. In this method, first, bidirectional encoder representations for transformers are used to extract the text features, and S win-transformer is used to extract the picture features, and then, the trained deep autoencoder is used as an early fusion method of multimodal features to fuse text features and visual features, and the low-dimensional features are taken as the joint features of the multimodalities. The original features of each modality are concatenated into the joint features to reduce the loss of original information. Finally, the text features, image features and joint features are processed by three classifiers to obtain three probability distributions, and the three probability distributions are added proportionally to obtain the final prediction result. Compared with the attention-based multimodal factorized bilinear pooling, the model achieves 4.3% and 1.2% improvement in accuracy on Weibo dataset and Twitter dataset. The experimental results show that the proposed model can effectively integrate multimodal information and improve the accuracy of false information detection.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Desai, Poorav, Tanmoy Chakraborty et Md Shad Akhtar. « Nice Perfume. How Long Did You Marinate in It ? Multimodal Sarcasm Explanation ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 10 (28 juin 2022) : 10563–71. http://dx.doi.org/10.1609/aaai.v36i10.21300.

Texte intégral

Résumé :

Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm is conveyed through the incongruity between the text and visual entities. Although recent approaches deal with sarcasm as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the underlying sense of irony. In this paper, we propose a novel problem -- Multimodal Sarcasm Explanation (MuSE) -- given a multimodal sarcastic post containing an image and a caption, we aim to generate a natural language explanation to reveal the intended sarcasm. To this end, we develop MORE, a new dataset with explanation of 3510 sarcastic multimodal posts. Each explanation is a natural language (English) sentence describing the hidden irony. We benchmark MORE by employing a multimodal Transformer-based architecture. It incorporates a cross-modal attention in the Transformer's encoder which attends to the distinguishing features between the two modalities. Subsequently, a BART-based auto-regressive decoder is used as the generator. Empirical results demonstrate convincing results over various baselines (adopted for MuSE) across five evaluation metrics. We also conduct human evaluation on predictions and obtain Fleiss' Kappa score of 0.4 as a fair agreement among 25 evaluators.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Shan, Qishang, Xiangsen Wei et Ziyun Cai. « Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis ». Journal of Physics : Conference Series 2224, n^o 1 (1 avril 2022) : 012024. http://dx.doi.org/10.1088/1742-6596/2224/1/012024.

Texte intégral

Résumé :

Abstract Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Gupta, Arpit, Himanshu Goyal et Ishita Kohli. « Synthesis of Vision and Language : Multifaceted Image Captioning Application ». INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, n^o 12 (23 décembre 2023) : 1–10. http://dx.doi.org/10.55041/ijsrem27770.

Texte intégral

Résumé :

The rapid advancement in image captioning has been a pivotal area of research, aiming to mimic human-like understanding of visual content. This paper presents an innovative approach that integrates attention mechanisms and object features into an image captioning model. Leveraging the Flickr8k dataset, this research explores the fusion of these components to enhance image comprehension and caption generation. Furthermore, the study showcases the implementation of this model in a user-friendly application using FASTAPI and ReactJS, offering text-to-speech translation in multiple languages. The findings underscore the efficacy of this approach in advancing image captioning technology. This tutorial outlines the construction of an image caption generator, employing Convolutional Neural Network (CNN) for image feature extraction and Long Short-Term Memory Network (LSTM) for Natural Language Processing (NLP). Keywords—Convolutional Neural Networks, Long Short Term Memory, Attention Mechanism, Transformer Architecture, Vision Transformers, Transfer Learning, Multimodal fusion, Deep Learning Models, Pre-Trained Models, Image Processing Techniques

Styles APA, Harvard, Vancouver, ISO, etc.

15

Liu, Bo, Lejian He, Yafei Liu, Tianyao Yu, Yuejia Xiang, Li Zhu et Weijian Ruan. « Transformer-Based Multimodal Infusion Dialogue Systems ». Electronics 11, n^o 20 (20 octobre 2022) : 3409. http://dx.doi.org/10.3390/electronics11203409.

Texte intégral

Résumé :

The recent advancements in multimodal dialogue systems have been gaining importance in several domains such as retail, travel, fashion, among others. Several existing works have improved the understanding and generation of multimodal dialogues. However, there still exists considerable space to improve the quality of output textual responses due to insufficient information infusion between the visual and textual semantics. Moreover, the existing dialogue systems often generate defective knowledge-aware responses for tasks such as providing product attributes and celebrity endorsements. To address the aforementioned issues, we present a Transformer-based Multimodal Infusion Dialogue (TMID) system that extracts the visual and textual information from dialogues via a transformer-based multimodal context encoder and employs a cross-attention mechanism to achieve information infusion between images and texts for each utterance. Furthermore, TMID uses adaptive decoders to generate appropriate multimodal responses based on the user intentions it has determined using a state classifier and enriches the output responses by incorporating domain knowledge into the decoders. The results of extensive experiments on a multimodal dialogue dataset demonstrate that TMID has achieved a state-of-the-art performance by improving the BLUE-4 score by 13.03, NIST by 2.77, image selection Recall@1 by 1.84%.

Styles APA, Harvard, Vancouver, ISO, etc.

16

Wang, LeiChen, Simon Giebenhain, Carsten Anklam et Bastian Goldluecke. « Radar Ghost Target Detection via Multimodal Transformers ». IEEE Robotics and Automation Letters 6, n^o 4 (octobre 2021) : 7758–65. http://dx.doi.org/10.1109/lra.2021.3100176.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

17

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache et Benoit Favre. « Are Vision-Language Transformers Learning Multimodal Representations ? A Probing Perspective ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 10 (28 juin 2022) : 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Texte intégral

Résumé :

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Zhao, Bin, Maoguo Gong et Xuelong Li. « Hierarchical multimodal transformer to summarize videos ». Neurocomputing 468 (janvier 2022) : 360–69. http://dx.doi.org/10.1016/j.neucom.2021.10.039.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

19

Ding, Lan. « Online teaching emotion analysis based on GRU and nonlinear transformer algorithm ». PeerJ Computer Science 9 (21 novembre 2023) : e1696. http://dx.doi.org/10.7717/peerj-cs.1696.

Texte intégral

Résumé :

Nonlinear models of neural networks demonstrate the ability to autonomously extract significant attributes from a given target, thus facilitating automatic analysis of classroom emotions. This article introduces an online auxiliary tool for analyzing emotional states in virtual classrooms using the nonlinear vision algorithm Transformer. This research uses multimodal fusion, students’ auditory input, facial expression and text data as the foundational elements of sentiment analysis. In addition, a modal feature extractor has been developed to extract multimodal emotions using convolutional and gated cycle unit (GRU) architectures. In addition, inspired by the Transformer algorithm, a cross-modal Transformer algorithm is proposed to enhance the processing of multimodal information. The experiments demonstrate that the training performance of the proposed model surpasses that of similar methods, with its recall, precision, accuracy, and F1 values achieving 0.8587, 0.8365, 0.8890, and 0.8754, respectively, which is superior accuracy in capturing students’ emotional states, thus having important implications in assessing students’ engagement in educational courses.

Styles APA, Harvard, Vancouver, ISO, etc.

20

Wang, Zhaokai, Renda Bao, Qi Wu et Si Liu. « Confidence-aware Non-repetitive Multimodal Transformers for TextCaps ». Proceedings of the AAAI Conference on Artificial Intelligence 35, n^o 4 (18 mai 2021) : 2835–43. http://dx.doi.org/10.1609/aaai.v35i4.16389.

Texte intégral

Résumé :

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Xiang, Yunfan, Xiangyu Tian, Yue Xu, Xiaokun Guan et Zhengchao Chen. « EGMT-CD : Edge-Guided Multimodal Transformers Change Detection from Satellite and Aerial Images ». Remote Sensing 16, n^o 1 (25 décembre 2023) : 86. http://dx.doi.org/10.3390/rs16010086.

Texte intégral

Résumé :

Change detection from heterogeneous satellite and aerial images plays a progressively important role in many fields, including disaster assessment, urban construction, and land use monitoring. Currently, researchers have mainly devoted their attention to change detection using homologous image pairs and achieved many remarkable results. It is sometimes necessary to use heterogeneous images for change detection in practical scenarios due to missing images, emergency situations, and cloud and fog occlusion. However, heterogeneous change detection still faces great challenges, especially using satellite and aerial images. The main challenges in satellite and aerial image change detection are related to the resolution gap and blurred edge. Previous studies used interpolation or shallow feature alignment before traditional homologous change detection methods, which ignored the high-level feature interaction and edge information. Therefore, we propose a new heterogeneous change detection model based on multimodal transformers combined with edge guidance. In order to alleviate the resolution gap between satellite and aerial images, we design an improved spatially aligned transformer (SP-T) with a sub-pixel module to align the satellite features to the same size of the aerial ones supervised by a token loss. Moreover, we introduce an edge detection branch to guide change features using the object edge with an auxiliary edge-change loss. Finally, we conduct considerable experiments to verify the effectiveness and superiority of our proposed model (EGMT-CD) on a new satellite–aerial heterogeneous change dataset, named SACD. The experiments show that our method (EGMT-CD) outperforms many previously superior change detection methods and fully demonstrates its potential in heterogeneous change detection from satellite–aerial images.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Li, Ning, Jie Chen, Nanxin Fu, Wenzhuo Xiao, Tianrun Ye, Chunming Gao et Ping Zhang. « Leveraging Dual Variational Autoencoders and Generative Adversarial Networks for Enhanced Multimodal Interaction in Zero-Shot Learning ». Electronics 13, n^o 3 (29 janvier 2024) : 539. http://dx.doi.org/10.3390/electronics13030539.

Texte intégral

Résumé :

In the evolving field of taxonomic classification, and especially in Zero-shot Learning (ZSL), the challenge of accurately classifying entities unseen in training datasets remains a significant hurdle. Although the existing literature is rich in developments, it often falls short in two critical areas: semantic consistency (ensuring classifications align with true meanings) and the effective handling of dataset diversity biases. These gaps have created a need for a more robust approach that can navigate both with greater efficacy. This paper introduces an innovative integration of transformer models with ariational autoencoders (VAEs) and generative adversarial networks (GANs), with the aim of addressing them within the ZSL framework. The choice of VAE-GAN is driven by their complementary strengths: VAEs are proficient in providing a richer representation of data patterns, and GANs are able to generate data that is diverse yet representative, thus mitigating biases from dataset diversity. Transformers are employed to further enhance semantic consistency, which is key because many existing models underperform. Through experiments have been conducted on benchmark ZSL datasets such as CUB, SUN, and Animals with Attributes 2 (AWA2), our approach is novel because it demonstrates significant improvements, not only in enhancing semantic and structural coherence, but also in effectively addressing dataset biases. This leads to a notable enhancement of the model’s ability to generalize visual categorization tasks beyond the training data, thus filling a critical gap in the current ZSL research landscape.

Styles APA, Harvard, Vancouver, ISO, etc.

23

Abdine, Hadi, Michail Chatzianastasis, Costas Bouyioukos et Michalis Vazirgiannis. « Prot2Text : Multimodal Protein’s Function Generation with GNNs and Transformers ». Proceedings of the AAAI Conference on Artificial Intelligence 38, n^o 10 (24 mars 2024) : 10757–65. http://dx.doi.org/10.1609/aaai.v38i10.28948.

Texte intégral

Résumé :

In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.

Styles APA, Harvard, Vancouver, ISO, etc.

24

Li, Zuhe, Qingbing Guo, Chengyao Feng, Lujuan Deng, Qiuwen Zhang, Jianwei Zhang, Fengqin Wang et Qian Sun. « Multimodal Sentiment Analysis Based on Interactive Transformer and Soft Mapping ». Wireless Communications and Mobile Computing 2022 (3 février 2022) : 1–12. http://dx.doi.org/10.1155/2022/6243347.

Texte intégral

Résumé :

Multimodal sentiment analysis aims to harvest people’s opinions or attitudes from multimedia data through fusion techniques. However, existing fusion methods cannot take advantage of the correlation between multimodal data but introduce interference factors. In this paper, we propose an Interactive Transformer and Soft Mapping based method for multimodal sentiment analysis. In the Interactive Transformer layer, an Interactive Multihead Guided-Attention structure composed of a pair of Multihead Attention modules is first utilized to find the mapping relationship between multimodalities. Then, the obtained results are fed into a Feedforward Neural Network. The Soft Mapping layer consisting of stacking Soft Attention module is finally used to map the results to a higher dimension to realize the fusion of multimodal information. The proposed model can fully consider the relationship between multiple modal pieces of information and provides a new solution to the problem of data interaction in multimodal sentiment analysis. Our model was evaluated on benchmark datasets CMU-MOSEI and MELD, and the accuracy is improved by 5.57% compared with the baseline standard.

Styles APA, Harvard, Vancouver, ISO, etc.

25

Zhang, Yinshuo, Lei Chen et Yuan Yuan. « Multimodal Fine-Grained Transformer Model for Pest Recognition ». Electronics 12, n^o 12 (10 juin 2023) : 2620. http://dx.doi.org/10.3390/electronics12122620.

Texte intégral

Résumé :

Deep learning has shown great potential in smart agriculture, especially in the field of pest recognition. However, existing methods require large datasets and do not exploit the semantic associations between multimodal data. To address these problems, this paper proposes a multimodal fine-grained transformer (MMFGT) model, a novel pest recognition method that improves three aspects of transformer architecture to meet the needs of few-shot pest recognition. On the one hand, the MMFGT uses self-supervised learning to extend the transformer structure to extract target features using contrastive learning to reduce the reliance on data volume. On the other hand, fine-grained recognition is integrated into the MMFGT to focus attention on finely differentiated areas of pest images to improve recognition accuracy. In addition, the MMFGT further improves the performance in pest recognition by using the joint multimodal information from the pest’s image and natural language description. Extensive experimental results demonstrate that the MMFGT obtains more competitive results compared to other excellent models, such as ResNet, ViT, SwinT, DINO, and EsViT, in pest recognition tasks, with recognition accuracy up to 98.12% and achieving 5.92% higher accuracy compared to the state-of-the-art DINO method for the baseline.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Zhang, Tianze. « Investigation on task effect analysis and optimization strategy of multimodal large model based on Transformers architecture for various languages ». Applied and Computational Engineering 47, n^o 1 (15 mars 2024) : 213–24. http://dx.doi.org/10.54254/2755-2721/47/20241374.

Texte intégral

Résumé :

As artificial intelligence technology advances swiftly, the Transformers architecture has emerged as a pivotal model for handling multimodal data. This investigation delves into the impact of multimodal large-scale models utilizing the Transformers architecture for addressing various linguistic tasks, along with proposing optimization approaches tailored to this context. Through a series of experiments, this study scrutinized the performance of these models on multilingual datasets, engaging in a comprehensive analysis of the key determinants influencing their effectiveness. Firstly, several models of transformers architecture are pre trained on the same corpus, including ERNIE, GPT, ViT, VisualBERT, and a series of tests are carried out on these models in English, Chinese, Spanish and other languages. By comparing the performance of different models, it is found that these models show significant performance differences when dealing with tasks in different languages. Further, through analysis and experimental verification, this paper proposes a series of optimization strategies for different languages, including: annotation method for language specific datasets, incremental fine-tuning method for tuning, increasing the size of datasets, using multi task learning, etc. Experiments show that these methods have achieved remarkable results, and put forward the future research direction.

Styles APA, Harvard, Vancouver, ISO, etc.

27

Wang, Zhecan, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang et Shih-Fu Chang. « SGEITL : Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 5 (28 juin 2022) : 5914–22. http://dx.doi.org/10.1609/aaai.v36i5.20536.

Texte intégral

Résumé :

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Wei, Jiaqi, Bin Jiang et Yanxia Zhang. « Identification of Blue Horizontal Branch Stars with Multimodal Fusion ». Publications of the Astronomical Society of the Pacific 135, n^o 1050 (1 août 2023) : 084501. http://dx.doi.org/10.1088/1538-3873/acea43.

Texte intégral

Résumé :

Abstract Blue Horizontal Branch stars (BHBs) are ideal tracers to probe the global structure of the milky Way (MW), and the increased size of the BHB star sample could be helpful to accurately calculate the MW’s enclosed mass and kinematics. Large survey telescopes have produced an increasing number of astronomical images and spectra. However, traditional methods of identifying BHBs are limited in dealing with the large scale of astronomical data. A fast and efficient way of identifying BHBs can provide a more significant sample for further analysis and research. Therefore, in order to fully use the various data observed and further improve the identification accuracy of BHBs, we have innovatively proposed and implemented a Bi-level attention mechanism-based Transformer multimodal fusion model, called Bi-level Attention in the Transformer with Multimodality (BATMM). The model consists of a spectrum encoder, an image encoder, and a Transformer multimodal fusion module. The Transformer enables the effective fusion of data from two modalities, namely image and spectrum, by using the proposed Bi-level attention mechanism, including cross-attention and self-attention. As a result, the information from the different modalities complements each other, thus improving the accuracy of the identification of BHBs. The experimental results show that the F1 score of the proposed BATMM is 94.78%, which is 21.77% and 2.76% higher than the image and spectral unimodality, respectively. It is therefore demonstrated that higher identification accuracy of BHBs can be achieved by means of using data from multiple modalities and employing an efficient data fusion strategy.

Styles APA, Harvard, Vancouver, ISO, etc.

29

Sams, Andrew Steven, et Amalia Zahra. « Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers ». Bulletin of Electrical Engineering and Informatics 12, n^o 1 (1 février 2023) : 355–64. http://dx.doi.org/10.11591/eei.v12i1.4231.

Texte intégral

Résumé :

Music carries emotional information and allows the listener to feel the emotions contained in the music. This study proposes a multimodal music emotion recognition (MER) system using Indonesian song and lyrics data. In the proposed multimodal system, the audio data will use the mel spectrogram feature, and the lyrics feature will be extracted by going through the tokenizing process from XLNet. Convolutional long short term memory network (CNN-LSTM) performs the audio classification task, while XLNet transformers performs the lyrics classification task. The outputs of the two classification tasks are probability weight and actual prediction with the value of positive, neutral, and negative emotions, which are then combined using the stacking ensemble method. The combined output will be trained into an artificial neural network (ANN) model to get the best probability weight output. The multimodal system achieves the best performance with an accuracy of 80.56%. The results showed that the multimodal method of recognizing musical emotions gave better performance than the single modal method. In addition, hyperparameter tuning can affect the performance of multimodal systems.

Styles APA, Harvard, Vancouver, ISO, etc.

30

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S et C. Gururaj. « Multimodal Offensive Meme Classification u sing Transformers and BiLSTM ». International Journal of Engineering and Advanced Technology 11, n^o 3 (28 février 2022) : 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Texte intégral

Résumé :

Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a classification head. The SwinT + BiLSTM has performed better when compared to the ViT + BiLSTM across all the dimensions. The performance of the models has improved significantly when the contextual embeddings from DistilBert replace the custom embeddings. We have achieved the highest recall of 0.631 by combining outputs of four models using the soft voting technique.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Nadal, Clement, et Francois Pigache. « Multimodal electromechanical model of piezoelectric transformers by Hamilton's principle ». IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 56, n^o 11 (novembre 2009) : 2530–43. http://dx.doi.org/10.1109/tuffc.2009.1340.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

32

Chen, Yunfan, Jinxing Ye et Xiangkui Wan. « TF-YOLO : A Transformer–Fusion-Based YOLO Detector for Multimodal Pedestrian Detection in Autonomous Driving Scenes ». World Electric Vehicle Journal 14, n^o 12 (18 décembre 2023) : 352. http://dx.doi.org/10.3390/wevj14120352.

Texte intégral

Résumé :

Recent research demonstrates that the fusion of multimodal images can improve the performance of pedestrian detectors under low-illumination environments. However, existing multimodal pedestrian detectors cannot adapt to the variability of environmental illumination. When the lighting conditions of the application environment do not match the experimental data illumination conditions, the detection performance is likely to be stuck significantly. To resolve this problem, we propose a novel transformer–fusion-based YOLO detector to detect pedestrians under various illumination environments, such as nighttime, smog, and heavy rain. Specifically, we develop a novel transformer–fusion module embedded in a two-stream backbone network to robustly integrate the latent interactions between multimodal images (visible and infrared images). This enables the multimodal pedestrian detector to adapt to changing illumination conditions. Experimental results on two well-known datasets demonstrate that the proposed approach exhibits superior performance. The proposed TF-YOLO drastically improves the average precision of the state-of-the-art approach by 3.3% and reduces the miss rate of the state-of-the-art approach by about 6% on the challenging multi-scenario multi-modality dataset.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Pezzelle, Sandro, Ece Takmaz et Raquel Fernández. « Word Representation Learning in Multimodal Pre-Trained Transformers : An Intrinsic Evaluation ». Transactions of the Association for Computational Linguistics 9 (2021) : 1563–79. http://dx.doi.org/10.1162/tacl_a_00443.

Texte intégral

Résumé :

Abstract This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In line with previous evidence, we observe a generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Zhang, Yingjie. « The current status and prospects of transformer in multimodality ». Applied and Computational Engineering 11, n^o 1 (25 septembre 2023) : 224–30. http://dx.doi.org/10.54254/2755-2721/11/20230240.

Texte intégral

Résumé :

At present, the attention mechanism represented by transformer has greatly promoted the development of natural language processing (NLP) and image processing (CV). However, in the multimodal field, the application of attention mechanism still mainly focuses on extracting the features of different types of data, and then fusing these features (such as text and image). With the increasing scale of the model and the instability of the Internet data, feature fusion has been difficult to solve the growing variety of multimodal problems for us, and the multimodal field has always lacked a model that can uniformly handle all types of data. In this paper, we first take the CV and NLP fields as examples to review various derived models of transformer. Then, based on the mechanism of word embedding and image embedding, we discuss how embedding with different granularity is handled uniformly under the attention mechanism in multimodal scenes. Further, we reveal that this mechanism will not only be limited to CV and NLP, but the real unified model will be able to handle tasks across data types through pre-training and fine tuning. Finally, on the specific implementation of the unified model, this paper lists several cases, and analyzes the valuable research directions in related fields.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Hasan, Md Kamrul, Sangwu Lee, Wasifur Rahman, Amir Zadeh, Rada Mihalcea, Louis-Philippe Morency et Ehsan Hoque. « Humor Knowledge Enriched Transformer for Understanding Multimodal Humor ». Proceedings of the AAAI Conference on Artificial Intelligence 35, n^o 14 (18 mai 2021) : 12972–80. http://dx.doi.org/10.1609/aaai.v35i14.17534.

Texte intégral

Résumé :

Recognizing humor from a video utterance requires understanding the verbal and non-verbal components as well as incorporating the appropriate context and external knowledge. In this paper, we propose Humor Knowledge enriched Transformer (HKT) that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge. We incorporate humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language. We encode all the language, acoustic, vision, and humor centric features separately using Transformer based encoders, followed by a cross attention layer to exchange information among them. Our model achieves 77.36% and 79.41% accuracy in humorous punchline detection on UR-FUNNY and MUStaRD datasets -- achieving a new state-of-the-art on both datasets with the margin of 4.93% and 2.94% respectively. Furthermore, we demonstrate that our model can capture interpretable, humor-inducing patterns from all modalities.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Zhang, Xiaojuan, Yongxiu Zhou, Peihao Peng et Guoyan Wang. « A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features ». Sustainability 14, n^o 21 (28 octobre 2022) : 14034. http://dx.doi.org/10.3390/su142114034.

Texte intégral

Résumé :

Species distribution models (SDMs) are critical in conservation decision-making and ecological or biogeographical inference. Accurately predicting species distribution can facilitate resource monitoring and management for sustainable regional development. Currently, species distribution models usually use a single source of information as input for the model. To determine a solution to the lack of accuracy of the species distribution model with a single information source, we propose a multimodal species distribution model that can input multiple information sources simultaneously. We used ResNet50 and Transformer network structures as the backbone for multimodal data modeling. The model’s accuracy was tested using the GEOLIFE2020 dataset, and our model’s accuracy is state-of-the-art (SOTA). We found that the prediction accuracy of the multimodal species distribution model with multiple data sources of remote sensing images, environmental variables, and latitude and longitude information as inputs (29.56%) was higher than that of the model with only remote sensing images or environmental variables as inputs (25.72% and 21.68%, respectively). We also found that using a Transformer network structure to fuse data from multiple sources can significantly improve the accuracy of multimodal models. We present a novel multimodal model that fuses multiple sources of information as input for species distribution prediction to advance the research progress of multimodal models in the field of ecology.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Zhang, Guihao, et Jiangzhong Cao. « Feature Fusion Based on Transformer for Cross-modal Retrieval ». Journal of Physics : Conference Series 2558, n^o 1 (1 août 2023) : 012012. http://dx.doi.org/10.1088/1742-6596/2558/1/012012.

Texte intégral

Résumé :

Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Park, Junhee, et Nammee Moon. « Design and Implementation of Attention Depression Detection Model Based on Multimodal Analysis ». Sustainability 14, n^o 6 (18 mars 2022) : 3569. http://dx.doi.org/10.3390/su14063569.

Texte intégral

Résumé :

Depression is becoming a social problem as the number of sufferers steadily increases. In this regard, this paper proposes a multimodal analysis-based attention depression detection model that simultaneously uses voice and text data obtained from users. The proposed models consist of Bidirectional Encoders from Transformers-Convolutional Neural Network (BERT-CNN) for natural language analysis, CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM) for voice signal processing, and multimodal analysis and fusion models for depression detection. The experiments in this paper are conducted using the DAIC-WOZ dataset, a clinical interview designed to support psychological distress states such as anxiety and post-traumatic stress. The voice data were set to 4 seconds in length and the number of mel filters was set to 128 in the preprocessing process. For text data, we used the subject text data of the interview and derived the embedding vector using a transformers tokenizer. Based on each data set, the BERT-CNN and CNN-BiLSTM proposed in this paper were applied and combined to classify depression. Through experiments, the accuracy and loss degree were compared for the cases of using multimodal data and using single data, and it was confirmed that the existing low accuracy was improved.

Styles APA, Harvard, Vancouver, ISO, etc.

39

Qi, Qingfu, Liyuan Lin, Rui Zhang et Chengrong Xue. « MEDT : Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis ». IEEE Access 10 (2022) : 28750–59. http://dx.doi.org/10.1109/access.2022.3157712.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

40

Li, Lei, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen et Ningyu Zhang. « On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (Student Abstract) ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 13 (26 juin 2023) : 16254–55. http://dx.doi.org/10.1609/aaai.v37i13.26987.

Texte intégral

Résumé :

Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.

Styles APA, Harvard, Vancouver, ISO, etc.

41

Zhang, Junyan. « Research on transformer and attention in applied algorithms ». Applied and Computational Engineering 13, n^o 1 (23 octobre 2023) : 221–28. http://dx.doi.org/10.54254/2755-2721/13/20230737.

Texte intégral

Résumé :

The transformer is an encoder-decoder-based structure and model for deep learning that completely utilizes the self-attention mechanism. It has gained remarkable success in natural language processing and computer vision and is becoming the predominant research direction. This study first analyzes the transformer and attention mechanism, summarizes their advantages, and explores how they help the recommendation algorithm dynamically focus on specific parts of the input that are helpful to perform the current recommendation task. After analyzing the framework of the attention mechanism network and its weight computation for data received. To further enhance the practicality of objects in natural situations and the precision of object recognition, a transformer detection approach based on deformable convolution is presented. And analyzed how the transformer works in the generative pre-trained transformer. These algorithms illustrate the efficacy and robustness of the transformer, indicating that the transformer that incorporates the attention mechanism may satisfy the requirements of the majority of deep learning tasks. However, the unpredictability of demands, the exponential growth of information, and other issues will continue to make it challenging to deal with global interaction mechanisms and a unified framework for multimodal data.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Gao, Jialin, Jianyu Chen, Jiaqi Wei, Bin Jiang et A.-Li Luo. « Deep Multimodal Networks for M-type Star Classification with Paired Spectrum and Photometric Image ». Publications of the Astronomical Society of the Pacific 135, n^o 1046 (1 avril 2023) : 044503. http://dx.doi.org/10.1088/1538-3873/acc7ca.

Texte intégral

Résumé :

Abstract Traditional stellar classification methods include spectral and photometric classification separately. Although satisfactory results can be achieved, the accuracy could be improved. In this paper, we pioneer a novel approach to deeply fuse the spectra and photometric images of the sources in an advanced multimodal network to enhance the model’s discriminatory ability. We use Transformer as the fusion module and apply a spectrum–image contrastive loss function to enhance the consistency of the spectrum and photometric image of the same source in two different feature spaces. We perform M-type stellar subtype classification on two data sets with high and low signal-to-noise ratio (S/N) spectra and corresponding photometric images, and the F1-score achieves 95.65% and 90.84%, respectively. In our experiments, we prove that our model effectively utilizes the information from photometric images and is more accurate than advanced spectrum and photometric image classifiers. Our contributions can be summarized as follows: (1) We propose an innovative idea for stellar classification that allows the model to simultaneously consider information from spectra and photometric images. (2) We discover the challenge of fusing low-S/N spectra and photometric images in the Transformer and provide a solution. (3) The effectiveness of Transformer for spectral classification is discussed for the first time and will inspire more Transformer-based spectral classification models.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Zong, Daoming, et Shiliang Sun. « McOmet : Multimodal Fusion Transformer for Physical Audiovisual Commonsense Reasoning ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 5 (26 juin 2023) : 6621–29. http://dx.doi.org/10.1609/aaai.v37i5.25813.

Texte intégral

Résumé :

Physical commonsense reasoning is essential for building reliable and interpretable AI systems, which involves a general understanding of the physical properties and affordances of everyday objects, how these objects can be manipulated, and how they interact with others. It is fundamentally a multi-modal task, as physical properties are manifested through multiple modalities, including vision and acoustics. In this work, we present a unified framework, named Multimodal Commonsense Transformer (MCOMET), for physical audiovisual commonsense reasoning. MCOMET has two intriguing properties: i) it fully mines higher-ordered temporal relationships across modalities (e.g., pairs, triplets, and quadruplets); and ii) it restricts the cross-modal flow through the feature collection and propagation mechanism along with tight fusion bottlenecks, forcing the model to attend the most relevant parts in each modality and suppressing the dissemination of noisy information. We evaluate our model on a very recent public benchmark, PACS. Results show that MCOMET significantly outperforms a variety of strong baselines, revealing powerful multi-modal commonsense reasoning capabilities.

Styles APA, Harvard, Vancouver, ISO, etc.

44

JayaLakshmi, Gundabathina, Abburi Madhuri, Deepak Vasudevan, Balamuralikrishna Thati, Uddagiri Sirisha et Surapaneni Phani Praveen. « Effective Disaster Management Through Transformer-Based Multimodal Tweet Classification ». Revue d'Intelligence Artificielle 37, n^o 5 (31 octobre 2023) : 1263–72. http://dx.doi.org/10.18280/ria.370519.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

45

Liu, Biyuan, Huaixin Chen, Kun Li et Michael Ying Yang. « Transformer-based multimodal change detection with multitask consistency constraints ». Information Fusion 108 (août 2024) : 102358. http://dx.doi.org/10.1016/j.inffus.2024.102358.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

46

Abiyev, Rahib H., Mohamad Ziad Altabel, Manal Darwish et Abdulkader Helwan. « A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos ». Diagnostics 14, n^o 7 (23 mars 2024) : 681. http://dx.doi.org/10.3390/diagnostics14070681.

Texte intégral

Résumé :

The determination of the potential role and advantages of artificial intelligence-based models in the field of surgery remains uncertain. This research marks an initial stride towards creating a multimodal model, inspired by the Video-Audio-Text Transformer, that aims to reduce negative occurrences and enhance patient safety. The model employs text and image embedding state-of-the-art models (ViT and BERT) to assess their efficacy in extracting the hidden and distinct features from the surgery video frames. These features are then used as inputs for convolution-free Transformer architectures to extract comprehensive multidimensional representations. A joint space is then used to combine the text and image features extracted from both Transformer encoders. This joint space ensures that the relationships between the different modalities are preserved during the combination process. The entire model was trained and tested on laparoscopic cholecystectomy (LC) videos encompassing various levels of complexity. Experimentally, a mean accuracy of 91.0%, a precision of 81%, and a recall of 83% were reached by the model when tested on 30 videos out of 80 from the Cholec80 dataset.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Chaudhari, Aayushi, Chintan Bhatt, Achyut Krishna et Carlos M. Travieso-González. « Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning ». Electronics 12, n^o 2 (5 janvier 2023) : 288. http://dx.doi.org/10.3390/electronics12020288.

Texte intégral

Résumé :

Emotion recognition is a very challenging research field due to its complexity, as individual differences in cognitive–emotional cues involve a wide variety of ways, including language, expressions, and speech. If we use video as the input, we can acquire a plethora of data for analyzing human emotions. In this research, we use features derived from separately pretrained self-supervised learning models to combine text, audio (speech), and visual data modalities. The fusion of features and representation is the biggest challenge in multimodal emotion classification research. Because of the large dimensionality of self-supervised learning characteristics, we present a unique transformer and attention-based fusion method for incorporating multimodal self-supervised learning features that achieved an accuracy of 86.40% for multimodal emotion classification.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Xu, Zhen, David R. So et Andrew M. Dai. « MUFASA : Multimodal Fusion Architecture Search for Electronic Health Records ». Proceedings of the AAAI Conference on Artificial Intelligence 35, n^o 12 (18 mai 2021) : 10532–40. http://dx.doi.org/10.1609/aaai.v35i12.17260.

Texte intégral

Résumé :

One important challenge of applying deep learning to electronic health records (EHR) is the complexity of their multimodal structure. EHR usually contains a mixture of structured (codes) and unstructured (free-text) data with sparse and irregular longitudinal features -- all of which doctors utilize when making decisions. In the deep learning regime, determining how different modality representations should be fused together is a difficult problem, which is often addressed by handcrafted modeling and intuition. In this work, we extend state-of-the-art neural architecture search (NAS) methods and propose MUltimodal Fusion Architecture SeArch (MUFASA) to simultaneously search across multimodal fusion strategies and modality-specific architectures for the first time. We demonstrate empirically that our MUFASA method outperforms established unimodal NAS on public EHR data with comparable computation costs. In addition, MUFASA produces architectures that outperform Transformer and Evolved Transformer. Compared with these baselines on CCS diagnosis code prediction, our discovered models improve top-5 recall from 0.88 to 0.91 and demonstrate the ability to generalize to other EHR tasks. Studying our top architecture in depth, we provide empirical evidence that MUFASA's improvements are derived from its ability to both customize modeling for each modality and find effective fusion strategies.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Ilmi, Yuslimu, Pratiwi Retnaningdyah et Ahmad Munir. « Exploring Digital Multimodal Text in EFL Classroom : Transformed Practice in Multiliteracies Pedagogy ». Linguistic, English Education and Art (LEEA) Journal 4, n^o 1 (28 décembre 2020) : 99–108. http://dx.doi.org/10.31539/leea.v4i1.1416.

Texte intégral

Résumé :

This study investigates EFL students’ composition of digital multimodal text. There were forty-four students recruited from tenth grade of a private senior high school in Sidoarjo. All of them were divided into seven groups and given authority to choose their own topic for the digital multimodal project; advertisement video. Based on the theory of multimodal analysis, this study examines both students’ processes and products. Qualitative case study was chosen as the design of the study and document analysis was chosen as the data collection technique. Through qualitative case study, the finding of this research shows that all the students’ groups used multimodal modes in creating their advertisement videos. Additionally, this study revealed that the qualities of the students’ project depend on two important things: the ability of the group members and the collaboration of the members in doing the project. Keywords: Multimodal Text, Multiliteracies Pedagogy, Transformed Practice, EFL Classroom, Senior High School.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Ammour, Nassim, Yakoub Bazi et Naif Alajlan. « Multimodal Approach for Enhancing Biometric Authentication ». Journal of Imaging 9, n^o 9 (22 août 2023) : 168. http://dx.doi.org/10.3390/jimaging9090168.

Texte intégral

Résumé :

Unimodal biometric systems rely on a single source or unique individual biological trait for measurement and examination. Fingerprint-based biometric systems are the most common, but they are vulnerable to presentation attacks or spoofing when a fake fingerprint is presented to the sensor. To address this issue, we propose an enhanced biometric system based on a multimodal approach using two types of biological traits. We propose to combine fingerprint and Electrocardiogram (ECG) signals to mitigate spoofing attacks. Specifically, we design a multimodal deep learning architecture that accepts fingerprints and ECG as inputs and fuses the feature vectors using stacking and channel-wise approaches. The feature extraction backbone of the architecture is based on data-efficient transformers. The experimental results demonstrate the promising capabilities of the proposed approach in enhancing the robustness of the system to presentation attacks.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!