Log in

Relevant bibliographies by topics / Transformers Multimodaux / Journal articles

To see the other types of publications on this topic, follow the link: Transformers Multimodaux.

Journal articles on the topic 'Transformers Multimodaux'

Author: Grafiati

Published: 13 April 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Transformers Multimodaux.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Jaiswal, Sushma, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal. "Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search." International Journal of Intelligent Systems and Applications 16, no. 2 (April 8, 2024): 53–61. http://dx.doi.org/10.5815/ijisa.2024.02.05.

Full text

Abstract:

Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.

APA, Harvard, Vancouver, ISO, and other styles

2

Bayat, Nasrin, Jong-Hwan Kim, Renoa Choudhury, Ibrahim F. Kadhim, Zubaidah Al-Mashhadani, Mark Aldritz Dela Virgen, Reuben Latorre, Ricardo De La Paz, and Joon-Hyuk Park. "Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired." Journal of Imaging 9, no. 8 (August 15, 2023): 161. http://dx.doi.org/10.3390/jimaging9080161.

Full text

Abstract:

This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user’s walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.

APA, Harvard, Vancouver, ISO, and other styles

3

Shao, Zilei. "A literature review on multimodal deep learning models for detecting mental disorders in conversational data: Pre-transformer and transformer-based approaches." Applied and Computational Engineering 18, no. 1 (October 23, 2023): 215–24. http://dx.doi.org/10.54254/2755-2721/18/20230993.

Full text

Abstract:

This paper provides a comprehensive review of multimodal deep learning models that utilize conversational data to detect mental health disorders. In addition to discussing models based on the Transformer, such as BERT (Bidirectional Encoder Representations from Transformers), this paper addresses models that existed prior to the Transformer, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The paper covers the application of these models in the construction of multimodal deep learning systems to detect mental disorders. In addition, the difficulties encountered by multimodal deep learning systems are brought up. Furthermore, the paper proposes research directions for enhancing the performance and robustness of these models in mental health applications. By shedding light on the potential of multimodal deep learning in mental health care, this paper aims to foster further research and development in this critical domain.

APA, Harvard, Vancouver, ISO, and other styles

4

Hendricks, Lisa Anne, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers." Transactions of the Association for Computational Linguistics 9 (2021): 570–85. http://dx.doi.org/10.1162/tacl_a_00385.

Full text

Abstract:

Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.

APA, Harvard, Vancouver, ISO, and other styles

5

Chen, Yu, Ming Yin, Yu Li, and Qian Cai. "CSU-Net: A CNN-Transformer Parallel Network for Multimodal Brain Tumour Segmentation." Electronics 11, no. 14 (July 16, 2022): 2226. http://dx.doi.org/10.3390/electronics11142226.

Full text

Abstract:

Medical image segmentation techniques are vital to medical image processing and analysis. Considering the significant clinical applications of brain tumour image segmentation, it represents a focal point of medical image segmentation research. Most of the work in recent times has been centred on Convolutional Neural Networks (CNN) and Transformers. However, CNN has some deficiencies in modelling long-distance information transfer and contextual processing information, while Transformer is relatively weak in acquiring local information. To overcome the above defects, we propose a novel segmentation network with an “encoder–decoder” architecture, namely CSU-Net. The encoder consists of two parallel feature extraction branches based on CNN and Transformer, respectively, in which the features of the same size are fused. The decoder has a dual Swin Transformer decoder block with two learnable parameters for feature upsampling. The features from multiple resolutions in the encoder and decoder are merged via skip connections. On the BraTS 2020, our model achieves 0.8927, 0.8857, and 0.8188 for the Whole Tumour (WT), Tumour Core (TC), and Enhancing Tumour (ET), respectively, in terms of Dice scores.

APA, Harvard, Vancouver, ISO, and other styles

6

Sun, Qixuan, Nianhua Fang, Zhuo Liu, Liang Zhao, Youpeng Wen, and Hongxiang Lin. "HybridCTrm: Bridging CNN and Transformer for Multimodal Brain Image Segmentation." Journal of Healthcare Engineering 2021 (October 1, 2021): 1–10. http://dx.doi.org/10.1155/2021/7467261.

Full text

Abstract:

Multimodal medical image segmentation is always a critical problem in medical image segmentation. Traditional deep learning methods utilize fully CNNs for encoding given images, thus leading to deficiency of long-range dependencies and bad generalization performance. Recently, a sequence of Transformer-based methodologies emerges in the field of image processing, which brings great generalization and performance in various tasks. On the other hand, traditional CNNs have their own advantages, such as rapid convergence and local representations. Therefore, we analyze a hybrid multimodal segmentation method based on Transformers and CNNs and propose a novel architecture, HybridCTrm network. We conduct experiments using HybridCTrm on two benchmark datasets and compare with HyperDenseNet, a network based on fully CNNs. Results show that our HybridCTrm outperforms HyperDenseNet on most of the evaluation metrics. Furthermore, we analyze the influence of the depth of Transformer on the performance. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.

APA, Harvard, Vancouver, ISO, and other styles

7

Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu, and Faouzi Bader. "Multimodal transformers for wireless communications: A case study in beam prediction." ITU Journal on Future and Evolving Technologies 4, no. 3 (September 5, 2023): 461–71. http://dx.doi.org/10.52953/jwra8095.

Full text

Abstract:

Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.

APA, Harvard, Vancouver, ISO, and other styles

8

Xu, Yifan, Huapeng Wei, Minxuan Lin, Yingying Deng, Kekai Sheng, Mengdan Zhang, Fan Tang, Weiming Dong, Feiyue Huang, and Changsheng Xu. "Transformers in computational visual media: A survey." Computational Visual Media 8, no. 1 (October 27, 2021): 33–62. http://dx.doi.org/10.1007/s41095-021-0247-3.

Full text

Abstract:

AbstractTransformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

APA, Harvard, Vancouver, ISO, and other styles

9

Zhong, Enmin, Carlos R. del-Blanco, Daniel Berjón, Fernando Jaureguizar, and Narciso García. "Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer." Sensors 23, no. 16 (August 10, 2023): 7066. http://dx.doi.org/10.3390/s23167066.

Full text

Abstract:

Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. The primary challenge lies in achieving real-time recognition while managing temporal dependencies that can impact performance. Existing methods employ 3D convolutional or Transformer-based architectures with hand skeleton estimation, but both have limitations. To address these challenges, a hybrid approach that combines 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed. The method involves using a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics of hand gestures. A Transformer network with a self-attention mechanism is then employed to efficiently capture long-range temporal dependencies in the skeleton sequence. Evaluation of the Briareo and Multimodal Hand Gesture datasets resulted in accuracy scores of 95.49% and 97.25%, respectively. Notably, this approach achieves real-time performance using a standard CPU, distinguishing it from methods that require specialized GPUs. The hybrid approach’s real-time efficiency and high accuracy demonstrate its superiority over existing state-of-the-art methods. In summary, the hybrid 3D-CNN and Transformer approach effectively addresses real-time recognition challenges and efficient handling of temporal dependencies, outperforming existing methods in both accuracy and speed.

APA, Harvard, Vancouver, ISO, and other styles

10

Nia, Zahra Movahedi, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary, and Jude D. Kong. "Twitter-based gender recognition using transformers." Mathematical Biosciences and Engineering 20, no. 9 (2023): 15957–77. http://dx.doi.org/10.3934/mbe.2023711.

Full text

Abstract:

<abstract> <p>Social media contains useful information about people and society that could help advance research in many different areas of health (e.g. by applying opinion mining, emotion/sentiment analysis and statistical analysis) such as mental health, health surveillance, socio-economic inequality and gender vulnerability. User demographics provide rich information that could help study the subject further. However, user demographics such as gender are considered private and are not freely available. In this study, we propose a model based on transformers to predict the user's gender from their images and tweets. The image-based classification model is trained in two different methods: using the profile image of the user and using various image contents posted by the user on Twitter. For the first method a Twitter gender recognition dataset, publicly available on Kaggle and for the second method the PAN-18 dataset is used. Several transformer models, i.e. vision transformers (ViT), LeViT and Swin Transformer are fine-tuned for both of the image datasets and then compared. Next, different transformer models, namely, bidirectional encoders representations from transformers (BERT), RoBERTa and ELECTRA are fine-tuned to recognize the user's gender by their tweets. This is highly beneficial, because not all users provide an image that indicates their gender. The gender of such users could be detected from their tweets. The significance of the image and text classification models were evaluated using the Mann-Whitney U test. Finally, the combination model improved the accuracy of image and text classification models by 11.73 and 5.26% for the Kaggle dataset and by 8.55 and 9.8% for the PAN-18 dataset, respectively. This shows that the image and text classification models are capable of complementing each other by providing additional information to one another. Our overall multimodal method has an accuracy of 88.11% for the Kaggle and 89.24% for the PAN-18 dataset and outperforms state-of-the-art models. Our work benefits research that critically require user demographic information such as gender to further analyze and study social media content for health-related issues.</p> </abstract>

APA, Harvard, Vancouver, ISO, and other styles

11

Liang, Yi, Turdi Tohti, and Askar Hamdulla. "False Information Detection via Multimodal Feature Fusion and Multi-Classifier Hybrid Prediction." Algorithms 15, no. 4 (March 29, 2022): 119. http://dx.doi.org/10.3390/a15040119.

Full text

Abstract:

In the existing false information detection methods, the quality of the extracted single-modality features is low, the information between different modalities cannot be fully fused, and the original information will be lost when the information of different modalities is fused. This paper proposes a false information detection via multimodal feature fusion and multi-classifier hybrid prediction. In this method, first, bidirectional encoder representations for transformers are used to extract the text features, and S win-transformer is used to extract the picture features, and then, the trained deep autoencoder is used as an early fusion method of multimodal features to fuse text features and visual features, and the low-dimensional features are taken as the joint features of the multimodalities. The original features of each modality are concatenated into the joint features to reduce the loss of original information. Finally, the text features, image features and joint features are processed by three classifiers to obtain three probability distributions, and the three probability distributions are added proportionally to obtain the final prediction result. Compared with the attention-based multimodal factorized bilinear pooling, the model achieves 4.3% and 1.2% improvement in accuracy on Weibo dataset and Twitter dataset. The experimental results show that the proposed model can effectively integrate multimodal information and improve the accuracy of false information detection.

APA, Harvard, Vancouver, ISO, and other styles

12

Desai, Poorav, Tanmoy Chakraborty, and Md Shad Akhtar. "Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10563–71. http://dx.doi.org/10.1609/aaai.v36i10.21300.

Full text

Abstract:

Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm is conveyed through the incongruity between the text and visual entities. Although recent approaches deal with sarcasm as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the underlying sense of irony. In this paper, we propose a novel problem -- Multimodal Sarcasm Explanation (MuSE) -- given a multimodal sarcastic post containing an image and a caption, we aim to generate a natural language explanation to reveal the intended sarcasm. To this end, we develop MORE, a new dataset with explanation of 3510 sarcastic multimodal posts. Each explanation is a natural language (English) sentence describing the hidden irony. We benchmark MORE by employing a multimodal Transformer-based architecture. It incorporates a cross-modal attention in the Transformer's encoder which attends to the distinguishing features between the two modalities. Subsequently, a BART-based auto-regressive decoder is used as the generator. Empirical results demonstrate convincing results over various baselines (adopted for MuSE) across five evaluation metrics. We also conduct human evaluation on predictions and obtain Fleiss' Kappa score of 0.4 as a fair agreement among 25 evaluators.

APA, Harvard, Vancouver, ISO, and other styles

13

Shan, Qishang, Xiangsen Wei, and Ziyun Cai. "Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis." Journal of Physics: Conference Series 2224, no. 1 (April 1, 2022): 012024. http://dx.doi.org/10.1088/1742-6596/2224/1/012024.

Full text

Abstract:

Abstract Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.

APA, Harvard, Vancouver, ISO, and other styles

14

Gupta, Arpit, Himanshu Goyal, and Ishita Kohli. "Synthesis of Vision and Language: Multifaceted Image Captioning Application." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, no. 12 (December 23, 2023): 1–10. http://dx.doi.org/10.55041/ijsrem27770.

Full text

Abstract:

The rapid advancement in image captioning has been a pivotal area of research, aiming to mimic human-like understanding of visual content. This paper presents an innovative approach that integrates attention mechanisms and object features into an image captioning model. Leveraging the Flickr8k dataset, this research explores the fusion of these components to enhance image comprehension and caption generation. Furthermore, the study showcases the implementation of this model in a user-friendly application using FASTAPI and ReactJS, offering text-to-speech translation in multiple languages. The findings underscore the efficacy of this approach in advancing image captioning technology. This tutorial outlines the construction of an image caption generator, employing Convolutional Neural Network (CNN) for image feature extraction and Long Short-Term Memory Network (LSTM) for Natural Language Processing (NLP). Keywords—Convolutional Neural Networks, Long Short Term Memory, Attention Mechanism, Transformer Architecture, Vision Transformers, Transfer Learning, Multimodal fusion, Deep Learning Models, Pre-Trained Models, Image Processing Techniques

APA, Harvard, Vancouver, ISO, and other styles

15

Liu, Bo, Lejian He, Yafei Liu, Tianyao Yu, Yuejia Xiang, Li Zhu, and Weijian Ruan. "Transformer-Based Multimodal Infusion Dialogue Systems." Electronics 11, no. 20 (October 20, 2022): 3409. http://dx.doi.org/10.3390/electronics11203409.

Full text

Abstract:

The recent advancements in multimodal dialogue systems have been gaining importance in several domains such as retail, travel, fashion, among others. Several existing works have improved the understanding and generation of multimodal dialogues. However, there still exists considerable space to improve the quality of output textual responses due to insufficient information infusion between the visual and textual semantics. Moreover, the existing dialogue systems often generate defective knowledge-aware responses for tasks such as providing product attributes and celebrity endorsements. To address the aforementioned issues, we present a Transformer-based Multimodal Infusion Dialogue (TMID) system that extracts the visual and textual information from dialogues via a transformer-based multimodal context encoder and employs a cross-attention mechanism to achieve information infusion between images and texts for each utterance. Furthermore, TMID uses adaptive decoders to generate appropriate multimodal responses based on the user intentions it has determined using a state classifier and enriches the output responses by incorporating domain knowledge into the decoders. The results of extensive experiments on a multimodal dialogue dataset demonstrate that TMID has achieved a state-of-the-art performance by improving the BLUE-4 score by 13.03, NIST by 2.77, image selection Recall@1 by 1.84%.

APA, Harvard, Vancouver, ISO, and other styles

16

Wang, LeiChen, Simon Giebenhain, Carsten Anklam, and Bastian Goldluecke. "Radar Ghost Target Detection via Multimodal Transformers." IEEE Robotics and Automation Letters 6, no. 4 (October 2021): 7758–65. http://dx.doi.org/10.1109/lra.2021.3100176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache, and Benoit Favre. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Full text

Abstract:

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhao, Bin, Maoguo Gong, and Xuelong Li. "Hierarchical multimodal transformer to summarize videos." Neurocomputing 468 (January 2022): 360–69. http://dx.doi.org/10.1016/j.neucom.2021.10.039.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Ding, Lan. "Online teaching emotion analysis based on GRU and nonlinear transformer algorithm." PeerJ Computer Science 9 (November 21, 2023): e1696. http://dx.doi.org/10.7717/peerj-cs.1696.

Full text

Abstract:

Nonlinear models of neural networks demonstrate the ability to autonomously extract significant attributes from a given target, thus facilitating automatic analysis of classroom emotions. This article introduces an online auxiliary tool for analyzing emotional states in virtual classrooms using the nonlinear vision algorithm Transformer. This research uses multimodal fusion, students’ auditory input, facial expression and text data as the foundational elements of sentiment analysis. In addition, a modal feature extractor has been developed to extract multimodal emotions using convolutional and gated cycle unit (GRU) architectures. In addition, inspired by the Transformer algorithm, a cross-modal Transformer algorithm is proposed to enhance the processing of multimodal information. The experiments demonstrate that the training performance of the proposed model surpasses that of similar methods, with its recall, precision, accuracy, and F1 values achieving 0.8587, 0.8365, 0.8890, and 0.8754, respectively, which is superior accuracy in capturing students’ emotional states, thus having important implications in assessing students’ engagement in educational courses.

APA, Harvard, Vancouver, ISO, and other styles

20

Wang, Zhaokai, Renda Bao, Qi Wu, and Si Liu. "Confidence-aware Non-repetitive Multimodal Transformers for TextCaps." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 4 (May 18, 2021): 2835–43. http://dx.doi.org/10.1609/aaai.v35i4.16389.

Full text

Abstract:

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

APA, Harvard, Vancouver, ISO, and other styles

21

Xiang, Yunfan, Xiangyu Tian, Yue Xu, Xiaokun Guan, and Zhengchao Chen. "EGMT-CD: Edge-Guided Multimodal Transformers Change Detection from Satellite and Aerial Images." Remote Sensing 16, no. 1 (December 25, 2023): 86. http://dx.doi.org/10.3390/rs16010086.

Full text

Abstract:

Change detection from heterogeneous satellite and aerial images plays a progressively important role in many fields, including disaster assessment, urban construction, and land use monitoring. Currently, researchers have mainly devoted their attention to change detection using homologous image pairs and achieved many remarkable results. It is sometimes necessary to use heterogeneous images for change detection in practical scenarios due to missing images, emergency situations, and cloud and fog occlusion. However, heterogeneous change detection still faces great challenges, especially using satellite and aerial images. The main challenges in satellite and aerial image change detection are related to the resolution gap and blurred edge. Previous studies used interpolation or shallow feature alignment before traditional homologous change detection methods, which ignored the high-level feature interaction and edge information. Therefore, we propose a new heterogeneous change detection model based on multimodal transformers combined with edge guidance. In order to alleviate the resolution gap between satellite and aerial images, we design an improved spatially aligned transformer (SP-T) with a sub-pixel module to align the satellite features to the same size of the aerial ones supervised by a token loss. Moreover, we introduce an edge detection branch to guide change features using the object edge with an auxiliary edge-change loss. Finally, we conduct considerable experiments to verify the effectiveness and superiority of our proposed model (EGMT-CD) on a new satellite–aerial heterogeneous change dataset, named SACD. The experiments show that our method (EGMT-CD) outperforms many previously superior change detection methods and fully demonstrates its potential in heterogeneous change detection from satellite–aerial images.

APA, Harvard, Vancouver, ISO, and other styles

22

Li, Ning, Jie Chen, Nanxin Fu, Wenzhuo Xiao, Tianrun Ye, Chunming Gao, and Ping Zhang. "Leveraging Dual Variational Autoencoders and Generative Adversarial Networks for Enhanced Multimodal Interaction in Zero-Shot Learning." Electronics 13, no. 3 (January 29, 2024): 539. http://dx.doi.org/10.3390/electronics13030539.

Full text

Abstract:

In the evolving field of taxonomic classification, and especially in Zero-shot Learning (ZSL), the challenge of accurately classifying entities unseen in training datasets remains a significant hurdle. Although the existing literature is rich in developments, it often falls short in two critical areas: semantic consistency (ensuring classifications align with true meanings) and the effective handling of dataset diversity biases. These gaps have created a need for a more robust approach that can navigate both with greater efficacy. This paper introduces an innovative integration of transformer models with ariational autoencoders (VAEs) and generative adversarial networks (GANs), with the aim of addressing them within the ZSL framework. The choice of VAE-GAN is driven by their complementary strengths: VAEs are proficient in providing a richer representation of data patterns, and GANs are able to generate data that is diverse yet representative, thus mitigating biases from dataset diversity. Transformers are employed to further enhance semantic consistency, which is key because many existing models underperform. Through experiments have been conducted on benchmark ZSL datasets such as CUB, SUN, and Animals with Attributes 2 (AWA2), our approach is novel because it demonstrates significant improvements, not only in enhancing semantic and structural coherence, but also in effectively addressing dataset biases. This leads to a notable enhancement of the model’s ability to generalize visual categorization tasks beyond the training data, thus filling a critical gap in the current ZSL research landscape.

APA, Harvard, Vancouver, ISO, and other styles

23

Abdine, Hadi, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. "Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 10 (March 24, 2024): 10757–65. http://dx.doi.org/10.1609/aaai.v38i10.28948.

Full text

Abstract:

In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.

APA, Harvard, Vancouver, ISO, and other styles

24

Li, Zuhe, Qingbing Guo, Chengyao Feng, Lujuan Deng, Qiuwen Zhang, Jianwei Zhang, Fengqin Wang, and Qian Sun. "Multimodal Sentiment Analysis Based on Interactive Transformer and Soft Mapping." Wireless Communications and Mobile Computing 2022 (February 3, 2022): 1–12. http://dx.doi.org/10.1155/2022/6243347.

Full text

Abstract:

Multimodal sentiment analysis aims to harvest people’s opinions or attitudes from multimedia data through fusion techniques. However, existing fusion methods cannot take advantage of the correlation between multimodal data but introduce interference factors. In this paper, we propose an Interactive Transformer and Soft Mapping based method for multimodal sentiment analysis. In the Interactive Transformer layer, an Interactive Multihead Guided-Attention structure composed of a pair of Multihead Attention modules is first utilized to find the mapping relationship between multimodalities. Then, the obtained results are fed into a Feedforward Neural Network. The Soft Mapping layer consisting of stacking Soft Attention module is finally used to map the results to a higher dimension to realize the fusion of multimodal information. The proposed model can fully consider the relationship between multiple modal pieces of information and provides a new solution to the problem of data interaction in multimodal sentiment analysis. Our model was evaluated on benchmark datasets CMU-MOSEI and MELD, and the accuracy is improved by 5.57% compared with the baseline standard.

APA, Harvard, Vancouver, ISO, and other styles

25

Zhang, Yinshuo, Lei Chen, and Yuan Yuan. "Multimodal Fine-Grained Transformer Model for Pest Recognition." Electronics 12, no. 12 (June 10, 2023): 2620. http://dx.doi.org/10.3390/electronics12122620.

Full text

Abstract:

Deep learning has shown great potential in smart agriculture, especially in the field of pest recognition. However, existing methods require large datasets and do not exploit the semantic associations between multimodal data. To address these problems, this paper proposes a multimodal fine-grained transformer (MMFGT) model, a novel pest recognition method that improves three aspects of transformer architecture to meet the needs of few-shot pest recognition. On the one hand, the MMFGT uses self-supervised learning to extend the transformer structure to extract target features using contrastive learning to reduce the reliance on data volume. On the other hand, fine-grained recognition is integrated into the MMFGT to focus attention on finely differentiated areas of pest images to improve recognition accuracy. In addition, the MMFGT further improves the performance in pest recognition by using the joint multimodal information from the pest’s image and natural language description. Extensive experimental results demonstrate that the MMFGT obtains more competitive results compared to other excellent models, such as ResNet, ViT, SwinT, DINO, and EsViT, in pest recognition tasks, with recognition accuracy up to 98.12% and achieving 5.92% higher accuracy compared to the state-of-the-art DINO method for the baseline.

APA, Harvard, Vancouver, ISO, and other styles

26

Zhang, Tianze. "Investigation on task effect analysis and optimization strategy of multimodal large model based on Transformers architecture for various languages." Applied and Computational Engineering 47, no. 1 (March 15, 2024): 213–24. http://dx.doi.org/10.54254/2755-2721/47/20241374.

Full text

Abstract:

As artificial intelligence technology advances swiftly, the Transformers architecture has emerged as a pivotal model for handling multimodal data. This investigation delves into the impact of multimodal large-scale models utilizing the Transformers architecture for addressing various linguistic tasks, along with proposing optimization approaches tailored to this context. Through a series of experiments, this study scrutinized the performance of these models on multilingual datasets, engaging in a comprehensive analysis of the key determinants influencing their effectiveness. Firstly, several models of transformers architecture are pre trained on the same corpus, including ERNIE, GPT, ViT, VisualBERT, and a series of tests are carried out on these models in English, Chinese, Spanish and other languages. By comparing the performance of different models, it is found that these models show significant performance differences when dealing with tasks in different languages. Further, through analysis and experimental verification, this paper proposes a series of optimization strategies for different languages, including: annotation method for language specific datasets, incremental fine-tuning method for tuning, increasing the size of datasets, using multi task learning, etc. Experiments show that these methods have achieved remarkable results, and put forward the future research direction.

APA, Harvard, Vancouver, ISO, and other styles

27

Wang, Zhecan, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, and Shih-Fu Chang. "SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 5 (June 28, 2022): 5914–22. http://dx.doi.org/10.1609/aaai.v36i5.20536.

Full text

Abstract:

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.

APA, Harvard, Vancouver, ISO, and other styles

28

Wei, Jiaqi, Bin Jiang, and Yanxia Zhang. "Identification of Blue Horizontal Branch Stars with Multimodal Fusion." Publications of the Astronomical Society of the Pacific 135, no. 1050 (August 1, 2023): 084501. http://dx.doi.org/10.1088/1538-3873/acea43.

Full text

Abstract:

Abstract Blue Horizontal Branch stars (BHBs) are ideal tracers to probe the global structure of the milky Way (MW), and the increased size of the BHB star sample could be helpful to accurately calculate the MW’s enclosed mass and kinematics. Large survey telescopes have produced an increasing number of astronomical images and spectra. However, traditional methods of identifying BHBs are limited in dealing with the large scale of astronomical data. A fast and efficient way of identifying BHBs can provide a more significant sample for further analysis and research. Therefore, in order to fully use the various data observed and further improve the identification accuracy of BHBs, we have innovatively proposed and implemented a Bi-level attention mechanism-based Transformer multimodal fusion model, called Bi-level Attention in the Transformer with Multimodality (BATMM). The model consists of a spectrum encoder, an image encoder, and a Transformer multimodal fusion module. The Transformer enables the effective fusion of data from two modalities, namely image and spectrum, by using the proposed Bi-level attention mechanism, including cross-attention and self-attention. As a result, the information from the different modalities complements each other, thus improving the accuracy of the identification of BHBs. The experimental results show that the F1 score of the proposed BATMM is 94.78%, which is 21.77% and 2.76% higher than the image and spectral unimodality, respectively. It is therefore demonstrated that higher identification accuracy of BHBs can be achieved by means of using data from multiple modalities and employing an efficient data fusion strategy.

APA, Harvard, Vancouver, ISO, and other styles

29

Sams, Andrew Steven, and Amalia Zahra. "Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers." Bulletin of Electrical Engineering and Informatics 12, no. 1 (February 1, 2023): 355–64. http://dx.doi.org/10.11591/eei.v12i1.4231.

Full text

Abstract:

Music carries emotional information and allows the listener to feel the emotions contained in the music. This study proposes a multimodal music emotion recognition (MER) system using Indonesian song and lyrics data. In the proposed multimodal system, the audio data will use the mel spectrogram feature, and the lyrics feature will be extracted by going through the tokenizing process from XLNet. Convolutional long short term memory network (CNN-LSTM) performs the audio classification task, while XLNet transformers performs the lyrics classification task. The outputs of the two classification tasks are probability weight and actual prediction with the value of positive, neutral, and negative emotions, which are then combined using the stacking ensemble method. The combined output will be trained into an artificial neural network (ANN) model to get the best probability weight output. The multimodal system achieves the best performance with an accuracy of 80.56%. The results showed that the multimodal method of recognizing musical emotions gave better performance than the single modal method. In addition, hyperparameter tuning can affect the performance of multimodal systems.

APA, Harvard, Vancouver, ISO, and other styles

30

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S, and C. Gururaj. "Multimodal Offensive Meme Classification u sing Transformers and BiLSTM." International Journal of Engineering and Advanced Technology 11, no. 3 (February 28, 2022): 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Full text

Abstract:

Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a classification head. The SwinT + BiLSTM has performed better when compared to the ViT + BiLSTM across all the dimensions. The performance of the models has improved significantly when the contextual embeddings from DistilBert replace the custom embeddings. We have achieved the highest recall of 0.631 by combining outputs of four models using the soft voting technique.

APA, Harvard, Vancouver, ISO, and other styles

31

Nadal, Clement, and Francois Pigache. "Multimodal electromechanical model of piezoelectric transformers by Hamilton's principle." IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 56, no. 11 (November 2009): 2530–43. http://dx.doi.org/10.1109/tuffc.2009.1340.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Chen, Yunfan, Jinxing Ye, and Xiangkui Wan. "TF-YOLO: A Transformer–Fusion-Based YOLO Detector for Multimodal Pedestrian Detection in Autonomous Driving Scenes." World Electric Vehicle Journal 14, no. 12 (December 18, 2023): 352. http://dx.doi.org/10.3390/wevj14120352.

Full text

Abstract:

Recent research demonstrates that the fusion of multimodal images can improve the performance of pedestrian detectors under low-illumination environments. However, existing multimodal pedestrian detectors cannot adapt to the variability of environmental illumination. When the lighting conditions of the application environment do not match the experimental data illumination conditions, the detection performance is likely to be stuck significantly. To resolve this problem, we propose a novel transformer–fusion-based YOLO detector to detect pedestrians under various illumination environments, such as nighttime, smog, and heavy rain. Specifically, we develop a novel transformer–fusion module embedded in a two-stream backbone network to robustly integrate the latent interactions between multimodal images (visible and infrared images). This enables the multimodal pedestrian detector to adapt to changing illumination conditions. Experimental results on two well-known datasets demonstrate that the proposed approach exhibits superior performance. The proposed TF-YOLO drastically improves the average precision of the state-of-the-art approach by 3.3% and reduces the miss rate of the state-of-the-art approach by about 6% on the challenging multi-scenario multi-modality dataset.

APA, Harvard, Vancouver, ISO, and other styles

33

Pezzelle, Sandro, Ece Takmaz, and Raquel Fernández. "Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation." Transactions of the Association for Computational Linguistics 9 (2021): 1563–79. http://dx.doi.org/10.1162/tacl_a_00443.

Full text

Abstract:

Abstract This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In line with previous evidence, we observe a generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.

APA, Harvard, Vancouver, ISO, and other styles

34

Zhang, Yingjie. "The current status and prospects of transformer in multimodality." Applied and Computational Engineering 11, no. 1 (September 25, 2023): 224–30. http://dx.doi.org/10.54254/2755-2721/11/20230240.

Full text

Abstract:

At present, the attention mechanism represented by transformer has greatly promoted the development of natural language processing (NLP) and image processing (CV). However, in the multimodal field, the application of attention mechanism still mainly focuses on extracting the features of different types of data, and then fusing these features (such as text and image). With the increasing scale of the model and the instability of the Internet data, feature fusion has been difficult to solve the growing variety of multimodal problems for us, and the multimodal field has always lacked a model that can uniformly handle all types of data. In this paper, we first take the CV and NLP fields as examples to review various derived models of transformer. Then, based on the mechanism of word embedding and image embedding, we discuss how embedding with different granularity is handled uniformly under the attention mechanism in multimodal scenes. Further, we reveal that this mechanism will not only be limited to CV and NLP, but the real unified model will be able to handle tasks across data types through pre-training and fine tuning. Finally, on the specific implementation of the unified model, this paper lists several cases, and analyzes the valuable research directions in related fields.

APA, Harvard, Vancouver, ISO, and other styles

35

Hasan, Md Kamrul, Sangwu Lee, Wasifur Rahman, Amir Zadeh, Rada Mihalcea, Louis-Philippe Morency, and Ehsan Hoque. "Humor Knowledge Enriched Transformer for Understanding Multimodal Humor." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (May 18, 2021): 12972–80. http://dx.doi.org/10.1609/aaai.v35i14.17534.

Full text

Abstract:

Recognizing humor from a video utterance requires understanding the verbal and non-verbal components as well as incorporating the appropriate context and external knowledge. In this paper, we propose Humor Knowledge enriched Transformer (HKT) that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge. We incorporate humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language. We encode all the language, acoustic, vision, and humor centric features separately using Transformer based encoders, followed by a cross attention layer to exchange information among them. Our model achieves 77.36% and 79.41% accuracy in humorous punchline detection on UR-FUNNY and MUStaRD datasets -- achieving a new state-of-the-art on both datasets with the margin of 4.93% and 2.94% respectively. Furthermore, we demonstrate that our model can capture interpretable, humor-inducing patterns from all modalities.

APA, Harvard, Vancouver, ISO, and other styles

36

Zhang, Xiaojuan, Yongxiu Zhou, Peihao Peng, and Guoyan Wang. "A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features." Sustainability 14, no. 21 (October 28, 2022): 14034. http://dx.doi.org/10.3390/su142114034.

Full text

Abstract:

Species distribution models (SDMs) are critical in conservation decision-making and ecological or biogeographical inference. Accurately predicting species distribution can facilitate resource monitoring and management for sustainable regional development. Currently, species distribution models usually use a single source of information as input for the model. To determine a solution to the lack of accuracy of the species distribution model with a single information source, we propose a multimodal species distribution model that can input multiple information sources simultaneously. We used ResNet50 and Transformer network structures as the backbone for multimodal data modeling. The model’s accuracy was tested using the GEOLIFE2020 dataset, and our model’s accuracy is state-of-the-art (SOTA). We found that the prediction accuracy of the multimodal species distribution model with multiple data sources of remote sensing images, environmental variables, and latitude and longitude information as inputs (29.56%) was higher than that of the model with only remote sensing images or environmental variables as inputs (25.72% and 21.68%, respectively). We also found that using a Transformer network structure to fuse data from multiple sources can significantly improve the accuracy of multimodal models. We present a novel multimodal model that fuses multiple sources of information as input for species distribution prediction to advance the research progress of multimodal models in the field of ecology.

APA, Harvard, Vancouver, ISO, and other styles

37

Zhang, Guihao, and Jiangzhong Cao. "Feature Fusion Based on Transformer for Cross-modal Retrieval." Journal of Physics: Conference Series 2558, no. 1 (August 1, 2023): 012012. http://dx.doi.org/10.1088/1742-6596/2558/1/012012.

Full text

Abstract:

Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.

APA, Harvard, Vancouver, ISO, and other styles

38

Park, Junhee, and Nammee Moon. "Design and Implementation of Attention Depression Detection Model Based on Multimodal Analysis." Sustainability 14, no. 6 (March 18, 2022): 3569. http://dx.doi.org/10.3390/su14063569.

Full text

Abstract:

Depression is becoming a social problem as the number of sufferers steadily increases. In this regard, this paper proposes a multimodal analysis-based attention depression detection model that simultaneously uses voice and text data obtained from users. The proposed models consist of Bidirectional Encoders from Transformers-Convolutional Neural Network (BERT-CNN) for natural language analysis, CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM) for voice signal processing, and multimodal analysis and fusion models for depression detection. The experiments in this paper are conducted using the DAIC-WOZ dataset, a clinical interview designed to support psychological distress states such as anxiety and post-traumatic stress. The voice data were set to 4 seconds in length and the number of mel filters was set to 128 in the preprocessing process. For text data, we used the subject text data of the interview and derived the embedding vector using a transformers tokenizer. Based on each data set, the BERT-CNN and CNN-BiLSTM proposed in this paper were applied and combined to classify depression. Through experiments, the accuracy and loss degree were compared for the cases of using multimodal data and using single data, and it was confirmed that the existing low accuracy was improved.

APA, Harvard, Vancouver, ISO, and other styles

39

Qi, Qingfu, Liyuan Lin, Rui Zhang, and Chengrong Xue. "MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis." IEEE Access 10 (2022): 28750–59. http://dx.doi.org/10.1109/access.2022.3157712.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Li, Lei, Xiang Chen, Shuofei Qiao, Feiyu Xiong, Huajun Chen, and Ningyu Zhang. "On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (June 26, 2023): 16254–55. http://dx.doi.org/10.1609/aaai.v37i13.26987.

Full text

Abstract:

Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Junyan. "Research on transformer and attention in applied algorithms." Applied and Computational Engineering 13, no. 1 (October 23, 2023): 221–28. http://dx.doi.org/10.54254/2755-2721/13/20230737.

Full text

Abstract:

The transformer is an encoder-decoder-based structure and model for deep learning that completely utilizes the self-attention mechanism. It has gained remarkable success in natural language processing and computer vision and is becoming the predominant research direction. This study first analyzes the transformer and attention mechanism, summarizes their advantages, and explores how they help the recommendation algorithm dynamically focus on specific parts of the input that are helpful to perform the current recommendation task. After analyzing the framework of the attention mechanism network and its weight computation for data received. To further enhance the practicality of objects in natural situations and the precision of object recognition, a transformer detection approach based on deformable convolution is presented. And analyzed how the transformer works in the generative pre-trained transformer. These algorithms illustrate the efficacy and robustness of the transformer, indicating that the transformer that incorporates the attention mechanism may satisfy the requirements of the majority of deep learning tasks. However, the unpredictability of demands, the exponential growth of information, and other issues will continue to make it challenging to deal with global interaction mechanisms and a unified framework for multimodal data.

APA, Harvard, Vancouver, ISO, and other styles

42

Gao, Jialin, Jianyu Chen, Jiaqi Wei, Bin Jiang, and A.-Li Luo. "Deep Multimodal Networks for M-type Star Classification with Paired Spectrum and Photometric Image." Publications of the Astronomical Society of the Pacific 135, no. 1046 (April 1, 2023): 044503. http://dx.doi.org/10.1088/1538-3873/acc7ca.

Full text

Abstract:

Abstract Traditional stellar classification methods include spectral and photometric classification separately. Although satisfactory results can be achieved, the accuracy could be improved. In this paper, we pioneer a novel approach to deeply fuse the spectra and photometric images of the sources in an advanced multimodal network to enhance the model’s discriminatory ability. We use Transformer as the fusion module and apply a spectrum–image contrastive loss function to enhance the consistency of the spectrum and photometric image of the same source in two different feature spaces. We perform M-type stellar subtype classification on two data sets with high and low signal-to-noise ratio (S/N) spectra and corresponding photometric images, and the F1-score achieves 95.65% and 90.84%, respectively. In our experiments, we prove that our model effectively utilizes the information from photometric images and is more accurate than advanced spectrum and photometric image classifiers. Our contributions can be summarized as follows: (1) We propose an innovative idea for stellar classification that allows the model to simultaneously consider information from spectra and photometric images. (2) We discover the challenge of fusing low-S/N spectra and photometric images in the Transformer and provide a solution. (3) The effectiveness of Transformer for spectral classification is discussed for the first time and will inspire more Transformer-based spectral classification models.

APA, Harvard, Vancouver, ISO, and other styles

43

Zong, Daoming, and Shiliang Sun. "McOmet: Multimodal Fusion Transformer for Physical Audiovisual Commonsense Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 5 (June 26, 2023): 6621–29. http://dx.doi.org/10.1609/aaai.v37i5.25813.

Full text

Abstract:

Physical commonsense reasoning is essential for building reliable and interpretable AI systems, which involves a general understanding of the physical properties and affordances of everyday objects, how these objects can be manipulated, and how they interact with others. It is fundamentally a multi-modal task, as physical properties are manifested through multiple modalities, including vision and acoustics. In this work, we present a unified framework, named Multimodal Commonsense Transformer (MCOMET), for physical audiovisual commonsense reasoning. MCOMET has two intriguing properties: i) it fully mines higher-ordered temporal relationships across modalities (e.g., pairs, triplets, and quadruplets); and ii) it restricts the cross-modal flow through the feature collection and propagation mechanism along with tight fusion bottlenecks, forcing the model to attend the most relevant parts in each modality and suppressing the dissemination of noisy information. We evaluate our model on a very recent public benchmark, PACS. Results show that MCOMET significantly outperforms a variety of strong baselines, revealing powerful multi-modal commonsense reasoning capabilities.

APA, Harvard, Vancouver, ISO, and other styles

44

JayaLakshmi, Gundabathina, Abburi Madhuri, Deepak Vasudevan, Balamuralikrishna Thati, Uddagiri Sirisha, and Surapaneni Phani Praveen. "Effective Disaster Management Through Transformer-Based Multimodal Tweet Classification." Revue d'Intelligence Artificielle 37, no. 5 (October 31, 2023): 1263–72. http://dx.doi.org/10.18280/ria.370519.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Liu, Biyuan, Huaixin Chen, Kun Li, and Michael Ying Yang. "Transformer-based multimodal change detection with multitask consistency constraints." Information Fusion 108 (August 2024): 102358. http://dx.doi.org/10.1016/j.inffus.2024.102358.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Abiyev, Rahib H., Mohamad Ziad Altabel, Manal Darwish, and Abdulkader Helwan. "A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos." Diagnostics 14, no. 7 (March 23, 2024): 681. http://dx.doi.org/10.3390/diagnostics14070681.

Full text

Abstract:

The determination of the potential role and advantages of artificial intelligence-based models in the field of surgery remains uncertain. This research marks an initial stride towards creating a multimodal model, inspired by the Video-Audio-Text Transformer, that aims to reduce negative occurrences and enhance patient safety. The model employs text and image embedding state-of-the-art models (ViT and BERT) to assess their efficacy in extracting the hidden and distinct features from the surgery video frames. These features are then used as inputs for convolution-free Transformer architectures to extract comprehensive multidimensional representations. A joint space is then used to combine the text and image features extracted from both Transformer encoders. This joint space ensures that the relationships between the different modalities are preserved during the combination process. The entire model was trained and tested on laparoscopic cholecystectomy (LC) videos encompassing various levels of complexity. Experimentally, a mean accuracy of 91.0%, a precision of 81%, and a recall of 83% were reached by the model when tested on 30 videos out of 80 from the Cholec80 dataset.

APA, Harvard, Vancouver, ISO, and other styles

47

Chaudhari, Aayushi, Chintan Bhatt, Achyut Krishna, and Carlos M. Travieso-González. "Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning." Electronics 12, no. 2 (January 5, 2023): 288. http://dx.doi.org/10.3390/electronics12020288.

Full text

Abstract:

Emotion recognition is a very challenging research field due to its complexity, as individual differences in cognitive–emotional cues involve a wide variety of ways, including language, expressions, and speech. If we use video as the input, we can acquire a plethora of data for analyzing human emotions. In this research, we use features derived from separately pretrained self-supervised learning models to combine text, audio (speech), and visual data modalities. The fusion of features and representation is the biggest challenge in multimodal emotion classification research. Because of the large dimensionality of self-supervised learning characteristics, we present a unique transformer and attention-based fusion method for incorporating multimodal self-supervised learning features that achieved an accuracy of 86.40% for multimodal emotion classification.

APA, Harvard, Vancouver, ISO, and other styles

48

Xu, Zhen, David R. So, and Andrew M. Dai. "MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10532–40. http://dx.doi.org/10.1609/aaai.v35i12.17260.

Full text

Abstract:

One important challenge of applying deep learning to electronic health records (EHR) is the complexity of their multimodal structure. EHR usually contains a mixture of structured (codes) and unstructured (free-text) data with sparse and irregular longitudinal features -- all of which doctors utilize when making decisions. In the deep learning regime, determining how different modality representations should be fused together is a difficult problem, which is often addressed by handcrafted modeling and intuition. In this work, we extend state-of-the-art neural architecture search (NAS) methods and propose MUltimodal Fusion Architecture SeArch (MUFASA) to simultaneously search across multimodal fusion strategies and modality-specific architectures for the first time. We demonstrate empirically that our MUFASA method outperforms established unimodal NAS on public EHR data with comparable computation costs. In addition, MUFASA produces architectures that outperform Transformer and Evolved Transformer. Compared with these baselines on CCS diagnosis code prediction, our discovered models improve top-5 recall from 0.88 to 0.91 and demonstrate the ability to generalize to other EHR tasks. Studying our top architecture in depth, we provide empirical evidence that MUFASA's improvements are derived from its ability to both customize modeling for each modality and find effective fusion strategies.

APA, Harvard, Vancouver, ISO, and other styles

49

Ilmi, Yuslimu, Pratiwi Retnaningdyah, and Ahmad Munir. "Exploring Digital Multimodal Text in EFL Classroom: Transformed Practice in Multiliteracies Pedagogy." Linguistic, English Education and Art (LEEA) Journal 4, no. 1 (December 28, 2020): 99–108. http://dx.doi.org/10.31539/leea.v4i1.1416.

Full text

Abstract:

This study investigates EFL students’ composition of digital multimodal text. There were forty-four students recruited from tenth grade of a private senior high school in Sidoarjo. All of them were divided into seven groups and given authority to choose their own topic for the digital multimodal project; advertisement video. Based on the theory of multimodal analysis, this study examines both students’ processes and products. Qualitative case study was chosen as the design of the study and document analysis was chosen as the data collection technique. Through qualitative case study, the finding of this research shows that all the students’ groups used multimodal modes in creating their advertisement videos. Additionally, this study revealed that the qualities of the students’ project depend on two important things: the ability of the group members and the collaboration of the members in doing the project. Keywords: Multimodal Text, Multiliteracies Pedagogy, Transformed Practice, EFL Classroom, Senior High School.

APA, Harvard, Vancouver, ISO, and other styles

50

Ammour, Nassim, Yakoub Bazi, and Naif Alajlan. "Multimodal Approach for Enhancing Biometric Authentication." Journal of Imaging 9, no. 9 (August 22, 2023): 168. http://dx.doi.org/10.3390/jimaging9090168.

Full text

Abstract:

Unimodal biometric systems rely on a single source or unique individual biological trait for measurement and examination. Fingerprint-based biometric systems are the most common, but they are vulnerable to presentation attacks or spoofing when a fake fingerprint is presented to the sensor. To address this issue, we propose an enhanced biometric system based on a multimodal approach using two types of biological traits. We propose to combine fingerprint and Electrocardiogram (ECG) signals to mitigate spoofing attacks. Specifically, we design a multimodal deep learning architecture that accepts fingerprints and ECG as inputs and fuses the feature vectors using stacking and channel-wise approaches. The feature extraction backbone of the architecture is based on data-efficient transformers. The experimental results demonstrate the promising capabilities of the proposed approach in enhancing the robustness of the system to presentation attacks.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!