To see the other types of publications on this topic, follow the link: Multimodal Transformers.

Journal articles on the topic 'Multimodal Transformers'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multimodal Transformers.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Jaiswal, Sushma, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal. "Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search." International Journal of Intelligent Systems and Applications 16, no. 2 (April 8, 2024): 53–61. http://dx.doi.org/10.5815/ijisa.2024.02.05.

Full text
Abstract:
Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.
APA, Harvard, Vancouver, ISO, and other styles
2

Bayat, Nasrin, Jong-Hwan Kim, Renoa Choudhury, Ibrahim F. Kadhim, Zubaidah Al-Mashhadani, Mark Aldritz Dela Virgen, Reuben Latorre, Ricardo De La Paz, and Joon-Hyuk Park. "Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired." Journal of Imaging 9, no. 8 (August 15, 2023): 161. http://dx.doi.org/10.3390/jimaging9080161.

Full text
Abstract:
This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user’s walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.
APA, Harvard, Vancouver, ISO, and other styles
3

Hendricks, Lisa Anne, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers." Transactions of the Association for Computational Linguistics 9 (2021): 570–85. http://dx.doi.org/10.1162/tacl_a_00385.

Full text
Abstract:
Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.
APA, Harvard, Vancouver, ISO, and other styles
4

Shao, Zilei. "A literature review on multimodal deep learning models for detecting mental disorders in conversational data: Pre-transformer and transformer-based approaches." Applied and Computational Engineering 18, no. 1 (October 23, 2023): 215–24. http://dx.doi.org/10.54254/2755-2721/18/20230993.

Full text
Abstract:
This paper provides a comprehensive review of multimodal deep learning models that utilize conversational data to detect mental health disorders. In addition to discussing models based on the Transformer, such as BERT (Bidirectional Encoder Representations from Transformers), this paper addresses models that existed prior to the Transformer, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The paper covers the application of these models in the construction of multimodal deep learning systems to detect mental disorders. In addition, the difficulties encountered by multimodal deep learning systems are brought up. Furthermore, the paper proposes research directions for enhancing the performance and robustness of these models in mental health applications. By shedding light on the potential of multimodal deep learning in mental health care, this paper aims to foster further research and development in this critical domain.
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, LeiChen, Simon Giebenhain, Carsten Anklam, and Bastian Goldluecke. "Radar Ghost Target Detection via Multimodal Transformers." IEEE Robotics and Automation Letters 6, no. 4 (October 2021): 7758–65. http://dx.doi.org/10.1109/lra.2021.3100176.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache, and Benoit Favre. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Full text
Abstract:
In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.
APA, Harvard, Vancouver, ISO, and other styles
7

Sun, Qixuan, Nianhua Fang, Zhuo Liu, Liang Zhao, Youpeng Wen, and Hongxiang Lin. "HybridCTrm: Bridging CNN and Transformer for Multimodal Brain Image Segmentation." Journal of Healthcare Engineering 2021 (October 1, 2021): 1–10. http://dx.doi.org/10.1155/2021/7467261.

Full text
Abstract:
Multimodal medical image segmentation is always a critical problem in medical image segmentation. Traditional deep learning methods utilize fully CNNs for encoding given images, thus leading to deficiency of long-range dependencies and bad generalization performance. Recently, a sequence of Transformer-based methodologies emerges in the field of image processing, which brings great generalization and performance in various tasks. On the other hand, traditional CNNs have their own advantages, such as rapid convergence and local representations. Therefore, we analyze a hybrid multimodal segmentation method based on Transformers and CNNs and propose a novel architecture, HybridCTrm network. We conduct experiments using HybridCTrm on two benchmark datasets and compare with HyperDenseNet, a network based on fully CNNs. Results show that our HybridCTrm outperforms HyperDenseNet on most of the evaluation metrics. Furthermore, we analyze the influence of the depth of Transformer on the performance. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.
APA, Harvard, Vancouver, ISO, and other styles
8

Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu, and Faouzi Bader. "Multimodal transformers for wireless communications: A case study in beam prediction." ITU Journal on Future and Evolving Technologies 4, no. 3 (September 5, 2023): 461–71. http://dx.doi.org/10.52953/jwra8095.

Full text
Abstract:
Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Yu, Ming Yin, Yu Li, and Qian Cai. "CSU-Net: A CNN-Transformer Parallel Network for Multimodal Brain Tumour Segmentation." Electronics 11, no. 14 (July 16, 2022): 2226. http://dx.doi.org/10.3390/electronics11142226.

Full text
Abstract:
Medical image segmentation techniques are vital to medical image processing and analysis. Considering the significant clinical applications of brain tumour image segmentation, it represents a focal point of medical image segmentation research. Most of the work in recent times has been centred on Convolutional Neural Networks (CNN) and Transformers. However, CNN has some deficiencies in modelling long-distance information transfer and contextual processing information, while Transformer is relatively weak in acquiring local information. To overcome the above defects, we propose a novel segmentation network with an “encoder–decoder” architecture, namely CSU-Net. The encoder consists of two parallel feature extraction branches based on CNN and Transformer, respectively, in which the features of the same size are fused. The decoder has a dual Swin Transformer decoder block with two learnable parameters for feature upsampling. The features from multiple resolutions in the encoder and decoder are merged via skip connections. On the BraTS 2020, our model achieves 0.8927, 0.8857, and 0.8188 for the Whole Tumour (WT), Tumour Core (TC), and Enhancing Tumour (ET), respectively, in terms of Dice scores.
APA, Harvard, Vancouver, ISO, and other styles
10

Wang, Zhaokai, Renda Bao, Qi Wu, and Si Liu. "Confidence-aware Non-repetitive Multimodal Transformers for TextCaps." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 4 (May 18, 2021): 2835–43. http://dx.doi.org/10.1609/aaai.v35i4.16389.

Full text
Abstract:
When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.
APA, Harvard, Vancouver, ISO, and other styles
11

Xu, Yifan, Huapeng Wei, Minxuan Lin, Yingying Deng, Kekai Sheng, Mengdan Zhang, Fan Tang, Weiming Dong, Feiyue Huang, and Changsheng Xu. "Transformers in computational visual media: A survey." Computational Visual Media 8, no. 1 (October 27, 2021): 33–62. http://dx.doi.org/10.1007/s41095-021-0247-3.

Full text
Abstract:
AbstractTransformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.
APA, Harvard, Vancouver, ISO, and other styles
12

Abdine, Hadi, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. "Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 10 (March 24, 2024): 10757–65. http://dx.doi.org/10.1609/aaai.v38i10.28948.

Full text
Abstract:
In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.
APA, Harvard, Vancouver, ISO, and other styles
13

Sams, Andrew Steven, and Amalia Zahra. "Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers." Bulletin of Electrical Engineering and Informatics 12, no. 1 (February 1, 2023): 355–64. http://dx.doi.org/10.11591/eei.v12i1.4231.

Full text
Abstract:
Music carries emotional information and allows the listener to feel the emotions contained in the music. This study proposes a multimodal music emotion recognition (MER) system using Indonesian song and lyrics data. In the proposed multimodal system, the audio data will use the mel spectrogram feature, and the lyrics feature will be extracted by going through the tokenizing process from XLNet. Convolutional long short term memory network (CNN-LSTM) performs the audio classification task, while XLNet transformers performs the lyrics classification task. The outputs of the two classification tasks are probability weight and actual prediction with the value of positive, neutral, and negative emotions, which are then combined using the stacking ensemble method. The combined output will be trained into an artificial neural network (ANN) model to get the best probability weight output. The multimodal system achieves the best performance with an accuracy of 80.56%. The results showed that the multimodal method of recognizing musical emotions gave better performance than the single modal method. In addition, hyperparameter tuning can affect the performance of multimodal systems.
APA, Harvard, Vancouver, ISO, and other styles
14

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S, and C. Gururaj. "Multimodal Offensive Meme Classification u sing Transformers and BiLSTM." International Journal of Engineering and Advanced Technology 11, no. 3 (February 28, 2022): 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Full text
Abstract:
Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a classification head. The SwinT + BiLSTM has performed better when compared to the ViT + BiLSTM across all the dimensions. The performance of the models has improved significantly when the contextual embeddings from DistilBert replace the custom embeddings. We have achieved the highest recall of 0.631 by combining outputs of four models using the soft voting technique.
APA, Harvard, Vancouver, ISO, and other styles
15

Nadal, Clement, and Francois Pigache. "Multimodal electromechanical model of piezoelectric transformers by Hamilton's principle." IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 56, no. 11 (November 2009): 2530–43. http://dx.doi.org/10.1109/tuffc.2009.1340.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Pezzelle, Sandro, Ece Takmaz, and Raquel Fernández. "Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation." Transactions of the Association for Computational Linguistics 9 (2021): 1563–79. http://dx.doi.org/10.1162/tacl_a_00443.

Full text
Abstract:
Abstract This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In line with previous evidence, we observe a generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.
APA, Harvard, Vancouver, ISO, and other styles
17

Liang, Yi, Turdi Tohti, and Askar Hamdulla. "False Information Detection via Multimodal Feature Fusion and Multi-Classifier Hybrid Prediction." Algorithms 15, no. 4 (March 29, 2022): 119. http://dx.doi.org/10.3390/a15040119.

Full text
Abstract:
In the existing false information detection methods, the quality of the extracted single-modality features is low, the information between different modalities cannot be fully fused, and the original information will be lost when the information of different modalities is fused. This paper proposes a false information detection via multimodal feature fusion and multi-classifier hybrid prediction. In this method, first, bidirectional encoder representations for transformers are used to extract the text features, and S win-transformer is used to extract the picture features, and then, the trained deep autoencoder is used as an early fusion method of multimodal features to fuse text features and visual features, and the low-dimensional features are taken as the joint features of the multimodalities. The original features of each modality are concatenated into the joint features to reduce the loss of original information. Finally, the text features, image features and joint features are processed by three classifiers to obtain three probability distributions, and the three probability distributions are added proportionally to obtain the final prediction result. Compared with the attention-based multimodal factorized bilinear pooling, the model achieves 4.3% and 1.2% improvement in accuracy on Weibo dataset and Twitter dataset. The experimental results show that the proposed model can effectively integrate multimodal information and improve the accuracy of false information detection.
APA, Harvard, Vancouver, ISO, and other styles
18

Zhang, Tianze. "Investigation on task effect analysis and optimization strategy of multimodal large model based on Transformers architecture for various languages." Applied and Computational Engineering 47, no. 1 (March 15, 2024): 213–24. http://dx.doi.org/10.54254/2755-2721/47/20241374.

Full text
Abstract:
As artificial intelligence technology advances swiftly, the Transformers architecture has emerged as a pivotal model for handling multimodal data. This investigation delves into the impact of multimodal large-scale models utilizing the Transformers architecture for addressing various linguistic tasks, along with proposing optimization approaches tailored to this context. Through a series of experiments, this study scrutinized the performance of these models on multilingual datasets, engaging in a comprehensive analysis of the key determinants influencing their effectiveness. Firstly, several models of transformers architecture are pre trained on the same corpus, including ERNIE, GPT, ViT, VisualBERT, and a series of tests are carried out on these models in English, Chinese, Spanish and other languages. By comparing the performance of different models, it is found that these models show significant performance differences when dealing with tasks in different languages. Further, through analysis and experimental verification, this paper proposes a series of optimization strategies for different languages, including: annotation method for language specific datasets, incremental fine-tuning method for tuning, increasing the size of datasets, using multi task learning, etc. Experiments show that these methods have achieved remarkable results, and put forward the future research direction.
APA, Harvard, Vancouver, ISO, and other styles
19

Nia, Zahra Movahedi, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary, and Jude D. Kong. "Twitter-based gender recognition using transformers." Mathematical Biosciences and Engineering 20, no. 9 (2023): 15957–77. http://dx.doi.org/10.3934/mbe.2023711.

Full text
Abstract:
<abstract> <p>Social media contains useful information about people and society that could help advance research in many different areas of health (e.g. by applying opinion mining, emotion/sentiment analysis and statistical analysis) such as mental health, health surveillance, socio-economic inequality and gender vulnerability. User demographics provide rich information that could help study the subject further. However, user demographics such as gender are considered private and are not freely available. In this study, we propose a model based on transformers to predict the user's gender from their images and tweets. The image-based classification model is trained in two different methods: using the profile image of the user and using various image contents posted by the user on Twitter. For the first method a Twitter gender recognition dataset, publicly available on Kaggle and for the second method the PAN-18 dataset is used. Several transformer models, i.e. vision transformers (ViT), LeViT and Swin Transformer are fine-tuned for both of the image datasets and then compared. Next, different transformer models, namely, bidirectional encoders representations from transformers (BERT), RoBERTa and ELECTRA are fine-tuned to recognize the user's gender by their tweets. This is highly beneficial, because not all users provide an image that indicates their gender. The gender of such users could be detected from their tweets. The significance of the image and text classification models were evaluated using the Mann-Whitney U test. Finally, the combination model improved the accuracy of image and text classification models by 11.73 and 5.26% for the Kaggle dataset and by 8.55 and 9.8% for the PAN-18 dataset, respectively. This shows that the image and text classification models are capable of complementing each other by providing additional information to one another. Our overall multimodal method has an accuracy of 88.11% for the Kaggle and 89.24% for the PAN-18 dataset and outperforms state-of-the-art models. Our work benefits research that critically require user demographic information such as gender to further analyze and study social media content for health-related issues.</p> </abstract>
APA, Harvard, Vancouver, ISO, and other styles
20

Park, Junhee, and Nammee Moon. "Design and Implementation of Attention Depression Detection Model Based on Multimodal Analysis." Sustainability 14, no. 6 (March 18, 2022): 3569. http://dx.doi.org/10.3390/su14063569.

Full text
Abstract:
Depression is becoming a social problem as the number of sufferers steadily increases. In this regard, this paper proposes a multimodal analysis-based attention depression detection model that simultaneously uses voice and text data obtained from users. The proposed models consist of Bidirectional Encoders from Transformers-Convolutional Neural Network (BERT-CNN) for natural language analysis, CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM) for voice signal processing, and multimodal analysis and fusion models for depression detection. The experiments in this paper are conducted using the DAIC-WOZ dataset, a clinical interview designed to support psychological distress states such as anxiety and post-traumatic stress. The voice data were set to 4 seconds in length and the number of mel filters was set to 128 in the preprocessing process. For text data, we used the subject text data of the interview and derived the embedding vector using a transformers tokenizer. Based on each data set, the BERT-CNN and CNN-BiLSTM proposed in this paper were applied and combined to classify depression. Through experiments, the accuracy and loss degree were compared for the cases of using multimodal data and using single data, and it was confirmed that the existing low accuracy was improved.
APA, Harvard, Vancouver, ISO, and other styles
21

Xiang, Yunfan, Xiangyu Tian, Yue Xu, Xiaokun Guan, and Zhengchao Chen. "EGMT-CD: Edge-Guided Multimodal Transformers Change Detection from Satellite and Aerial Images." Remote Sensing 16, no. 1 (December 25, 2023): 86. http://dx.doi.org/10.3390/rs16010086.

Full text
Abstract:
Change detection from heterogeneous satellite and aerial images plays a progressively important role in many fields, including disaster assessment, urban construction, and land use monitoring. Currently, researchers have mainly devoted their attention to change detection using homologous image pairs and achieved many remarkable results. It is sometimes necessary to use heterogeneous images for change detection in practical scenarios due to missing images, emergency situations, and cloud and fog occlusion. However, heterogeneous change detection still faces great challenges, especially using satellite and aerial images. The main challenges in satellite and aerial image change detection are related to the resolution gap and blurred edge. Previous studies used interpolation or shallow feature alignment before traditional homologous change detection methods, which ignored the high-level feature interaction and edge information. Therefore, we propose a new heterogeneous change detection model based on multimodal transformers combined with edge guidance. In order to alleviate the resolution gap between satellite and aerial images, we design an improved spatially aligned transformer (SP-T) with a sub-pixel module to align the satellite features to the same size of the aerial ones supervised by a token loss. Moreover, we introduce an edge detection branch to guide change features using the object edge with an auxiliary edge-change loss. Finally, we conduct considerable experiments to verify the effectiveness and superiority of our proposed model (EGMT-CD) on a new satellite–aerial heterogeneous change dataset, named SACD. The experiments show that our method (EGMT-CD) outperforms many previously superior change detection methods and fully demonstrates its potential in heterogeneous change detection from satellite–aerial images.
APA, Harvard, Vancouver, ISO, and other styles
22

Ammour, Nassim, Yakoub Bazi, and Naif Alajlan. "Multimodal Approach for Enhancing Biometric Authentication." Journal of Imaging 9, no. 9 (August 22, 2023): 168. http://dx.doi.org/10.3390/jimaging9090168.

Full text
Abstract:
Unimodal biometric systems rely on a single source or unique individual biological trait for measurement and examination. Fingerprint-based biometric systems are the most common, but they are vulnerable to presentation attacks or spoofing when a fake fingerprint is presented to the sensor. To address this issue, we propose an enhanced biometric system based on a multimodal approach using two types of biological traits. We propose to combine fingerprint and Electrocardiogram (ECG) signals to mitigate spoofing attacks. Specifically, we design a multimodal deep learning architecture that accepts fingerprints and ECG as inputs and fuses the feature vectors using stacking and channel-wise approaches. The feature extraction backbone of the architecture is based on data-efficient transformers. The experimental results demonstrate the promising capabilities of the proposed approach in enhancing the robustness of the system to presentation attacks.
APA, Harvard, Vancouver, ISO, and other styles
23

Segura-Bedmar, Isabel, and Santiago Alonso-Bartolome. "Multimodal Fake News Detection." Information 13, no. 6 (June 2, 2022): 284. http://dx.doi.org/10.3390/info13060284.

Full text
Abstract:
Over the last few years, there has been an unprecedented proliferation of fake news. As a consequence, we are more susceptible to the pernicious impact that misinformation and disinformation spreading can have on different segments of our society. Thus, the development of tools for the automatic detection of fake news plays an important role in the prevention of its negative effects. Most attempts to detect and classify false content focus only on using textual information. Multimodal approaches are less frequent and they typically classify news either as true or fake. In this work, we perform a fine-grained classification of fake news on the Fakeddit dataset, using both unimodal and multimodal approaches. Our experiments show that the multimodal approach based on a Convolutional Neural Network (CNN) architecture combining text and image data achieves the best results, with an accuracy of 87%. Some fake news categories, such as Manipulated content, Satire, or False connection, strongly benefit from the use of images. Using images also improves the results of the other categories but with less impact. Regarding the unimodal approaches using only text, Bidirectional Encoder Representations from Transformers (BERT) is the best model, with an accuracy of 78%. Exploiting both text and image data significantly improves the performance of fake news detection.
APA, Harvard, Vancouver, ISO, and other styles
24

Mingyu, Ji, Zhou Jiawei, and Wei Ning. "AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model." PLOS ONE 17, no. 9 (September 9, 2022): e0273936. http://dx.doi.org/10.1371/journal.pone.0273936.

Full text
Abstract:
Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasoning and mathematical operations after learning multimodal emotional features. For the problem of how to consider the effective fusion of multimodal data and the relevance of multimodal data in multimodal sentiment analysis, we propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT). In the data pre-processing stage, text features are extracted using the pre-trained language model BERT (Bi-directional Encoder Representation from Transformers), and the BiLSTM (Bi-directional Long Short-Term Memory) is used to obtain the internal information of the audio. In the data fusion phase, the multimodal data fusion network effectively fuses multimodal features through the interaction of text and audio information. During the data analysis phase, the multimodal data association network analyzes the data by exploring the correlation of fused information between text and audio. In the data output phase, the model outputs the results of multimodal sentiment analysis. We conducted extensive comparative experiments on the publicly available sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experimental results show that AFR-BERT improves on the classical multimodal sentiment analysis model in terms of relevant performance metrics. In addition, ablation experiments and example analysis show that the multimodal data analysis network in AFR-BERT can effectively capture and analyze the sentiment features in text and audio.
APA, Harvard, Vancouver, ISO, and other styles
25

Argade, Dakshata, Vaishali Khairnar, Deepali Vora, Shruti Patil, Ketan Kotecha, and Sultan Alfarhood. "Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism." Heliyon 10, no. 4 (February 2024): e26162. http://dx.doi.org/10.1016/j.heliyon.2024.e26162.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Gupta, Arpit, Himanshu Goyal, and Ishita Kohli. "Synthesis of Vision and Language: Multifaceted Image Captioning Application." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, no. 12 (December 23, 2023): 1–10. http://dx.doi.org/10.55041/ijsrem27770.

Full text
Abstract:
The rapid advancement in image captioning has been a pivotal area of research, aiming to mimic human-like understanding of visual content. This paper presents an innovative approach that integrates attention mechanisms and object features into an image captioning model. Leveraging the Flickr8k dataset, this research explores the fusion of these components to enhance image comprehension and caption generation. Furthermore, the study showcases the implementation of this model in a user-friendly application using FASTAPI and ReactJS, offering text-to-speech translation in multiple languages. The findings underscore the efficacy of this approach in advancing image captioning technology. This tutorial outlines the construction of an image caption generator, employing Convolutional Neural Network (CNN) for image feature extraction and Long Short-Term Memory Network (LSTM) for Natural Language Processing (NLP). Keywords—Convolutional Neural Networks, Long Short Term Memory, Attention Mechanism, Transformer Architecture, Vision Transformers, Transfer Learning, Multimodal fusion, Deep Learning Models, Pre-Trained Models, Image Processing Techniques
APA, Harvard, Vancouver, ISO, and other styles
27

Zhong, Enmin, Carlos R. del-Blanco, Daniel Berjón, Fernando Jaureguizar, and Narciso García. "Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer." Sensors 23, no. 16 (August 10, 2023): 7066. http://dx.doi.org/10.3390/s23167066.

Full text
Abstract:
Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. The primary challenge lies in achieving real-time recognition while managing temporal dependencies that can impact performance. Existing methods employ 3D convolutional or Transformer-based architectures with hand skeleton estimation, but both have limitations. To address these challenges, a hybrid approach that combines 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed. The method involves using a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics of hand gestures. A Transformer network with a self-attention mechanism is then employed to efficiently capture long-range temporal dependencies in the skeleton sequence. Evaluation of the Briareo and Multimodal Hand Gesture datasets resulted in accuracy scores of 95.49% and 97.25%, respectively. Notably, this approach achieves real-time performance using a standard CPU, distinguishing it from methods that require specialized GPUs. The hybrid approach’s real-time efficiency and high accuracy demonstrate its superiority over existing state-of-the-art methods. In summary, the hybrid 3D-CNN and Transformer approach effectively addresses real-time recognition challenges and efficient handling of temporal dependencies, outperforming existing methods in both accuracy and speed.
APA, Harvard, Vancouver, ISO, and other styles
28

Nikzad-Khasmakhi, N., M. A. Balafar, M. Reza Feizi-Derakhshi, and Cina Motamed. "BERTERS: Multimodal representation learning for expert recommendation system with transformers and graph embeddings." Chaos, Solitons & Fractals 151 (October 2021): 111260. http://dx.doi.org/10.1016/j.chaos.2021.111260.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Hazmoune, Samira, and Fateh Bougamouza. "Using transformers for multimodal emotion recognition: Taxonomies and state of the art review." Engineering Applications of Artificial Intelligence 133 (July 2024): 108339. http://dx.doi.org/10.1016/j.engappai.2024.108339.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Perifanos, Konstantinos, and Dionysis Goutsos. "Multimodal Hate Speech Detection in Greek Social Media." Multimodal Technologies and Interaction 5, no. 7 (June 29, 2021): 34. http://dx.doi.org/10.3390/mti5070034.

Full text
Abstract:
Hateful and abusive speech presents a major challenge for all online social media platforms. Recent advances in Natural Language Processing and Natural Language Understanding allow for more accurate detection of hate speech in textual streams. This study presents a new multimodal approach to hate speech detection by combining Computer Vision and Natural Language processing models for abusive context detection. Our study focuses on Twitter messages and, more specifically, on hateful, xenophobic, and racist speech in Greek aimed at refugees and migrants. In our approach, we combine transfer learning and fine-tuning of Bidirectional Encoder Representations from Transformers (BERT) and Residual Neural Networks (Resnet). Our contribution includes the development of a new dataset for hate speech classification, consisting of tweet IDs, along with the code to obtain their visual appearance, as they would have been rendered in a web browser. We have also released a pre-trained Language Model trained on Greek tweets, which has been used in our experiments. We report a consistently high level of accuracy (accuracy score = 0.970, f1-score = 0.947 in our best model) in racist and xenophobic speech detection.
APA, Harvard, Vancouver, ISO, and other styles
31

Li, Ning, Jie Chen, Nanxin Fu, Wenzhuo Xiao, Tianrun Ye, Chunming Gao, and Ping Zhang. "Leveraging Dual Variational Autoencoders and Generative Adversarial Networks for Enhanced Multimodal Interaction in Zero-Shot Learning." Electronics 13, no. 3 (January 29, 2024): 539. http://dx.doi.org/10.3390/electronics13030539.

Full text
Abstract:
In the evolving field of taxonomic classification, and especially in Zero-shot Learning (ZSL), the challenge of accurately classifying entities unseen in training datasets remains a significant hurdle. Although the existing literature is rich in developments, it often falls short in two critical areas: semantic consistency (ensuring classifications align with true meanings) and the effective handling of dataset diversity biases. These gaps have created a need for a more robust approach that can navigate both with greater efficacy. This paper introduces an innovative integration of transformer models with ariational autoencoders (VAEs) and generative adversarial networks (GANs), with the aim of addressing them within the ZSL framework. The choice of VAE-GAN is driven by their complementary strengths: VAEs are proficient in providing a richer representation of data patterns, and GANs are able to generate data that is diverse yet representative, thus mitigating biases from dataset diversity. Transformers are employed to further enhance semantic consistency, which is key because many existing models underperform. Through experiments have been conducted on benchmark ZSL datasets such as CUB, SUN, and Animals with Attributes 2 (AWA2), our approach is novel because it demonstrates significant improvements, not only in enhancing semantic and structural coherence, but also in effectively addressing dataset biases. This leads to a notable enhancement of the model’s ability to generalize visual categorization tasks beyond the training data, thus filling a critical gap in the current ZSL research landscape.
APA, Harvard, Vancouver, ISO, and other styles
32

Meng, Yiwen, William Speier, Michael K. Ong, and Corey W. Arnold. "Bidirectional Representation Learning From Transformers Using Multimodal Electronic Health Record Data to Predict Depression." IEEE Journal of Biomedical and Health Informatics 25, no. 8 (August 2021): 3121–29. http://dx.doi.org/10.1109/jbhi.2021.3063721.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Zhang, Mengna, Qisong Huang, and Hua Liu. "A Multimodal Data Analysis Approach to Social Media during Natural Disasters." Sustainability 14, no. 9 (May 5, 2022): 5536. http://dx.doi.org/10.3390/su14095536.

Full text
Abstract:
During natural disasters, social media can provide real time or rapid disaster, perception information to help government managers carry out disaster response efforts efficiently. Therefore, it is of great significance to mine social media information accurately. In contrast to previous studies, this study proposes a multimodal data classification model for mining social media information. Using the model, the study employs Late Dirichlet Allocation (LDA) to identify subject information from multimodal data, then, the multimodal data is analyzed by bidirectional encoder representation from transformers (Bert) and visual geometry group 16 (Vgg-16). Text and image data are classified separately, resulting in real mining of topic information during disasters. This study uses Weibo data during the 2021 Henan heavy storm as the research object. Comparing the data with previous experiment results, this study proposes a model that can classify natural disaster topics more accurately. The accuracy of this study is 0.93. Compared with a topic-based event classification model KGE-MMSLDA, the accuracy of this study is improved by 12%. This study results in a real-time understanding of different themed natural disasters to help make informed decisions.
APA, Harvard, Vancouver, ISO, and other styles
34

Macfadyen, Craig, Ajay Duraiswamy, and David Harris-Birtill. "Classification of hyper-scale multimodal imaging datasets." PLOS Digital Health 2, no. 12 (December 13, 2023): e0000191. http://dx.doi.org/10.1371/journal.pdig.0000191.

Full text
Abstract:
Algorithms that classify hyper-scale multi-modal datasets, comprising of millions of images, into constituent modality types can help researchers quickly retrieve and classify diagnostic imaging data, accelerating clinical outcomes. This research aims to demonstrate that a deep neural network that is trained on a hyper-scale dataset (4.5 million images) composed of heterogeneous multi-modal data can be used to obtain significant modality classification accuracy (96%). By combining 102 medical imaging datasets, a dataset of 4.5 million images was created. A ResNet-50, ResNet-18, and VGG16 were trained to classify these images by the imaging modality used to capture them (Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and X-ray) across many body locations. The classification accuracy of the models was then tested on unseen data. The best performing model achieved classification accuracy of 96% on unseen data, which is on-par, or exceeds the accuracy of more complex implementations using EfficientNets or Vision Transformers (ViTs). The model achieved a balanced accuracy of 86%. This research shows it is possible to train Deep Learning (DL) Convolutional Neural Networks (CNNs) with hyper-scale multimodal datasets, composed of millions of images. Such models can find use in real-world applications with volumes of image data in the hyper-scale range, such as medical imaging repositories, or national healthcare institutions. Further research can expand this classification capability to include 3D-scans.
APA, Harvard, Vancouver, ISO, and other styles
35

Svyatov, Kirill V., Daniil P. Kanin, and Sergey V. Sukhov. "THE CONTROL SYSTEM FOR UNMANNED VEHICLES BASED ON MULTIMODAL DATA AND IDENTIFIED FEATURE HIERARCHY." Автоматизация процессов управления 1, no. 67 (2022): 52–59. http://dx.doi.org/10.35752/1991-2927-2022-1-67-52-59.

Full text
Abstract:
Currently, autonomous driving systems are becoming more and more widespread. A promising area of research is the design of control systems for self-driving cars using multiple sensors for autonomous driving. Data fusion allows to build a more complete and accurate model of the surrounding scene by complementing the data of one modality with the data of another modality. The article describes an approach to driving an unmanned vehicle in the Carla simulation environment based on a neural network model, which receives multimodal data from a camera and lidar. The approach can significantly improve the quality of recognition of surrounding scenes by identifying a hierarchy of features on the inner layers of the neural network by integrating multimodal information using the model of transformers with attention. The output of the neural network is a sequence of points that define further movement by converting them into control actions for the steering wheel, gas and brake.
APA, Harvard, Vancouver, ISO, and other styles
36

Watson, Eleanor, Thiago Viana, and Shujun Zhang. "Augmented Behavioral Annotation Tools, with Application to Multimodal Datasets and Models: A Systematic Review." AI 4, no. 1 (January 28, 2023): 128–71. http://dx.doi.org/10.3390/ai4010007.

Full text
Abstract:
Annotation tools are an essential component in the creation of datasets for machine learning purposes. Annotation tools have evolved greatly since the turn of the century, and now commonly include collaborative features to divide labor efficiently, as well as automation employed to amplify human efforts. Recent developments in machine learning models, such as Transformers, allow for training upon very large and sophisticated multimodal datasets and enable generalization across domains of knowledge. These models also herald an increasing emphasis on prompt engineering to provide qualitative fine-tuning upon the model itself, adding a novel emerging layer of direct machine learning annotation. These capabilities enable machine intelligence to recognize, predict, and emulate human behavior with much greater accuracy and nuance, a noted shortfall of which have contributed to algorithmic injustice in previous techniques. However, the scale and complexity of training data required for multimodal models presents engineering challenges. Best practices for conducting annotation for large multimodal models in the most safe and ethical, yet efficient, manner have not been established. This paper presents a systematic literature review of crowd and machine learning augmented behavioral annotation methods to distill practices that may have value in multimodal implementations, cross-correlated across disciplines. Research questions were defined to provide an overview of the evolution of augmented behavioral annotation tools in the past, in relation to the present state of the art. (Contains five figures and four tables).
APA, Harvard, Vancouver, ISO, and other styles
37

Zhang, Ke, Shunmin Wang, and Yuyuan Yu. "A TBGAV-Based Image-Text Multimodal Sentiment Analysis Method for Tourism Reviews." International Journal of Information Technology and Web Engineering 18, no. 1 (December 7, 2023): 1–17. http://dx.doi.org/10.4018/ijitwe.334595.

Full text
Abstract:
To overcome limitations in existing methods for sentiment analysis of tourism reviews, the authors propose an image-text multimodal sentiment analysis method (TBGAV). It consists of three modules: image sentiment extraction, text sentiment extraction, and image-text fusion. The image sentiment extraction module employs a pre-trained VGG19 model to capture sentiment features. The text sentiment extraction module utilizes the tiny bidirectional encoder representations from transformers (TinyBERT) model, incorporating the bidirectional recurrent neural network and attention (BiGRU-Attention) module for deeper sentiment semantics. The image-text fusion module employs the dual linear fusion approach to correlate image-text links and the maximum decision-making approach for high-precision sentiment prediction. TBGAV achieves superior performance on the Yelp dataset with accuracy, recall rates, and F1 scores of 77.51%, 78.01%, and 78.34%, respectively, outperforming existing methods. Accordingly, TBGAV is expected to help improve travel-related recommender systems and marketing strategies.
APA, Harvard, Vancouver, ISO, and other styles
38

Luna-Jiménez, Cristina, Ricardo Kleinlein, David Griol, Zoraida Callejas, Juan M. Montero, and Fernando Fernández-Martínez. "A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset." Applied Sciences 12, no. 1 (December 30, 2021): 327. http://dx.doi.org/10.3390/app12010327.

Full text
Abstract:
Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.
APA, Harvard, Vancouver, ISO, and other styles
39

Singh, Aman, Ankit Gautam, Deepanshu, Gautam Kumar, Lokesh Kumar Meena, and Shashank Saroop. "Automated Minutes of Meeting Using a Multimodal Approach." International Journal for Research in Applied Science and Engineering Technology 11, no. 12 (December 31, 2023): 2059–63. http://dx.doi.org/10.22214/ijraset.2023.57787.

Full text
Abstract:
Abstract: The automated minutes of meeting using multimodal approach technique has emerged as a promising solution to lighten the time consuming and error prone manual process of capturing and summerize the meeting discussion. This research paper presents a novel approach for automating the minute of meeting through multimodal approach. The natural language processing is used to identify the different type of topic such as key topics, important discussion and other significant details discussed during meeting. The machine learning models are trained on datasets to classify and extract the relevant information accurately. Further, the research explores the use of advanced machine modals, such as whisper and transformers, to capture the context and refinement of meeting. These models enhance the accuracy and fastest of generated minutes of meeting. The assessment of the automated minutes of meeting generation involves compare of the outputs against the manually generated minutes of meeting by human notetakers. Metrics such as accuracy and F1 score are used to assess the system performance, ensuring the accuracy and quality of generated minute of meeting. This demonstrate that the automated minute of meeting using multimodal approach offers significant time savings, reduces human error, and increase overall efficiency in capturing and summarize the meeting discussion. The system shows promising potential for adoption in various organization and industry. In conclusion, this research paper present a comprehensive study about the automated generation of minute of meeting using multimodal approach. The proposed approach uses thee NLP techniques and advanced machine learning models to accurately extract and summarize meeting content. The results highlight the potential of this automated system to streamline meeting processes and enhances overall productivity in organization.
APA, Harvard, Vancouver, ISO, and other styles
40

Wang, Zhecan, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, and Shih-Fu Chang. "SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 5 (June 28, 2022): 5914–22. http://dx.doi.org/10.1609/aaai.v36i5.20536.

Full text
Abstract:
Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.
APA, Harvard, Vancouver, ISO, and other styles
41

Li, Weisheng, Yin Zhang, Guofen Wang, Yuping Huang, and Ruyue Li. "DFENet: A dual-branch feature enhanced network integrating transformers and convolutional feature learning for multimodal medical image fusion." Biomedical Signal Processing and Control 80 (February 2023): 104402. http://dx.doi.org/10.1016/j.bspc.2022.104402.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Liu, Mingfei, Bin Zhou, Jie Li, Xinyu Li, and Jinsong Bao. "A Knowledge Graph-Based Approach for Assembly Sequence Recommendations for Wind Turbines." Machines 11, no. 10 (September 27, 2023): 930. http://dx.doi.org/10.3390/machines11100930.

Full text
Abstract:
There are various forms of assembly data sources for wind turbines, which contributes to the lack of a unified and standardized expression. Moreover, the reusability of historical assembly data is low, which leads to the poor reasoning ability of a new product assembly sequence. In this paper, we propose a knowledge graph-based approach for assembly sequence recommendations for wind turbines. First, for the multimodal data (text in process manual, image of tooling, and three-dimensional (3D) model) of assembly, a multi-process assembly information representation model is established to express assembly elements in a unified way. In addition, knowledge extraction methods for different modal data are designed to construct a multimodal knowledge graph for wind turbine assembly. Further, the retrieval of similar assembly process items based on the bidirectional encoder representation from transformers joint graph-matching network (BERT-GMN) is proposed to predict the assembly sequence subgraphs. Also, a Semantic Web Rule Language (SWRL)-based assembly process items inference method is proposed to automatically generate subassembly sequences by combining component assembly relationships. Then, a multi-objective sequence optimization algorithm for the final assembly is designed to output the optimal assembly sequences. Finally, taking the VEU-15 wind turbine as the object, the effectiveness of the assembly process information modeling and part multi-source information representation is verified. Sequence recommendation results are better quality compared to traditional assembly sequence planning algorithms. It provides a feasible solution for wind turbine assembly to be optimized from multiple objectives simultaneously.
APA, Harvard, Vancouver, ISO, and other styles
43

Kalra, Sakshi, Chitneedi Hemanth Sai Kumar, Yashvardhan Sharma, and Gajendra Singh Chauhan. "FakeExpose: Uncovering the falsity of news by targeting the multimodality via transfer learning." Journal of Information and Optimization Sciences 44, no. 3 (2023): 301–14. http://dx.doi.org/10.47974/jios-1342.

Full text
Abstract:
Social media for news utilization has its own pros and cons. There are several reasons why people look for and read news through internet media. On the one hand, it is easier to access, and on the other, social media’s dynamic content and misinformation pose serious problems for both government and public institutions. Several studies have been conducted in the past to classify online reviews and their textual content. The current paper suggests a multimodal strategy for the (FND) task that covers both text and image. The suggested model (FakeExpose) is created to automatically learn a variety of discriminative features, instead of relying on manually created features. Several pre-trained words and image embedding models, such as DistilRoBERTa and Vision Transformers (ViTs) are used and fine-tined for the best feature extraction and the various word dependencies. Data augmentation is used to address the issue of pre-trained textual feature extractors not processing a maximum of 512 tokens at a time. The accuracy of the presented model on PolitiFact and GossipCop is 91.35 percent and 98.59 percent, respectively, based on current standards. According to our knowledge, this is the first attempt to use the FakeNewsNet repository to reach the maximum multimodal accuracy. The results show that combining text and image data improves accuracy when compared to utilizing only text or images (Unimodal). Moreover, the outcomes imply that adding more data has improved the model’s accuracy rather than degraded it.
APA, Harvard, Vancouver, ISO, and other styles
44

Coleman, Matthew, Joanna F. Dipnall, Myong Jung, and Lan Du. "PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework." Mathematics 10, no. 24 (December 8, 2022): 4661. http://dx.doi.org/10.3390/math10244661.

Full text
Abstract:
Recently, self-supervised pretraining of transformers has gained considerable attention in analyzing electronic medical records. However, systematic evaluation of different pretraining tasks in radiology applications using both images and radiology reports is still lacking. We propose PreRadE, a simple proof of concept framework that enables novel evaluation of pretraining tasks in a controlled environment. We investigated three most-commonly used pretraining tasks (MLM—Masked Language Modelling, MFR—Masked Feature Regression, and ITM—Image to Text Matching) and their combinations against downstream radiology classification on MIMIC-CXR, a medical chest X-ray imaging and radiology text report dataset. Our experiments in the multimodal setting show that (1) pretraining with MLM yields the greatest benefit to classification performance, largely due to the task-relevant information learned from the radiology reports. (2) Pretraining with only a single task can introduce variation in classification performance across different fine-tuning episodes, suggesting that composite task objectives incorporating both image and text modalities are better suited to generating reliably performant models.
APA, Harvard, Vancouver, ISO, and other styles
45

Sriram, K., S. P. Mangaiyarkarasi, S. Sakthivel, and L. Jebaraj. "An Extensive Study Using the Beetle Swarm Method to Optimize Single and Multiple Objectives of Various Optimal Power Flow Problems." International Transactions on Electrical Energy Systems 2023 (March 30, 2023): 1–33. http://dx.doi.org/10.1155/2023/5779700.

Full text
Abstract:
An electric energy generation system, under the economic operation mode, is an imperative mission in the power system function. This article deals with the use of beetle swarm optimization algorithm (BSOA), for optimal power flow (OPF) solution, in an effective approach. BSOA is a competent optimization technique, to handle multimodal, nonlinear, and nondifferentiable objective functions. The proposed OPF is modeled by numerous objective functions, formulations with constraints, examined with thirty-one different cases, on the three distinguished test systems (IEEE 30, 57, and 118-bus), using single and weighted sum multiobjectives. Six new multiobjective cases are also studied. The control variables, such as real generation of power, tap setting ratio of transformers, bus voltages magnitudes, and the values of shunt capacitor, are also optimized. Potency and robustness of this proposed method were investigated and evaluated with more recent findings reported in the literature. This extensive study revealed the preeminence of the presented technique, applied to OPF problem, with intricate and nonsmooth objective functions.
APA, Harvard, Vancouver, ISO, and other styles
46

Boehm, Kevin M., Antonio Marra, Jorge S. Reis-Filho, Sarat Chandarlapaty, Fresia Pareja, and Sohrab P. Shah. "Abstract 890: Multimodal modeling of digitized histopathology slides improves risk stratification in hormone receptor-positive breast cancer patients." Cancer Research 84, no. 6_Supplement (March 22, 2024): 890. http://dx.doi.org/10.1158/1538-7445.am2024-890.

Full text
Abstract:
Abstract In early-stage hormone receptor-positive breast cancer, genomic risk scores identify patients who stand to benefit from up-front chemotherapy but introduce financial and logistical hurdles to care. We assembled a cohort of 5,244 patients with 11,671 corresponding whole-side images of breast tumors stained with hematoxylin and eosin. We developed a multimodal machine learning model to infer risk of distal metastatic recurrence from routine clinical data. Specifically, the model interprets text from the pathologist’s report using a large language model and uses self-supervised vision transformers to interpret the corresponding whole-slide image. Tensor fusion joins the modalities to infer Genomic Health’s Oncotype DX recurrence score. Inferred recurrence score from the multimodal model correlated with measured score with a concordance correlation coefficient of 0.64 (95% C.I. 0.59 - 0.69) in the withheld test set, compared to 0.55 (95% C.I. 0.49 - 0.61) and 0.56 (95% C.I. 0.52 - 0.60) for the linguistic and visual unimodal models, respectively. The multimodal model attains an area under the precision-recall curve (AUPRC) of 0.69 (AUROC=0.88) for identifying high-risk disease in the full-information setting (when images and pathology reports with quantitative hormone receptor status and grade are available) in a withheld test set, compared to AUPRC of 0.61 and 0.66 for the linguistic and visual models, respectively. By comparison, in the same full-information setting, the clinical nomogram introduced by Orucevic et al. in 2019 achieves an AUPRC of 0.48. We suggest the operating point at which precision is 94.4% and recall is 33.3%. Digitized whole-slide images of routine breast biopsies and their associated synoptic pathology reports contain much of the information necessary to stratify patients by risk of distal metastatic recurrence, when modeled appropriately. Our model could enable hospitals to rapidly triage the need for genomic risk testing, possibly precluding one third of orders without loss of accuracy. This helps allocate scarce resources for genomic tests and valuable weeks prior to beginning therapy while maintaining the standard of precision oncology. Citation Format: Kevin M. Boehm, Antonio Marra, Jorge S. Reis-Filho, Sarat Chandarlapaty, Fresia Pareja, Sohrab P. Shah. Multimodal modeling of digitized histopathology slides improves risk stratification in hormone receptor-positive breast cancer patients [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 890.
APA, Harvard, Vancouver, ISO, and other styles
47

Alam, Mohammad Arif Ul. "College Student Retention Risk Analysis from Educational Database Using Multi-Task Multi-Modal Neural Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (June 28, 2022): 12689–97. http://dx.doi.org/10.1609/aaai.v36i11.21545.

Full text
Abstract:
We develop a Multimodal Spatiotemporal Neural Fusion network for MTL (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each advising note, Long-Short Term Memory (LSTM) network to model temporal advising note embeddings, LSTM network to model students' temporal performance variables and students' static demographics altogether. The final fused representation from MSNF has been utilized on a Multi-Task Cascade Learning (MTCL) model towards building MSNF-MTCL for predicting 5 student retention risks. We evaluate MSNF-MTCL on a large educational database consists of 36,445 college students over 18 years period of time that provides promising performances comparing with the nearest state-of-art models. Additionally, we test the fairness of such model given the existence of biases.
APA, Harvard, Vancouver, ISO, and other styles
48

Wu, Di, Lihua Cao, Pengji Zhou, Ning Li, Yi Li, and Dejun Wang. "Infrared Small-Target Detection Based on Radiation Characteristics with a Multimodal Feature Fusion Network." Remote Sensing 14, no. 15 (July 25, 2022): 3570. http://dx.doi.org/10.3390/rs14153570.

Full text
Abstract:
Infrared small-target detection has widespread influences on anti-missile warning, precise weapon guidance, infrared stealth and anti-stealth, military reconnaissance, and other national defense fields. However, small targets are easily submerged in background clutter noise and have fewer pixels and shape features. Furthermore, random target positions and irregular motion can lead to target detection being carried out in the whole space–time domain. This could result in a large amount of calculation, and the accuracy and real-time performance are difficult to be guaranteed. Therefore, infrared small-target detection is still a challenging and far-reaching research hotspot. To solve the above problem, a novel multimodal feature fusion network (MFFN) is proposed, based on morphological characteristics, infrared radiation, and motion characteristics, which could compensate for the deficiency in the description of single modal characteristics of small targets and improve the recognition precision. Our innovations introduced in the paper are addressed in the following three aspects: Firstly, in the morphological domain, we propose a network with the skip-connected feature pyramid network (SCFPN) and dilated convolutional block attention module integrated with Resblock (DAMR) introduced to the backbone, which is designed to improve the feature extraction ability for infrared small targets. Secondly, in the radiation characteristic domain, we propose a prediction model of atmospheric transmittance based on deep neural networks (DNNs), which predicts the atmospheric transmittance effectively without being limited by the complex environment to improve the measurement accuracy of radiation characteristics. Finally, the dilated convolutional-network-based bidirectional encoder representation from a transformers (DC-BERT) structure combined with an attention mechanism is proposed for the feature extraction of radiation and motion characteristics. Finally, experiments on our self-established optoelectronic equipment detected dataset (OEDD) show that our method is superior to eight state-of-the-art algorithms in terms of the accuracy and robustness of infrared small-target detection. The comparative experimental results of four kinds of target sequences indicate that the average recognition rate Pavg is 92.64%, the mean average precision (mAP) is 92.01%, and the F1 score is 90.52%.
APA, Harvard, Vancouver, ISO, and other styles
49

de Hond, Anne, Marieke van Buchem, Claudio Fanconi, Mohana Roy, Douglas Blayney, Ilse Kant, Ewout Steyerberg, and Tina Hernandez-Boussard. "Predicting Depression Risk in Patients With Cancer Using Multimodal Data: Algorithm Development Study." JMIR Medical Informatics 12 (January 18, 2024): e51925. http://dx.doi.org/10.2196/51925.

Full text
Abstract:
Background Patients with cancer starting systemic treatment programs, such as chemotherapy, often develop depression. A prediction model may assist physicians and health care workers in the early identification of these vulnerable patients. Objective This study aimed to develop a prediction model for depression risk within the first month of cancer treatment. Methods We included 16,159 patients diagnosed with cancer starting chemo- or radiotherapy treatment between 2008 and 2021. Machine learning models (eg, least absolute shrinkage and selection operator [LASSO] logistic regression) and natural language processing models (Bidirectional Encoder Representations from Transformers [BERT]) were used to develop multimodal prediction models using both electronic health record data and unstructured text (patient emails and clinician notes). Model performance was assessed in an independent test set (n=5387, 33%) using area under the receiver operating characteristic curve (AUROC), calibration curves, and decision curve analysis to assess initial clinical impact use. Results Among 16,159 patients, 437 (2.7%) received a depression diagnosis within the first month of treatment. The LASSO logistic regression models based on the structured data (AUROC 0.74, 95% CI 0.71-0.78) and structured data with email classification scores (AUROC 0.74, 95% CI 0.71-0.78) had the best discriminative performance. The BERT models based on clinician notes and structured data with email classification scores had AUROCs around 0.71. The logistic regression model based on email classification scores alone performed poorly (AUROC 0.54, 95% CI 0.52-0.56), and the model based solely on clinician notes had the worst performance (AUROC 0.50, 95% CI 0.49-0.52). Calibration was good for the logistic regression models, whereas the BERT models produced overly extreme risk estimates even after recalibration. There was a small range of decision thresholds for which the best-performing model showed promising clinical effectiveness use. The risks were underestimated for female and Black patients. Conclusions The results demonstrated the potential and limitations of machine learning and multimodal models for predicting depression risk in patients with cancer. Future research is needed to further validate these models, refine the outcome label and predictors related to mental health, and address biases across subgroups.
APA, Harvard, Vancouver, ISO, and other styles
50

Nooralahzadeh, Farhad, and Rico Sennrich. "Improving the Cross-Lingual Generalisation in Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (June 26, 2023): 13419–27. http://dx.doi.org/10.1609/aaai.v37i11.26574.

Full text
Abstract:
While several benefits were realized for multilingual vision-language pretrained models, recent benchmarks across various tasks and languages showed poor cross-lingual generalisation when multilingually pre-trained vision-language models are applied to non-English data, with a large gap between (supervised) English performance and (zero-shot) cross-lingual transfer. In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task, where models are fine-tuned on English visual-question data and evaluated on 7 typologically diverse languages. We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, (3) we augment training examples using synthetic code-mixing to promote alignment of embeddings between source and target languages. Our experiments on xGQA using the pretrained multilingual multimodal transformers UC2 and M3P demonstrates the consistent effectiveness of the proposed fine-tuning strategy for 7 languages, outperforming existing transfer methods with sparse models.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography