Academic literature on the topic 'Transformers Multimodaux'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Transformers Multimodaux.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Transformers Multimodaux"

1

Jaiswal, Sushma, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal. "Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search." International Journal of Intelligent Systems and Applications 16, no. 2 (April 8, 2024): 53–61. http://dx.doi.org/10.5815/ijisa.2024.02.05.

Full text
Abstract:
Deep learning has improved image captioning. Transformer, a neural network architecture built for natural language processing, excels at image captioning and other computer vision applications. This paper reviews Transformer-based image captioning methods in detail. Convolutional neural networks (CNNs) extracted image features and RNNs or LSTM networks generated captions in traditional image captioning. This method often has information bottlenecks and trouble capturing long-range dependencies. Transformer architecture revolutionized natural language processing with its attention strategy and parallel processing. Researchers used Transformers' language success to solve image captioning problems. Transformer-based image captioning systems outperform previous methods in accuracy and efficiency by integrating visual and textual information into a single model. This paper discusses how the Transformer architecture's self-attention mechanisms and positional encodings are adapted for image captioning. Vision Transformers (ViTs) and CNN-Transformer hybrid models are discussed. We also discuss pre-training, fine-tuning, and reinforcement learning to improve caption quality. Transformer-based image captioning difficulties, trends, and future approaches are also examined. Multimodal fusion, visual-text alignment, and caption interpretability are challenges. We expect research to address these issues and apply Transformer-based image captioning to medical imaging and distant sensing. This paper covers how Transformer-based approaches have changed image captioning and their potential to revolutionize multimodal interpretation and generation, advancing artificial intelligence and human-computer interactions.
APA, Harvard, Vancouver, ISO, and other styles
2

Bayat, Nasrin, Jong-Hwan Kim, Renoa Choudhury, Ibrahim F. Kadhim, Zubaidah Al-Mashhadani, Mark Aldritz Dela Virgen, Reuben Latorre, Ricardo De La Paz, and Joon-Hyuk Park. "Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired." Journal of Imaging 9, no. 8 (August 15, 2023): 161. http://dx.doi.org/10.3390/jimaging9080161.

Full text
Abstract:
This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user’s walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment.
APA, Harvard, Vancouver, ISO, and other styles
3

Shao, Zilei. "A literature review on multimodal deep learning models for detecting mental disorders in conversational data: Pre-transformer and transformer-based approaches." Applied and Computational Engineering 18, no. 1 (October 23, 2023): 215–24. http://dx.doi.org/10.54254/2755-2721/18/20230993.

Full text
Abstract:
This paper provides a comprehensive review of multimodal deep learning models that utilize conversational data to detect mental health disorders. In addition to discussing models based on the Transformer, such as BERT (Bidirectional Encoder Representations from Transformers), this paper addresses models that existed prior to the Transformer, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The paper covers the application of these models in the construction of multimodal deep learning systems to detect mental disorders. In addition, the difficulties encountered by multimodal deep learning systems are brought up. Furthermore, the paper proposes research directions for enhancing the performance and robustness of these models in mental health applications. By shedding light on the potential of multimodal deep learning in mental health care, this paper aims to foster further research and development in this critical domain.
APA, Harvard, Vancouver, ISO, and other styles
4

Hendricks, Lisa Anne, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers." Transactions of the Association for Computational Linguistics 9 (2021): 570–85. http://dx.doi.org/10.1162/tacl_a_00385.

Full text
Abstract:
Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.
APA, Harvard, Vancouver, ISO, and other styles
5

Chen, Yu, Ming Yin, Yu Li, and Qian Cai. "CSU-Net: A CNN-Transformer Parallel Network for Multimodal Brain Tumour Segmentation." Electronics 11, no. 14 (July 16, 2022): 2226. http://dx.doi.org/10.3390/electronics11142226.

Full text
Abstract:
Medical image segmentation techniques are vital to medical image processing and analysis. Considering the significant clinical applications of brain tumour image segmentation, it represents a focal point of medical image segmentation research. Most of the work in recent times has been centred on Convolutional Neural Networks (CNN) and Transformers. However, CNN has some deficiencies in modelling long-distance information transfer and contextual processing information, while Transformer is relatively weak in acquiring local information. To overcome the above defects, we propose a novel segmentation network with an “encoder–decoder” architecture, namely CSU-Net. The encoder consists of two parallel feature extraction branches based on CNN and Transformer, respectively, in which the features of the same size are fused. The decoder has a dual Swin Transformer decoder block with two learnable parameters for feature upsampling. The features from multiple resolutions in the encoder and decoder are merged via skip connections. On the BraTS 2020, our model achieves 0.8927, 0.8857, and 0.8188 for the Whole Tumour (WT), Tumour Core (TC), and Enhancing Tumour (ET), respectively, in terms of Dice scores.
APA, Harvard, Vancouver, ISO, and other styles
6

Sun, Qixuan, Nianhua Fang, Zhuo Liu, Liang Zhao, Youpeng Wen, and Hongxiang Lin. "HybridCTrm: Bridging CNN and Transformer for Multimodal Brain Image Segmentation." Journal of Healthcare Engineering 2021 (October 1, 2021): 1–10. http://dx.doi.org/10.1155/2021/7467261.

Full text
Abstract:
Multimodal medical image segmentation is always a critical problem in medical image segmentation. Traditional deep learning methods utilize fully CNNs for encoding given images, thus leading to deficiency of long-range dependencies and bad generalization performance. Recently, a sequence of Transformer-based methodologies emerges in the field of image processing, which brings great generalization and performance in various tasks. On the other hand, traditional CNNs have their own advantages, such as rapid convergence and local representations. Therefore, we analyze a hybrid multimodal segmentation method based on Transformers and CNNs and propose a novel architecture, HybridCTrm network. We conduct experiments using HybridCTrm on two benchmark datasets and compare with HyperDenseNet, a network based on fully CNNs. Results show that our HybridCTrm outperforms HyperDenseNet on most of the evaluation metrics. Furthermore, we analyze the influence of the depth of Transformer on the performance. Besides, we visualize the results and carefully explore how our hybrid methods improve on segmentations.
APA, Harvard, Vancouver, ISO, and other styles
7

Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu, and Faouzi Bader. "Multimodal transformers for wireless communications: A case study in beam prediction." ITU Journal on Future and Evolving Technologies 4, no. 3 (September 5, 2023): 461–71. http://dx.doi.org/10.52953/jwra8095.

Full text
Abstract:
Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
APA, Harvard, Vancouver, ISO, and other styles
8

Xu, Yifan, Huapeng Wei, Minxuan Lin, Yingying Deng, Kekai Sheng, Mengdan Zhang, Fan Tang, Weiming Dong, Feiyue Huang, and Changsheng Xu. "Transformers in computational visual media: A survey." Computational Visual Media 8, no. 1 (October 27, 2021): 33–62. http://dx.doi.org/10.1007/s41095-021-0247-3.

Full text
Abstract:
AbstractTransformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.
APA, Harvard, Vancouver, ISO, and other styles
9

Zhong, Enmin, Carlos R. del-Blanco, Daniel Berjón, Fernando Jaureguizar, and Narciso García. "Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer." Sensors 23, no. 16 (August 10, 2023): 7066. http://dx.doi.org/10.3390/s23167066.

Full text
Abstract:
Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. The primary challenge lies in achieving real-time recognition while managing temporal dependencies that can impact performance. Existing methods employ 3D convolutional or Transformer-based architectures with hand skeleton estimation, but both have limitations. To address these challenges, a hybrid approach that combines 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed. The method involves using a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics of hand gestures. A Transformer network with a self-attention mechanism is then employed to efficiently capture long-range temporal dependencies in the skeleton sequence. Evaluation of the Briareo and Multimodal Hand Gesture datasets resulted in accuracy scores of 95.49% and 97.25%, respectively. Notably, this approach achieves real-time performance using a standard CPU, distinguishing it from methods that require specialized GPUs. The hybrid approach’s real-time efficiency and high accuracy demonstrate its superiority over existing state-of-the-art methods. In summary, the hybrid 3D-CNN and Transformer approach effectively addresses real-time recognition challenges and efficient handling of temporal dependencies, outperforming existing methods in both accuracy and speed.
APA, Harvard, Vancouver, ISO, and other styles
10

Nia, Zahra Movahedi, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary, and Jude D. Kong. "Twitter-based gender recognition using transformers." Mathematical Biosciences and Engineering 20, no. 9 (2023): 15957–77. http://dx.doi.org/10.3934/mbe.2023711.

Full text
Abstract:
<abstract> <p>Social media contains useful information about people and society that could help advance research in many different areas of health (e.g. by applying opinion mining, emotion/sentiment analysis and statistical analysis) such as mental health, health surveillance, socio-economic inequality and gender vulnerability. User demographics provide rich information that could help study the subject further. However, user demographics such as gender are considered private and are not freely available. In this study, we propose a model based on transformers to predict the user's gender from their images and tweets. The image-based classification model is trained in two different methods: using the profile image of the user and using various image contents posted by the user on Twitter. For the first method a Twitter gender recognition dataset, publicly available on Kaggle and for the second method the PAN-18 dataset is used. Several transformer models, i.e. vision transformers (ViT), LeViT and Swin Transformer are fine-tuned for both of the image datasets and then compared. Next, different transformer models, namely, bidirectional encoders representations from transformers (BERT), RoBERTa and ELECTRA are fine-tuned to recognize the user's gender by their tweets. This is highly beneficial, because not all users provide an image that indicates their gender. The gender of such users could be detected from their tweets. The significance of the image and text classification models were evaluated using the Mann-Whitney U test. Finally, the combination model improved the accuracy of image and text classification models by 11.73 and 5.26% for the Kaggle dataset and by 8.55 and 9.8% for the PAN-18 dataset, respectively. This shows that the image and text classification models are capable of complementing each other by providing additional information to one another. Our overall multimodal method has an accuracy of 88.11% for the Kaggle and 89.24% for the PAN-18 dataset and outperforms state-of-the-art models. Our work benefits research that critically require user demographic information such as gender to further analyze and study social media content for health-related issues.</p> </abstract>
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Transformers Multimodaux"

1

Vazquez, Rodriguez Juan Fernando. "Transformateurs multimodaux pour la reconnaissance des émotions." Electronic Thesis or Diss., Université Grenoble Alpes, 2023. http://www.theses.fr/2023GRALM057.

Full text
Abstract:
La santé mentale et le bien-être émotionnel ont une influence significative sur la santé physique et sont particulièrement importants pour un viellissement en bonne santé. Les avancées continues dans le domaine des capteurs et de la microélectronique en général ont permis l’avènement de nouvelles technologies pouvant être déployées dans les maisons pour surveiller la santé et le bien-être des occupants. Ces technologies de captation peuvent être combinées aux avancées récentes sur l’apprentissage automatique pour proposer des services utiles pour vieillir en bonne santé. Dans ce cadre, un système de reconnaissance automatique d’émotions peut être un outil s’assurant du bien-être de personnes fragiles. Dès lors, il est intéressant de développer un système pouvant déduire des informations sur les émotions humaines à partir de modalités de captation multiples, et pouvant être entrainé sans requérir de larges jeux de données labellisées d’apprentissage.Cette thèse aborde le problème de la reconnaissance d’émotions à partir de différents types de signaux qu’un environnement intelligent peut capter, tels que des signaux visuels, audios, et physiologiques. Pour ce faire, nous développons différents modèles basés sur l’architecture extit{Transformer}, possédant des caractéristiques utiles à nos besoins comme la capacité à modéliser des dépendances longues et à sélectionner les parties importantes des signaux entrants. Nous proposons en premier lieu un modèle pour reconnaitre les émotions à partir de signaux physiologiques individuels. Nous proposons une technique de pré-apprentissage auto-supervisé utilisant des données physiologiques non-labellisées, qui améliore les performances du modèle. Cette approche est ensuite étendue pour exploiter la complémentarité de différents types de signaux physiologiques. Nous développons un modèle qui combine ces différents signaux physiologiques, et qui exploite également le pré-apprentissage auto-supervisé. Nous proposons une méthode de pré-apprentissage qui ne nécessite pas un jeu de données unique contenant tous les types de signaux utilisés, pouvant au contraire être pré-entrainé avec des jeux de données différents pour chaque type de signal.Pour tirer parti des différentes modalités qu’un environnement connecté peut offrir, nous proposons un modèle multimodal exploitant des signaux vidéos, audios, et physiologiques. Ces signaux étant de natures différentes, ils capturent des modes distincts d’expression des émotions, qui peuvent être complémentaires et qu’il est donc intéressant d’exploiter simultanément. Cependant, dans des situations d’usage réelles, il se peut que certaines de ces modalités soient manquantes. Notre modèle est suffisamment flexible pour continuer à fonctionner lorsqu’une modalité est manquante, mais sera moins performant. Nous proposons alors une stratégie d’apprentissage permettant de réduire ces baisses de performances lorsqu’une modalité est manquante.Les méthodes développées dans cette thèse sont évaluées sur plusieurs jeux de données. Les résultats obtenus montrent que nos approches de extit{Transformer} pré-entrainé sont performantes pour reconnaitre les émotions à partir de signaux physiologiques. Nos résultats mettent également en lumière les capacités de notre solution à aggréger différents signaux multimodaux, et à s’adapter à l’absence de l’un d’entre eux. Ces résultats montrent que les approches proposées sont adaptées pour reconnaitre les émotions à partir de multiples capteurs de l’environnement. Nos travaux ouvrent de nouvelles pistes de recherche sur l’utilisation des extit{Transformers} pour traiter les informations de capteurs d’environnements intelligents et sur la reconnaissance d’émotions robuste dans les cas où des modalités sont manquantes. Les résultats de ces travaux peuvent contribuer à améliorer l’attention apportée à la santé mentale des personnes fragiles
Mental health and emotional well-being have significant influence on physical health, and are especially important for healthy aging. Continued progress on sensors and microelectronics has provided a number of new technologies that can be deployed in homes and used to monitor health and well-being. These can be combined with recent advances in machine learning to provide services that enhance the physical and emotional well-being of individuals to promote healthy aging. In this context, an automatic emotion recognition system can provide a tool to help assure the emotional well-being of frail people. Therefore, it is desirable to develop a technology that can draw information about human emotions from multiple sensor modalities and can be trained without the need for large labeled training datasets.This thesis addresses the problem of emotion recognition using the different types of signals that a smart environment may provide, such as visual, audio, and physiological signals. To do this, we develop different models based on the Transformer architecture, which has useful characteristics such as their capacity to model long-range dependencies, as well as their capability to discern the relevant parts of the input. We first propose a model to recognize emotions from individual physiological signals. We propose a self-supervised pre-training technique that uses unlabeled physiological signals, showing that that pre-training technique helps the model to perform better. This approach is then extended to take advantage of the complementarity of information that may exist in different physiological signals. For this, we develop a model that combines different physiological signals and also uses self-supervised pre-training to improve its performance. We propose a method for pre-training that does not require a dataset with the complete set of target signals, but can rather, be trained on individual datasets from each target signal.To further take advantage of the different modalities that a smart environment may provide, we also propose a model that uses as inputs multimodal signals such as video, audio, and physiological signals. Since these signals are of a different nature, they cover different ways in which emotions are expressed, thus they should provide complementary information concerning emotions, and therefore it is appealing to use them together. However, in real-world scenarios, there might be cases where a modality is missing. Our model is flexible enough to continue working when a modality is missing, albeit with a reduction in its performance. To address this problem, we propose a training strategy that reduces the drop in performance when a modality is missing.The methods developed in this thesis are evaluated using several datasets, obtaining results that demonstrate the effectiveness of our approach to pre-train Transformers to recognize emotions from physiological signals. The results also show the efficacy of our Transformer-based solution to aggregate multimodal information, and to accommodate missing modalities. These results demonstrate the feasibility of the proposed approaches to recognizing emotions from multiple environmental sensors. This opens new avenues for deeper exploration of using Transformer-based approaches to process information from environmental sensors and allows the development of emotion recognition technologies robust to missing modalities. The results of this work can contribute to better care for the mental health of frail people
APA, Harvard, Vancouver, ISO, and other styles
2

Greco, Claudio. "Transfer Learning and Attention Mechanisms in a Multimodal Setting." Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/341874.

Full text
Abstract:
Humans are able to develop a solid knowledge of the world around them: they can leverage information coming from different sources (e.g., language, vision), focus on the most relevant information from the input they receive in a given life situation, and exploit what they have learned before without forgetting it. In the field of Artificial Intelligence and Computational Linguistics, replicating these human abilities in artificial models is a major challenge. Recently, models based on pre-training and on attention mechanisms, namely pre-trained multimodal Transformers, have been developed. They seem to perform tasks surprisingly well compared to other computational models in multiple contexts. They simulate a human-like cognition in that they supposedly rely on previously acquired knowledge (transfer learning) and focus on the most important information (attention mechanisms) of the input. Nevertheless, we still do not know whether these models can deal with multimodal tasks that require merging different types of information simultaneously to be solved, as humans would do. This thesis attempts to fill this crucial gap in our knowledge of multimodal models by investigating the ability of pre-trained Transformers to encode multimodal information; and the ability of attention-based models to remember how to deal with previously-solved tasks. With regards to pre-trained Transformers, we focused on their ability to rely on pre-training and on attention while dealing with tasks requiring to merge information coming from language and vision. More precisely, we investigate if pre-trained multimodal Transformers are able to understand the internal structure of a dialogue (e.g., organization of the turns); to effectively solve complex spatial questions requiring to process different spatial elements (e.g., regions of the image, proximity between elements, etc.); and to make predictions based on complementary multimodal cues (e.g., guessing the most plausible action by leveraging the content of a sentence and of an image). The results of this thesis indicate that pre-trained Transformers outperform other models. Indeed, they are able to some extent to integrate complementary multimodal information; they manage to pinpoint both the relevant turns in a dialogue and the most important regions in an image. These results suggest that pre-training and attention play a key role in pre-trained Transformers’ encoding. Nevertheless, their way of processing information cannot be considered as human-like. Indeed, when compared to humans, they struggle (as non-pre-trained models do) to understand negative answers, to merge spatial information in difficult questions, and to predict actions based on complementary linguistic and visual cues. With regards to attention-based models, we found out that these kinds of models tend to forget what they have learned in previously-solved tasks. However, training these models on easy tasks before more complex ones seems to mitigate this catastrophic forgetting phenomenon. These results indicate that, at least in this context, attention-based models (and, supposedly, pre-trained Transformers too) are sensitive to tasks’ order. A better control of this variable may therefore help multimodal models learn sequentially and continuously as humans do.
APA, Harvard, Vancouver, ISO, and other styles
3

Mills, Kathy Ann. "Multiliteracies : a critical ethnography : pedagogy, power, discourse and access to multiliteracies." Thesis, Queensland University of Technology, 2006. https://eprints.qut.edu.au/16244/1/Kathy_Mills_Thesis.pdf.

Full text
Abstract:
The multiliteracies pedagogy of the New London Group is a response to the emergence of new literacies and changing forms of meaning-making in contemporary contexts of increased cultural and linguistic diversity. This critical ethnographic research investigates the interactions between pedagogy, power, discourses, and differential access to multiliteracies, among a group of culturally and linguistically diverse learners in a mainstream Australian classroom. The study documents the way in which a teacher enacted the multiliteracies pedagogy through a series of mediabased lessons with her year six (aged 11-12 years) class. The reporting of this research is timely because the multiliteracies pedagogy has become a key feature of Australian educational policy initiatives and syllabus requirements. The methodology of this study was based on Carspecken's critical ethnography. This method includes five stages: Stage One involved eighteen days of observational data collection over the course of ten weeks in the classroom. The multiliteracies lessons aimed to enable learners to collaboratively design a claymation movie. Stage Two was the initial analysis of data, including verbatim transcribing, coding, and applying analytic tools to the data. Stage Three involved semi-structured, forty-five minute interviews with the principal, teacher, and four culturally and linguistically diverse students. In Stages Four and Five, the results of micro-level data analysis were compared with macro-level phenomena using structuration theory and extant literature about access to multiliteracies. The key finding was that students' access to multiliteracies differed among the culturally and linguistically diverse group. Existing degrees of access were reproduced, based on the learners' relation to the dominant culture. In the context of the media-based lessons in which students designed claymation movies, students from Anglo-Australian, middle-class backgrounds had greater access to transformed designing than those who were culturally marginalised. These experiences were mediated by pedagogy, power, and discourses in the classroom, which were in turn influenced by the agency of individuals. The individuals were both enabled and constrained by structures of power within the school and the wider educational and social systems. Recommendations arising from the study were provided for teachers, principals, policy makers and researchers who seek to monitor and facilitate the success of the multiliteracies pedagogy in culturally and linguistically diverse educational contexts.
APA, Harvard, Vancouver, ISO, and other styles
4

Mills, Kathy Ann. "Multiliteracies : a critical ethnography : pedagogy, power, discourse and access to multiliteracies." Queensland University of Technology, 2006. http://eprints.qut.edu.au/16244/.

Full text
Abstract:
The multiliteracies pedagogy of the New London Group is a response to the emergence of new literacies and changing forms of meaning-making in contemporary contexts of increased cultural and linguistic diversity. This critical ethnographic research investigates the interactions between pedagogy, power, discourses, and differential access to multiliteracies, among a group of culturally and linguistically diverse learners in a mainstream Australian classroom. The study documents the way in which a teacher enacted the multiliteracies pedagogy through a series of mediabased lessons with her year six (aged 11-12 years) class. The reporting of this research is timely because the multiliteracies pedagogy has become a key feature of Australian educational policy initiatives and syllabus requirements. The methodology of this study was based on Carspecken's critical ethnography. This method includes five stages: Stage One involved eighteen days of observational data collection over the course of ten weeks in the classroom. The multiliteracies lessons aimed to enable learners to collaboratively design a claymation movie. Stage Two was the initial analysis of data, including verbatim transcribing, coding, and applying analytic tools to the data. Stage Three involved semi-structured, forty-five minute interviews with the principal, teacher, and four culturally and linguistically diverse students. In Stages Four and Five, the results of micro-level data analysis were compared with macro-level phenomena using structuration theory and extant literature about access to multiliteracies. The key finding was that students' access to multiliteracies differed among the culturally and linguistically diverse group. Existing degrees of access were reproduced, based on the learners' relation to the dominant culture. In the context of the media-based lessons in which students designed claymation movies, students from Anglo-Australian, middle-class backgrounds had greater access to transformed designing than those who were culturally marginalised. These experiences were mediated by pedagogy, power, and discourses in the classroom, which were in turn influenced by the agency of individuals. The individuals were both enabled and constrained by structures of power within the school and the wider educational and social systems. Recommendations arising from the study were provided for teachers, principals, policy makers and researchers who seek to monitor and facilitate the success of the multiliteracies pedagogy in culturally and linguistically diverse educational contexts.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Transformers Multimodaux"

1

Revanur, Ambareesh, Ananyananda Dasari, Conrad S. Tucker, and László A. Jeni. "Instantaneous Physiological Estimation Using Video Transformers." In Multimodal AI in Healthcare, 307–19. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-14771-5_22.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kant, Yash, Dhruv Batra, Peter Anderson, Alexander Schwing, Devi Parikh, Jiasen Lu, and Harsh Agrawal. "Spatially Aware Multimodal Transformers for TextVQA." In Computer Vision – ECCV 2020, 715–32. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58545-7_41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Mojtahedi, Ramtin, Mohammad Hamghalam, Richard K. G. Do, and Amber L. Simpson. "Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation." In Multiscale Multimodal Medical Imaging, 110–20. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-18814-5_11.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Sun, Zhengxiao, Feiyu Chen, and Jie Shao. "Synesthesia Transformer with Contrastive Multimodal Learning." In Neural Information Processing, 431–42. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-30105-6_36.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Ramesh, Krithik, and Yun Sing Koh. "Investigation of Explainability Techniques for Multimodal Transformers." In Communications in Computer and Information Science, 90–98. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-8746-5_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Xie, Long-Fei, and Xu-Yao Zhang. "Gate-Fusion Transformer for Multimodal Sentiment Analysis." In Pattern Recognition and Artificial Intelligence, 28–40. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-59830-3_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Wenxuan, Chen Chen, Meng Ding, Hong Yu, Sen Zha, and Jiangyun Li. "TransBTS: Multimodal Brain Tumor Segmentation Using Transformer." In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, 109–19. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-87193-2_11.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Liu, Dan, Wei Song, and Xiaobing Zhao. "Pedestrian Attribute Recognition Based on Multimodal Transformer." In Pattern Recognition and Computer Vision, 422–33. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8429-9_34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Reyes, Abel A., Sidike Paheding, Makarand Deo, and Michel Audette. "Gabor Filter-Embedded U-Net with Transformer-Based Encoding for Biomedical Image Segmentation." In Multiscale Multimodal Medical Imaging, 76–88. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-18814-5_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Santhirasekaram, Ainkaran, Karen Pinto, Mathias Winkler, Eric Aboagye, Ben Glocker, and Andrea Rockall. "Multi-scale Hybrid Transformer Networks: Application to Prostate Disease Classification." In Multimodal Learning for Clinical Decision Support, 12–21. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-89847-2_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Transformers Multimodaux"

1

Yao, Shaowei, and Xiaojun Wan. "Multimodal Transformer for Multimodal Machine Translation." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.acl-main.400.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Tang, Jiajia, Kang Li, Ming Hou, Xuanyu Jin, Wanzeng Kong, Yu Ding, and Qibin Zhao. "MMT: Multi-way Multi-modal Transformer for Multimodal Learning." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/480.

Full text
Abstract:
The heart of multimodal learning research lies the challenge of effectively exploiting fusion representations among multiple modalities.However, existing two-way cross-modality unidirectional attention could only exploit the intermodal interactions from one source to one target modality. This indeed fails to unleash the complete expressive power of multimodal fusion with restricted number of modalities and fixed interactive direction.In this work, the multiway multimodal transformer (MMT) is proposed to simultaneously explore multiway multimodal intercorrelations for each modality via single block rather than multiple stacked cross-modality blocks. The core idea of MMT is the multiway multimodal attention, where the multiple modalities are leveraged to compute the multiway attention tensor. This naturally benefits us to exploit comprehensive many-to-many multimodal interactive paths. Specifically, the multiway tensor is comprised of multiple interconnected modality-aware core tensors that consist of the intramodal interactions. Additionally, the tensor contraction operation is utilized to investigate intermodal dependencies between distinct core tensors.Essentially, our tensor-based multiway structure allows for easily extending MMT to the case associated with an arbitrary number of modalities. Taking MMT as the basis, the hierarchical network is further established to recursively transmit the low-level multiway multimodal interactions to high-level ones. The experiments demonstrate that MMT can achieve state-of-the-art or comparable performance.
APA, Harvard, Vancouver, ISO, and other styles
3

Parthasarathy, Srinivas, and Shiva Sundaram. "Detecting Expressions with Multimodal Transformers." In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021. http://dx.doi.org/10.1109/slt48900.2021.9383573.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Chua, Watson W. K., Lu Li, and Alvina Goh. "Classifying Multimodal Data Using Transformers." In KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3534678.3542634.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Tsai, Yao-Hung Hubert, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. "Multimodal Transformer for Unaligned Multimodal Language Sequences." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/p19-1656.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

He, Xuehai, and Xin Wang. "Multimodal Graph Transformer for Multimodal Question Answering." In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.eacl-main.15.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Jin, Tao, Siyu Huang, Ming Chen, Yingming Li, and Zhongfei Zhang. "SBAT: Video Captioning with Sparse Boundary-Aware Transformer." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/88.

Full text
Abstract:
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
APA, Harvard, Vancouver, ISO, and other styles
8

Wang, Yikai, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. "Multimodal Token Fusion for Vision Transformers." In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.01187.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Tang, Wenzhuo, Hongzhi Wen, Renming Liu, Jiayuan Ding, Wei Jin, Yuying Xie, Hui Liu, and Jiliang Tang. "Single-Cell Multimodal Prediction via Transformers." In CIKM '23: The 32nd ACM International Conference on Information and Knowledge Management. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3583780.3615061.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Liu, Yicheng, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, and Bolei Zhou. "Multimodal Motion Prediction with Stacked Transformers." In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021. http://dx.doi.org/10.1109/cvpr46437.2021.00749.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography