Добірка наукової літератури з теми "Apprentissage multi-Modal"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Apprentissage multi-Modal".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Дисертації з теми "Apprentissage multi-Modal":
Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature
Michel, Fabrice. "Multi-Modal Similarity Learning for 3D Deformable Registration of Medical Images." Phd thesis, Ecole Centrale Paris, 2013. http://tel.archives-ouvertes.fr/tel-01005141.
Tahoun, Mohamed. "Object Shape Perception for Autonomous Dexterous Manipulation Based on Multi-Modal Learning Models." Electronic Thesis or Diss., Bourges, INSA Centre Val de Loire, 2021. http://www.theses.fr/2021ISAB0003.
This thesis proposes 3D object reconstruction methods based on multimodal deep learning strategies. The targeted applications concern robotic manipulation. First, the thesis proposes a 3D visual reconstruction method from a single view of the object obtained by an RGB-D sensor. Then, in order to improve the quality of 3D reconstruction of objects from a single view, a new method combining visual and tactile information has been proposed based on a learning reconstruction model. The proposed method has been validated on a visual-tactile dataset respecting the kinematic constraints of a robotic hand. The visual-tactile dataset respecting the kinematic properties of the multi-fingered robotic hand has been created in the framework of this PhD work. This dataset is unique in the literature and is also a contribution of the thesis. The validation results show that the tactile information can have an important contribution for the prediction of the complete shape of an object, especially the part that is not visible to the RGD-D sensor. They also show that the proposed model allows to obtain better results compared to those obtained with the best performing methods of the state of the art
Zhang, Yifei. "Real-time multimodal semantic scene understanding for autonomous UGV navigation." Thesis, Bourgogne Franche-Comté, 2021. http://www.theses.fr/2021UBFCK002.
Robust semantic scene understanding is challenging due to complex object types, as well as environmental changes caused by varying illumination and weather conditions. This thesis studies the problem of deep semantic segmentation with multimodal image inputs. Multimodal images captured from various sensory modalities provide complementary information for complete scene understanding. We provided effective solutions for fully-supervised multimodal image segmentation and few-shot semantic segmentation of the outdoor road scene. Regarding the former case, we proposed a multi-level fusion network to integrate RGB and polarimetric images. A central fusion framework was also introduced to adaptively learn the joint representations of modality-specific features and reduce model uncertainty via statistical post-processing.In the case of semi-supervised semantic scene understanding, we first proposed a novel few-shot segmentation method based on the prototypical network, which employs multiscale feature enhancement and the attention mechanism. Then we extended the RGB-centric algorithms to take advantage of supplementary depth cues. Comprehensive empirical evaluations on different benchmark datasets demonstrate that all the proposed algorithms achieve superior performance in terms of accuracy as well as demonstrating the effectiveness of complementary modalities for outdoor scene understanding for autonomous navigation
Zambra, Matteo. "Méthodes IA multimodales dans des contextes d’observation océanographique et de surveillance maritime multi-capteurs hétérogènes." Electronic Thesis or Diss., Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2024. http://www.theses.fr/2024IMTA0391.
The aim of this thesis is to study the simultaneous use of heterogeneous ocean datasets to improve the performance of predictive models used in scientific and operational fields for the simulation and analysis of the ocean and marine environment. Two distinct case studies were explored in the course of the thesis work. The first study focuses on the local estimation of wind speed at the sea surface from underwater soundscape measurements and atmospheric model products. The second study considers the spatial extension of the problem and the use of observations at different scales and spatial resolutions, from pseudo-observations simulating satellite images to time series measured by in-situ infrastructures. The recurring theme of these investigations is the multi-modality of the data fed into the model. That is, to what extent and how the predictive model can benefit from the use of spatio-temporally heterogeneous information channels. The preferred methodological tool is a simulation system based on variational data assimilation and deep learning concepts
Ouenniche, Kaouther. "Multimodal deep learning for audiovisual production." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.
Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context
Aissa, Wafa. "Réseaux de modules neuronaux pour un raisonnement visuel compositionnel." Electronic Thesis or Diss., Paris, HESAM, 2023. http://www.theses.fr/2023HESAC033.
The context of this PhD thesis is compositional visual reasoning. When presented with an image and a question pair, our objective is to have neural networks models answer the question by following a reasoning chain defined by a program. We assess the model's reasoning ability through a Visual Question Answering (VQA) setup.Compositional VQA breaks down complex questions into modular easier sub-problems.These sub-problems include reasoning skills such as object and attribute detection, relation detection, logical operations, counting, and comparisons. Each sub-problem is assigned to a different module. This approach discourages shortcuts, demanding an explicit understanding of the problem. It also promotes transparency and explainability.Neural module networks (NMN) are used to enable compositional reasoning. The framework is based on a generator-executor framework, the generator learns the translation of the question to its function program. The executor instantiates a neural module network where each function is assigned to a specific module. We also design a neural modules catalog and define the function and the structure of each module. The training and evaluations are conducted using the pre-processed GQA dataset cite{gqa}, which includes natural language questions, functional programs representing the reasoning chain, images, and corresponding answers.The research contributions revolve around the establishment of an NMN framework for the VQA task.One primary contribution involves the integration of vision and language pre-trained (VLP) representations into modular VQA. This integration serves as a ``warm-start" mechanism for initializing the reasoning process.The experiments demonstrate that cross-modal vision and language representations outperform uni-modal ones. This utilization enables the capture of intricate relationships within each individual modality while also facilitating alignment between different modalities, consequently enhancing overall accuracy of our NMN.Moreover, we explore various training techniques to enhance the learning process and improve cost-efficiency. In addition to optimizing the modules within the reasoning chain to collaboratively produce accurate answers, we introduce a teacher-guidance approach to optimize the intermediate modules in the reasoning chain. This ensures that these modules perform their specific reasoning sub-tasks without taking shortcuts or compromising the reasoning process's integrity. We propose and implement several teacher-guidance techniques, one of which draws inspiration from the teacher-forcing method commonly used in sequential models. Comparative analyses demonstrate the advantages of our teacher-guidance approach for NMNs, as detailed in our paper [1].We also introduce a novel Curriculum Learning (CL) strategy tailored for NMNs to reorganize the training examples and define a start-small training strategy. We begin by learning simpler programs and progressively increase the complexity of the training programs. We use several difficulty criteria to define the CL approach. Our findings demonstrate that by selecting the appropriate CL method, we can significantly reduce the training cost and required training data, with only a limited impact on the final VQA accuracy. This significant contribution forms the core of our paper [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Advanced Concepts for Intelligent Vision Systems, 21st International Conference (ACIVS 2023). Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019
Robert, Damien. "Efficient learning on large-scale 3D point clouds." Electronic Thesis or Diss., Université Gustave Eiffel, 2024. http://www.theses.fr/2024UEFL2003.
For the past decade, deep learning has been driving progress in the automated understanding of complex data structures as diverse as text, image, audio, and video. In particular, transformer-based models and self-supervised learning have recently ignited a global competition to learn expressive textual and visual representations by training the largest possible model on Internet-scale datasets, with the help of massive computational resources. This thesis takes a different path, by proposing resource-efficient deep learning methods for the analysis of large-scale 3D point clouds.The efficiency of the introduced approaches comes in various flavors: fast training, few parameters, small compute or memory footprint, and leveraging realistically-available data.In doing so, we strive to devise solutions that can be used by researchers and practitioners with minimal hardware requirements.We first introduce a 3D semantic segmentation model which combines the efficiency of superpoint-based methods with the expressivity of transformers. We build a hierarchical data representation which drastically reduces the size of the 3D point cloud parsing problem, facilitating the processing of large point clouds en masse. Our self-attentive network proves to match or even surpass state-of-the-art approaches on a range of sensors and acquisition environments, while boasting orders of magnitude fewer parameters, faster training, and swift inference.We then build upon this framework to tackle panoptic segmentation of large-scale point clouds. Existing instance and panoptic segmentation methods need to solve a complex matching problem between predicted and ground truth instances for computing their supervision loss.Instead, we frame this task as a scalable graph clustering problem, which a small network is trained to address from local objectives only, without computing the actual object instances at train time. Our lightweight model can process ten-million-point scenes at once on a single GPU in a few seconds, opening the door to 3D panoptic segmentation at unprecedented scales. Finally, we propose to exploit the complementarity of image and point cloud modalities to enhance 3D scene understanding.We place ourselves in a realistic acquisition setting where multiple arbitrarily-located images observe the same scene, with potential occlusions.Unlike previous 2D-3D fusion approaches, we learn to select information from various views of the same object based on their respective observation conditions: camera-to-object distance, occlusion rate, optical distortion, etc. Our efficient implementation achieves state-of-the-art results both in indoor and outdoor settings, with minimal requirements: raw point clouds, arbitrarily-positioned images, and their cameras poses. Overall, this thesis upholds the principle that in data-scarce regimes,exploiting the structure of the problem unlocks both efficient and performant architectures
Yang, Yingyu. "Analyse automatique de la fonction cardiaque par intelligence artificielle : approche multimodale pour un dispositif d'échocardiographie portable." Electronic Thesis or Diss., Université Côte d'Azur, 2023. http://www.theses.fr/2023COAZ4107.
According to the 2023 annual report of the World Heart Federation, cardiovascular diseases (CVD) accounted for nearly one third of all global deaths in 2021. Compared to high-income countries, more than 80% of CVD deaths occurred in low and middle-income countries. The inequitable distribution of CVD diagnosis and treatment resources still remains unresolved. In the face of this challenge, affordable point-of-care ultrasound (POCUS) devices demonstrate significant potential to improve the diagnosis of CVDs. Furthermore, by taking advantage of artificial intelligence (AI)-based tools, POCUS enables non-experts to help, thus largely improving the access to care, especially in less-served regions.The objective of this thesis is to develop robust and automatic algorithms to analyse cardiac function for POCUS devices, with a focus on echocardiography (ECHO) and electrocardiogram (ECG). Our first goal is to obtain explainable cardiac features from each single modality respectively. Our second goal is to explore a multi-modal approach by combining ECHO and ECG data.We start by presenting two novel deep learning (DL) frameworks for echocardiography segmentation and motion estimation tasks, respectively. By incorporating shape prior and motion prior into DL models, we demonstrate through extensive experiments that such prior can help improve the accuracy and generalises well on different unseen datasets. Furthermore, we are able to extract left ventricle ejection fraction (LVEF), global longitudinal strain (GLS) and other useful indices for myocardial infarction (MI) detection.Next, we propose an explainable DL model for unsupervised electrocardiogram decomposition. This model can extract interpretable information related to different ECG subwaves without manual annotation. We further apply those parameters to a linear classifier for myocardial infarction detection, which showed good generalisation across different datasets.Finally, we combine data from both modalities together for trustworthy multi-modal classification. Our approach employs decision-level fusion with uncertainty, allowing training with unpaired multi-modal data. We further evaluate the trained model using paired multi-modal data, showcasing the potential of multi-modal MI detection to surpass that from a single modality.Overall, our proposed robust and generalisable algorithms for ECHO and ECG analysis demonstrate significant potential for portable cardiac function analysis. We anticipate that our novel framework could be further validated using real-world portable devices. We envision that such advanced integrative tools may significantly contribute towards better identification of CVD patients
Liu, Li. "Modélisation pour la reconnaissance continue de la langue française parlée complétée à l'aide de méthodes avancées d'apprentissage automatique." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAT057/document.
This PhD thesis deals with the automatic continuous Cued Speech (CS) recognition basedon the images of subjects without marking any artificial landmark. In order to realize thisobjective, we extract high level features of three information flows (lips, hand positions andshapes), and find an optimal approach to merging them for a robust CS recognition system.We first introduce a novel and powerful deep learning method based on the ConvolutionalNeural Networks (CNNs) for extracting the hand shape/lips features from raw images. Theadaptive background mixture models (ABMMs) are also applied to obtain the hand positionfeatures for the first time. Meanwhile, based on an advanced machine learning method Modi-fied Constrained Local Neural Fields (CLNF), we propose the Modified CLNF to extract theinner lips parameters (A and B ), as well as another method named adaptive ellipse model. Allthese methods make significant contributions to the feature extraction in CS. Then, due tothe asynchrony problem of three feature flows (i.e., lips, hand shape and hand position) in CS,the fusion of them is a challenging issue. In order to resolve it, we propose several approachesincluding feature-level and model-level fusion strategies combined with the context-dependentHMM. To achieve the CS recognition, we propose three tandem CNNs-HMM architectureswith different fusion types. All these architectures are evaluated on the corpus without anyartifice, and the CS recognition performance confirms the efficiency of our proposed methods.The result is comparable with the state of the art using the corpus with artifices. In parallel,we investigate a specific study about the temporal organization of hand movements in CS,especially about its temporal segmentation, and the evaluations confirm the superior perfor-mance of our methods. In summary, this PhD thesis applies the advanced machine learningmethods to computer vision, and the deep learning methodologies to CS recognition work,which make a significant step to the general automatic conversion problem of CS to sound.The future work will mainly focus on an end-to-end CNN-RNN system which incorporates alanguage model, and an attention mechanism for the multi-modal fusion