Dissertations / Theses on the topic 'Deep multi-Modal learning'

To see the other types of publications on this topic, follow the link: Deep multi-Modal learning.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 16 dissertations / theses for your research on the topic 'Deep multi-Modal learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Feng, Xue Ph D. Massachusetts Institute of Technology. "Multi-modal and deep learning for robust speech recognition." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113999.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 105-115).
Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environments, system performance can degrade dramatically when noise and reverberation are present. In this thesis, speech denoising and model adaptation for robust speech recognition were studied, and four novel methods were introduced to improve ASR robustness. First, we developed an ASR system using multi-channel information from microphone arrays via accurate speaker tracking with Kalman filtering and subsequent beamforming. The system was evaluated on the publicly available Reverb Challenge corpus, and placed second (out of 49 submitted systems) in the recognition task on real data. Second, we explored a speech feature denoising and dereverberation method via deep denoising autoencoders (DDA). The method was evaluated on the CHiME2-WSJ0 corpus and achieved a 16% to 25% absolute improvement in word error rate (WER) compared to the baseline. Third, we developed a method to incorporate heterogeneous multi-modal data with a deep neural network (DNN) based acoustic model. Our experiments on a noisy vehicle-based speech corpus demonstrated that WERs can be reduced by 6.3% relative to the baseline system. Finally, we explored the use of a low-dimensional environmentally-aware feature derived from the total acoustic variability space. Two extraction methods are presented: one via linear discriminant analysis (LDA) projection, and the other via a bottleneck deep neural network (BN-DNN). Our evaluations showed that by adapting ASR systems with the proposed feature, ASR performance was significantly improved. We also demonstrated that the proposed feature yielded promising results on environment identification tasks.
by Xue Feng.
Ph. D.
2

Mali, Shruti Atul. "Multi-Modal Learning for Abdominal Organ Segmentation." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285866.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Deep Learning techniques are widely used across various medical imaging applications. However, they are often fine-tuned for a specific modality and are not generalizable when it comes to new modalities or datasets. One of the main reasons for this is large data variations for e.g., the dynamic range of intensity values is large across multi-modal images. The goal of the project is to develop a method to address multi-modal learning that aims at segmenting liver from Computed Tomography (CT) images and abdominal organs from Magnetic Resonance (MR) images using deep learning techniques. In this project, a self-supervised approach is adapted to attain domain adaptation across images while retaining important 3D information from medical images using a simple 3D-UNet with a few auxiliary tasks. The method comprises of two main steps: representation learning via self-supervised learning (pre-training) and fully supervised learning (fine-tuning). Pre-training is done using a 3D-UNet as a base model along with some auxiliary data augmentation tasks to learn representation through texture, geometry and appearances. The second step is fine-tuning the same network, without the auxiliary tasks, to perform the segmentation tasks on CT and MR images. The annotations of all organs are not available in both modalities. Thus the first step is used to learn general representation from both image modalities; while the second step helps to fine-tune the representations to the available annotations of each modality. Results obtained for each modality were submitted online, and one of the evaluations obtained was in the form of DICE score. The results acquired showed that the highest DICE score of 0.966 was obtained for CT liver prediction and highest DICE score of 0.7 for MRI abdominal segmentation. This project shows the potential to achieve desired results by combining both self and fully-supervised approaches.
3

Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature
4

Ahmedt, Aristizabal David Esteban. "Multi-modal analysis for the automatic evaluation of epilepsy." Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/132537/1/David_Ahmedt%20Aristizabal_Thesis.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Motion recognition technology is proposed to support neurologists in the study of patients' behaviour during epileptic seizures. This system can provide clues on the sub-type of epilepsy that patients have, it identifies unusual manifestations that require further investigation, as well as better understands the temporal evolution of seizures, from their onset through to termination. The incorporation of quantitative methods would assist in developing and formulating a diagnosis in situations where clinical expertise is unavailable. This research provides important supplementary and unbiased data to assist with seizure localization. It is a vital complementary resource in the era of seizure-based detection through electrophysiological data.
5

Ouenniche, Kaouther. "Multimodal deep learning for audiovisual production." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Dans le contexte en constante évolution du contenu audiovisuel, la nécessité cruciale d'automatiser l'indexation et l'organisation des archives s'est imposée comme un objectif primordial. En réponse, cette recherche explore l'utilisation de techniques d'apprentissage profond pour automatiser l'extraction de métadonnées diverses dans les archives, améliorant ainsi leur accessibilité et leur réutilisation. La première contribution de cette recherche concerne la classification des mouvements de caméra. Il s'agit d'un aspect crucial de l'indexation du contenu, car il permet une catégorisation efficace et une récupération du contenu vidéo en fonction de la dynamique visuelle qu'il présente. L'approche proposée utilise des réseaux neuronaux convolutionnels 3D avec des blocs résiduels. Une approche semi-automatique pour la construction d'un ensemble de données fiable sur les mouvements de caméra à partir de vidéos disponibles au public est également présentée, réduisant au minimum le besoin d'intervention manuelle. De plus, la création d'un ensemble de données d'évaluation exigeant, comprenant des vidéos de la vie réelle tournées avec des caméras professionnelles à différentes résolutions, met en évidence la robustesse et la capacité de généralisation de la technique proposée, atteignant un taux de précision moyen de 94 %.La deuxième contribution se concentre sur la tâche de Vidéo Question Answering. Dans ce contexte, notre Framework intègre un Transformers léger et un module de cross modalité. Ce module utilise une corrélation croisée pour permettre un apprentissage réciproque entre les caractéristiques visuelles conditionnées par le texte et les caractéristiques textuelles conditionnées par la vidéo. De plus, un scénario de test adversarial avec des questions reformulées met en évidence la robustesse du modèle et son applicabilité dans le monde réel. Les résultats expérimentaux sur MSVD-QA et MSRVTT-QA, valident la méthodologie proposée, avec une précision moyenne de 45 % et 42 % respectivement. La troisième contribution de cette recherche aborde le problème de vidéo captioning. Le travail introduit intègre un module de modality attention qui capture les relations complexes entre les données visuelles et textuelles à l'aide d'une corrélation croisée. De plus, l'intégration de l'attention temporelle améliore la capacité du modèle à produire des légendes significatives en tenant compte de la dynamique temporelle du contenu vidéo. Notre travail intègre également une tâche auxiliaire utilisant une fonction de perte contrastive, ce qui favorise la généralisation du modèle et une compréhension plus approfondie des relations intermodales et des sémantiques sous-jacentes. L'utilisation d'une architecture de transformer pour l'encodage et le décodage améliore considérablement la capacité du modèle à capturer les interdépendances entre les données textuelles et vidéo. La recherche valide la méthodologie proposée par une évaluation rigoureuse sur MSRVTT, atteignant des scores BLEU4, ROUGE et METEOR de 0,4408, 0,6291 et 0,3082 respectivement. Notre approche surpasse les méthodes de l'état de l'art, avec des gains de performance allant de 1,21 % à 1,52 % pour les trois métriques considérées. En conclusion, ce manuscrit offre une exploration holistique des techniques basées sur l'apprentissage profond pour automatiser l'indexation du contenu télévisuel, en abordant la nature laborieuse et chronophage de l'indexation manuelle. Les contributions englobent la classification des types de mouvements de caméra, la vidéo question answering et la vidéo captioning, faisant avancer collectivement l'état de l'art et fournissant des informations précieuses pour les chercheurs dans le domaine. Ces découvertes ont non seulement des applications pratiques pour la recherche et l'indexation de contenu, mais contribuent également à l'avancement plus large des méthodologies d'apprentissage profond dans le contexte multimodal
Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context
6

Tahoun, Mohamed. "Object Shape Perception for Autonomous Dexterous Manipulation Based on Multi-Modal Learning Models." Electronic Thesis or Diss., Bourges, INSA Centre Val de Loire, 2021. http://www.theses.fr/2021ISAB0003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse propose des méthodes de reconstruction 3D d’objets basées sur des stratégies multimodales d'apprentissage profond. Les applications visées concernent la manipulation robotique. Dans un premier temps, la thèse propose une méthode de reconstruction visuelle 3D à partir d’une seule vue de l’objet obtenue par un capteur RGB-D. Puis, afin d’améliorer la qualité de reconstruction 3D des objets à partir d’une seule vue, une nouvelle méthode combinant informations visuelles et tactiles a été proposée en se basant sur un modèle de reconstruction par apprentissage. La méthode proposée a été validée sur un ensemble de données visuo-tactiles respectant les contraintes cinématique d’une main robotique. L’ensemble de données visuo-tactiles respectant les propriétés cinématiques de la main robotique à plusieurs doigts a été créé dans le cadre de ce travail doctoral. Cette base de données est unique dans la littérature et constitue également une contribution de la thèse. Les résultats de validation montrent que les informations tactiles peuvent avoir un apport important pour la prédiction de la forme complète d’un objet, en particulier de la partie invisible pour le capteur RGD-D. Ils montrent également que le modèle proposé permet d’obtenir de meilleurs résultats en comparaison à ceux obtenus avec les méthodes les plus performantes de l’état de l’art
This thesis proposes 3D object reconstruction methods based on multimodal deep learning strategies. The targeted applications concern robotic manipulation. First, the thesis proposes a 3D visual reconstruction method from a single view of the object obtained by an RGB-D sensor. Then, in order to improve the quality of 3D reconstruction of objects from a single view, a new method combining visual and tactile information has been proposed based on a learning reconstruction model. The proposed method has been validated on a visual-tactile dataset respecting the kinematic constraints of a robotic hand. The visual-tactile dataset respecting the kinematic properties of the multi-fingered robotic hand has been created in the framework of this PhD work. This dataset is unique in the literature and is also a contribution of the thesis. The validation results show that the tactile information can have an important contribution for the prediction of the complete shape of an object, especially the part that is not visible to the RGD-D sensor. They also show that the proposed model allows to obtain better results compared to those obtained with the best performing methods of the state of the art
7

Dickens, James. "Depth-Aware Deep Learning Networks for Object Detection and Image Segmentation." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42619.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The rise of convolutional neural networks (CNNs) in the context of computer vision has occurred in tandem with the advancement of depth sensing technology. Depth cameras are capable of yielding two-dimensional arrays storing at each pixel the distance from objects and surfaces in a scene from a given sensor, aligned with a regular color image, obtaining so-called RGBD images. Inspired by prior models in the literature, this work develops a suite of RGBD CNN models to tackle the challenging tasks of object detection, instance segmentation, and semantic segmentation. Prominent architectures for object detection and image segmentation are modified to incorporate dual backbone approaches inputting RGB and depth images, combining features from both modalities through the use of novel fusion modules. For each task, the models developed are competitive with state-of-the-art RGBD architectures. In particular, the proposed RGBD object detection approach achieves 53.5% mAP on the SUN RGBD 19-class object detection benchmark, while the proposed RGBD semantic segmentation architecture yields 69.4% accuracy with respect to the SUN RGBD 37-class semantic segmentation benchmark. An original 13-class RGBD instance segmentation benchmark is introduced for the SUN RGBD dataset, for which the proposed model achieves 38.4% mAP. Additionally, an original depth-aware panoptic segmentation model is developed, trained, and tested for new benchmarks conceived for the NYUDv2 and SUN RGBD datasets. These benchmarks offer researchers a baseline for the task of RGBD panoptic segmentation on these datasets, where the novel depth-aware model outperforms a comparable RGB counterpart.
8

Husseini, Orabi Ahmed. "Multi-Modal Technology for User Interface Analysis including Mental State Detection and Eye Tracking Analysis." Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/36451.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
We present a set of easy-to-use methods and tools to analyze human attention, behaviour, and physiological responses. A potential application of our work is evaluating user interfaces being used in a natural manner. Our approach is designed to be scalable and to work remotely on regular personal computers using expensive and noninvasive equipment. The data sources our tool processes are nonintrusive, and captured from video; i.e. eye tracking, and facial expressions. For video data retrieval, we use a basic webcam. We investigate combinations of observation modalities to detect and extract affective and mental states. Our tool provides a pipeline-based approach that 1) collects observational, data 2) incorporates and synchronizes the signal modality mentioned above, 3) detects users' affective and mental state, 4) records user interaction with applications and pinpoints the parts of the screen users are looking at, 5) analyzes and visualizes results. We describe the design, implementation, and validation of a novel multimodal signal fusion engine, Deep Temporal Credence Network (DTCN). The engine uses Deep Neural Networks to provide 1) a generative and probabilistic inference model, and 2) to handle multimodal data such that its performance does not degrade due to the absence of some modalities. We report on the recognition accuracy of basic emotions for each modality. Then, we evaluate our engine in terms of effectiveness of recognizing basic six emotions and six mental states, which are agreeing, concentrating, disagreeing, interested, thinking, and unsure. Our principal contributions include the implementation of a 1) multimodal signal fusion engine, 2) real time recognition of affective and primary mental states from nonintrusive and inexpensive modality, 3) novel mental state-based visualization techniques, 3D heatmaps, 3D scanpaths, and widget heatmaps that find parts of the user interface where users are perhaps unsure, annoyed, frustrated, or satisfied.
9

Siddiqui, Mohammad Faridul Haque. "A Multi-modal Emotion Recognition Framework Through The Fusion Of Speech With Visible And Infrared Images." University of Toledo / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1556459232937498.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Zhang, Yifei. "Real-time multimodal semantic scene understanding for autonomous UGV navigation." Thesis, Bourgogne Franche-Comté, 2021. http://www.theses.fr/2021UBFCK002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Une analyse sémantique robuste des scènes extérieures est difficile en raison des changements environnementaux causés par l'éclairage et les conditions météorologiques variables, ainsi que par la variation des types d'objets rencontrés. Cette thèse étudie le problème de la segmentation sémantique à l'aide de l'apprentissage profond et avec des d'images de différentes modalités. Les images capturées à partir de diverses modalités d'acquisition fournissent des informations complémentaires pour une compréhension complète de la scène. Nous proposons des solutions efficaces pour la segmentation supervisée d'images multimodales, de même que pour la segmentation semi-supervisée de scènes routières en extérieur. Concernant le premier cas, nous avons proposé un réseau de fusion multi-niveaux pour intégrer des images couleur et polarimétriques. Une méthode de fusion centrale a également été introduite pour apprendre de manière adaptative les représentations conjointes des caractéristiques spécifiques aux modalités et réduire l'incertitude du modèle via un post-traitement statistique. Dans le cas de la segmentation semi-supervisée, nous avons d'abord proposé une nouvelle méthode de segmentation basée sur un réseau prototypique, qui utilise l'amélioration des fonctionnalités multi-échelles et un mécanisme d'attention. Ensuite, nous avons étendu les algorithmes centrés sur les images RGB, pour tirer parti des informations de profondeur supplémentaires fournies par les caméras RGBD. Des évaluations empiriques complètes sur différentes bases de données de référence montrent que les algorithmes proposés atteignent des performances supérieures en termes de précision et démontrent le bénéfice de l'emploi de modalités complémentaires pour l'analyse de scènes extérieures dans le cadre de la navigation autonome
Robust semantic scene understanding is challenging due to complex object types, as well as environmental changes caused by varying illumination and weather conditions. This thesis studies the problem of deep semantic segmentation with multimodal image inputs. Multimodal images captured from various sensory modalities provide complementary information for complete scene understanding. We provided effective solutions for fully-supervised multimodal image segmentation and few-shot semantic segmentation of the outdoor road scene. Regarding the former case, we proposed a multi-level fusion network to integrate RGB and polarimetric images. A central fusion framework was also introduced to adaptively learn the joint representations of modality-specific features and reduce model uncertainty via statistical post-processing.In the case of semi-supervised semantic scene understanding, we first proposed a novel few-shot segmentation method based on the prototypical network, which employs multiscale feature enhancement and the attention mechanism. Then we extended the RGB-centric algorithms to take advantage of supplementary depth cues. Comprehensive empirical evaluations on different benchmark datasets demonstrate that all the proposed algorithms achieve superior performance in terms of accuracy as well as demonstrating the effectiveness of complementary modalities for outdoor scene understanding for autonomous navigation
11

Gehlin, Nils, and Martin Antonsson. "Detecting Non-Natural Objects in a Natural Environment using Generative Adversarial Networks with Stereo Data." Thesis, Linköpings universitet, Datorseende, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166619.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This thesis investigates the use of Generative Adversarial Networks (GANs) for detecting images containing non-natural objects in natural environments and if the introduction of stereo data can improve the performance. The state-of-the-art GAN-based anomaly detection method presented by A. Berget al. in [5] (BergGAN) was the base of this thesis. By modifiying BergGAN to not only accept three channel input, but also four and six channel input, it was possible to investigate the effect of introducing stereo data in the method. The input to the four channel network was an RGB image and its corresponding disparity map, and the input to the six channel network was a stereo pair consistingof two RGB images. The three datasets used in the thesis were constructed froma dataset of aerial video sequences provided by SAAB Dynamics, where the scene was mostly wooded areas. The datasets were divided into training and validation data, where the latter was used for the performance evaluation of the respective network. The evaluation method suggested in [5] was used in the thesis, where each sample was scored on the likelihood of it containing anomalies, Receiver Operating Characteristics (ROC) analysis was then applied and the area under the ROC-curve was calculated. The results showed that BergGAN was successfully able to detect images containing non-natural objects in natural environments using the dataset provided by SAAB Dynamics. The adaption of BergGAN to also accept four and six input channels increased the performance of the method, showing that there is information in stereo data that is relevant for GAN-based anomaly detection. There was however no substantial performance difference between the network trained with two RGB images versus the one trained with an RGB image and its corresponding disparity map.
12

Liu, Li. "Modélisation pour la reconnaissance continue de la langue française parlée complétée à l'aide de méthodes avancées d'apprentissage automatique." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAT057/document.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse de doctorat traite de la reconnaissance automatique du Langage français Parlé Complété (LPC), version française du Cued Speech (CS), à partir de l’image vidéo et sans marquage de l’information préalable à l’enregistrement vidéo. Afin de réaliser cet objectif, nous cherchons à extraire les caractéristiques de haut niveau de trois flux d’information (lèvres, positions de la main et formes), et fusionner ces trois modalités dans une approche optimale pour un système de reconnaissance de LPC robuste. Dans ce travail, nous avons introduit une méthode d’apprentissage profond avec les réseaux neurono convolutifs (CNN)pour extraire les formes de main et de lèvres à partir d’images brutes. Un modèle de mélange de fond adaptatif (ABMM) est proposé pour obtenir la position de la main. De plus, deux nouvelles méthodes nommées Modified Constraint Local Neural Fields (CLNF Modifié) et le model Adaptive Ellipse Model ont été proposées pour extraire les paramètres du contour interne des lèvres (étirement et ouverture aux lèvres). Le premier s’appuie sur une méthode avancée d’apprentissage automatique (CLNF) en vision par ordinateur. Toutes ces méthodes constituent des contributions significatives pour l’extraction de caractéristiques du LPC. En outre, en raison de l’asynchronie des trois flux caractéristiques du LPC, leur fusion est un enjeu important dans cette thèse. Afin de le résoudre, nous avons proposé plusieurs approches, y compris les stratégies de fusion au niveau données et modèle avec une modélisation HMM dépendant du contexte. Pour obtenir le décodage, nous avons proposé trois architectures CNNs-HMMs. Toutes ces architectures sont évaluées sur un corpus de phrases codées en LPC en parole continue sans aucun artifice, et la performance de reconnaissance CS confirme l’efficacité de nos méthodes proposées. Le résultat est comparable à l’état de l’art qui utilisait des bases de données où l’information pertinente était préalablement repérée. En même temps, nous avons réalisé une étude spécifique concernant l’organisation temporelle des mouvements de la main, révélant une avance de la main en relation avec l’emplacement dans la phrase. En résumé, ce travail de doctorat propose les méthodes avancées d’apprentissage automatique issues du domaine de la vision par ordinateur et les méthodologies d’apprentissage en profondeur dans le travail de reconnaissance CS, qui constituent un pas important vers le problème général de conversion automatique de CS en parole audio
This PhD thesis deals with the automatic continuous Cued Speech (CS) recognition basedon the images of subjects without marking any artificial landmark. In order to realize thisobjective, we extract high level features of three information flows (lips, hand positions andshapes), and find an optimal approach to merging them for a robust CS recognition system.We first introduce a novel and powerful deep learning method based on the ConvolutionalNeural Networks (CNNs) for extracting the hand shape/lips features from raw images. Theadaptive background mixture models (ABMMs) are also applied to obtain the hand positionfeatures for the first time. Meanwhile, based on an advanced machine learning method Modi-fied Constrained Local Neural Fields (CLNF), we propose the Modified CLNF to extract theinner lips parameters (A and B ), as well as another method named adaptive ellipse model. Allthese methods make significant contributions to the feature extraction in CS. Then, due tothe asynchrony problem of three feature flows (i.e., lips, hand shape and hand position) in CS,the fusion of them is a challenging issue. In order to resolve it, we propose several approachesincluding feature-level and model-level fusion strategies combined with the context-dependentHMM. To achieve the CS recognition, we propose three tandem CNNs-HMM architectureswith different fusion types. All these architectures are evaluated on the corpus without anyartifice, and the CS recognition performance confirms the efficiency of our proposed methods.The result is comparable with the state of the art using the corpus with artifices. In parallel,we investigate a specific study about the temporal organization of hand movements in CS,especially about its temporal segmentation, and the evaluations confirm the superior perfor-mance of our methods. In summary, this PhD thesis applies the advanced machine learningmethods to computer vision, and the deep learning methodologies to CS recognition work,which make a significant step to the general automatic conversion problem of CS to sound.The future work will mainly focus on an end-to-end CNN-RNN system which incorporates alanguage model, and an attention mechanism for the multi-modal fusion
13

Weiss, Martin. "Deep reinforcement learning for multi-modal embodied navigation." Thesis, 2020. http://hdl.handle.net/1866/25106.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Ce travail se concentre sur une tâche de micro-navigation en plein air où le but est de naviguer vers une adresse de rue spécifiée en utilisant plusieurs modalités (par exemple, images, texte de scène et GPS). La tâche de micro-navigation extérieure s’avère etre un défi important pour de nombreuses personnes malvoyantes, ce que nous démontrons à travers des entretiens et des études de marché, et nous limitons notre définition des problèmes à leurs besoins. Nous expérimentons d’abord avec un monde en grille partiellement observable (Grid-Street et Grid City) contenant des maisons, des numéros de rue et des régions navigables. Ensuite, nous introduisons le Environnement de Trottoir pour la Navigation Visuelle (ETNV), qui contient des images panoramiques avec des boîtes englobantes pour les numéros de maison, les portes et les panneaux de nom de rue, et des formulations pour plusieurs tâches de navigation. Dans SEVN, nous formons un modèle de politique pour fusionner des observations multimodales sous la forme d’images à résolution variable, de texte visible et de données GPS simulées afin de naviguer vers une porte d’objectif. Nous entraînons ce modèle en utilisant l’algorithme d’apprentissage par renforcement, Proximal Policy Optimization (PPO). Nous espérons que cette thèse fournira une base pour d’autres recherches sur la création d’agents pouvant aider les membres de la communauté des gens malvoyantes à naviguer le monde.
This work focuses on an Outdoor Micro-Navigation (OMN) task in which the goal is to navigate to a specified street address using multiple modalities including images, scene-text, and GPS. This task is a significant challenge to many Blind and Visually Impaired (BVI) people, which we demonstrate through interviews and market research. To investigate the feasibility of solving this task with Deep Reinforcement Learning (DRL), we first introduce two partially observable grid-worlds, Grid-Street and Grid City, containing houses, street numbers, and navigable regions. In these environments, we train an agent to find specific houses using local observations under a variety of training procedures. We parameterize our agent with a neural network and train using reinforcement learning methods. Next, we introduce the Sidewalk Environment for Visual Navigation (SEVN), which contains panoramic images with labels for house numbers, doors, and street name signs, and formulations for several navigation tasks. In SEVN, we train another neural network model using Proximal Policy Optimization (PPO) to fuse multi-modal observations in the form of variable resolution images, visible text, and simulated GPS data, and to use this representation to navigate to goal doors. Our best model used all available modalities and was able to navigate to over 100 goals with an 85% success rate. We found that models with access to only a subset of these modalities performed significantly worse, supporting the need for a multi-modal approach to the OMN task. We hope that this thesis provides a foundation for further research into the creation of agents to assist members of the BVI community to safely navigate.
14

Lin, Tzu-Ying, and 林子盈. "BiodesNet: Discriminative Multi-Modal Deep Learning for RGB-D Gesture and Face Recognition." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/22hu56.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立清華大學
資訊工程學系所
106
In recent years, due to the rapid development of depth sensors and their wide application scenarios, it becomes an important technology for using depth image to face and gesture recognition. Depth data provides more information about appearance and object shape, and it is invariant to lighting or color variations. For example, in AR/VR applications, depth information is more important for gesture recognition due to the complex light changes. Hence, with depth information, it can be expected that the performance of recognition as well as safety can be improved. Although many studies have been conducted, most work used hand-crafted features, the problem is that they do not readily extend to different datasets and need extra effort to extract features in depth images. Recently, with the growth of deep convolutional neural network, most of them either treat RGB-D as undifferentiated four-channel data, or learn features from color and depth separately, which cannot adequately utilize the difference between two individually features. In this paper, we propose a CNN-based multi-modal learning framework for RGB-D gesture and face recognition. After training the network of color and depth separately, we use both feature fusion and add our own defined discriminative and associate loss function to strengthen the complementary and discrimination between two modalities to improve the performance. We performed experiments on gesture and face RGB-D datasets. The results show that our multi-modal learning method achieved 97.8\% and 99.7\% classification accuracy on the ASL Finger Spelling and IIITD Face RGB-D dataset respectively. In addition, we map a face image with color and its corresponding depth into a feature vector, and convert it into a global descriptor. The equal error rate is 5.663\% with 256 bits global descriptors. To extract a global descriptor from an image, it takes only 0.83 seconds with GPU acceleration and comparing two 256 bits global descriptors for just 22.7 microseconds without GPU acceleration.
15

Kumari, K., J. P. Singh, Y. K. Dwivedi, and Nripendra P. Rana. "Towards Cyberbullying-free social media in smart cities: a unified multi-modal approach." 2019. http://hdl.handle.net/10454/18116.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Yes
Smart cities are shifting the presence of people from physical world to cyber world (cyberspace). Along with the facilities for societies, the troubles of physical world, such as bullying, aggression and hate speech, are also taking their presence emphatically in cyberspace. This paper aims to dig the posts of social media to identify the bullying comments containing text as well as image. In this paper, we have proposed a unified representation of text and image together to eliminate the need for separate learning modules for image and text. A single-layer Convolutional Neural Network model is used with a unified representation. The major findings of this research are that the text represented as image is a better model to encode the information. We also found that single-layer Convolutional Neural Network is giving better results with two-dimensional representation. In the current scenario, we have used three layers of text and three layers of a colour image to represent the input that gives a recall of 74% of the bullying class with one layer of Convolutional Neural Network.
Ministry of Electronics and Information Technology (MeitY), Government of India
16

Sylvain, Tristan. "Locality and compositionality in representation learning for complex visual tasks." Thesis, 2021. http://hdl.handle.net/1866/25594.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
L'utilisation d'architectures neuronales profondes associée à des innovations spécifiques telles que les méthodes adversarielles, l’entraînement préalable sur de grands ensembles de données et l'estimation de l'information mutuelle a permis, ces dernières années, de progresser rapidement dans de nombreuses tâches de vision par ordinateur complexes telles que la classification d'images de catégories préalablement inconnues (apprentissage zéro-coups), la génération de scènes ou la classification multimodale. Malgré ces progrès, il n’est pas certain que les méthodes actuelles d’apprentissage de représentations suffiront à atteindre une performance équivalente au niveau humain sur des tâches visuelles arbitraires et, de fait, cela pose des questions quant à la direction de la recherche future. Dans cette thèse, nous nous concentrerons sur deux aspects des représentations qui semblent nécessaires pour atteindre de bonnes performances en aval pour l'apprentissage des représentations : la localité et la compositionalité. La localité peut être comprise comme la capacité d'une représentation à retenir des informations locales. Ceci sera pertinent dans de nombreux cas, et bénéficiera particulièrement à la vision informatique, domaine dans lequel les images naturelles comportent intrinsèquement des informations locales, par exemple des parties pertinentes d’une image, des objets multiples présents dans une scène... D'autre part, une représentation compositionnelle peut être comprise comme une représentation qui résulte d'une combinaison de parties plus simples. Les réseaux neuronaux convolutionnels sont intrinsèquement compositionnels, et de nombreuses images complexes peuvent être considérées comme la composition de sous-composantes pertinentes : les objets et attributs individuels dans une scène, les attributs sémantiques dans l'apprentissage zéro-coups en sont deux exemples. Nous pensons que ces deux propriétés détiennent la clé pour concevoir de meilleures méthodes d'apprentissage de représentations. Dans cette thèse, nous présentons trois articles traitant de la localité et/ou de la compositionnalité, et de leur application à l'apprentissage de représentations pour des tâches visuelles complexes. Dans le premier article, nous introduisons des méthodes de mesure de la localité et de la compositionnalité pour les représentations d'images, et nous démontrons que les représentations locales et compositionnelles sont plus performantes dans l'apprentissage zéro-coups. Nous utilisons également ces deux notions comme base pour concevoir un nouvel algorithme d'apprentissage des représentations qui atteint des performances de pointe dans notre cadre expérimental, une variante de l'apprentissage "zéro-coups" plus difficile où les informations externes, par exemple un pré-entraînement sur d'autres ensembles de données d'images, ne sont pas autorisées. Dans le deuxième article, nous montrons qu'en encourageant un générateur à conserver des informations locales au niveau de l'objet, à l'aide d'un module dit de similarité de graphes de scène, nous pouvons améliorer les performances de génération de scènes. Ce modèle met également en évidence l'importance de la composition, car de nombreux composants fonctionnent individuellement sur chaque objet présent. Pour démontrer pleinement la portée de notre approche, nous effectuons une analyse détaillée et proposons un nouveau cadre pour évaluer les modèles de génération de scènes. Enfin, dans le troisième article, nous montrons qu'en encourageant une forte information mutuelle entre les représentations multimodales locales et globales des images médicales en 2D et 3D, nous pouvons améliorer la classification et la segmentation des images. Ce cadre général peut être appliqué à une grande variété de contextes et démontre les avantages non seulement de la localité, mais aussi de la compositionnalité, car les représentations multimodales sont combinées pour obtenir une représentation plus générale.
The use of deep neural architectures coupled with specific innovations such as adversarial methods, pre-training on large datasets and mutual information estimation has in recent years allowed rapid progress in many complex vision tasks such as zero-shot learning, scene generation, or multi-modal classification. Despite such progress, it is still not clear if current representation learning methods will be enough to attain human-level performance on arbitrary visual tasks, and if not, what direction should future research take. In this thesis, we will focus on two aspects of representations that seem necessary to achieve good downstream performance for representation learning: locality and compositionality. Locality can be understood as a representation's ability to retain local information. This will be relevant in many cases, and will specifically benefit computer vision where natural images inherently feature local information, i.e. relevant patches of an image, multiple objects present in a scene... On the other hand, a compositional representation can be understood as one that arises from a combination of simpler parts. Convolutional neural networks are inherently compositional, and many complex images can be seen as composition of relevant sub-components: individual objects and attributes in a scene, semantic attributes in zero-shot learning are two examples. We believe both properties hold the key to designing better representation learning methods. In this thesis, we present 3 articles dealing with locality and/or compositionality, and their application to representation learning for complex visual tasks. In the first article, we introduce ways of measuring locality and compositionality for image representations, and demonstrate that local and compositional representations perform better at zero-shot learning. We also use these two notions as the basis for designing class-matching deep info-max, a novel representation learning algorithm that achieves state-of-the-art performance on our proposed "Zero-shot from scratch" setting, a harder zero-shot setting where external information, e.g. pre-training on other image datasets is not allowed. In the second article, we show that by encouraging a generator to retain local object-level information, using a scene-graph similarity module, we can improve scene generation performance. This model also showcases the importance of compositionality as many components operate individually on each object present. To fully demonstrate the reach of our approach, we perform detailed analysis, and propose a new framework to evaluate scene generation models. Finally, in the third article, we show that encouraging high mutual information between local and global multi-modal representations of 2D and 3D medical images can lead to improvements in image classification and segmentation. This general framework can be applied to a wide variety of settings, and demonstrates the benefits of not only locality, but also of compositionality as multi-modal representations are combined to obtain a more general one.

To the bibliography