Thematische Bibliographien / Deep multi-Modal learning

Auswahl der wissenschaftlichen Literatur zum Thema „Deep multi-Modal learning“

Autor: Grafiati

Veröffentlicht am 1. Juni 2024

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Inhaltsverzeichnis

Zeitschriftenartikel
Dissertationen
Buchteile
Konferenzberichte

Machen Sie sich mit den Listen der aktuellen Artikel, Bücher, Dissertationen, Berichten und anderer wissenschaftlichen Quellen zum Thema "Deep multi-Modal learning" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Zeitschriftenartikel zum Thema "Deep multi-Modal learning"

Shetty D S, Radhika. „Multi-Modal Fusion Techniques in Deep Learning“. International Journal of Science and Research (IJSR) 12, Nr. 9 (05.09.2023): 526–32. http://dx.doi.org/10.21275/sr23905100554.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Roostaiyan, Seyed Mahdi, Ehsan Imani und Mahdieh Soleymani Baghshah. „Multi-modal deep distance metric learning“. Intelligent Data Analysis 21, Nr. 6 (15.11.2017): 1351–69. http://dx.doi.org/10.3233/ida-163196.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhu, Xinghui, Liewu Cai, Zhuoyang Zou und Lei Zhu. „Deep Multi-Semantic Fusion-Based Cross-Modal Hashing“. Mathematics 10, Nr. 3 (29.01.2022): 430. http://dx.doi.org/10.3390/math10030430.

Der volle Inhalt der Quelle

Annotation:

Due to the low costs of its storage and search, the cross-modal retrieval hashing method has received much research interest in the big data era. Due to the application of deep learning, the cross-modal representation capabilities have risen markedly. However, the existing deep hashing methods cannot consider multi-label semantic learning and cross-modal similarity learning simultaneously. That means potential semantic correlations among multimedia data are not fully excavated from multi-category labels, which also affects the original similarity preserving of cross-modal hash codes. To this end, this paper proposes deep multi-semantic fusion-based cross-modal hashing (DMSFH), which uses two deep neural networks to extract cross-modal features, and uses a multi-label semantic fusion method to improve cross-modal consistent semantic discrimination learning. Moreover, a graph regularization method is combined with inter-modal and intra-modal pairwise loss to preserve the nearest neighbor relationship between data in Hamming subspace. Thus, DMSFH not only retains semantic similarity between multi-modal data, but integrates multi-label information into modal learning as well. Extensive experimental results on two commonly used benchmark datasets show that our DMSFH is competitive with the state-of-the-art methods.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Du, Lin, Xiong You, Ke Li, Liqiu Meng, Gong Cheng, Liyang Xiong und Guangxia Wang. „Multi-modal deep learning for landform recognition“. ISPRS Journal of Photogrammetry and Remote Sensing 158 (Dezember 2019): 63–75. http://dx.doi.org/10.1016/j.isprsjprs.2019.09.018.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wang, Wei, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang und Yueting Zhuang. „Effective deep learning-based multi-modal retrieval“. VLDB Journal 25, Nr. 1 (19.07.2015): 79–101. http://dx.doi.org/10.1007/s00778-015-0391-4.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Jeong, Changhoon, Sung-Eun Jang, Sanghyuck Na und Juntae Kim. „Korean Tourist Spot Multi-Modal Dataset for Deep Learning Applications“. Data 4, Nr. 4 (12.10.2019): 139. http://dx.doi.org/10.3390/data4040139.

Der volle Inhalt der Quelle

Annotation:

Recently, deep learning-based methods for solving multi-modal tasks such as image captioning, multi-modal classification, and cross-modal retrieval have attracted much attention. To apply deep learning for such tasks, large amounts of data are needed for training. However, although there are several Korean single-modal datasets, there are not enough Korean multi-modal datasets. In this paper, we introduce a KTS (Korean tourist spot) dataset for Korean multi-modal deep-learning research. The KTS dataset has four modalities (image, text, hashtags, and likes) and consists of 10 classes related to Korean tourist spots. All data were extracted from Instagram and preprocessed. We performed two experiments, image classification and image captioning with the dataset, and they showed appropriate results. We hope that many researchers will use this dataset for multi-modal deep-learning research.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Yang, Yang, Yi-Feng Wu, De-Chuan Zhan, Zhi-Bin Liu und Yuan Jiang. „Deep Robust Unsupervised Multi-Modal Network“. Proceedings of the AAAI Conference on Artificial Intelligence 33 (17.07.2019): 5652–59. http://dx.doi.org/10.1609/aaai.v33i01.33015652.

Der volle Inhalt der Quelle

Annotation:

In real-world applications, data are often with multiple modalities, and many multi-modal learning approaches are proposed for integrating the information from different sources. Most of the previous multi-modal methods utilize the modal consistency to reduce the complexity of the learning problem, therefore the modal completeness needs to be guaranteed. However, due to the data collection failures, self-deficiencies, and other various reasons, multi-modal instances are often incomplete in real applications, and have the inconsistent anomalies even in the complete instances, which jointly result in the inconsistent problem. These degenerate the multi-modal feature learning performance, and will finally affect the generalization abilities in different tasks. In this paper, we propose a novel Deep Robust Unsupervised Multi-modal Network structure (DRUMN) for solving this real problem within a unified framework. The proposed DRUMN can utilize the extrinsic heterogeneous information from unlabeled data against the insufficiency caused by the incompleteness. On the other hand, the inconsistent anomaly issue is solved with an adaptive weighted estimation, rather than adjusting the complex thresholds. As DRUMN can extract the discriminative feature representations for each modality, experiments on real-world multimodal datasets successfully validate the effectiveness of our proposed method.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Hua, Yan, Yingyun Yang und Jianhe Du. „Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval“. Electronics 9, Nr. 3 (10.03.2020): 466. http://dx.doi.org/10.3390/electronics9030466.

Der volle Inhalt der Quelle

Annotation:

Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-modal correlations constructed on multi-scale semantic similarity are incorporated to train the deep model in an end-to-end way. Experiments validate the effectiveness of our proposed method on multi-modal retrieval tasks, and our method outperforms state-of-the-art methods on NUS-WIDE, MIR Flickr, and Wikipedia datasets.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Han, Dong, Hong Nie, Jinbao Chen, Meng Chen, Zhen Deng und Jianwei Zhang. „Multi-modal haptic image recognition based on deep learning“. Sensor Review 38, Nr. 4 (17.09.2018): 486–93. http://dx.doi.org/10.1108/sr-08-2017-0160.

Der volle Inhalt der Quelle

Annotation:

Purpose This paper aims to improve the diversity and richness of haptic perception by recognizing multi-modal haptic images. Design/methodology/approach First, the multi-modal haptic data collected by BioTac sensors from different objects are pre-processed, and then combined into haptic images. Second, a multi-class and multi-label deep learning model is designed, which can simultaneously learn four haptic features (hardness, thermal conductivity, roughness and texture) from the haptic images, and recognize objects based on these features. The haptic images with different dimensions and modalities are provided for testing the recognition performance of this model. Findings The results imply that multi-modal data fusion has a better performance than single-modal data on tactile understanding, and the haptic images with larger dimension are conducive to more accurate haptic measurement. Practical implications The proposed method has important potential application in unknown environment perception, dexterous grasping manipulation and other intelligent robotics domains. Originality/value This paper proposes a new deep learning model for extracting multiple haptic features and recognizing objects from multi-modal haptic images.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Pyrovolakis, Konstantinos, Paraskevi Tzouveli und Giorgos Stamou. „Multi-Modal Song Mood Detection with Deep Learning“. Sensors 22, Nr. 3 (29.01.2022): 1065. http://dx.doi.org/10.3390/s22031065.

Der volle Inhalt der Quelle

Annotation:

The production and consumption of music in the contemporary era results in big data generation and creates new needs for automated and more effective management of these data. Automated music mood detection constitutes an active task in the field of MIR (Music Information Retrieval). The first approach to correlating music and mood was made in 1990 by Gordon Burner who researched the way that musical emotion affects marketing. In 2016, Lidy and Schiner trained a CNN for the task of genre and mood classification based on audio. In 2018, Delbouys et al. developed a multi-modal Deep Learning system combining CNN and LSTM architectures and concluded that multi-modal approaches overcome single channel models. This work will examine and compare single channel and multi-modal approaches for the task of music mood detection applying Deep Learning architectures. Our first approach tries to utilize the audio signal and the lyrics of a musical track separately, while the second approach applies a uniform multi-modal analysis to classify the given data into mood classes. The available data we will use to train and evaluate our models comes from the MoodyLyrics dataset, which includes 2000 song titles with labels from four mood classes, {happy, angry, sad, relaxed}. The result of this work leads to a uniform prediction of the mood that represents a music track and has usage in many applications.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Dissertationen zum Thema "Deep multi-Modal learning"

Feng, Xue Ph D. Massachusetts Institute of Technology. „Multi-modal and deep learning for robust speech recognition“. Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113999.

Der volle Inhalt der Quelle

Annotation:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 105-115).
Automatic speech recognition (ASR) decodes speech signals into text. While ASR can produce accurate word recognition in clean environments, system performance can degrade dramatically when noise and reverberation are present. In this thesis, speech denoising and model adaptation for robust speech recognition were studied, and four novel methods were introduced to improve ASR robustness. First, we developed an ASR system using multi-channel information from microphone arrays via accurate speaker tracking with Kalman filtering and subsequent beamforming. The system was evaluated on the publicly available Reverb Challenge corpus, and placed second (out of 49 submitted systems) in the recognition task on real data. Second, we explored a speech feature denoising and dereverberation method via deep denoising autoencoders (DDA). The method was evaluated on the CHiME2-WSJ0 corpus and achieved a 16% to 25% absolute improvement in word error rate (WER) compared to the baseline. Third, we developed a method to incorporate heterogeneous multi-modal data with a deep neural network (DNN) based acoustic model. Our experiments on a noisy vehicle-based speech corpus demonstrated that WERs can be reduced by 6.3% relative to the baseline system. Finally, we explored the use of a low-dimensional environmentally-aware feature derived from the total acoustic variability space. Two extraction methods are presented: one via linear discriminant analysis (LDA) projection, and the other via a bottleneck deep neural network (BN-DNN). Our evaluations showed that by adapting ASR systems with the proposed feature, ASR performance was significantly improved. We also demonstrated that the proposed feature yielded promising results on environment identification tasks.
by Xue Feng.
Ph. D.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mali, Shruti Atul. „Multi-Modal Learning for Abdominal Organ Segmentation“. Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285866.

Der volle Inhalt der Quelle

Annotation:

Deep Learning techniques are widely used across various medical imaging applications. However, they are often fine-tuned for a specific modality and are not generalizable when it comes to new modalities or datasets. One of the main reasons for this is large data variations for e.g., the dynamic range of intensity values is large across multi-modal images. The goal of the project is to develop a method to address multi-modal learning that aims at segmenting liver from Computed Tomography (CT) images and abdominal organs from Magnetic Resonance (MR) images using deep learning techniques. In this project, a self-supervised approach is adapted to attain domain adaptation across images while retaining important 3D information from medical images using a simple 3D-UNet with a few auxiliary tasks. The method comprises of two main steps: representation learning via self-supervised learning (pre-training) and fully supervised learning (fine-tuning). Pre-training is done using a 3D-UNet as a base model along with some auxiliary data augmentation tasks to learn representation through texture, geometry and appearances. The second step is fine-tuning the same network, without the auxiliary tasks, to perform the segmentation tasks on CT and MR images. The annotations of all organs are not available in both modalities. Thus the first step is used to learn general representation from both image modalities; while the second step helps to fine-tune the representations to the available annotations of each modality. Results obtained for each modality were submitted online, and one of the evaluations obtained was in the form of DICE score. The results acquired showed that the highest DICE score of 0.966 was obtained for CT liver prediction and highest DICE score of 0.7 for MRI abdominal segmentation. This project shows the potential to achieve desired results by combining both self and fully-supervised approaches.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Ben-Younes, Hedi. „Multi-modal representation learning towards visual reasoning“. Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Der volle Inhalt der Quelle

Annotation:

La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Ahmedt, Aristizabal David Esteban. „Multi-modal analysis for the automatic evaluation of epilepsy“. Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/132537/1/David_Ahmedt%20Aristizabal_Thesis.pdf.

Der volle Inhalt der Quelle

Annotation:

Motion recognition technology is proposed to support neurologists in the study of patients' behaviour during epileptic seizures. This system can provide clues on the sub-type of epilepsy that patients have, it identifies unusual manifestations that require further investigation, as well as better understands the temporal evolution of seizures, from their onset through to termination. The incorporation of quantitative methods would assist in developing and formulating a diagnosis in situations where clinical expertise is unavailable. This research provides important supplementary and unbiased data to assist with seizure localization. It is a vital complementary resource in the era of seizure-based detection through electrophysiological data.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Ouenniche, Kaouther. „Multimodal deep learning for audiovisual production“. Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.

Der volle Inhalt der Quelle

Annotation:

Dans le contexte en constante évolution du contenu audiovisuel, la nécessité cruciale d'automatiser l'indexation et l'organisation des archives s'est imposée comme un objectif primordial. En réponse, cette recherche explore l'utilisation de techniques d'apprentissage profond pour automatiser l'extraction de métadonnées diverses dans les archives, améliorant ainsi leur accessibilité et leur réutilisation. La première contribution de cette recherche concerne la classification des mouvements de caméra. Il s'agit d'un aspect crucial de l'indexation du contenu, car il permet une catégorisation efficace et une récupération du contenu vidéo en fonction de la dynamique visuelle qu'il présente. L'approche proposée utilise des réseaux neuronaux convolutionnels 3D avec des blocs résiduels. Une approche semi-automatique pour la construction d'un ensemble de données fiable sur les mouvements de caméra à partir de vidéos disponibles au public est également présentée, réduisant au minimum le besoin d'intervention manuelle. De plus, la création d'un ensemble de données d'évaluation exigeant, comprenant des vidéos de la vie réelle tournées avec des caméras professionnelles à différentes résolutions, met en évidence la robustesse et la capacité de généralisation de la technique proposée, atteignant un taux de précision moyen de 94 %.La deuxième contribution se concentre sur la tâche de Vidéo Question Answering. Dans ce contexte, notre Framework intègre un Transformers léger et un module de cross modalité. Ce module utilise une corrélation croisée pour permettre un apprentissage réciproque entre les caractéristiques visuelles conditionnées par le texte et les caractéristiques textuelles conditionnées par la vidéo. De plus, un scénario de test adversarial avec des questions reformulées met en évidence la robustesse du modèle et son applicabilité dans le monde réel. Les résultats expérimentaux sur MSVD-QA et MSRVTT-QA, valident la méthodologie proposée, avec une précision moyenne de 45 % et 42 % respectivement. La troisième contribution de cette recherche aborde le problème de vidéo captioning. Le travail introduit intègre un module de modality attention qui capture les relations complexes entre les données visuelles et textuelles à l'aide d'une corrélation croisée. De plus, l'intégration de l'attention temporelle améliore la capacité du modèle à produire des légendes significatives en tenant compte de la dynamique temporelle du contenu vidéo. Notre travail intègre également une tâche auxiliaire utilisant une fonction de perte contrastive, ce qui favorise la généralisation du modèle et une compréhension plus approfondie des relations intermodales et des sémantiques sous-jacentes. L'utilisation d'une architecture de transformer pour l'encodage et le décodage améliore considérablement la capacité du modèle à capturer les interdépendances entre les données textuelles et vidéo. La recherche valide la méthodologie proposée par une évaluation rigoureuse sur MSRVTT, atteignant des scores BLEU4, ROUGE et METEOR de 0,4408, 0,6291 et 0,3082 respectivement. Notre approche surpasse les méthodes de l'état de l'art, avec des gains de performance allant de 1,21 % à 1,52 % pour les trois métriques considérées. En conclusion, ce manuscrit offre une exploration holistique des techniques basées sur l'apprentissage profond pour automatiser l'indexation du contenu télévisuel, en abordant la nature laborieuse et chronophage de l'indexation manuelle. Les contributions englobent la classification des types de mouvements de caméra, la vidéo question answering et la vidéo captioning, faisant avancer collectivement l'état de l'art et fournissant des informations précieuses pour les chercheurs dans le domaine. Ces découvertes ont non seulement des applications pratiques pour la recherche et l'indexation de contenu, mais contribuent également à l'avancement plus large des méthodologies d'apprentissage profond dans le contexte multimodal
Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Tahoun, Mohamed. „Object Shape Perception for Autonomous Dexterous Manipulation Based on Multi-Modal Learning Models“. Electronic Thesis or Diss., Bourges, INSA Centre Val de Loire, 2021. http://www.theses.fr/2021ISAB0003.

Der volle Inhalt der Quelle

Annotation:

Cette thèse propose des méthodes de reconstruction 3D d’objets basées sur des stratégies multimodales d'apprentissage profond. Les applications visées concernent la manipulation robotique. Dans un premier temps, la thèse propose une méthode de reconstruction visuelle 3D à partir d’une seule vue de l’objet obtenue par un capteur RGB-D. Puis, afin d’améliorer la qualité de reconstruction 3D des objets à partir d’une seule vue, une nouvelle méthode combinant informations visuelles et tactiles a été proposée en se basant sur un modèle de reconstruction par apprentissage. La méthode proposée a été validée sur un ensemble de données visuo-tactiles respectant les contraintes cinématique d’une main robotique. L’ensemble de données visuo-tactiles respectant les propriétés cinématiques de la main robotique à plusieurs doigts a été créé dans le cadre de ce travail doctoral. Cette base de données est unique dans la littérature et constitue également une contribution de la thèse. Les résultats de validation montrent que les informations tactiles peuvent avoir un apport important pour la prédiction de la forme complète d’un objet, en particulier de la partie invisible pour le capteur RGD-D. Ils montrent également que le modèle proposé permet d’obtenir de meilleurs résultats en comparaison à ceux obtenus avec les méthodes les plus performantes de l’état de l’art
This thesis proposes 3D object reconstruction methods based on multimodal deep learning strategies. The targeted applications concern robotic manipulation. First, the thesis proposes a 3D visual reconstruction method from a single view of the object obtained by an RGB-D sensor. Then, in order to improve the quality of 3D reconstruction of objects from a single view, a new method combining visual and tactile information has been proposed based on a learning reconstruction model. The proposed method has been validated on a visual-tactile dataset respecting the kinematic constraints of a robotic hand. The visual-tactile dataset respecting the kinematic properties of the multi-fingered robotic hand has been created in the framework of this PhD work. This dataset is unique in the literature and is also a contribution of the thesis. The validation results show that the tactile information can have an important contribution for the prediction of the complete shape of an object, especially the part that is not visible to the RGD-D sensor. They also show that the proposed model allows to obtain better results compared to those obtained with the best performing methods of the state of the art

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Dickens, James. „Depth-Aware Deep Learning Networks for Object Detection and Image Segmentation“. Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42619.

Der volle Inhalt der Quelle

Annotation:

The rise of convolutional neural networks (CNNs) in the context of computer vision has occurred in tandem with the advancement of depth sensing technology. Depth cameras are capable of yielding two-dimensional arrays storing at each pixel the distance from objects and surfaces in a scene from a given sensor, aligned with a regular color image, obtaining so-called RGBD images. Inspired by prior models in the literature, this work develops a suite of RGBD CNN models to tackle the challenging tasks of object detection, instance segmentation, and semantic segmentation. Prominent architectures for object detection and image segmentation are modified to incorporate dual backbone approaches inputting RGB and depth images, combining features from both modalities through the use of novel fusion modules. For each task, the models developed are competitive with state-of-the-art RGBD architectures. In particular, the proposed RGBD object detection approach achieves 53.5% mAP on the SUN RGBD 19-class object detection benchmark, while the proposed RGBD semantic segmentation architecture yields 69.4% accuracy with respect to the SUN RGBD 37-class semantic segmentation benchmark. An original 13-class RGBD instance segmentation benchmark is introduced for the SUN RGBD dataset, for which the proposed model achieves 38.4% mAP. Additionally, an original depth-aware panoptic segmentation model is developed, trained, and tested for new benchmarks conceived for the NYUDv2 and SUN RGBD datasets. These benchmarks offer researchers a baseline for the task of RGBD panoptic segmentation on these datasets, where the novel depth-aware model outperforms a comparable RGB counterpart.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Husseini, Orabi Ahmed. „Multi-Modal Technology for User Interface Analysis including Mental State Detection and Eye Tracking Analysis“. Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/36451.

Der volle Inhalt der Quelle

Annotation:

We present a set of easy-to-use methods and tools to analyze human attention, behaviour, and physiological responses. A potential application of our work is evaluating user interfaces being used in a natural manner. Our approach is designed to be scalable and to work remotely on regular personal computers using expensive and noninvasive equipment. The data sources our tool processes are nonintrusive, and captured from video; i.e. eye tracking, and facial expressions. For video data retrieval, we use a basic webcam. We investigate combinations of observation modalities to detect and extract affective and mental states. Our tool provides a pipeline-based approach that 1) collects observational, data 2) incorporates and synchronizes the signal modality mentioned above, 3) detects users' affective and mental state, 4) records user interaction with applications and pinpoints the parts of the screen users are looking at, 5) analyzes and visualizes results. We describe the design, implementation, and validation of a novel multimodal signal fusion engine, Deep Temporal Credence Network (DTCN). The engine uses Deep Neural Networks to provide 1) a generative and probabilistic inference model, and 2) to handle multimodal data such that its performance does not degrade due to the absence of some modalities. We report on the recognition accuracy of basic emotions for each modality. Then, we evaluate our engine in terms of effectiveness of recognizing basic six emotions and six mental states, which are agreeing, concentrating, disagreeing, interested, thinking, and unsure. Our principal contributions include the implementation of a 1) multimodal signal fusion engine, 2) real time recognition of affective and primary mental states from nonintrusive and inexpensive modality, 3) novel mental state-based visualization techniques, 3D heatmaps, 3D scanpaths, and widget heatmaps that find parts of the user interface where users are perhaps unsure, annoyed, frustrated, or satisfied.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Siddiqui, Mohammad Faridul Haque. „A Multi-modal Emotion Recognition Framework Through The Fusion Of Speech With Visible And Infrared Images“. University of Toledo / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1556459232937498.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhang, Yifei. „Real-time multimodal semantic scene understanding for autonomous UGV navigation“. Thesis, Bourgogne Franche-Comté, 2021. http://www.theses.fr/2021UBFCK002.

Der volle Inhalt der Quelle

Annotation:

Une analyse sémantique robuste des scènes extérieures est difficile en raison des changements environnementaux causés par l'éclairage et les conditions météorologiques variables, ainsi que par la variation des types d'objets rencontrés. Cette thèse étudie le problème de la segmentation sémantique à l'aide de l'apprentissage profond et avec des d'images de différentes modalités. Les images capturées à partir de diverses modalités d'acquisition fournissent des informations complémentaires pour une compréhension complète de la scène. Nous proposons des solutions efficaces pour la segmentation supervisée d'images multimodales, de même que pour la segmentation semi-supervisée de scènes routières en extérieur. Concernant le premier cas, nous avons proposé un réseau de fusion multi-niveaux pour intégrer des images couleur et polarimétriques. Une méthode de fusion centrale a également été introduite pour apprendre de manière adaptative les représentations conjointes des caractéristiques spécifiques aux modalités et réduire l'incertitude du modèle via un post-traitement statistique. Dans le cas de la segmentation semi-supervisée, nous avons d'abord proposé une nouvelle méthode de segmentation basée sur un réseau prototypique, qui utilise l'amélioration des fonctionnalités multi-échelles et un mécanisme d'attention. Ensuite, nous avons étendu les algorithmes centrés sur les images RGB, pour tirer parti des informations de profondeur supplémentaires fournies par les caméras RGBD. Des évaluations empiriques complètes sur différentes bases de données de référence montrent que les algorithmes proposés atteignent des performances supérieures en termes de précision et démontrent le bénéfice de l'emploi de modalités complémentaires pour l'analyse de scènes extérieures dans le cadre de la navigation autonome
Robust semantic scene understanding is challenging due to complex object types, as well as environmental changes caused by varying illumination and weather conditions. This thesis studies the problem of deep semantic segmentation with multimodal image inputs. Multimodal images captured from various sensory modalities provide complementary information for complete scene understanding. We provided effective solutions for fully-supervised multimodal image segmentation and few-shot semantic segmentation of the outdoor road scene. Regarding the former case, we proposed a multi-level fusion network to integrate RGB and polarimetric images. A central fusion framework was also introduced to adaptively learn the joint representations of modality-specific features and reduce model uncertainty via statistical post-processing.In the case of semi-supervised semantic scene understanding, we first proposed a novel few-shot segmentation method based on the prototypical network, which employs multiscale feature enhancement and the attention mechanism. Then we extended the RGB-centric algorithms to take advantage of supplementary depth cues. Comprehensive empirical evaluations on different benchmark datasets demonstrate that all the proposed algorithms achieve superior performance in terms of accuracy as well as demonstrating the effectiveness of complementary modalities for outdoor scene understanding for autonomous navigation

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Buchteile zum Thema "Deep multi-Modal learning"

Hiriyannaiah, Srinidhi, G. M. Siddesh und K. G. Srinivasa. „Overview of Deep Learning“. In Cloud-based Multi-Modal Information Analytics, 39–55. Boca Raton: Chapman and Hall/CRC, 2023. http://dx.doi.org/10.1201/9781003215974-4.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Hiriyannaiah, Srinidhi, G. M. Siddesh und K. G. Srinivasa. „Cloud and Deep Learning“. In Cloud-based Multi-Modal Information Analytics, 19–38. Boca Raton: Chapman and Hall/CRC, 2023. http://dx.doi.org/10.1201/9781003215974-3.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Hiriyannaiah, Srinidhi, G. M. Siddesh und K. G. Srinivasa. „Deep Learning Platforms and Cloud“. In Cloud-based Multi-Modal Information Analytics, 57–70. Boca Raton: Chapman and Hall/CRC, 2023. http://dx.doi.org/10.1201/9781003215974-5.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Yang, Yang, Yi-Feng Wu, De-Chuan Zhan und Yuan Jiang. „Deep Multi-modal Learning with Cascade Consensus“. In Lecture Notes in Computer Science, 64–72. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-97310-4_8.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Varsavsky, Thomas, Zach Eaton-Rosen, Carole H. Sudre, Parashkev Nachev und M. Jorge Cardoso. „PIMMS: Permutation Invariant Multi-modal Segmentation“. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 201–9. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-00889-5_23.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Li, Cheng, Hui Sun, Zaiyi Liu, Meiyun Wang, Hairong Zheng und Shanshan Wang. „Learning Cross-Modal Deep Representations for Multi-Modal MR Image Segmentation“. In Lecture Notes in Computer Science, 57–65. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-32245-8_7.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Lin, Yu. „Sentiment Analysis of Painting Based on Deep Learning“. In Application of Intelligent Systems in Multi-modal Information Analytics, 651–55. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-51556-0_96.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Yang, Liang, Huajun Wang und Xiaolin Zhang. „A Deep Learning Method for Salient Object Detection“. In Application of Intelligent Systems in Multi-modal Information Analytics, 894–99. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-05484-6_118.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Luo, Yanling, Jiawei Wan und Shengqin She. „Software Security Vulnerability Mining Based on Deep Learning“. In Application of Intelligent Systems in Multi-modal Information Analytics, 536–43. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-05237-8_66.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhang, Sen, Changzheng Zhang, Lanjun Wang, Cixing Li, Dandan Tu, Rui Luo, Guojun Qi und Jiebo Luo. „MSAFusionNet: Multiple Subspace Attention Based Deep Multi-modal Fusion Network“. In Machine Learning in Medical Imaging, 54–62. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-32692-0_7.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Konferenzberichte zum Thema "Deep multi-Modal learning"

Iyer, Vasanth, Alex J. Aved, Todd B. Howlett, Jeffrey T. Carlo, Asif Mehmood, Niki Pissniou und S. Sitharama Iyengar. „Fast multi-modal reuse: co-occurrence pre-trained deep learning models“. In Real-Time Image Processing and Deep Learning 2019, herausgegeben von Nasser Kehtarnavaz und Matthias F. Carlsohn. SPIE, 2019. http://dx.doi.org/10.1117/12.2519546.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Kulkarni, Karthik, Prakash Patil und Suvarna G. Kanakaraddi. „Multi-Modal Colour Extraction Using Deep Learning Techniques“. In 2022 Fourth International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT). IEEE, 2022. http://dx.doi.org/10.1109/icerect56837.2022.10060086.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Müller, K. R., und S. M. Hofmann. „Interpreting Deep Learning Models for Multi-modal Neuroimaging“. In 2023 11th International Winter Conference on Brain-Computer Interface (BCI). IEEE, 2023. http://dx.doi.org/10.1109/bci57258.2023.10078502.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Haritha, D., und B. Sandhya. „Multi-modal Medical Data Fusion using Deep Learning“. In 2022 9th International Conference on Computing for Sustainable Global Development (INDIACom). IEEE, 2022. http://dx.doi.org/10.23919/indiacom54597.2022.9763296.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

You, Bihao, Jiahao Qin, Yitao Xu, Yunfeng Wu, Yize Liu und Sijia Pan. „Multi - Modal Deep Learning Model for Stock Crises“. In 2023 2nd International Conference on Frontiers of Communications, Information System and Data Science (CISDS). IEEE, 2023. http://dx.doi.org/10.1109/cisds61173.2023.00017.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Vijayaraghavan, Prashanth, Soroush Vosoughi und Deb Roy. „Twitter Demographic Classification Using Deep Multi-modal Multi-task Learning“. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. http://dx.doi.org/10.18653/v1/p17-2076.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhang, Xiao, und Xiaoling Liu. „Interference Signal Recognition Based on Multi-Modal Deep Learning“. In 2020 7th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 2020. http://dx.doi.org/10.1109/dsa51864.2020.00055.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Liu, Bao-Yun, Yi-Hsin Jen, Shih-Wei Sun, Li Su und Pao-Chi Chang. „Multi-Modal Deep Learning-Based Violin Bowing Action Recognition“. In 2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan). IEEE, 2020. http://dx.doi.org/10.1109/icce-taiwan49838.2020.9257995.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Huang, Xin, und Yuxin Peng. „Cross-modal deep metric learning with multi-task regularization“. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2017. http://dx.doi.org/10.1109/icme.2017.8019340.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Lam, Genevieve, Huang Dongyan und Weisi Lin. „Context-aware Deep Learning for Multi-modal Depression Detection“. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. http://dx.doi.org/10.1109/icassp.2019.8683027.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!