Tesi sul tema "Multimodal data processing"

Segui questo link per vedere altri tipi di pubblicazioni sul tema: Multimodal data processing.

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-17 saggi (tesi di laurea o di dottorato) per l'attività di ricerca sul tema "Multimodal data processing".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi le tesi di molte aree scientifiche e compila una bibliografia corretta.

1

Cadène, Rémi. "Deep Multimodal Learning for Vision and Language Processing". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS277.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Les technologies du numérique ont joué un rôle déterminant dans la transformation de notre société. Des méthodes statistiques récentes ont été déployées avec succès afin d’automatiser le traitement de la quantité croissante d’images, de vidéos et de textes que nous produisons quotidiennement. En particulier, les réseaux de neurones profonds ont été adopté par les communautés de la vision par ordinateur et du traitement du langage naturel pour leur capacité à interpréter le contenu des images et des textes une fois entraînés sur de grands ensembles de données. Les progrès réalisés dans les deux communautés ont permis de jeter les bases de nouveaux problèmes de recherche à l’intersection entre vision et langage. Dans la première partie de cette thèse, nous nous concentrons sur des moteurs de recherche multimodaux images-textes. Nous proposons une stratégie d’apprentissage pour aligner efficacement les deux modalités tout en structurant l’espace de recherche avec de l’information sémantique. Dans la deuxième partie, nous nous concentrons sur des systèmes capables de répondre à toute question sur une image. Nous proposons une architecture multimodale qui fusionne itérativement les modalités visuelles et textuelles en utilisant un modèle bilinéaire factorisé, tout en modélisant les relations par paires entre chaque région de l’image. Dans la dernière partie, nous abordons les problèmes de biais dans la modélisation. Nous proposons une stratégie d’apprentissage réduisant les biais linguistiques généralement présents dans les systèmes de réponse aux questions visuelles
Digital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants.In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems
2

Lizarraga, Gabriel M. "A Neuroimaging Web Interface for Data Acquisition, Processing and Visualization of Multimodal Brain Images". FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3855.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Structural and functional brain images are generated as essential modalities for medical experts to learn about the different functions of the brain. These images are typically visually inspected by experts. Many software packages are available to process medical images, but they are complex and difficult to use. The software packages are also hardware intensive. As a consequence, this dissertation proposes a novel Neuroimaging Web Services Interface (NWSI) as a series of processing pipelines for a common platform to store, process, visualize and share data. The NWSI system is made up of password-protected interconnected servers accessible through a web interface. The web-interface driving the NWSI is based on Drupal, a popular open source content management system. Drupal provides a user-based platform, in which the core code for the security and design tools are updated and patched frequently. New features can be added via modules, while maintaining the core software secure and intact. The webserver architecture allows for the visualization of results and the downloading of tabulated data. Several forms are ix available to capture clinical data. The processing pipeline starts with a FreeSurfer (FS) reconstruction of T1-weighted MRI images. Subsequently, PET, DTI, and fMRI images can be uploaded. The Webserver captures uploaded images and performs essential functionalities, while processing occurs in supporting servers. The computational platform is responsive and scalable. The current pipeline for PET processing calculates all regional Standardized Uptake Value ratios (SUVRs). The FS and SUVR calculations have been validated using Alzheimer's Disease Neuroimaging Initiative (ADNI) results posted at Laboratory of Neuro Imaging (LONI). The NWSI system provides access to a calibration process through the centiloid scale, consolidating Florbetapir and Florbetaben tracers in amyloid PET images. The interface also offers onsite access to machine learning algorithms, and introduces new heat maps that augment expert visual rating of PET images. NWSI has been piloted using data and expertise from Mount Sinai Medical Center, the 1Florida Alzheimer’s Disease Research Center (ADRC), Baptist Health South Florida, Nicklaus Children's Hospital, and the University of Miami. All results were obtained using our processing servers in order to maintain data validity, consistency, and minimal processing bias.
3

Gimenes, Gabriel Perri. "Advanced techniques for graph analysis: a multimodal approach over planetary-scale data". Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-26062015-105026/.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Applications such as electronic commerce, computer networks, social networks, and biology (protein interaction), to name a few, have led to the production of graph-like data in planetary scale { possibly with millions of nodes and billions of edges. These applications pose challenging problems when the task is to use their data to support decision making processes by means of non-obvious and potentially useful patterns. In order to process such data for pattern discover, researchers and practitioners have used distributed processing resources organized in computational clusters. However, building and managing such clusters can be complex, bringing technical and financial issues that can be prohibitive in a variety of scenarios. Alternatively, it is desirable to process large scale graphs using only one computational node. To do so, we developed processes and algorithms according to three different approaches, building up towards an analytical set capable of revealing patterns, comprehension, and to help with the decision making process over planetary-scale graphs.
Aplicações como comércio eletrônico, redes de computadores, redes sociais e biologia (interação proteica), entre outras, levaram a produção de dados que podem ser representados como grafos à escala planetária { podendo possuir milhões de nós e bilhões de arestas. Tais aplicações apresentam problemas desafiadores quando a tarefa consiste em usar as informações contidas nos grafos para auxiliar processos de tomada de decisão através da descoberta de padrões não triviais e potencialmente utéis. Para processar esses grafos em busca de padrões, tanto pesquisadores como a indústria tem usado recursos de processamento distribuído organizado em clusters computacionais. Entretanto, a construção e manutenção desses clusters pode ser complexa, trazendo tanto problemas técnicos como financeiros que podem ser proibitivos em diversos casos. Por isso, torna-se desejável a capacidade de se processar grafos em larga escala usando somente um nó computacional. Para isso, foram desenvolvidos processos e algoritmos seguindo três abordagens diferentes, visando a definição de um arcabouço de análise capaz de revelar padrões, compreensão e auxiliar na tomada de decisão sobre grafos em escala planetária.
4

Rabhi, Sara. "Optimized deep learning-based multimodal method for irregular medical timestamped data". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS003.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
L'adoption des dossiers médicaux électroniques dans les systèmes d'information des hôpitaux a conduit à la définition de bases de données regroupant divers types de données telles que des notes cliniques textuelles, des événements médicaux longitudinaux et des informations statiques sur les patients. Toutefois, les données ne sont renseignées que lors des consultations médicales ou des séjours hospitaliers. La fréquence de ces visites varie selon l’état de santé du patient. Ainsi, un système capable d'exploiter les différents types de données collectées à différentes échelles de temps est essentiel pour reconstruire la trajectoire de soin du patient, analyser son historique et délivrer des soins adaptés. Ce travail de thèse aborde deux défis principaux du traitement des données médicales : Représenter la séquence des observations médicales à échantillonnage irrégulier et optimiser l'extraction des événements médicaux à partir des textes de notes cliniques. Notre objectif principal est de concevoir une représentation multimodale de la trajectoire de soin du patient afin de résoudre les problèmes de prédiction clinique. Notre premier travail porte sur la modélisation des séries temporelles médicales irrégulières afin d'évaluer l'importance de considérer les écarts de temps entre les visites médicales dans la représentation de la trajectoire de soin d'un patient donné. À cette fin, nous avons mené une étude comparative entre les réseaux de neurones récurrents, les modèles basés sur l’architecture « Transformer » et les techniques de représentation du temps. De plus, l'objectif clinique était de prédire les complications de la rétinopathie chez les patients diabétiques de type 1 de la base de données française CaRéDIAB (Champagne Ardenne Réseau Diabète) en utilisant leur historique de mesures HbA1c. Les résultats de l'étude ont montré que le modèle « Transformer », combiné à la représentation `Soft-One-Hot` des écarts temporels a conduit à un score AUC de 88,65% (spécificité de 85,56%, sensibilité de 83,33%), soit une amélioration de 4,3% par rapport au modèle « LSTM ». Motivés par ces résultats, nous avons étendu notre étude à des séries temporelles multivariées plus courtes et avons prédit le risque de mortalité à l'hôpital pour les patients présents dans la base de données MIMIC-III. L'architecture proposée, HiTT, a amélioré le score AUC de 5 % par rapport à l’architecture « Transformer ». Dans la deuxième étape, nous nous sommes intéressés à l'extraction d'informations médicales à partir des comptes rendus médicaux afin d'enrichir la trajectoire de soin du patient. En particulier, les réseaux de neurones basés sur le module « Transformer » ont montré des résultats encourageants dans d'extraction d'informations médicales. Cependant, ces modèles complexes nécessitent souvent un grand corpus annoté. Cette exigence est difficile à atteindre dans le domaine médical car elle nécessite l'accès à des données privées de patients et des annotateurs experts. Pour réduire les coûts d'annotation, nous avons exploré les stratégies d'apprentissage actif qui se sont avérées efficaces dans de nombreuses tâches, notamment la classification de textes, l’analyse d’image et la reconnaissance vocale. En plus des méthodes existantes, nous avons défini une stratégie d'apprentissage actif, Hybrid Weighted Uncertainty Sampling, qui utilise la représentation cachée du texte donnée par le modèle pour mesurer la représentativité des échantillons. Une simulation utilisant les données du challenge i2b2-2010 a montré que la métrique proposée réduit le coût d'annotation de 70% pour atteindre le même score de performance que l'apprentissage passif. Enfin, nous avons combiné des séries temporelles médicales multivariées et des concepts médicaux extraits des notes cliniques de la base de données MIMIC-III pour entraîner une architecture multimodale. Les résultats du test ont montré une amélioration de 5,3% en considérant les informations textuelles
The wide adoption of Electronic Health Records in hospitals’ information systems has led to the definition of large databases grouping various types of data such as textual notes, longitudinal medical events, and tabular patient information. However, the records are only filled during consultations or hospital stays that depend on the patient’s state, and local habits. A system that can leverage the different types of data collected at different time scales is critical for reconstructing the patient’s health trajectory, analyzing his history, and consequently delivering more adapted care.This thesis work addresses two main challenges of medical data processing: learning to represent the sequence of medical observations with irregular elapsed time between consecutive visits and optimizing the extraction of medical events from clinical notes. Our main goal is to design a multimodal representation of the patient’s health trajectory to solve clinical prediction problems. Our first work built a framework for modeling irregular medical time series to evaluate the importance of considering the time gaps between medical episodes when representing a patient’s health trajectory. To that end, we conducted a comparative study of sequential neural networks and irregular time representation techniques. The clinical objective was to predict retinopathy complications for type 1 diabetes patients in the French database CaRéDIAB (Champagne Ardenne Réseau Diabetes) using their history of HbA1c measurements. The study results showed that the attention-based model combined with the soft one-hot representation of time gaps led to AUROC score of 88.65% (specificity of 85.56%, sensitivity of 83.33%), an improvement of 4.3% when compared to the LSTM-based model. Motivated by these results, we extended our framework to shorter multivariate time series and predicted in-hospital mortality for critical care patients of the MIMIC-III dataset. The proposed architecture, HiTT, improved the AUC score by 5% over the Transformer baseline. In the second step, we focused on extracting relevant medical information from clinical notes to enrich the patient’s health trajectories. Particularly, Transformer-based architectures showed encouraging results in medical information extraction tasks. However, these complex models require a large, annotated corpus. This requirement is hard to achieve in the medical field as it necessitates access to private patient data and high expert annotators. To reduce annotation cost, we explored active learning strategies that have been shown to be effective in tasks such as text classification, information extraction, and speech recognition. In addition to existing methods, we defined a Hybrid Weighted Uncertainty Sampling active learning strategy that takes advantage of the contextual embeddings learned by the Transformer-based approach to measuring the representativeness of samples. A simulated study using the i2b2-2010 challenge dataset showed that our proposed metric reduces the annotation cost by 70% to achieve the same score as passive learning. Lastly, we combined multivariate medical time series and medical concepts extracted from clinical notes of the MIMIC-III database to train a multimodal transformer-based architecture. The test results of the in-hospital mortality task showed an improvement of 5.3% when considering additional text data. This thesis contributes to patient health trajectory representation by alleviating the burden of episodic medical records and the manual annotation of free-text notes
5

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
6

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
7

Ouenniche, Kaouther. "Multimodal deep learning for audiovisual production". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Dans le contexte en constante évolution du contenu audiovisuel, la nécessité cruciale d'automatiser l'indexation et l'organisation des archives s'est imposée comme un objectif primordial. En réponse, cette recherche explore l'utilisation de techniques d'apprentissage profond pour automatiser l'extraction de métadonnées diverses dans les archives, améliorant ainsi leur accessibilité et leur réutilisation. La première contribution de cette recherche concerne la classification des mouvements de caméra. Il s'agit d'un aspect crucial de l'indexation du contenu, car il permet une catégorisation efficace et une récupération du contenu vidéo en fonction de la dynamique visuelle qu'il présente. L'approche proposée utilise des réseaux neuronaux convolutionnels 3D avec des blocs résiduels. Une approche semi-automatique pour la construction d'un ensemble de données fiable sur les mouvements de caméra à partir de vidéos disponibles au public est également présentée, réduisant au minimum le besoin d'intervention manuelle. De plus, la création d'un ensemble de données d'évaluation exigeant, comprenant des vidéos de la vie réelle tournées avec des caméras professionnelles à différentes résolutions, met en évidence la robustesse et la capacité de généralisation de la technique proposée, atteignant un taux de précision moyen de 94 %.La deuxième contribution se concentre sur la tâche de Vidéo Question Answering. Dans ce contexte, notre Framework intègre un Transformers léger et un module de cross modalité. Ce module utilise une corrélation croisée pour permettre un apprentissage réciproque entre les caractéristiques visuelles conditionnées par le texte et les caractéristiques textuelles conditionnées par la vidéo. De plus, un scénario de test adversarial avec des questions reformulées met en évidence la robustesse du modèle et son applicabilité dans le monde réel. Les résultats expérimentaux sur MSVD-QA et MSRVTT-QA, valident la méthodologie proposée, avec une précision moyenne de 45 % et 42 % respectivement. La troisième contribution de cette recherche aborde le problème de vidéo captioning. Le travail introduit intègre un module de modality attention qui capture les relations complexes entre les données visuelles et textuelles à l'aide d'une corrélation croisée. De plus, l'intégration de l'attention temporelle améliore la capacité du modèle à produire des légendes significatives en tenant compte de la dynamique temporelle du contenu vidéo. Notre travail intègre également une tâche auxiliaire utilisant une fonction de perte contrastive, ce qui favorise la généralisation du modèle et une compréhension plus approfondie des relations intermodales et des sémantiques sous-jacentes. L'utilisation d'une architecture de transformer pour l'encodage et le décodage améliore considérablement la capacité du modèle à capturer les interdépendances entre les données textuelles et vidéo. La recherche valide la méthodologie proposée par une évaluation rigoureuse sur MSRVTT, atteignant des scores BLEU4, ROUGE et METEOR de 0,4408, 0,6291 et 0,3082 respectivement. Notre approche surpasse les méthodes de l'état de l'art, avec des gains de performance allant de 1,21 % à 1,52 % pour les trois métriques considérées. En conclusion, ce manuscrit offre une exploration holistique des techniques basées sur l'apprentissage profond pour automatiser l'indexation du contenu télévisuel, en abordant la nature laborieuse et chronophage de l'indexation manuelle. Les contributions englobent la classification des types de mouvements de caméra, la vidéo question answering et la vidéo captioning, faisant avancer collectivement l'état de l'art et fournissant des informations précieuses pour les chercheurs dans le domaine. Ces découvertes ont non seulement des applications pratiques pour la recherche et l'indexation de contenu, mais contribuent également à l'avancement plus large des méthodologies d'apprentissage profond dans le contexte multimodal
Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context
8

Bernardi, Dario. "A feasibility study on pairinga smartwatch and a mobile devicethrough multi-modal gestures". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254387.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Pairing is the process of establishing an association between two personal devices. Although such a process is intuitively very simple, achieving a straightforward and secure association is challenging due to several possible attacks and usability-related issues. Indeed, malicious attackers might want to spoof the communication between devices in order to gather sensitive information or harm them. Moreover, offering users simple and usable schemes which attain a high level of security remains a major issue. In addition, due to the great diversity of pairing scenarios and equipment, achieving a single, usable, secure association for all possible devices and use cases is simply not possible.In this thesis, we study the feasibility of a novel pairing scheme based on multi-modal gestures, namely, gestures involving drawing supported by accelerometer data. In particular, a user can pair a smart-watch on his wrist and a mobile device (e.g., a smart-phone) by simply drawing with a finger on the screen at the device.To this purpose, we developed mobile applications for smart-watch and smart-phone to sample and process sensed data in support of a secure commitment-based protocol. Furthermore, we performed experiments to verify whether encoded matching-movements have a clear similarity compared to non-matching movements.The results proved that it is feasible to implement such a scheme which also offers users a natural way to perform secure pairing. This innovative scheme may be adopted by a large number of mobile devices (e.g., smart-watches, smart-phones, tablets, etc.) in different scenarios.
Parkoppling är processen för att etablera en anslutning mellan två personliga enheter. Även om den processen rent intuitivt verkar väldigt enkel så är det en utmaning att göra det säkert på grund av en uppsjö olika attackvektorer och användbarhets-relaterade problem. Faktum är att angripare kanske vill spionera på kommunikationen mellan enheterna för att samla information, eller skada enheten. Dessutom kvarstår problemet att erbjuda användaren ett simpelt och användarvänligt sätt att parkoppla enheter som håller en hög nivå av säkerhet. På grund av mängden av olika enheter och parkopplingsscenarier är det helt enkelt inte möjligt att skapa ett enskilt säkert sätt att parkoppla enheter på.I den här uppsatsen studerar vi genomförbarheten av ett nytt parkopplingsschema baserat på kombinerade rörelser, nämligen en målande rörelse supportat av data från accelerometern. I synnerhet kan en användare parkoppla en smart klocka på sin handled med en mobiltelefon genom att måla med sitt finger på mobiltelefonens skärm. För ändamålet utvecklar vi en mobilapplikation för smarta klocka och mobiltelefoner för att testa och processa inhämtad data som support för ett säkert engagemangsbaserat protokoll. Utöver det genomförde vi ett antal experiment för att verifiera om synkroniserade rörelser har tydliga liknelser i jämförelse med icke synkroniserade rörelser.Resultatet visade att det är genomförbart att implementera ett sådant system vilket också erbjuder användaren ett naturligt sätt att genomföra en säker parkoppling. Detta innovativa system kan komma att användas av ett stort antal mobila enheter (t.ex. smarta klockor, mobiltelefoner, surfplattor etc) i olika scenarion.
9

Mozaffari, Maaref Mohammad Hamed. "A Real-Time and Automatic Ultrasound-Enhanced Multimodal Second Language Training System: A Deep Learning Approach". Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/40477.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
The critical role of language pronunciation in communicative competence is significant, especially for second language learners. Despite renewed awareness of the importance of articulation, it remains a challenge for instructors to handle the pronunciation needs of language learners. There are relatively scarce pedagogical tools for pronunciation teaching and learning, such as inefficient, traditional pronunciation instructions like listening and repeating. Recently, electronic visual feedback (EVF) systems (e.g., medical ultrasound imaging) have been exploited in new approaches in such a way that they could be effectively incorporated in a range of teaching and learning contexts. Evaluation of ultrasound-enhanced methods for pronunciation training, such as multimodal methods, has asserted that visualizing articulator’s system as biofeedback to language learners might improve the efficiency of articulation learning. Despite the recent successful usage of multimodal techniques for pronunciation training, manual works and human manipulation are inevitable in many stages of those systems. Furthermore, recognizing tongue shape in noisy and low-contrast ultrasound images is a challenging job, especially for non-expert users in real-time applications. On the other hand, our user study revealed that users could not perceive the placement of their tongue inside the mouth comfortably just by watching pre-recorded videos. Machine learning is a subset of Artificial Intelligence (AI), where machines can learn by experiencing and acquiring skills without human involvement. Inspired by the functionality of the human brain, deep artificial neural networks learn from large amounts of data to perform a task repeatedly. Deep learning-based methods in many computer vision tasks have emerged as the dominant paradigm in recent years. Deep learning methods are powerful in automatic learning of a new job, while unlike traditional image processing methods, they are capable of dealing with many challenges such as object occlusion, transformation variant, and background artifacts. In this dissertation, we implemented a guided language pronunciation training system, benefits from the strengths of deep learning techniques. Our modular system attempts to provide a fully automatic and real-time language pronunciation training tool using ultrasound-enhanced augmented reality. Qualitatively and quantitatively assessments indicate an exceptional performance for our system in terms of flexibility, generalization, robustness, and autonomy outperformed previous techniques. Using our ultrasound-enhanced system, a language learner can observe her/his tongue movements during real-time speech, superimposed on her/his face automatically.
10

Benmoussat, Mohammed Seghir. "Hyperspectral imagery algorithms for the processing of multimodal data : application for metal surface inspection in an industrial context by means of multispectral imagery, infrared thermography and stripe projection techniques". Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM4347/document.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Le travail présenté dans cette thèse porte sur l'inspection de surfaces métalliques industrielles. Nous proposons de généraliser des méthodes de l'imagerie hyperspectrale à des données multimodales comme des images optiques multi-canales, et des images thermographiques multi-temporelles. Dans la première application, les cubes de données sont construits à partir d'images multi-composantes pour détecter des défauts de surface. Les meilleures performances sont obtenues avec les éclairages multi-longueurs d'ondes dans le visible et le proche IR, et la détection du défaut en utilisant l'angle spectral, avec le spectre moyen comme référence. La deuxième application concerne l'utilisation de l'imagerie thermique pour l'inspection de pièces métalliques nucléaires afin de détecter des défauts de surface et sub-surface. Une approche 1D est proposée, basée sur l'utilisation du kurtosis pour sélectionner la composante principale parmi les premières obtenues après réduction des données avec l’ACP. La méthode proposée donne de bonnes performances avec des données non-bruitées et homogènes, cependant la SVD avec les algorithmes de détection d'anomalies est très robuste aux perturbations. Finalement, une approche, basée sur les techniques d'analyse de franges et la lumière structurée est présentée, dans le but d'inspecter des surfaces métalliques à forme libre. Après avoir déterminé les paramètres décrivant les modèles de franges sinusoïdaux, l'approche proposée consiste à projeter une liste de motifs déphasés et à calculer l'image de phase des motifs enregistrés. La localisation des défauts est basée sur la détection et l'analyse des franges dans les images de phase
The work presented in this thesis deals with the quality control and inspection of industrial metallic surfaces. The purpose is the generalization and application of hyperspectral imagery methods for multimodal data such as multi-channel optical images and multi-temporal thermographic images. In the first application, data cubes are built from multi-component images to detect surface defects within flat metallic parts. The best performances are obtained with multi-wavelength illuminations in the visible and near infrared ranges, and detection using spectral angle mapper with mean spectrum as a reference. The second application turns on the use of thermography imaging for the inspection of nuclear metal components to detect surface and subsurface defects. A 1D approach is proposed based on using the kurtosis to select 1 principal component (PC) from the first PCs obtained after reducing the original data cube with the principal component analysis (PCA) algorithm. The proposed PCA-1PC method gives good performances with non-noisy and homogeneous data, and SVD with anomaly detection algorithms gives the most consistent results and is quite robust to perturbations such as inhomogeneous background. Finally, an approach based on fringe analysis and structured light techniques in case of deflectometric recordings is presented for the inspection of free-form metal surfaces. After determining the parameters describing the sinusoidal stripe patterns, the proposed approach consists in projecting a list of phase-shifted patterns and calculating the corresponding phase-images. Defect location is based on detecting and analyzing the stripes within the phase-images
11

Neumann, Markus. "Automatic multimodal real-time tracking for image plane alignment in interventional Magnetic Resonance Imaging". Phd thesis, Université de Strasbourg, 2014. http://tel.archives-ouvertes.fr/tel-01038023.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Interventional magnetic resonance imaging (MRI) aims at performing minimally invasive percutaneous interventions, such as tumor ablations and biopsies, under MRI guidance. During such interventions, the acquired MR image planes are typically aligned to the surgical instrument (needle) axis and to surrounding anatomical structures of interest in order to efficiently monitor the advancement in real-time of the instrument inside the patient's body. Object tracking inside the MRI is expected to facilitate and accelerate MR-guided interventions by allowing to automatically align the image planes to the surgical instrument. In this PhD thesis, an image-based workflow is proposed and refined for automatic image plane alignment. An automatic tracking workflow was developed, performing detection and tracking of a passive marker directly in clinical real-time images. This tracking workflow is designed for fully automated image plane alignment, with minimization of tracking-dedicated time. Its main drawback is its inherent dependence on the slow clinical MRI update rate. First, the addition of motion estimation and prediction with a Kalman filter was investigated and improved the workflow tracking performance. Second, a complementary optical sensor was used for multi-sensor tracking in order to decouple the tracking update rate from the MR image acquisition rate. Performance of the workflow was evaluated with both computer simulations and experiments using an MR compatible testbed. Results show a high robustness of the multi-sensor tracking approach for dynamic image plane alignment, due to the combination of the individual strengths of each sensor.
12

Muliukov, Artem. "Étude croisée des cartes auto-organisatrices et des réseaux de neurones profonds pour l'apprentissage multimodal inspiré du cerveau". Electronic Thesis or Diss., Université Côte d'Azur, 2024. https://intranet-theses.unice.fr/2024COAZ4008.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
La plasticité corticale est l'une des principales caractéristiques qui permettent à notre capacité d'apprendre et de s'adapter à notre environnement. En effet, le cortex cérébral a la capacité de s'auto-organiser à travers deux formes distinctes de plasticité: la plasticité structurelle et la plasticité synaptique. Ces mécanismes sont très probablement à la base d'une caractéristique extrêmement intéressante du développement du cerveau humain: l'association multimodale. Le cerveau utilise des corrélations spatio-temporelles entre plusieurs modalités pour structurer les données et créer du sens à partir des observations. De plus, les observations biologiques montrent qu'une modalité peut activer la représentation interne d'une autre modalité lorsque les deux sont corrélées. Pour modéliser un tel comportement, Edelman et Damasio ont proposé respectivement les cadres Reentry et Convergence Divergence Zone où les communications neuronales bidirectionnelles peuvent conduire à la fois à la fusion multimodale (convergence) et à l'activation intermodale (divergence). Néanmoins, ces frameworks ne fournissent pas de modèle de calcul au niveau neuronal, et seuls quelques travaux abordent cette question d'association multimodale bio-inspirée qui est pourtant nécessaire pour une représentation complète de l'environnement notamment en ciblant des systèmes intelligents autonomes et embarqués. Dans ce projet de doctorat, nous proposons de poursuivre l'exploration de modèles informatiques d'auto-organisation inspirés du cerveau pour l'apprentissage multimodal non supervisé dans les systèmes neuromorphiques. Ces architectures neuromorphes tirent leur efficacité énergétique des modèles bio-inspirés qu'elles supportent, et pour cette raison nous ne considérons dans notre travail que des règles d'apprentissage basées sur des traitements locaux et distribués
Cortical plasticity is one of the main features that enable our capability to learn and adapt in our environment. Indeed, the cerebral cortex has the ability to self-organize itself through two distinct forms of plasticity: the structural plasticity and the synaptic plasticity. These mechanisms are very likely at the basis of an extremely interesting characteristic of the human brain development: the multimodal association. The brain uses spatio-temporal correlations between several modalities to structure the data and create sense from observations. Moreover, biological observations show that one modality can activate the internal representation of another modality when both are correlated. To model such a behavior, Edelman and Damasio proposed respectively the Reentry and the Convergence Divergence Zone frameworks where bi-directional neural communications can lead to both multimodal fusion (convergence) and inter-modal activation (divergence). Nevertheless, these frameworks do not provide a computational model at the neuron level, and only few works tackle this issue of bio-inspired multimodal association which is yet necessary for a complete representation of the environment especially when targeting autonomous and embedded intelligent systems. In this doctoral project, we propose to pursue the exploration of brain-inspired computational models of self-organization for multimodal unsupervised learning in neuromorphic systems. These neuromorphic architectures get their energy-efficient from the bio-inspired models they support, and for that reason we only consider in our work learning rules based on local and distributed processing
13

Boscaro, Anthony. "Analyse multimodale et multicritères pour l'expertise et la localisation de défauts dans les composants électriques modernes". Thesis, Bourgogne Franche-Comté, 2017. http://www.theses.fr/2017UBFCK014/document.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Ce manuscrit de thèse illustre l’ensemble des travaux de recherche répondant aux problématiques de traitement des données issues des techniques de localisation de défauts. Cette phase de localisation étant une étape déterminante dans le processus d’analyse de défaillances des circuits submicroniques, il est primordial que l’analyste exploite les résultats de l’émission de lumière et du sondage laser. Cependant, ce procédé d’expertise reste séquentiel et dépend uniquement du jugement de l’expert. Cela induit une probabilité de localisation non quantifiée. Afin de pallier ces différents défis, nous avons développé tout au long de cette thèse, une méthodologie d’analyse multimodale et multicritères exploitant le caractère hétérogène et complémentaire des techniques d’émission de lumière et de sondage laser. Ce type d’analyse reposera sur des outils de haut niveau tels que le traitement du signal et la fusion de données, pour au final apporter une aide décisionnelle à l’expert à la fois qualitative et quantitative.Dans un premier temps, nous détaillerons l’ensemble des traitements utilisés en post-acquisition pour l’amélioration des données 1D et 2D. Par la suite, l’analyse spatio-temporelle des données en sondage laser sera explicitée. L’aide décisionnelle fera l’objet de la dernière partie de ce manuscrit, illustrant la méthode de fusion de données utilisée ainsi que des résultats de validation
The purpose of this manuscript is to exhibit the research work solving the issue of data processing stem from defect localization techniques. This step being decisive in the failure analysis process, scientists have to harness data coming from light emission and laser techniques. Nevertheless, this analysis process is sequential and only depends on the expert’s decision. This factor leads to a not quantified probability of localization. Consequently to solve these issues, a multimodaland multicriteria analysis has been developped, taking advantage of the heterogeneous and complementary nature of light emission and laser probing techniques. This kind of process is based on advanced level tools such as signal/image processing and data fusion. The final aim being to provide a quantitive and qualitative decision help for the experts.The first part of this manuscript is dedicated to the description of the entire process for 1D and 2D data enhancement. Thereafter, the spatio-temporal analysis of laser probing waveforms will be tackled. Finally, the last part highlights the decision support brought by data fusion
14

"Graph-based approaches for multimodal brain imaging data analysis". Tulane University, 2021.

Cerca il testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
15

Chen, I.-Wei, e 陳弈暐. "An Integrated Electrocardiography and Photoplethysmography Signal Processing System Based on Ensemble Empirical Mode Decomposition Method for Multimodal Physiological Data Monitoring". Thesis, 2018. http://ndltd.ncl.edu.tw/handle/yk4fna.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
16

Fürbach, Radek. "Metody lokalizace rozdílů v různých modálitách malířských děl". Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-328544.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
The work focuses on the analysis of paintings to determine the painting techniques. Specifically, it focuses on the localization of the underdrawing by comparing images taken in the spectra with different penetration depth. Defines the problem associated with the capture of the compared images in different spectra. Specifies methods that determine the dependence between two parts of the spectrum (mainly RGB and IR) and based on the dependence approximates conversion between these two parts of the spectrum (Red spectral component projection, Colour intensity, Weighted average of spectral components, Table conversion, Linear regression, PCA analysis and Edge decomposition). Work also describes more general problems that complicate solving tasks, such as noise, non-uniform illumination and adding the same type of radiation. Problems at work are thoroughly analyzed. We design a Calculation of illumination parameters using a neural network, Approximation of illumination by blur, Polynomial approximation of illumination and TWMJ approximation of illumination for suppressing non-uniform illumination. Define methods Estimation by edge decomposition and Local least squares method solving adding the same type of radiation. In addition, we describe the Gaussian filter, the Averaging, Median filter, Conservative...
17

Nouri, Golmaei Sara. "Improving the Performance of Clinical Prediction Tasks by using Structured and Unstructured Data combined with a Patient Network". Thesis, 2021. http://dx.doi.org/10.7912/C2/41.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
Abstract (sommario):
Indiana University-Purdue University Indianapolis (IUPUI)
With the increasing availability of Electronic Health Records (EHRs) and advances in deep learning techniques, developing deep predictive models that use EHR data to solve healthcare problems has gained momentum in recent years. The majority of clinical predictive models benefit from structured data in EHR (e.g., lab measurements and medications). Still, learning clinical outcomes from all possible information sources is one of the main challenges when building predictive models. This work focuses mainly on two sources of information that have been underused by researchers; unstructured data (e.g., clinical notes) and a patient network. We propose a novel hybrid deep learning model, DeepNote-GNN, that integrates clinical notes information and patient network topological structure to improve 30-day hospital readmission prediction. DeepNote-GNN is a robust deep learning framework consisting of two modules: DeepNote and patient network. DeepNote extracts deep representations of clinical notes using a feature aggregation unit on top of a state-of-the-art Natural Language Processing (NLP) technique - BERT. By exploiting these deep representations, a patient network is built, and Graph Neural Network (GNN) is used to train the network for hospital readmission predictions. Performance evaluation on the MIMIC-III dataset demonstrates that DeepNote-GNN achieves superior results compared to the state-of-the-art baselines on the 30-day hospital readmission task. We extensively analyze the DeepNote-GNN model to illustrate the effectiveness and contribution of each component of it. The model analysis shows that patient network has a significant contribution to the overall performance, and DeepNote-GNN is robust and can consistently perform well on the 30-day readmission prediction task. To evaluate the generalization of DeepNote and patient network modules on new prediction tasks, we create a multimodal model and train it on structured and unstructured data of MIMIC-III dataset to predict patient mortality and Length of Stay (LOS). Our proposed multimodal model consists of four components: DeepNote, patient network, DeepTemporal, and score aggregation. While DeepNote keeps its functionality and extracts representations of clinical notes, we build a DeepTemporal module using a fully connected layer stacked on top of a one-layer Gated Recurrent Unit (GRU) to extract the deep representations of temporal signals. Independent to DeepTemporal, we extract feature vectors of temporal signals and use them to build a patient network. Finally, the DeepNote, DeepTemporal, and patient network scores are linearly aggregated to fit the multimodal model on downstream prediction tasks. Our results are very competitive to the baseline model. The multimodal model analysis reveals that unstructured text data better help to estimate predictions than temporal signals. Moreover, there is no limitation in applying a patient network on structured data. In comparison to other modules, the patient network makes a more significant contribution to prediction tasks. We believe that our efforts in this work have opened up a new study area that can be used to enhance the performance of clinical predictive models.

Vai alla bibliografia