Rozprawy doktorskie na temat „Traitement multimodal”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Traitement multimodal”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Dourlens, Sébastien. "Multimodal interaction semantic architecture for ambient intelligence". Versailles-St Quentin en Yvelines, 2012. http://www.theses.fr/2012VERS0011.
Pełny tekst źródłaThere still exist many fields in which ways are to be explored to improve the human-system interaction. These systems must have the capability to take advantage of the environment in order to improve interaction. This extends the capabilities of system (machine or robot) to better reach natural language used by human beings. We propose a methodology to solve the multimodal interaction problem adapted to several contexts by defining and modelling a distributed architecture relying on W3C standards and web services (semantic agents and input/output services) working in ambient intelligence environment. This architecture is embedded in a multi-agent system modelling technique. In order to achieve this goal, we need to model the environment using a knowledge representation and communication language (EKRL, Ontology). The obtained semantic environment model is used in two main semantic inference processes: fusion and fission of events at different levels of abstraction. They are considered as two context-aware operations. The fusion operation interprets and understands the environment and detects the happening scenario. The multimodal fission operation interprets the scenario, divides it into elementary tasks, and executes these tasks which require the discovery, selection and composition of appropriate services in the environment to accomplish various aims. The adaptation to environmental context is based on multilevel reinforcement learning technique. The overall architecture of fusion and fission is validated under our framework (agents, services, EKRL concentrator), by developing different performance analysis on some use cases such as monitoring and assistance in daily activities at home and in the town
Chlaily, Saloua. "Modèle d'interaction et performances du traitement du signal multimodal". Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAT026/document.
Pełny tekst źródłaThe joint processing of multimodal measurements is supposed to lead to better performances than those obtained using a single modality or several modalities independently. However, in literature, there are examples that show that is not always true. In this thesis, we analyze, in terms of mutual information and estimation error, the different situations of multimodal analysis in order to determine the conditions to achieve the optimal performances.In the first part, we consider the simple case of two or three modalities, each associated with noisy measurement of a signal. These modalities are linked through the correlations between the useful parts of the signal and the correlations between the noises. We show that the performances are improved if the links between the modalities are exploited. In the second part, we study the impact on performance of wrong links between modalities. We show that these false assumptions decline the performance, which can become lower than the performance achieved using a single modality.In the general case, we model the multiple modalities as a noisy Gaussian channel. We then extend literature results by considering the impact of the errors on signal and noise probability densities on the information transmitted by the channel. We then analyze this relationship in the case of a simple model of two modalities. Our results show in particular the unexpected fact that a double mismatch of the noise and the signal can sometimes compensate for each other, and thus lead to very good performances
Caglayan, Ozan. "Multimodal Machine Translation". Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1016/document.
Pełny tekst źródłaMachine translation aims at automatically translating documents from one language to another without human intervention. With the advent of deep neural networks (DNN), neural approaches to machine translation started to dominate the field, reaching state-ofthe-art performance in many languages. Neural machine translation (NMT) also revived the interest in interlingual machine translation due to how it naturally fits the task into an encoder-decoder framework which produces a translation by decoding a latent source representation. Combined with the architectural flexibility of DNNs, this framework paved the way for further research in multimodality with the objective of augmenting the latent representations with other modalities such as vision or speech, for example. This thesis focuses on a multimodal machine translation (MMT) framework that integrates a secondary visual modality to achieve better and visually grounded language understanding. I specifically worked with a dataset containing images and their translated descriptions, where visual context can be useful forword sense disambiguation, missing word imputation, or gender marking when translating from a language with gender-neutral nouns to one with grammatical gender system as is the case with English to French. I propose two main approaches to integrate the visual modality: (i) a multimodal attention mechanism that learns to take into account both sentence and convolutional visual representations, (ii) a method that uses global visual feature vectors to prime the sentence encoders and the decoders. Through automatic and human evaluation conducted on multiple language pairs, the proposed approaches were demonstrated to be beneficial. Finally, I further show that by systematically removing certain linguistic information from the input sentences, the true strength of both methods emerges as they successfully impute missing nouns, colors and can even translate when parts of the source sentences are completely removed
Choumane, Ali Siroux Jacques. "Traitement générique des références dans le cadre multimodal parole-image-tactile". Rennes : [s.n.], 2008. ftp://ftp.irisa.fr/techreports/theses/2008/choumane.pdf.
Pełny tekst źródłaChoumane, Ali. "Traitement générique des références dans le cadre multimodal parole-image-tactile". Rennes 1, 2008. ftp://ftp.irisa.fr/techreports/theses/2008/choumane.pdf.
Pełny tekst źródłaWe are interested in multimodal human-computer communication systems that use the following modes: speech, gesture and vision. The user communicates with the system by oral utterance in natural language and/or by gesture. The user's request contains his/her goal and the designation of objects (referents) required to the goal realisation. The system should identify in a precise and non ambiguous way the designated objects. In this context, we aim to improve the understanding process of multimodal requests. Hence, we propose a generic set of processing of modalities, for fusion and for reference resolution. The main aspects of the realisation consist in modeling the natural language processing in speech environment, the gesture processing and the visual context (visual salience use) while taking into account the difficulties in multimodal context: speech recognition errors, natural language ambiguity, gesture imprecision due to the user performance, designation ambiguity due to the perception of the displayed objects or to the display topology. To complete the interpretation of the user's request, we propose a method for fusion/verification of modalities processing results to find the designated objects by the user
Sarrut, David Miguet Serge. "Recalage multimodal et plate-forme d'imagerie médicale à accès distant". [S.l.] : [s.n.], 2000. http://demeter.univ-lyon2.fr:8080/sdx/theses/lyon2/2000/sarrut_d.
Pełny tekst źródłaSarrut, David. "Recalage multimodal et plate-forme d'imagerie médicale à accès distant". Lyon 2, 2000. http://theses.univ-lyon2.fr/documents/lyon2/2000/sarrut_d.
Pełny tekst źródłaCadène, Rémi. "Deep Multimodal Learning for Vision and Language Processing". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS277.
Pełny tekst źródłaDigital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants.In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems
Chen, Jianan. "Deep Learning Based Multimodal Retrieval". Electronic Thesis or Diss., Rennes, INSA, 2023. http://www.theses.fr/2023ISAR0019.
Pełny tekst źródłaMultimodal tasks play a crucial role in the progression towards achieving general artificial intelligence (AI). The primary goal of multimodal retrieval is to employ machine learning algorithms to extract relevant semantic information, bridging the gap between different modalities such as visual images, linguistic text, and other data sources. It is worth noting that the information entropy associated with heterogeneous data for the same high-level semantics varies significantly, posing a significant challenge for multimodal models. Deep learning-based multimodal network models provide an effective solution to tackle the difficulties arising from substantial differences in information entropy. These models exhibit impressive accuracy and stability in large-scale cross-modal information matching tasks, such as image-text retrieval. Furthermore, they demonstrate strong transfer learning capabilities, enabling a well-trained model from one multimodal task to be fine-tuned and applied to a new multimodal task, even in scenarios involving few-shot or zero-shot learning. In our research, we develop a novel generative multimodal multi-view database specifically designed for the multimodal referential segmentation task. Additionally, we establish a state-of-the-art (SOTA) benchmark and multi-view metric for referring expression segmentation models in the multimodal domain. The results of our comparative experiments are presented visually, providing clear and comprehensive insights
Delecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634/document.
Pełny tekst źródłaThe human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
Ranisavljević, Elisabeth. "Cloud computing appliqué au traitement multimodal d’images in situ pour l’analyse des dynamiques environnementales". Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20128/document.
Pełny tekst źródłaAnalyzing landscape, its dynamics and environmental evolutions require regular data from the sites, specifically for glacier mass balanced in Spitsbergen and high mountain area. Due to poor weather conditions including common heavy cloud cover at polar latitudes, and because of its cost, daily satellite imaging is not always accessible. Besides, fast events like flood or blanket of snow is ignored by satellite based studies, since the slowest sampling rate is unable to observe it. We complement satellite imagery with a set of ground based autonomous automated digital cameras which take 3 pictures a day. These pictures form a huge database. Each picture needs many processing to extract the information (geometric modifications, atmospheric disturbances, classification, etc). Only computer science is able to store and manage all this information. Cloud computing, being more accessible in the last few years, offers as services IT resources (computing power, storage, applications, etc.). The storage of the huge geographical data could, in itself, be a reason to use cloud computing. But in addition to its storage space, cloud offers an easy way to access , a scalable architecture and a modularity in the services available. As part of the analysis of in situ images, cloud computing offers the possibility to set up an automated tool to process all the data despite the variety of disturbances and the data volume. Through decomposition of image processing in several tasks, implemented as web services, the composition of these services allows us to adapt the treatment to the conditions of each of the data
Atif, Jamal. "Recalage non-rigide multimodal des images radiologiques par information mutuelle quadratique normalisée". Paris 11, 2004. http://www.theses.fr/2004PA112337.
Pełny tekst źródłaDelecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Electronic Thesis or Diss., Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634.
Pełny tekst źródłaThe human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
Fares, Mireille. "Multimodal Expressive Gesturing With Style". Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS017.
Pełny tekst źródłaThe generation of expressive gestures allows Embodied Conversational Agents (ECA) to articulate the speech intent and content in a human-like fashion. The central theme of the manuscript is to leverage and control the ECAs’ behavioral expressivity by modelling the complex multimodal behavior that humans employ during communication. The driving forces of the Thesis are twofold: (1) to exploit speech prosody, visual prosody and language with the aim of synthesizing expressive and human-like behaviors for ECAs; (2) to control the style of the synthesized gestures such that we can generate them with the style of any speaker. With these motivations in mind, we first propose a semantically aware and speech-driven facial and head gesture synthesis model trained on the TEDx Corpus which we collected. Then we propose ZS-MSTM 1.0, an approach to synthesize stylized upper-body gestures, driven by the content of a source speaker’s speech and corresponding to the style of any target speakers, seen or unseen by our model. It is trained on PATS Corpus which includes multimodal data of speakers having different behavioral style. ZS-MSTM 1.0 is not limited to PATS speakers, and can generate gestures in the style of any newly coming speaker without further training or fine-tuning, rendering our approach zero-shot. Behavioral style is modelled based on multimodal speakers’ data - language, body gestures, and speech - and independent from the speaker’s identity ("ID"). We additionally propose ZS-MSTM 2.0 to generate stylized facial gestures in addition to the upper-body gestures. We train ZS-MSTM 2.0 on PATS Corpus, which we extended to include dialog acts and 2D facial landmarks
Aron, Michaël. "Acquisition et modélisation de données articulatoires dans un contexte multimodal". Thesis, Nancy 1, 2009. http://www.theses.fr/2009NAN10097/document.
Pełny tekst źródłaThere is no single technique that will allow all relevant behaviour of the speech articulators (lips, tongue, palate...) to be spatially ant temporally acquired. Thus, this thesis investigates the fusion of multimodal articulatory data. A framework is described in order to acquire and fuse automatically an important database of articulatory data. This includes: 2D Ultrasound (US) data to recover the dynamic of the tongue, stereovision data to recover the 3D dynamic of the lips, electromagnetic sensors that provide 3D position of points on the face and the tongue, and 3D Magnetic Resonance Imaging (MRI) that depict the vocal tract for various sustained articulations. We investigate the problems of the temporal synchronization and the spatial registration between all these modalities, and also the extraction of the shape articulators from the data (tongue tracking in US images). We evaluate the uncertainty of our system by quantifying the spatial and temporal inaccuracies of the components of the system, both individually and in combination. Finally, the fused data are evaluated on an existing articulatory model to assess their quality for an application in speech production
Ouenniche, Kaouther. "Multimodal deep learning for audiovisual production". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.
Pełny tekst źródłaWithin the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context
Kruk, Dominika. "Multimodal Imaging of the heart muscle - Analysis and visualization to aided diagnosis". Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCK070.
Pełny tekst źródłaThe heart plays a vital role in the functioning of the human body. The function of the human heart is pumping blood throughout the body, supplying oxygen and nutrients to the tissues and removing carbon dioxide and other wastes. Cardiovascular diseases are the first cause of death worldwide. Heart diseases are mainly related to a process called atherosclerosis. This process caused harder blood flow through arteries and finally it can stop the blood flow. It can lead to heart attack and stroke. Early and accurate diagnosis of cardiovascular diseases plays an important role in improving the life of population afflicted heart diseases. Medical imaging is widely used in the diagnosis and monitoring of cardiovascular diseases. Medical imaging is a process of collecting information about a place of interest in the body using a predefined characteristic property that is displayed in the form of an image. Imaging techniques allow clinicians and scientist to see inside the body and provide a wealth of information.Recent advances in medical imaging with meaningful contributions from many fields of science, such us medical physic, chemistry, electrical and computer engineering, and computer science have a large impact on diagnostic radiology. The development of engineering and computer science has given the possibility to obtain high-resolution multidimensional images of the place of interest in the body. This kind of images gives a complex information to analyze the structure and function of the organs for computer-aided diagnosis, more accurate diagnosis or to develop or to direct new therapeutic strategies. The aim of this thesis is to develop a new method, which will allow to obtain more complex and accurate information about myocardial disease by using a computer science's and image processing methods.The main objective of this thesis is to develop a complete method allowing Positron Emission Tomography (PET) and Magnetic Resonance Images (MRI) registration of cardiac images. The main difficulties of the PET-MRI registration are the differences between these two modalities. To decrease these differences, the segmentation method were applied to PET and MRI images. Segmentation of the images can help to extract myocardium from the background and focus just on the registration of the myocardium without the impact of the structure around
Meseguer, Brocal Gabriel. "Multimodal analysis : informed content estimation and audio source separation". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS111.
Pełny tekst źródłaThis dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation. Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods. To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data
Rabhi, Sara. "Optimized deep learning-based multimodal method for irregular medical timestamped data". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS003.
Pełny tekst źródłaThe wide adoption of Electronic Health Records in hospitals’ information systems has led to the definition of large databases grouping various types of data such as textual notes, longitudinal medical events, and tabular patient information. However, the records are only filled during consultations or hospital stays that depend on the patient’s state, and local habits. A system that can leverage the different types of data collected at different time scales is critical for reconstructing the patient’s health trajectory, analyzing his history, and consequently delivering more adapted care.This thesis work addresses two main challenges of medical data processing: learning to represent the sequence of medical observations with irregular elapsed time between consecutive visits and optimizing the extraction of medical events from clinical notes. Our main goal is to design a multimodal representation of the patient’s health trajectory to solve clinical prediction problems. Our first work built a framework for modeling irregular medical time series to evaluate the importance of considering the time gaps between medical episodes when representing a patient’s health trajectory. To that end, we conducted a comparative study of sequential neural networks and irregular time representation techniques. The clinical objective was to predict retinopathy complications for type 1 diabetes patients in the French database CaRéDIAB (Champagne Ardenne Réseau Diabetes) using their history of HbA1c measurements. The study results showed that the attention-based model combined with the soft one-hot representation of time gaps led to AUROC score of 88.65% (specificity of 85.56%, sensitivity of 83.33%), an improvement of 4.3% when compared to the LSTM-based model. Motivated by these results, we extended our framework to shorter multivariate time series and predicted in-hospital mortality for critical care patients of the MIMIC-III dataset. The proposed architecture, HiTT, improved the AUC score by 5% over the Transformer baseline. In the second step, we focused on extracting relevant medical information from clinical notes to enrich the patient’s health trajectories. Particularly, Transformer-based architectures showed encouraging results in medical information extraction tasks. However, these complex models require a large, annotated corpus. This requirement is hard to achieve in the medical field as it necessitates access to private patient data and high expert annotators. To reduce annotation cost, we explored active learning strategies that have been shown to be effective in tasks such as text classification, information extraction, and speech recognition. In addition to existing methods, we defined a Hybrid Weighted Uncertainty Sampling active learning strategy that takes advantage of the contextual embeddings learned by the Transformer-based approach to measuring the representativeness of samples. A simulated study using the i2b2-2010 challenge dataset showed that our proposed metric reduces the annotation cost by 70% to achieve the same score as passive learning. Lastly, we combined multivariate medical time series and medical concepts extracted from clinical notes of the MIMIC-III database to train a multimodal transformer-based architecture. The test results of the in-hospital mortality task showed an improvement of 5.3% when considering additional text data. This thesis contributes to patient health trajectory representation by alleviating the burden of episodic medical records and the manual annotation of free-text notes
Pouteau, Xavier. "Dialogue de commande multimodal en milieu opérationnel : une communication naturelle pour l'utilisateur ?" Nancy 1, 1995. http://www.theses.fr/1995NAN10419.
Pełny tekst źródłaMa, Ta-Yu. "Modèle dynamique de transport basé sur les activités". Marne-la-vallée, ENPC, 2007. https://pastel.archives-ouvertes.fr/pastel-00003309.
Pełny tekst źródłaZnaidia, Amel. "Handling Imperfections for Multimodal Image Annotation". Phd thesis, Ecole Centrale Paris, 2014. http://tel.archives-ouvertes.fr/tel-01012009.
Pełny tekst źródłaZablocki, Éloi. "Multimodal machine learning : complementarity of textual and visual contexts". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS409.
Pełny tekst źródłaResearch looking at the interaction between language and vision, despite a growing interest, is relatively underexplored. Beyond trivial differences between texts and images, these two modalities have non overlapping semantics. On the one hand, language can express high-level semantics about the world, but it is biased in the sense that a large portion of its content is implicit (common-sense or implicit knowledge). On the other hand, images are aggregates of lower-level information, but they can depict a more direct view of real-world statistics and can be used to ground the meaning of objects. In this thesis, we exploit connections and leverage complementarity between language and vision. First, natural language understanding capacities can be augmented with the help of the visual modality, as language is known to be grounded in the visual world. In particular, representing language semantics is a long-standing problem for the natural language processing community, and to further improve traditional approaches towards that goal, leveraging visual information is crucial. We show that semantic linguistic representations can be enriched by visual information, and we especially focus on visual contexts and spatial organization of scenes. We present two models to learn grounded word or sentence semantic representations respectively, with the help of images. Conversely, integrating language with vision brings the possibility of expanding the horizons and tasks of the vision community. Assuming that language contains visual information about objects, and that this can be captured within linguistic semantic representation, we focus on the zero-shot object recognition task, which consists in recognizing objects that have never been seen thanks to linguistic knowledge acquired about the objects beforehand. In particular, we argue that linguistic representations not only contain visual information about the visual appearance of objects but also about their typical visual surroundings and visual occurrence frequencies. We thus present a model for zero-shot recognition that leverages the visual context of an object, and its visual occurrence likelihood, in addition to the region of interest as done in traditional approaches. Finally, we present prospective research directions to further exploit connections between language and images and to better understand the semantic gap between the two modalities
Mitra, Jhimli. "Multimodal Image Registration applied to Magnetic Resonance and Ultrasound Prostatic Images". Phd thesis, Université de Bourgogne, 2012. http://tel.archives-ouvertes.fr/tel-00786032.
Pełny tekst źródłaBonazza, Pierre. "Système de sécurité biométrique multimodal par imagerie, dédié au contrôle d’accès". Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCK017/document.
Pełny tekst źródłaResearch of this thesis consists in setting up efficient and light solutions to answer the problems of securing sensitive products. Motivated by a collaboration with various stakeholders within the Nuc-Track project, the development of a biometric security system, possibly multimodal, will lead to a study on various biometric features such as the face, fingerprints and the vascular network. This thesis will focus on an algorithm and architecture matching, with the aim of minimizing the storage size of the learning models while guaranteeing optimal performances. This will allow it to be stored on a personal support, thus respecting privacy standards
Toulouse, Tom. "Estimation par stéréovision multimodale de caractéristiques géométriques d’un feu de végétation en propagation". Thesis, Corte, 2015. http://www.theses.fr/2015CORT0009/document.
Pełny tekst źródłaThis thesis presents the geometrical characteristics measurement of spreading vegetation fires with multimodal stereovision systems. Image processing and 3D registration are used in order to obtain a three-dimensional modeling of the fire at each instant of image acquisition and then to compute fire front characteristics like its position, its rate of spread, its height, its width, its inclination, its surface and its volume. The first important contribution of this thesis is the fire pixel detection. A benchmark of fire pixel detection algorithms and of those that are developed in this thesis have been on a database of 500 vegetation fire images of the visible spectra which have been characterized according to the fire properties in the image (color, smoke, luminosity). Five fire pixel detection algorithms based on fusion of data from visible and near-infrared spectra images have also been developed and tested on another database of 100 multimodal images. The second important contribution of this thesis is about the use of images fusion for the optimization of the matching point’s number between the multimodal stereo images.The second important contribution of this thesis is the registration method of 3D fire points obtained with stereovision systems. It uses information collected from a housing containing a GPS and an IMU card which is positioned on each stereovision systems. With this registration, a method have been developed to extract the geometrical characteristics when the fire is spreading.The geometrical characteristics estimation device have been evaluated on a car of known dimensions and the results obtained confirm the good accuracy of the device. The results obtained from vegetation fires are also presented
Aderghal, Karim. "Classification of multimodal MRI images using Deep Learning : Application to the diagnosis of Alzheimer’s disease". Thesis, Bordeaux, 2021. http://www.theses.fr/2021BORD0045.
Pełny tekst źródłaIn this thesis, we are interested in the automatic classification of brain MRI images to diagnose Alzheimer’s disease (AD). We aim to build intelligent models that provide decisions about a patient’s disease state to the clinician based on visual features extracted from MRI images. The goal is to classify patients (subjects) into three main categories: healthy subjects (NC), subjects with mild cognitive impairment (MCI), and subjects with Alzheimer’s disease (AD). We use deep learning methods, specifically convolutional neural networks (CNN) based on visual biomarkers from multimodal MRI images (structural MRI and DTI), to detect structural changes in the brain hippocampal region of the limbic cortex. We propose an approach called "2-D+e" applied to our ROI (Region-of-Interest): the hippocampus. This approach allows extracting 2D slices from three planes (sagittal, coronal, and axial) of our region by preserving the spatial dependencies between adjacent slices according to each dimension. We present a complete study of different artificial data augmentation methods and different data balancing approaches to analyze the impact of these conditions on our models during the training phase. We propose our methods for combining information from different sources (projections/modalities), including two fusion strategies (early fusion and late fusion). Finally, we present transfer learning schemes by introducing three frameworks: (i) a cross-modal scheme (using sMRI and DTI), (ii) a cross-domain scheme that involves external data (MNIST), and (iii) a hybrid scheme with these two methods (i) and (ii). Our proposed methods are suitable for using shallow CNNs for multimodal MRI images. They give encouraging results even if the model is trained on small datasets, which is often the case in medical image analysis
Fayech, Besma. "Régulation des réseaux de transport multimodal : systèmes multi-agents et algorithmes évolutionnistes". Lille 1, 2003. https://pepite-depot.univ-lille.fr/LIBRE/Th_Num/2003/50376-2003-323.pdf.
Pełny tekst źródłaDe, goussencourt Timothée. "Système multimodal de prévisualisation “on set” pour le cinéma". Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAT106/document.
Pełny tekst źródłaPreviz on-set is a preview step that takes place directly during the shootingphase of a film with special effects. The aim of previz on-set is to show to the film director anassembled view of the final plan in realtime. The work presented in this thesis focuses on aspecific step of the previz : the compositing. This step consists in mixing multiple images tocompose a single and coherent one. In our case, it is to mix computer graphics with an imagefrom the main camera. The objective of this thesis is to propose a system for automaticadjustment of the compositing. The method requires the measurement of the geometry ofthe scene filmed. For this reason, a depth sensor is added to the main camera. The data issent to the computer that executes an algorithm to merge data from depth sensor and themain camera. Through a hardware demonstrator, we formalized an integrated solution in avideo game engine. The experiments gives encouraging results for compositing in real time.Improved results were observed with the introduction of a joint segmentation method usingdepth and color information. The main strength of this work lies in the development of ademonstrator that allowed us to obtain effective algorithms in the field of previz on-set
Alameda-Pineda, Xavier. "Egocentric Audio-Visual Scene Analysis : a machine learning and signal processing approach". Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM024/document.
Pełny tekst źródłaAlong the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as "Egocentric Audio-Visual Scene Analysis''.This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been used later on inside and outside the thesis. We also investigated the problem of AV event detection. We showed how the trust on one of the modalities (visual to be precise) can be modeled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. We explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to video-only classifiers. Later, we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited
Guislain, Maximilien. "Traitement joint de nuage de points et d'images pour l'analyse et la visualisation des formes 3D". Thesis, Lyon, 2017. http://www.theses.fr/2017LYSE1219/document.
Pełny tekst źródłaRecent years saw a rapid development of city digitization technologies. Acquisition campaigns covering entire cities are now performed using LiDAR (Light Detection And Ranging) scanners embedded aboard mobile vehicles. These acquisition campaigns yield point clouds, composed of millions of points, representing the buildings and the streets, and may also contain a set of images of the scene. The subject developed here is the improvement of the point cloud using the information contained in the camera images. This thesis introduces several contributions to this joint improvement. The position and orientation of acquired images are usually estimated using devices embedded with the LiDAR scanner, even if this information is inaccurate. To obtain the precise registration of an image on a point cloud, we propose a two-step algorithm which uses both Mutual Information and Histograms of Oriented Gradients. The proposed method yields an accurate camera pose, even when the initial estimations are far from the real position and orientation. Once the images have been correctly registered, it is possible to use them to color each point of the cloud while using the variability of the point of view. This is done by minimizing an energy considering the different colors associated with a point and the potential colors of its neighbors. Illumination changes can also change the color assigned to a point. Notably, this color can be affected by cast shadows. These cast shadows are changing with the sun position, it is therefore necessary to detect and correct them. We propose a new method that analyzes the joint variation of the reflectance value obtained by the LiDAR and the color of the points. By detecting enough interfaces between shadow and light, we can characterize the luminance of the scene and to remove the cast shadows. The last point developed in this thesis is the densification of a point cloud. Indeed, the local density of a point cloud varies and is sometimes insufficient in certain areas. We propose a directly applicable approach to increase the density of a point cloud using multiple images
Garcia, Geoffrey. "Une approche logicielle du traitement de la dyslexie : étude de modèles et applications". Thesis, Clermont-Ferrand 2, 2015. http://www.theses.fr/2015CLF22634/document.
Pełny tekst źródłaNeuropsychological disorders are widespread and generate real public health problems. In particular in our modern society, where written communication is ubiquitous, dyslexia can be extremely disabling. Nevertheless we can note that the diagnosis and remediation of this pathology are fastidious and lack of standardization. Unfortunately it seems inherent to the clinical characterization of dyslexia by exclusion, to the multitude of different practitioners involved in such treatment and to the lack of objectivity of some existing methods. In this respect, we decided to investigate the possibilities offered by modern computing to overcome these barriers. Indeed we have assumed that the democratization of computer systems and their computing power could make of them a perfect tool to alleviate the difficulties encountered in the treatment of dyslexia. This research has led us to study the techniques software as well as hardware, which can conduct to the development of an inexpensive and scalable system able to attend a beneficial and progressive changing of practices in this pathology field. With this project we put ourselves definitely in an innovative stream serving quality of care and aid provided to people with disabilities. Our work has been identifying different improvement areas that the use of computers enables. Then each of these areas could then be the subject of extensive research, modeling and prototype developments. We also considered the methodology for designing this kind of system as a whole. In particular our thoughts and these accomplishments have allowed us to define a software framework suitable for implementing a software platform that we called the PAMMA. This platform should theoretically have access to all the tools required for the flexible and efficient development of medical applications integrating business processes. In this way it is expected that this system allows the development of applications for caring dyslexic patients thus leading to a faster and more accurate diagnosis and a more appropriate and effective remediation. Of our innovation efforts emerge encouraging perspectives. However such initiatives can only be achieved within multidisciplinary collaborations with many functional, technical and financial means. Creating such a consortium seems to be the next required step to get a funding necessary for realizing a first functional prototype of the PAMMA, as well as its first applications. Some clinical studies may be conducted to prove undoubtedly the effectiveness of such an approach for treating dyslexia and eventually other neuropsychological disorders
Cantisani, Giorgia. "Neuro-steered music source separation". Electronic Thesis or Diss., Institut polytechnique de Paris, 2021. http://www.theses.fr/2021IPPAT038.
Pełny tekst źródłaIn this PhD thesis, we address the challenge of integrating Brain-Computer Interfaces (BCI) and music technologies on the specific application of music source separation, which is the task of isolating individual sound sources that are mixed in the audio recording of a musical piece. This problem has been investigated for decades, but never considering BCI as a possible way to guide and inform separation systems. Specifically, we explored how the neural activity characterized by electroencephalographic signals (EEG) reflects information about the attended instrument and how we can use it to inform a source separation system.First, we studied the problem of EEG-based auditory attention decoding of a target instrument in polyphonic music, showing that the EEG tracks musically relevant features which are highly correlated with the time-frequency representation of the attended source and only weakly correlated with the unattended one. Second, we leveraged this ``contrast'' to inform an unsupervised source separation model based on a novel non-negative matrix factorisation (NMF) variant, named contrastive-NMF (C-NMF) and automatically separate the attended source.Unsupervised NMF represents a powerful approach in such applications with no or limited amounts of training data as when neural recording is involved. Indeed, the available music-related EEG datasets are still costly and time-consuming to acquire, precluding the possibility of tackling the problem with fully supervised deep learning approaches. Thus, in the last part of the thesis, we explored alternative learning strategies to alleviate this problem. Specifically, we propose to adapt a state-of-the-art music source separation model to a specific mixture using the time activations of the sources derived from the user's neural activity. This paradigm can be referred to as one-shot adaptation, as it acts on the target song instance only.We conducted an extensive evaluation of both the proposed system on the MAD-EEG dataset which was specifically assembled for this study obtaining encouraging results, especially in difficult cases where non-informed models struggle
Vukotic, Verdran. "Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data". Thesis, Rennes, INSA, 2017. http://www.theses.fr/2017ISAR0015/document.
Pełny tekst źródłaIn this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain
Benaissa, Ezzeddine. "Plate-forme intelligente pour la chaîne logistique : approche basée sur un système multi-agents et les services web sémantiques, cas du transport multimodal des marchandises". Le Havre, 2013. http://www.theses.fr/2013LEHA0002.
Pełny tekst źródłaWe have proposed in this thesis a new approach for the development of a distributed based Multi-Agent System (MAS) coupled with Semantic Web Services (SWS) architecture in order to assist in the collaborative decision-making in the context of extended enterprise. To validate our appoach, the multimodal transport chain of goods was taken as a case application. The result of the research work is done in the form of called i-SEET for Intelligent System for Extended Enterprise intelligent platform
Maman, Lucien. "Automated analysis of cohesion in small groups interactions". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT030.
Pełny tekst źródłaOver the last decade, a new multidisciplinary research domain named Social Signal Processing (SSP) emerged. It is aimed at enabling machines to sense, recognize, and display human social signals. One of the challenging tasks addressed by SSP is the automated group interaction analysis. Recently, a particular emphasis is given to the automated study of emergent states as they play an important role in group dynamics. These are social processes that develop throughout group members' interactions.In this Thesis, we address the automated analysis of cohesion in small groups interactions. Cohesion is a multidimensional affective emergent state that can be defined as a dynamic process reflected by the tendency of a group to stick together to pursue goals and/or affective needs. Despite the rich literature available on cohesion from a Social Sciences perspective, its automated analysis is still in its infancy. Grounding on Social Sciences' insights, this Thesis aims to develop computational models of cohesion following four axes research axes, leveraging Machine Learning and Deep Learning techniques. Computational models of cohesion, indeed, should account for the temporal nature of cohesion, the multidimensionality of this group process, take into account how to model cohesion from both individuals and group perspectives, integrate the relationships between its dimensions and their development over time, and take heed of the relationships between cohesion and other group processes.In addition, facing a lack of publicly available data, this Thesis contributed to the collection of a multimodal dataset specifically designed for studying group cohesion and for explicitly controlling its variations over time. Such a dataset enables, among other perspectives, further development of computational models integrating the perceived cohesion from group members and/or external points of view. Our results show the relevance of leveraging Social Sciences' insights to develop new computational models of cohesion and confirm the benefits of exploring each of the four research axes
Malik, Muhammad Usman. "Learning multimodal interaction models in mixed societies A novel focus encoding scheme for addressee detection in multiparty interaction using machine learning algorithms". Thesis, Normandie, 2020. http://www.theses.fr/2020NORMIR18.
Pełny tekst źródłaHuman -Agent Interaction and Machine learning are two different research domains. Human-agent interaction refers to techniques and concepts involved in developing smart agents, such as robots or virtual agents, capable of seamless interaction with humans, to achieve a common goal. Machine learning, on the other hand, exploits statistical algorithms to learn data patterns. The proposed research work lies at the crossroad of these two research areas. Human interactions involve multiple modalities, which can be verbal such as speech and text, as well as non-verbal i.e. facial expressions, gaze, head and hand gestures, etc. To mimic real-time human-human interaction within human-agent interaction,multiple interaction modalities can be exploited. With the availability of multimodal human-human and human-agent interaction corpora, machine learning techniques can be used to develop various interrelated human-agent interaction models. In this regard, our research work proposes original models for addressee detection, turn change and next speaker prediction, and finally visual focus of attention behaviour generation, in multiparty interaction. Our addressee detection model predicts the addressee of an utterance during interaction involving more than two participants. The addressee detection problem has been tackled as a supervised multiclass machine learning problem. Various machine learning algorithms have been trained to develop addressee detection models. The results achieved show that the proposed addressee detection algorithms outperform a baseline. The second model we propose concerns the turn change and next speaker prediction in multiparty interaction. Turn change prediction is modeled as a binary classification problem whereas the next speaker prediction model is considered as a multiclass classification problem. Machine learning algorithms are trained to solve these two interrelated problems. The results depict that the proposed models outperform baselines. Finally, the third proposed model concerns the visual focus of attention (VFOA) behaviour generation problem for both speakers and listeners in multiparty interaction. This model is divided into various sub-models that are trained via machine learning as well as heuristic techniques. The results testify that our proposed systems yield better performance than the baseline models developed via random and rule-based approaches. The proposed VFOA behavior generation model is currently implemented as a series of four modules to create different interaction scenarios between multiple virtual agents. For the purpose of evaluation, recorded videos for VFOA generation models for speakers and listeners, are presented to users who evaluate the baseline, real VFOA behaviour and proposed VFOA models on the various naturalness criteria. The results show that the VFOA behaviour generated via the proposed VFOA model is perceived more natural than the baselines and as equally natural as real VFOA behaviour
Bezivin, Pauline. "Effets du sexe sur la maturation cérébrale et impacts sur la régulation émotionnelle à l’adolescence". Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS205.
Pełny tekst źródłaIn adolescence, brain maturation involves subtle global and regional anatomical changes, and estimating the exact morphology of some structures during post-pubertal development is therefore difficult. Effect of sexual dimorphism on brain maturation has been under-explored prospectively by magnetic resonance imaging. In this context, this work is focused on the longitudinal study of the effects of sex on brain maturation using two methods to control and analyze the spatial positioning variations of images acquired at different time points. In a first study using a multimodal approach, our goal was to examine sexual dimorphism in brain maturation of the limbic system to explain the emotional differences between girls and boys during adolescence. We adapted a method of longitudinal processing on anatomical and diffusion images of 335 healthy adolescents between 14 and 16 years. We highlighted sexual differences in brain maturation of the limbic system with a later maturation of boys compared to girls. These changes mediated sexual differences in emotional regulation, illustrated by an increase in positive personality traits in boys and a decrease in girls. In a second study using an original registration approach, our objective was to estimate and extrapolate maturation trajectories based on sexual dimorphism. We highlighted divergent trajectories between girls and boys between 14 and 16, illustrating a differentiation in maturation rates that increased during this period, specifically in the prefrontal cortex. These differential trajectories made it possible to estimate a maturational advance of 5 months in girls in the prefrontal cortex. All these results provide useful information for a better understanding of the differences in brain maturation between girls and boys, and their links with the emotional system dysregulation and therefore the vulnerability to depression in adolescence
Neumann, Markus. "Automatic multimodal real-time tracking for image plane alignment in interventional Magnetic Resonance Imaging". Phd thesis, Université de Strasbourg, 2014. http://tel.archives-ouvertes.fr/tel-01038023.
Pełny tekst źródłaHett, Kilian. "Multi-scale and multimodal imaging biomarkers for the early detection of Alzheimer’s disease". Thesis, Bordeaux, 2019. http://www.theses.fr/2019BORD0011/document.
Pełny tekst źródłaAlzheimer’s disease (AD) is the most common dementia leading to a neurodegenerative process and causing mental dysfunctions. According to the world health organization, the number of patients having AD will double in 20 years. Neuroimaging studies performed on AD patients revealed that structural brain alterations are advanced when the diagnosis is established. Indeed, the clinical symptoms of AD are preceded by brain changes. This stresses the need to develop new biomarkers to detect the first stages of the disease. The development of such biomarkers can make easier the design of clinical trials and therefore accelerate the development of new therapies. Over the past decades, the improvement of magnetic resonance imaging (MRI) has led to the development of new imaging biomarkers. Such biomarkers demonstrated their relevance for computer-aided diagnosis but have shown limited performances for AD prognosis. Recently, advanced biomarkers were proposed toimprove computer-aided prognosis. Among them, patch-based grading methods demonstrated competitive results to detect subtle modifications at the earliest stages of AD. Such methods have shown their ability to predict AD several years before the conversion to dementia. For these reasons, we have had a particular interest in patch-based grading methods. First, we studied patch-based grading methods for different anatomical scales (i.e., whole brain, hippocampus, and hippocampal subfields). We adapted patch-based grading method to different MRI modalities (i.e., anatomical MRI and diffusion-weighted MRI) and developed an adaptive fusion scheme. Then, we showed that patch comparisons are improved with the use of multi-directional derivative features. Finally, we proposed a new method based on a graph modeling that enables to combine information from inter-subjects’ similarities and intra-subjects’ variability. The conducted experiments demonstrate that our proposed method enable an improvement of AD detection and prediction
Andriamanampisoa, Fenohery Tiana. "Recalage multimodal 3D utilisant le modèle élastique, la méthode des éléments finis et l'information mutuelle dans un environnement parallèle". Toulouse 3, 2008. http://thesesups.ups-tlse.fr/332/.
Pełny tekst źródłaTo superpose and fuse images in medical imagery, it is indispensable that the images are set in correspondence. This survey deals rigid and non-rigid multimodal registration. We have chosen centered rotation, translation and scale transform as geometric transformation. For non-rigid registration, the modelling is based on isotropic elastic linear material model. We use finite elements method and uniform grid for mesh. Besides, one used Mattes mutual information as criteria of similarity and gradient conjugated as the optimization method. Besides, this survey deals the transformation of the registration algorithms in a parallel environment. This work use SPMD-DM architecture and the experimentation is done on supercomputer and large scale network
Mihoub, Alaeddine. "Apprentissage statistique de modèles de comportement multimodal pour les agents conversationnels interactifs". Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAT079/document.
Pełny tekst źródłaFace to face interaction is one of the most fundamental forms of human communication. It is a complex multimodal and coupled dynamic system involving not only speech but of numerous segments of the body among which gaze, the orientation of the head, the chest and the body, the facial and brachiomanual movements, etc. The understanding and the modeling of this type of communication is a crucial stage for designing interactive agents capable of committing (hiring) credible conversations with human partners. Concretely, a model of multimodal behavior for interactive social agents faces with the complex task of generating gestural scores given an analysis of the scene and an incremental estimation of the joint objectives aimed during the conversation. The objective of this thesis is to develop models of multimodal behavior that allow artificial agents to engage into a relevant co-verbal communication with a human partner. While the immense majority of the works in the field of human-agent interaction (HAI) is scripted using ruled-based models, our approach relies on the training of statistical models from tracks collected during exemplary interactions, demonstrated by human trainers. In this context, we introduce "sensorimotor" models of behavior, which perform at the same time the recognition of joint cognitive states and the generation of the social signals in an incremental way. In particular, the proposed models of behavior have to estimate the current unit of interaction ( IU) in which the interlocutors are jointly committed and to predict the co-verbal behavior of its human trainer given the behavior of the interlocutor(s). The proposed models are all graphical models, i.e. Hidden Markov Models (HMM) and Dynamic Bayesian Networks (DBN). The models were trained and evaluated - in particular compared with classic classifiers - using datasets collected during two different interactions. Both interactions were carefully designed so as to collect, in a minimum amount of time, a sufficient number of exemplars of mutual attention and multimodal deixis of objects and places. Our contributions are completed by original methods for the interpretation and comparative evaluation of the properties of the proposed models. By comparing the output of the models with the original scores, we show that the HMM, thanks to its properties of sequential modeling, outperforms the simple classifiers in term of performances. The semi-Markovian models (HSMM) further improves the estimation of sensorimotor states thanks to duration modeling. Finally, thanks to a rich structure of dependency between variables learnt from the data, the DBN has the most convincing performances and demonstrates both the best performance and the most faithful multimodal coordination to the original multimodal events
Harrando, Ismail. "Representation, information extraction, and summarization for automatic multimedia understanding". Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS097.
Pełny tekst źródłaWhether on TV or on the internet, video content production is seeing an unprecedented rise. Not only is video the dominant medium for entertainment purposes, but it is also reckoned to be the future of education, information and leisure. Nevertheless, the traditional paradigm for multimedia management proves to be incapable of keeping pace with the scale brought about by the sheer volume of content created every day across the disparate distribution channels. Thus, routine tasks like archiving, editing, content organization and retrieval by multimedia creators become prohibitively costly. On the user side, too, the amount of multimedia content pumped daily can be simply overwhelming; the need for shorter and more personalized content has never been more pronounced. To advance the state of the art on both fronts, a certain level of multimedia understanding has to be achieved by our computers. In this research thesis, we aim to go about the multiple challenges facing automatic media content processing and analysis, mainly gearing our exploration to three axes: 1. Representing multimedia: With all its richness and variety, modeling and representing multimedia content can be a challenge in itself. 2. Describing multimedia: The textual component of multimedia can be capitalized on to generate high-level descriptors, or annotations, for the content at hand. 3. Summarizing multimedia: we investigate the possibility of extracting highlights from media content, both for narrative-focused summarization and for maximising memorability
Sutour, Camille. "Vision nocturne numérique : restauration automatique et recalage multimodal des images à bas niveau de lumière". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0099/document.
Pełny tekst źródłaNight vision for helicopter pilots is artificially enhanced by a night vision system. It consists in a light intensifier (LI) coupled with a numerical camera, and an infrared camera. The goal of this thesis is to improve this device by analyzing the defaults in order to correct them.The first part consists in reducing the noise level on the LI images. This requires to evaluate the nature of the noise corrupting these images, so an automatic noise estimation method has been developed. The estimation is based on a non parametric detection of homogeneous areas.Then the noise statistics are estimated using these homogeneous regions by performing a robust l`1 estimation of the noise level function.The LI images can then be denoised using the noise estimation. We have developed in the second part a denoising algorithm that combines the non local means with variational methods by applying an adaptive regularization weighted by a non local data fidelity term. Then this algorithm is adapted to video denoising using the redundancy provided by the sequences, hence guaranteeing temporel stability and preservation of the fine structures.Finally, in the third part data from the optical and infrared sensors are registered. We propose an edge based multimodal registration metric. Combined with a gradient ascent resolution and a temporel scheme, the proposed method allows robust registration of the two modalities for later fusion
Commandeur, Frédéric. "Fusion d'images multimodales pour la caractérisation du cancer de la prostate". Thesis, Rennes 1, 2016. http://www.theses.fr/2016REN1S038/document.
Pełny tekst źródłaThis thesis concerns the prostate cancer characterization based on multimodal imaging data. The purpose is to identify and characterize the tumors using in-vivo observations including mMRI and PET/CT, with a biological reference obtained from anatomopathological analysis of radical prostatectomy specimen providing histological slices. Firstly, we propose two registration methods to match the multimodal images in the the spatial reference defined by MRI. The first algorithm aims at aligning PET/CT images with MRI by combining contours information and presence probability of the prostate. The objective of the second is to register the histological slices with the MRI. Based on the Stanford protocol, a thinner cutting of the radical prostatectomy specimen is done providing more slices compared to clinical routine. The correspondance between histological and MRI slices is then estimated using a combination of the prior information of the slicing and salient points (SURF) extracted in both modalities. This initialization step allows for an affine and non-rigid registration based on mutual information and intraprostatic structures distance map. Secondly, structural (Haar, Garbor, etc) and functional (Ktrans, Kep, SUV, TLG, etc) descriptors are extracted for each prostate voxel over MRI and PET images. Corresponding biological labels obtained from the anatomopathological analysis are associated to the features vectors. The biological labels are composed by the Gleason score providing an information of aggressiveness and immunohistochemistry grades providing a quantification of biological process such as hypoxia and cell growth. Finally, these pairs (features vectors/biological information) are used as training data to build RF and SVM classifiers to characterize tumors from new in-vivo observations. In this work, we perform a feasibility study with nine patients
Soury, Mariette. "Détection multimodale du stress pour la conception de logiciels de remédiation". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112278/document.
Pełny tekst źródłaThis thesis focuses on the automatic recognition of human stress during stress-inducing interactions (public speaking, job interview and serious games), using audio and visual cues.In order to build automatic stress recognition models, we used audio cues computed from subjects' voice captured via a lapel microphone, and visual cues computed either form subjects' facial expressions captured via a webcam, or subjects' posture captured via a Kinect. Part of this work is dedicated to the study of information fusion form those various modalities.Stress expression and coping are influenced both by interpersonal differences (personality traits, past experiences, cultural background) and contextual differences (type of stressor, situation's stakes). We evaluated stress in various populations in data corpora collected during this thesis: social phobics in anxiety-inducing situations in interaction with a machine and with humans; apathologic subjects in a mock job interview; and apathologic subjects interaction with a computer and with the humanoid robot Nao. Inter-individual and inter-corpora comparisons highlight the variability of stress expression.A possible application of this work could be the elaboration of therapeutic software to learn stress coping strategies, particularly for social phobics.Key words: stress, social phobia, multimodal stress detection, stress audio cues, stress facial cues, stress postural cues, multimodal fusion
Yoo, Thomas. "Application of a Multimodal Polarimetric Imager to Study the Polarimetric Response of Scattering Media and Microstructures". Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLX106/document.
Pełny tekst źródłaThe work carried out during this thesis was aimed to study the interaction of polarized light from the scattering media and particles. This work is part of a strong collaborative context between the LPICM and various private and public laboratories. A wide variety of aspects have been treated deeply, including instrumental development, advanced numerical simulation and the creation of measurement protocols for the interpretation of complex data.The instrumental part of the thesis was devoted to the development of an innovative instrument, suitable for taking polarimetric images at different scales (from millimeters to microns) that can be quickly reconfigured to offer different imaging modes of the same sample. The two main aspects that characterize the instrument are i) the possibility of obtaining real polarimetric images of the sample and the angular distribution of light scattered by an illuminated zone whose size and position can be controlled, ii) the total control of the polarization state, size and divergence of the beams. These two aspects are not united on any other commercial or experimental apparatus today.The first object of the study using the multimodal imaging polarimeter was to study the effect of the thickness from a scattering medium on its optical response. In medical imaging, there is a broad consensus on the benefits of using different polarimetric properties to improve the effectiveness of optical screening techniques for different diseases. Despite these advantages, the interpretation of the polarimetric responses in terms of the physiological properties of tissues has been obscured by the influence of the unknown thickness of the sample.The objective of the work was, therefore, to better understand the dependence of the polarimetric properties of different scattering materials with the known thickness. In conclusion, it is possible to show that the polarimetric properties of the scattering media vary proportionally with the optical path that the light has traveled inside the medium, whereas the degree of polarization depends quadratically on the optical path. This discovery could be used to develop a method of data analysis that overcomes the effect of thickness variations, thus making the measurements very robust and related only to the intrinsic properties of the samples studied.The second object of study was to study the polarimetric responses from particles of micrometric size. The selection of the particles studied by analogy to the size of the cells that form the biological tissues, and which are responsible for the dispersion of light. By means of the polarimetric measurements, it has been discovered that when the microparticles are illuminated with an oblique incidence with respect to the optical axis of the microscope, they appear to behave as if they were optically active. Moreover, it has been found that the value of this apparent optical activity depends on the shape of the particles. The explanation of this phenomenon is based on the appearance of a topological phase of the beam. This topological phase depends on the path of the light scattered inside the microscope. The unprecedented observation of this topological phase has been done by the fact that the multimodal polarimetric imager allows illumination of the samples at the oblique incidence. This discovery can significantly improve the efficiency of optical methods for determining the shape of micro-objects
Rocher, Pierre-Olivier. "Transmodalité de flux d'images de synthèse". Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET2026/document.
Pełny tekst źródłaThe use of video as an information dissemination support has become preponderant during the last few years. According to some analysts, by 2017 approximately 90% of the world's bandwidth will be consumed by video streaming services. These services have encouraged cloud gaming solutions to become more democratic. Such solutions have been devised in the context of strong development of the cloud-computing paradigm, and they were driven by the proliferation of mobile devices as well as growing network quality. The technologies used in this kind of solutions refer to as remote rendering. They allow the execution of multiple applications, while maximizing the number of clients per server. Thus, it is essential to control the necessary bandwidth to allow the required functionality of various services. The existing cloud gaming solutions in the literature use various methods of video compression to transmit images between sever and clients (pixels reigns supreme). However, there are various other ways of encoding digital images, including parametric map storage and a number of studies encourage this approach (for both image and video). In this thesis, we propose a hybrid representation of space in order to reduce the bit rate. Our approach utilizes both pixel and parametric approaches for the compression of video stream. The use of two compression techniques requires defining the area to be covered by different encoders. This is accomplished by including user to the life cycle of rendering, and attending to the area mostly concerned to the user. In order to identify the area an eye-tracker device was used on several games and several testers. We also establish a correlation between the characteristics of images and the type of game. This helps to identify areas that the player looks directly or indirectly (“maps of selective attention"), and thus, encoders are manager accordingly. For this thesis, we details and implement the architecture and algorithms for such multi-model encoder (which we call "transmodeur") as proof of concept. We also provide an analytical study of out model and the influence of various parameters on transmodeur and describe in effectiveness through an objective study. Our transmodeur (rendering system) has been successfully integrated into XLcloud project for rendering purposes. A number of improvement (especially in performance) will be required for production use, but it is now possible to use it smoothly using spatial resolutions slightly lower than 720p at 30 frames per second
Poinsot, Audrey. "Traitements pour la reconnaissance biométrique multimodale : algorithmes et architectures". Thesis, Dijon, 2011. http://www.theses.fr/2011DIJOS010.
Pełny tekst źródłaIncluding multiple sources of information in personal identity recognition reduces the limitations of each used characteristic and gives the opportunity to greatly improve performance. This thesis presents the design work done in order to build an efficient generalpublic recognition system, which can be implemented on a low-cost hardware platform. The chosen solution explores the possibilities offered by multimodality and in particular by the fusion of face and palmprint. The algorithmic chain consists in a processing based on Gabor filters and score fusion. A real database of 130 subjects has been designed and built for the study. High performance has been obtained and confirmed on a virtual database, which consists of two common public biometric databases (AR and PolyU). Thanks to a comprehensive study on the architecture of the DSP components and some implementations carried out on a DSP belonging to the TMS320c64x family, it has been proved that it is possible to implement the system on a single DSP with short processing times. Moreover, an algorithms and architectures development work for FPGA implementation has demonstrated that these times can be significantly reduced
Courtial, Nicolas. "Fusion d’images multimodales pour l’assistance de procédures d’électrophysiologie cardiaque". Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S015.
Pełny tekst źródłaCardiac electrophysiology procedures have been proved to be efficient to suppress arrythmia and heart failure symptoms. Their success rate depends on patient’s heart condition’s knowledge, including electrical and mechanical functions and tissular quality. It is a major clinical concern for these therapies. This work focuses on the development of specific patient multimodal model to plan and assist radio-frequency ablation (RFA) and cardiac resynchronization therapy (CRT). First, segmentation, registration and fusion methods have been developped to create these models, allowing to plan these interventional procedures. For each therapy, specific means of integration within surgical room have been established, for assistance purposes. Finally, a new multimodal descriptor has been synthesized during a post-procedure analysis, aiming to predict the CRT’s response depending on the left ventricular stimulation site. These studies have been applied and validated on patients candidate to CRT and ARF. They showed the feasibility and interest of integrating such multimodal models in the clinical workflow to assist these procedures