Rozprawy doktorskie na temat „Multimodal processing”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Multimodal processing”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Cadène, Rémi. "Deep Multimodal Learning for Vision and Language Processing". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS277.
Pełny tekst źródłaDigital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants.In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems
Hu, Yongtao, i 胡永涛. "Multimodal speaker localization and identification for video processing". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2014. http://hdl.handle.net/10722/212633.
Pełny tekst źródłaChen, Xun. "Multimodal biomedical signal processing for corticomuscular coupling analysis". Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/45811.
Pełny tekst źródłaSadr, Lahijany Nadi. "Multimodal Signal Processing for Diagnosis of Cardiorespiratory Disorders". Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17636.
Pełny tekst źródłaElshaw, Mark. "Multimodal neural grounding of language processing for robot actions". Thesis, University of Sunderland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.420517.
Pełny tekst źródłaFriedel, Paul. "Sensory information processing : detection, feature extraction, & multimodal integration". kostenfrei, 2008. http://mediatum2.ub.tum.de/doc/651333/651333.pdf.
Pełny tekst źródłaSadeghi, Ghandehari Soroush. "Multimodal signal processing in the peripheral and central vestibular pathways". Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=95559.
Pełny tekst źródłaLes organes sensoriels vestibulaires de l’oreille interne détectent les mouvements de la tète dans r espace. Ces informations sont envoyées aux neurones vestibulaires centraux localises au niveau du tronc cérébral. A ce niveau convergent également d'autres signaux en provenance du cortex, du cervelet. de la moelle ainsi que de divers noyaux du tronc cérébral. Les études présentées ici ont pour but de comprendre le mode de codage et la nature des signaux générés par les neurones vestibulaires périphériques, ainsi que les capacités de traitement des neurones vestibulaires centraux. véritable centres d'intégration sensori-motrice. Ces travaux ont été conduits en condition physiologique et physiopathologique sur le modèle de la compensation vestibulaire. A r aide de mesures issues de la théorie de l'information, nous nous sommes tout d'abord intéresse aux codages effectues par Ies afférences vestibulaires régulières et irrégulières. Ces deux types neuronaux différent notamment par la variabilité de leur fréquence de décharge spontanée (bruit) et leurs sensibilités (signal). Nous avons montre que Ies fibres afférentes régulières utilisent un codage temporel alors que les fibres irrégulières fonctionnent essentiellement sur un codage en modulation de la fréquence, et ce d' autant mieux que les fréquences sont élevées, constituant ainsi de véritables détecteurs d'évènements. Nous avons ensuite étudie les réponses des afférences suite a une stimulation vestibulaire directe ou a une activation du « système efférent ». En conditions physiologiques, nous avons tout d'abord pu démontrer que le système efférent est bien fonctionneI chez le singe éveille. fr
Fateri, Sina. "Advanced signal processing techniques for multimodal ultrasonic guided wave response". Thesis, Brunel University, 2015. http://bura.brunel.ac.uk/handle/2438/11657.
Pełny tekst źródłaCaglayan, Ozan. "Multimodal Machine Translation". Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1016/document.
Pełny tekst źródłaMachine translation aims at automatically translating documents from one language to another without human intervention. With the advent of deep neural networks (DNN), neural approaches to machine translation started to dominate the field, reaching state-ofthe-art performance in many languages. Neural machine translation (NMT) also revived the interest in interlingual machine translation due to how it naturally fits the task into an encoder-decoder framework which produces a translation by decoding a latent source representation. Combined with the architectural flexibility of DNNs, this framework paved the way for further research in multimodality with the objective of augmenting the latent representations with other modalities such as vision or speech, for example. This thesis focuses on a multimodal machine translation (MMT) framework that integrates a secondary visual modality to achieve better and visually grounded language understanding. I specifically worked with a dataset containing images and their translated descriptions, where visual context can be useful forword sense disambiguation, missing word imputation, or gender marking when translating from a language with gender-neutral nouns to one with grammatical gender system as is the case with English to French. I propose two main approaches to integrate the visual modality: (i) a multimodal attention mechanism that learns to take into account both sentence and convolutional visual representations, (ii) a method that uses global visual feature vectors to prime the sentence encoders and the decoders. Through automatic and human evaluation conducted on multiple language pairs, the proposed approaches were demonstrated to be beneficial. Finally, I further show that by systematically removing certain linguistic information from the input sentences, the true strength of both methods emerges as they successfully impute missing nouns, colors and can even translate when parts of the source sentences are completely removed
Fridman, Linnea, i Victoria Nordberg. "Two Multimodal Image Registration Approaches for Positioning Purposes". Thesis, Linköpings universitet, Datorseende, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157424.
Pełny tekst źródłaZamzmi, Ghada. "Automatic Multimodal Assessment of Neonatal Pain". Scholar Commons, 2018. https://scholarcommons.usf.edu/etd/7662.
Pełny tekst źródłaBaum, Karl G. "Multimodal breast imaging : registration, visualization, and image synthesis /". Online version of thesis, 2008. http://hdl.handle.net/1850/7063.
Pełny tekst źródłaPanchev, Christo. "Spatio-temporal and multimodal processing in a spiking neural mind of a robot". Thesis, University of Sunderland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.420478.
Pełny tekst źródłaSanders, Teresa H. "Multimodal assessment of Parkinson's disease using electrophysiology and automated motor scoring". Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/51970.
Pełny tekst źródłaWhitehurst, Daniel Scott. "Techniques for Processing Airborne Imagery for Multimodal Crop Health Monitoring and Early Insect Detection". Thesis, Virginia Tech, 2016. http://hdl.handle.net/10919/73048.
Pełny tekst źródłaMaster of Science
Lizarraga, Gabriel M. "A Neuroimaging Web Interface for Data Acquisition, Processing and Visualization of Multimodal Brain Images". FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3855.
Pełny tekst źródłaAlameda-Pineda, Xavier. "Egocentric Audio-Visual Scene Analysis : a machine learning and signal processing approach". Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM024/document.
Pełny tekst źródłaAlong the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as "Egocentric Audio-Visual Scene Analysis''.This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been used later on inside and outside the thesis. We also investigated the problem of AV event detection. We showed how the trust on one of the modalities (visual to be precise) can be modeled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. We explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to video-only classifiers. Later, we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited
Karvonen, Tuukka Matias. "Towards Visuocomputational Endoscopy: Visual Computing for Multimodal and Multi-Articulated Endoscopy". Kyoto University, 2017. http://hdl.handle.net/2433/227661.
Pełny tekst źródłaKoubaroulis, D. A. "The multimodal neighbourhood signature for modelling object colour appearance and applications in computer vision". Thesis, University of Surrey, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.365142.
Pełny tekst źródłaLadwig, Stefan [Verfasser]. "About multimodal information processing and the relation of proximal and distal action effects / Stefan Ladwig". Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2015. http://d-nb.info/1066812535/34.
Pełny tekst źródłaLeatherday, Christopher. "Evaluation of recurrent glioma and Alzheimer’s disease using novel multimodal brain image processing and analysis". Thesis, Curtin University, 2016. http://hdl.handle.net/20.500.11937/2238.
Pełny tekst źródłaAppelstål, Michael. "Multimodal Model for Construction Site Aversion Classification". Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-421011.
Pełny tekst źródłaDelecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634/document.
Pełny tekst źródłaThe human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
Jaime, Mark. "The Role of Temporal Synchrony in the Facilitation of Perceptual Learning during Prenatal Development". FIU Digital Commons, 2007. http://digitalcommons.fiu.edu/etd/58.
Pełny tekst źródłaCaixeta, Fabio Viegas. "Atividade multimodal no c?rtex sensorial prim?rio de ratos". Universidade Federal do Rio Grande do Norte, 2010. http://repositorio.ufrn.br:8080/jspui/handle/123456789/17290.
Pełny tekst źródłaCoordena??o de Aperfei?oamento de Pessoal de N?vel Superior
The currently accepted model of sensory processing states that different senses are processed in parallel, and that the activity of specific cortical regions define the sensorial modality perceived by the subject. In this work we used chronic multielectrode extracellular recordings to investigate to which extent neurons in the visual and tactile primary cortices (V1 and S1) of anesthetized rats would respond to sensory modalities not traditionaly associated with these cortices. Visual stimulation yielded 87% of responsive neurons in V1, while 82% of S1 neurons responded to tactile stimulation. In the same stimulation sessions, we found 23% of V1 neurons responding to tactile stimuli and 22% of S1 neurons responding to visual stimuli. Our data supports an increasing body of evidence that indicates the existence multimodal processing in primary sensory cortices. Our data challenge the unimodal sensory processing paradigm, and suggest the need of a reinterpretation of the currently accepted model of cortical hierarchy.
O modelo de processamento sensorial mais aceito atualmente afirma que os sentidos s?o processados em paralelo, e que a atividade de c?rtices sensoriais espec?ficos define a modalidade sens?ria percebida subjetivamente. Neste trabalho utilizamos registros eletrofisiol?gicos cr?nicos de m?ltiplos neur?nios para investigar se neur?nios nos c?rtices prim?rios visual (V1) e t?til (S1) de ratos anestesiados podem responder a est?mulos das modalidades sensoriais n?o associadas tradicionalmente a estes c?rtices. Durante a estimula??o visual, 87% dos neur?nios de V1 foram responsivos, enquanto 82% dos neur?nios de S1 responderam ? estimula??o t?til. Nos mesmos registros, encontramos 23% dos neur?nios de V1 responsivos a est?mulos t?teis e 22% dos neur?nios de S1 responsivos a est?mulos visuais. Nossos dados corroboram uma crescente s?rie de evid?ncias que indica a presen?a de processamento multimodal nos c?rtices sensoriais prim?rios, o que desafia o paradigma do processamento sensorial unimodal e sugere a necessidade de uma reinterpreta??o do modelo de hierarquia cortical atualmente aceito.
Padula, Claudia B. "The Functional and Structural Neural Connectivity of Affective Processing in Alcohol Dependence: A Multimodal Imaging Study". University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1377869730.
Pełny tekst źródłaPoria, Soujanya. "Novel symbolic and machine-learning approaches for text-based and multimodal sentiment analysis". Thesis, University of Stirling, 2017. http://hdl.handle.net/1893/25396.
Pełny tekst źródłaOuenniche, Kaouther. "Multimodal deep learning for audiovisual production". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.
Pełny tekst źródłaWithin the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context
Harris, Matthew Joshua. "Accelerating Reverse Engineering Image Processing Using FPGA". Wright State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright155535529307322.
Pełny tekst źródłaGimenes, Gabriel Perri. "Advanced techniques for graph analysis: a multimodal approach over planetary-scale data". Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-26062015-105026/.
Pełny tekst źródłaAplicações como comércio eletrônico, redes de computadores, redes sociais e biologia (interação proteica), entre outras, levaram a produção de dados que podem ser representados como grafos à escala planetária { podendo possuir milhões de nós e bilhões de arestas. Tais aplicações apresentam problemas desafiadores quando a tarefa consiste em usar as informações contidas nos grafos para auxiliar processos de tomada de decisão através da descoberta de padrões não triviais e potencialmente utéis. Para processar esses grafos em busca de padrões, tanto pesquisadores como a indústria tem usado recursos de processamento distribuído organizado em clusters computacionais. Entretanto, a construção e manutenção desses clusters pode ser complexa, trazendo tanto problemas técnicos como financeiros que podem ser proibitivos em diversos casos. Por isso, torna-se desejável a capacidade de se processar grafos em larga escala usando somente um nó computacional. Para isso, foram desenvolvidos processos e algoritmos seguindo três abordagens diferentes, visando a definição de um arcabouço de análise capaz de revelar padrões, compreensão e auxiliar na tomada de decisão sobre grafos em escala planetária.
Clark, Rebecca A. "Multimodal flavour perception : the impact of sweetness, bitterness, alcohol content and carbonation level on flavour perception". Thesis, University of Nottingham, 2011. http://eprints.nottingham.ac.uk/13432/.
Pełny tekst źródłaDean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.
Pełny tekst źródłaDean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.
Pełny tekst źródłaDelecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Electronic Thesis or Diss., Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634.
Pełny tekst źródłaThe human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
Warraich, Daud Sana Mechanical & Manufacturing Engineering Faculty of Engineering UNSW. "Ultrasonic stochastic localization of hidden discontinuities in composites using multimodal probability beliefs". Publisher:University of New South Wales. Mechanical & Manufacturing Engineering, 2009. http://handle.unsw.edu.au/1959.4/43719.
Pełny tekst źródłaToulouse, Tom. "Estimation par stéréovision multimodale de caractéristiques géométriques d’un feu de végétation en propagation". Thesis, Corte, 2015. http://www.theses.fr/2015CORT0009/document.
Pełny tekst źródłaThis thesis presents the geometrical characteristics measurement of spreading vegetation fires with multimodal stereovision systems. Image processing and 3D registration are used in order to obtain a three-dimensional modeling of the fire at each instant of image acquisition and then to compute fire front characteristics like its position, its rate of spread, its height, its width, its inclination, its surface and its volume. The first important contribution of this thesis is the fire pixel detection. A benchmark of fire pixel detection algorithms and of those that are developed in this thesis have been on a database of 500 vegetation fire images of the visible spectra which have been characterized according to the fire properties in the image (color, smoke, luminosity). Five fire pixel detection algorithms based on fusion of data from visible and near-infrared spectra images have also been developed and tested on another database of 100 multimodal images. The second important contribution of this thesis is about the use of images fusion for the optimization of the matching point’s number between the multimodal stereo images.The second important contribution of this thesis is the registration method of 3D fire points obtained with stereovision systems. It uses information collected from a housing containing a GPS and an IMU card which is positioned on each stereovision systems. With this registration, a method have been developed to extract the geometrical characteristics when the fire is spreading.The geometrical characteristics estimation device have been evaluated on a car of known dimensions and the results obtained confirm the good accuracy of the device. The results obtained from vegetation fires are also presented
Pérez-Rosas, Verónica. "Exploration of Visual, Acoustic, and Physiological Modalities to Complement Linguistic Representations for Sentiment Analysis". Thesis, University of North Texas, 2014. https://digital.library.unt.edu/ark:/67531/metadc699996/.
Pełny tekst źródłaBonazza, Pierre. "Système de sécurité biométrique multimodal par imagerie, dédié au contrôle d’accès". Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCK017/document.
Pełny tekst źródłaResearch of this thesis consists in setting up efficient and light solutions to answer the problems of securing sensitive products. Motivated by a collaboration with various stakeholders within the Nuc-Track project, the development of a biometric security system, possibly multimodal, will lead to a study on various biometric features such as the face, fingerprints and the vascular network. This thesis will focus on an algorithm and architecture matching, with the aim of minimizing the storage size of the learning models while guaranteeing optimal performances. This will allow it to be stored on a personal support, thus respecting privacy standards
Mani, Gayathri. "Smells and multimodal learning: The role of congruency in the processing of olfactory, visual and verbal elements of product offerings". Diss., The University of Arizona, 1999. http://hdl.handle.net/10150/283973.
Pełny tekst źródłaSIMONETTA, FEDERICO. "MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS". Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.
Pełny tekst źródłaOlsheski, Julia DeBlasio. "The role of synesthetic correspondence in intersensory binding: investigating an unrecognized confound in multimodal perception research". Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50215.
Pełny tekst źródłaMeseguer, Brocal Gabriel. "Multimodal analysis : informed content estimation and audio source separation". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS111.
Pełny tekst źródłaThis dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation. Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods. To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data
Calumby, Rodrigo Tripodi 1985. "Recuperação multimodal de imagens com realimentação de relevância baseada em programação genética". [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275814.
Pełny tekst źródłaDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-16T05:18:58Z (GMT). No. of bitstreams: 1 Calumby_RodrigoTripodi_M.pdf: 15749586 bytes, checksum: 2493b0b703adc1973eeabf7eb70ad21c (MD5) Previous issue date: 2010
Resumo: Este trabalho apresenta uma abordagem para recuperação multimodal de imagens com realimentação de relevância baseada em programação genética. Supõe-se que cada imagem da coleção possui informação textual associada (metadado, descrição textual, etc.), além de ter suas propriedades visuais (por exemplo, cor e textura) codificadas em vetores de características. A partir da informação obtida ao longo das iterações de realimentação de relevância, programação genética é utilizada para a criação de funções de combinação de medidas de similaridades eficazes. Com essas novas funções, valores de similaridades diversos são combinados em uma única medida, que mais adequadamente reflete as necessidades do usuário. As principais contribuições deste trabalho consistem na proposta e implementação de dois arcabouços. O primeiro, RFCore, é um arcabouço genérico para atividades de realimentação de relevância para manipulação de objetos digitais. O segundo, MMRFGP, é um arcabouço para recuperação de objetos digitais com realimentação de relevância baseada em programação genética, construído sobre o RFCore. O método proposto de recuperação multimodal de imagens foi validado sobre duas coleções de imagens, uma desenvolvida pela Universidade de Washington e outra da ImageCLEF Photographic Retrieval Task. A abordagem proposta mostrou melhores resultados para recuperação multimodal frente a utilização das modalidades isoladas. Além disso, foram obtidos resultados para recuperação visual e multimodal melhores do que as melhores submissões para a ImageCLEF Photographic Retrieval Task 2008
Abstract: This work presents an approach for multimodal content-based image retrieval with relevance feedback based on genetic programming. We assume that there is textual information (e.g., metadata, textual descriptions) associated with collection images. Furthermore, image content properties (e.g., color and texture) are characterized by image descriptores. Given the information obtained over the relevance feedback iterations, genetic programming is used to create effective combination functions that combine similarities associated with different features. Hence using these new functions the different similarities are combined into a unique measure that more properly meets the user needs. The main contribution of this work is the proposal and implementation of two frameworks. The first one, RFCore, is a generic framework for relevance feedback tasks over digital objects. The second one, MMRF-GP, is a framework for digital object retrieval with relevance feedback based on genetic programming and it was built on top of RFCore. We have validated the proposed multimodal image retrieval approach over 2 datasets, one from the University of Washington and another from the ImageCLEF Photographic Retrieval Task. Our approach has yielded the best results for multimodal image retrieval when compared with one-modality approaches. Furthermore, it has achieved better results for visual and multimodal image retrieval than the best submissions for ImageCLEF Photographic Retrieval Task 2008
Mestrado
Sistemas de Recuperação da Informação
Mestre em Ciência da Computação
Mozaffari, Maaref Mohammad Hamed. "A Real-Time and Automatic Ultrasound-Enhanced Multimodal Second Language Training System: A Deep Learning Approach". Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/40477.
Pełny tekst źródłaItani, Sara T. "EduCase : an automated lecture video recording, post-processing, and viewing system that utilizes multimodal inputs to provide a dynamic student experience". Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85426.
Pełny tekst źródłaCataloged from PDF version of thesis.
Includes bibliographical references (page 59).
This thesis describes the design, implementation, and evaluation of EduCase: an inexpensive automated lecture video recording, post-processing, and viewing system. The EduCase recording system consists of three devices, one per lecture hall board. Each recording device records color, depth, skeletal, and audio inputs. The Post-Processor automatically processes the recordings to produce an output file usable by the Viewer, which provides a more dynamic student experience than traditional video playback systems. In particular, it allows students to flip back to view a previous board while the lecture continues to play in the background. It also allows students to toggle the professor's visibility in and out to see the board they might be blocking. The system was successfully evaluated in blackboard-heavy lectures at MIT and Harvard. We hope that EduCase will be the quickest, most inexpensive, and student-friendly lecture capture system, and contribute to our overarching goal of education for all.
by Sara T. Itani.
M. Eng.
Mitra, Jhimli. "Multimodal Image Registration applied to Magnetic Resonance and Ultrasound Prostatic Images". Phd thesis, Université de Bourgogne, 2012. http://tel.archives-ouvertes.fr/tel-00786032.
Pełny tekst źródłaRabhi, Sara. "Optimized deep learning-based multimodal method for irregular medical timestamped data". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS003.
Pełny tekst źródłaThe wide adoption of Electronic Health Records in hospitals’ information systems has led to the definition of large databases grouping various types of data such as textual notes, longitudinal medical events, and tabular patient information. However, the records are only filled during consultations or hospital stays that depend on the patient’s state, and local habits. A system that can leverage the different types of data collected at different time scales is critical for reconstructing the patient’s health trajectory, analyzing his history, and consequently delivering more adapted care.This thesis work addresses two main challenges of medical data processing: learning to represent the sequence of medical observations with irregular elapsed time between consecutive visits and optimizing the extraction of medical events from clinical notes. Our main goal is to design a multimodal representation of the patient’s health trajectory to solve clinical prediction problems. Our first work built a framework for modeling irregular medical time series to evaluate the importance of considering the time gaps between medical episodes when representing a patient’s health trajectory. To that end, we conducted a comparative study of sequential neural networks and irregular time representation techniques. The clinical objective was to predict retinopathy complications for type 1 diabetes patients in the French database CaRéDIAB (Champagne Ardenne Réseau Diabetes) using their history of HbA1c measurements. The study results showed that the attention-based model combined with the soft one-hot representation of time gaps led to AUROC score of 88.65% (specificity of 85.56%, sensitivity of 83.33%), an improvement of 4.3% when compared to the LSTM-based model. Motivated by these results, we extended our framework to shorter multivariate time series and predicted in-hospital mortality for critical care patients of the MIMIC-III dataset. The proposed architecture, HiTT, improved the AUC score by 5% over the Transformer baseline. In the second step, we focused on extracting relevant medical information from clinical notes to enrich the patient’s health trajectories. Particularly, Transformer-based architectures showed encouraging results in medical information extraction tasks. However, these complex models require a large, annotated corpus. This requirement is hard to achieve in the medical field as it necessitates access to private patient data and high expert annotators. To reduce annotation cost, we explored active learning strategies that have been shown to be effective in tasks such as text classification, information extraction, and speech recognition. In addition to existing methods, we defined a Hybrid Weighted Uncertainty Sampling active learning strategy that takes advantage of the contextual embeddings learned by the Transformer-based approach to measuring the representativeness of samples. A simulated study using the i2b2-2010 challenge dataset showed that our proposed metric reduces the annotation cost by 70% to achieve the same score as passive learning. Lastly, we combined multivariate medical time series and medical concepts extracted from clinical notes of the MIMIC-III database to train a multimodal transformer-based architecture. The test results of the in-hospital mortality task showed an improvement of 5.3% when considering additional text data. This thesis contributes to patient health trajectory representation by alleviating the burden of episodic medical records and the manual annotation of free-text notes
Prates, Jonathan Simon. "Gerenciamento de diálogo baseado em modelo cognitivo para sistemas de interação multimodal". Universidade do Vale do Rio dos Sinos, 2015. http://www.repositorio.jesuita.org.br/handle/UNISINOS/3348.
Pełny tekst źródłaMade available in DSpace on 2015-04-24T13:06:48Z (GMT). No. of bitstreams: 1 Jonathan Simon Prates.pdf: 2514736 bytes, checksum: 58b7bca77d32ecba8467a3e3a533d2a0 (MD5) Previous issue date: 2015-01-31
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Os Sistemas de Interação Multimodal possibilitam uma utilização mais amigável dos sistemas de computação. Eles permitem que os usuários recebam informações e indiquem suas necessidades com maior facilidade, amparados por recursos de interação cada vez mais diversos. Neste contexto, um elemento central é o diálogo que se estabelece entre os usuários e estes sistemas. Alguns dos desafios observados na área de Interação Multimodal estão ligados à integração dos diversos estímulos a serem tratados, enquanto outros estão ligados à geração de respostas adequadas a estes estímulos. O gerenciamento do diálogo nestes sistemas envolve atividades diversas associadas tanto com a representação dos assuntos tratados, como com a escolha de alternativas de resposta e com o tratamento de modelos que representam tarefas e usuários. A partir das diversas abordagens conhecidas para estas implementações, são observadas demandas de modelos de diálogo que aproximem os resultados das interações que são geradas pelos sistemas daquelas interações que seriam esperados em situações de interação em linguagem natural. Uma linha de atuação possível para a obtenção de melhorias neste aspecto pode estar ligada à utilização de estudos da psicologia cognitiva sobre a memória de trabalho e a integração de informações. Este trabalho apresenta os resultados obtidos com um modelo de tratamento de diálogo para sistemas de Interação Multimodal baseado em um modelo cognitivo, que visa proporcionar a geração de diálogos que se aproximem de situações de diálogo em linguagem natural. São apresentados os estudos que embasaram esta proposta e a sua justificativa para uso no modelo descrito. Também são demonstrados resultados preliminares obtidos com o uso de protótipos para a validação do modelo. As avaliações realizadas demonstram um bom potencial para o modelo proposto.
Multimodal interaction systems allow a friendly use of computing systems. They allow users to receive information and indicate their needs with ease, supported by new interaction resources. In this context, the central element is the dialogue, established between users and these systems. The dialogue management of these systems involves various activities associated with the representation of subjects treated, possible answers, tasks model and users model treatment. In implementations for these approaches, some demands can be observed to approximate the results of the interactions by these systems of interaction in natural language. One possible line of action to obtain improvements in this aspect can be associated to the use of cognitive psychology studies on working memory and information integration. This work presents results obtained with a model of memory handling for multimodal dialogue interaction based on a cognitive model, which aims to provide conditions for dialogue generation closer to situations in natural language dialogs. This research presents studies that supported this proposal and the justification for the described model’s description. At the end, results using two prototypes for the model’s validation are also shown.
Leong, Chee Wee. "Modeling Synergistic Relationships Between Words and Images". Thesis, University of North Texas, 2012. https://digital.library.unt.edu/ark:/67531/metadc177223/.
Pełny tekst źródłaMaman, Lucien. "Automated analysis of cohesion in small groups interactions". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT030.
Pełny tekst źródłaOver the last decade, a new multidisciplinary research domain named Social Signal Processing (SSP) emerged. It is aimed at enabling machines to sense, recognize, and display human social signals. One of the challenging tasks addressed by SSP is the automated group interaction analysis. Recently, a particular emphasis is given to the automated study of emergent states as they play an important role in group dynamics. These are social processes that develop throughout group members' interactions.In this Thesis, we address the automated analysis of cohesion in small groups interactions. Cohesion is a multidimensional affective emergent state that can be defined as a dynamic process reflected by the tendency of a group to stick together to pursue goals and/or affective needs. Despite the rich literature available on cohesion from a Social Sciences perspective, its automated analysis is still in its infancy. Grounding on Social Sciences' insights, this Thesis aims to develop computational models of cohesion following four axes research axes, leveraging Machine Learning and Deep Learning techniques. Computational models of cohesion, indeed, should account for the temporal nature of cohesion, the multidimensionality of this group process, take into account how to model cohesion from both individuals and group perspectives, integrate the relationships between its dimensions and their development over time, and take heed of the relationships between cohesion and other group processes.In addition, facing a lack of publicly available data, this Thesis contributed to the collection of a multimodal dataset specifically designed for studying group cohesion and for explicitly controlling its variations over time. Such a dataset enables, among other perspectives, further development of computational models integrating the perceived cohesion from group members and/or external points of view. Our results show the relevance of leveraging Social Sciences' insights to develop new computational models of cohesion and confirm the benefits of exploring each of the four research axes