Tesi: "Multimodal processing"

1

Cadène, Rémi. "Deep Multimodal Learning for Vision and Language Processing". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS277.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les technologies du numérique ont joué un rôle déterminant dans la transformation de notre société. Des méthodes statistiques récentes ont été déployées avec succès afin d’automatiser le traitement de la quantité croissante d’images, de vidéos et de textes que nous produisons quotidiennement. En particulier, les réseaux de neurones profonds ont été adopté par les communautés de la vision par ordinateur et du traitement du langage naturel pour leur capacité à interpréter le contenu des images et des textes une fois entraînés sur de grands ensembles de données. Les progrès réalisés dans les deux communautés ont permis de jeter les bases de nouveaux problèmes de recherche à l’intersection entre vision et langage. Dans la première partie de cette thèse, nous nous concentrons sur des moteurs de recherche multimodaux images-textes. Nous proposons une stratégie d’apprentissage pour aligner efficacement les deux modalités tout en structurant l’espace de recherche avec de l’information sémantique. Dans la deuxième partie, nous nous concentrons sur des systèmes capables de répondre à toute question sur une image. Nous proposons une architecture multimodale qui fusionne itérativement les modalités visuelles et textuelles en utilisant un modèle bilinéaire factorisé, tout en modélisant les relations par paires entre chaque région de l’image. Dans la dernière partie, nous abordons les problèmes de biais dans la modélisation. Nous proposons une stratégie d’apprentissage réduisant les biais linguistiques généralement présents dans les systèmes de réponse aux questions visuelles
Digital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants.In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems

2

Hu, Yongtao, e 胡永涛. "Multimodal speaker localization and identification for video processing". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2014. http://hdl.handle.net/10722/212633.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

3

Chen, Xun. "Multimodal biomedical signal processing for corticomuscular coupling analysis". Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/45811.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Corticomuscular coupling analysis using multiple data sets such as electroencepha-logram (EEG) and electromyogram (EMG) signals provides a useful tool for understanding human motor control systems. A popular conventional method to assess corticomuscular coupling is the pair-wise magnitude-squared coherence (MSC). However, there are certain limitations associated with MSC, including the difficulty in robustly assessing group inference, only dealing with two types of data sets simultaneously and the biologically implausible assumption of pair-wise interactions. In this thesis, we propose several novel signal processing techniques to overcome the disadvantages of current coupling analysis methods. We propose combining partial least squares (PLS) and canonical correlation analysis (CCA) to take advantage of both techniques to ensure that the extracted components are maximally correlated across two data sets and meanwhile can well explain the information within each data set. Furthermore, we propose jointly incorporating response-relevance and statistical independence into a multi-objective optimization function, meaningfully combining the goals of independent component analysis (ICA) and PLS under the same mathematical umbrella. In addition, we extend the coupling analysis to multiple data sets by proposing a joint multimodal group analysis framework. Finally, to acquire independent components but not just uncorrelated ones, we improve the multimodal framework by exploiting the complementary property of multiset canonical correlation analysis (M-CCA) and joint ICA. Simulations show that our proposed methods can achieve superior performances than conventional approaches. We also apply the proposed methods to concurrent EEG, EMG and behavior data collected in a Parkinson's disease (PD) study. The results reveal highly correlated temporal patterns among the multimodal signals and corresponding spatial activation patterns. In addition to the expected motor areas, the corresponding spatial activation patterns demonstrate enhanced occipital connectivity in PD subjects, consistent with previous medical findings.

4

Sadr, Lahijany Nadi. "Multimodal Signal Processing for Diagnosis of Cardiorespiratory Disorders". Thesis, The University of Sydney, 2017. http://hdl.handle.net/2123/17636.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This thesis addresses the use of multimodal signal processing to develop algorithms for the automated processing of two cardiorespiratory disorders. The aim of the first application of this thesis was to reduce false alarm rate in an intensive care unit. The goal was to detect five critical arrhythmias using processing of multimodal signals including photoplethysmography, arterial blood pressure, Lead II and augmented right arm electrocardiogram (ECG). A hierarchical approach was used to process the signals as well as a custom signal processing technique for each arrhythmia type. Sleep disorders are a prevalent health issue, currently costly and inconvenient to diagnose, as they normally require an overnight hospital stay by the patient. In the second application of this project, we designed automated signal processing algorithms for the diagnosis of sleep apnoea with a main focus on the ECG signal processing. We estimated the ECG-derived respiratory (EDR) signal using different methods: QRS-complex area, principal component analysis (PCA) and kernel PCA. We proposed two algorithms (segmented PCA and approximated PCA) for EDR estimation to enable applying the PCA method to overnight recordings and rectify the computational issues and memory requirement. We compared the EDR information against the chest respiratory effort signals. The performance was evaluated using three automated machine learning algorithms of linear discriminant analysis (LDA), extreme learning machine (ELM) and support vector machine (SVM) on two databases: the MIT PhysioNet database and the St. Vincent’s database. The results showed that the QRS area method for EDR estimation combined with the LDA classifier was the highest performing method and the EDR signals contain respiratory information useful for discriminating sleep apnoea. As a final step, heart rate variability (HRV) and cardiopulmonary coupling (CPC) features were extracted and combined with the EDR features and temporal optimisation techniques were applied. The cross-validation results of the minute-by-minute apnoea classification achieved an accuracy of 89%, a sensitivity of 90%, a specificity of 88%, and an AUC of 0.95 which is comparable to the best results reported in the literature.

5

Elshaw, Mark. "Multimodal neural grounding of language processing for robot actions". Thesis, University of Sunderland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.420517.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

6

Friedel, Paul. "Sensory information processing : detection, feature extraction, & multimodal integration". kostenfrei, 2008. http://mediatum2.ub.tum.de/doc/651333/651333.pdf.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

7

Sadeghi, Ghandehari Soroush. "Multimodal signal processing in the peripheral and central vestibular pathways". Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=95559.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The vestibular sensory apparatus and associated vestibular nuclei encode head movements during our daily activities. In addition to direct inputs from vestibular afferents, the vestibular nuclei receive substantial projections from cortical. cerebellar, spinal cord. and other brainstem structures. The present studies were aimed at investigating the coding strategies and the signals carried by the peripheral and central vestibular neurons under normal conditions and following vestibular compensation. In normal animals, we first studied the coding strategies of regular and irregular afferents using information theoretic measures over the range of functionally relevant frequenciesand found differences between the two types of afferents as a result of different variability in the resting discharge (i.e., noise) and sensitivity (i.e., signal). We found that regular afferents carry information mostly in the spike times of their discharge, whereas irregular afferents carry information mostly in their firing rate and at higher frequency range of stimuli, thus acting as event detectors. We next studied the signals carried by the vestibular-nerve afferents either as a result of direct vestibular stimulation or through efferent fibers, in normal conditions and following vestibular lesion. We first showed that the efferent vestibular system is functional in the alert monkey. In order to address the functional role of the efferent system. we then characterized the responses of vestibular afferents evoked by a wide range of stimuli. We found that vestibular afferents did not encode extravestibular signals and that their response properties do not change significantly following lesion. Thus the question of the functional role of the vestibular efferent system remains open. In addition our findings demonstrate that the vestibular periphery (afferents and efferents) do not show the plasticity required to support vestibular compensation. Finally. we studied the central vestibular
Les organes sensoriels vestibulaires de l’oreille interne détectent les mouvements de la tète dans r espace. Ces informations sont envoyées aux neurones vestibulaires centraux localises au niveau du tronc cérébral. A ce niveau convergent également d'autres signaux en provenance du cortex, du cervelet. de la moelle ainsi que de divers noyaux du tronc cérébral. Les études présentées ici ont pour but de comprendre le mode de codage et la nature des signaux générés par les neurones vestibulaires périphériques, ainsi que les capacités de traitement des neurones vestibulaires centraux. véritable centres d'intégration sensori-motrice. Ces travaux ont été conduits en condition physiologique et physiopathologique sur le modèle de la compensation vestibulaire. A r aide de mesures issues de la théorie de l'information, nous nous sommes tout d'abord intéresse aux codages effectues par Ies afférences vestibulaires régulières et irrégulières. Ces deux types neuronaux différent notamment par la variabilité de leur fréquence de décharge spontanée (bruit) et leurs sensibilités (signal). Nous avons montre que Ies fibres afférentes régulières utilisent un codage temporel alors que les fibres irrégulières fonctionnent essentiellement sur un codage en modulation de la fréquence, et ce d' autant mieux que les fréquences sont élevées, constituant ainsi de véritables détecteurs d'évènements. Nous avons ensuite étudie les réponses des afférences suite a une stimulation vestibulaire directe ou a une activation du « système efférent ». En conditions physiologiques, nous avons tout d'abord pu démontrer que le système efférent est bien fonctionneI chez le singe éveille. fr

8

Fateri, Sina. "Advanced signal processing techniques for multimodal ultrasonic guided wave response". Thesis, Brunel University, 2015. http://bura.brunel.ac.uk/handle/2438/11657.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Ultrasonic technology is commonly used in the eld of Non-Destructive Testing (NDT) of metal structures such as steel, aluminium, etc. Compared to ultrasonic bulk waves that travel in infinite media with no boundary influence, Ultrasonic Guided Waves (UGWs) require a structural boundary for propagation such that they can be used to inspect and monitor long elements of a structure from a single position. The greatest challenges for any UGW system are the plethora of wave modes arising from the geometry of the structural element which propagate with a range of frequency dependent velocities and the interpretation of these combined signals reflected by discontinuities in the structural element. In this thesis, a technique is developed which facilitates the measurement of Time of Arrival (ToA) and group velocity dispersion curves of wave modes for one dimensional structures as far as wave propagation is concerned. A second technique is also presented which employs the dispersion curves to deliver enhanced range measurements in complex multimodal UGW responses. Ultimately, the aforementioned techniques are used as a part of the analysis of previously unreported signals arising from interactions of UGWs with piezoelectric transducers. The first signal processing technique is presented which used a combination of frequency-sweep measurement, sampling rate conversion and the Fourier transform. The technique is applied to synthesized and experimental data in order to identify different wave modes in complex UGW signals. It is demonstrated that the technique has the capability to derive the ToA and group velocity dispersion curve of the wave modes of interest. The second signal processing technique uses broad band excitation, dispersion compensation and cross-correlation. The technique is applied to synthesized and experimental data in order to identify different wave modes in complex UGW signals. It is demonstrated that the technique noticeably improves the Signal to Noise Ratio (SNR) of the UGW response using a priori knowledge of the dispersion curve. It is also able to derive accurate quantitative information about the ToA and the propagation distance. During the development of the aforementioned signal processing techniques, some unwanted wave-packets are identified in the UGW responses which are found to be induced by the coupling of a shear mode piezoelectric transducer at the free edge of the waveguide. Accordingly, the effect of the force on the piezoelectric transducers and the corresponding reflections and mode conversions are studied experimentally. The aforementioned signal processing techniques are also employed as a part of the study. A Finite Element Analysis (FEA) procedure is also presented which can potentially improve the theoretical predictions and converge to results found in experimental routines. The approach enhances the con dence in the FEA models compared to traditional approaches. The outcome of the research conducted in this thesis paves the way to enhance the reliability of UGW inspections by utilizing the signal processing techniques and studying the multimodal responses.

9

Caglayan, Ozan. "Multimodal Machine Translation". Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1016/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

La traduction automatique vise à traduire des documents d’une langue à une autre sans l’intervention humaine. Avec l’apparition des réseaux de neurones profonds (DNN), la traduction automatique neuronale(NMT) a commencé à dominer le domaine, atteignant l’état de l’art pour de nombreuses langues. NMT a également ravivé l’intérêt pour la traduction basée sur l’interlangue grâce à la manière dont elle place la tâche dans un cadre encodeur-décodeur en passant par des représentations latentes. Combiné avec la flexibilité architecturale des DNN, ce cadre a aussi ouvert une piste de recherche sur la multimodalité, ayant pour but d’enrichir les représentations latentes avec d’autres modalités telles que la vision ou la parole, par exemple. Cette thèse se concentre sur la traduction automatique multimodale(MMT) en intégrant la vision comme une modalité secondaire afin d’obtenir une meilleure compréhension du langage, ancrée de façon visuelle. J’ai travaillé spécifiquement avec un ensemble de données contenant des images et leurs descriptions traduites, où le contexte visuel peut être utile pour désambiguïser le sens des mots polysémiques, imputer des mots manquants ou déterminer le genre lors de la traduction vers une langue ayant du genre grammatical comme avec l’anglais vers le français. Je propose deux approches principales pour intégrer la modalité visuelle : (i) un mécanisme d’attention multimodal qui apprend à prendre en compte les représentations latentes des phrases sources ainsi que les caractéristiques visuelles convolutives, (ii) une méthode qui utilise des caractéristiques visuelles globales pour amorcer les encodeurs et les décodeurs récurrents. Grâce à une évaluation automatique et humaine réalisée sur plusieurs paires de langues, les approches proposées se sont montrées bénéfiques. Enfin,je montre qu’en supprimant certaines informations linguistiques à travers la dégradation systématique des phrases sources, la véritable force des deux méthodes émerge en imputant avec succès les noms et les couleurs manquants. Elles peuvent même traduire lorsque des morceaux de phrases sources sont entièrement supprimés
Machine translation aims at automatically translating documents from one language to another without human intervention. With the advent of deep neural networks (DNN), neural approaches to machine translation started to dominate the field, reaching state-ofthe-art performance in many languages. Neural machine translation (NMT) also revived the interest in interlingual machine translation due to how it naturally fits the task into an encoder-decoder framework which produces a translation by decoding a latent source representation. Combined with the architectural flexibility of DNNs, this framework paved the way for further research in multimodality with the objective of augmenting the latent representations with other modalities such as vision or speech, for example. This thesis focuses on a multimodal machine translation (MMT) framework that integrates a secondary visual modality to achieve better and visually grounded language understanding. I specifically worked with a dataset containing images and their translated descriptions, where visual context can be useful forword sense disambiguation, missing word imputation, or gender marking when translating from a language with gender-neutral nouns to one with grammatical gender system as is the case with English to French. I propose two main approaches to integrate the visual modality: (i) a multimodal attention mechanism that learns to take into account both sentence and convolutional visual representations, (ii) a method that uses global visual feature vectors to prime the sentence encoders and the decoders. Through automatic and human evaluation conducted on multiple language pairs, the proposed approaches were demonstrated to be beneficial. Finally, I further show that by systematically removing certain linguistic information from the input sentences, the true strength of both methods emerges as they successfully impute missing nouns, colors and can even translate when parts of the source sentences are completely removed

10

Fridman, Linnea, e Victoria Nordberg. "Two Multimodal Image Registration Approaches for Positioning Purposes". Thesis, Linköpings universitet, Datorseende, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157424.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This report is the result of a master thesis made by two students at Linköping University. The aim was to find an image registration method for visual and infrared images and to find an error measure for grading the registration performance. In practice this could be used for position determination by registering the infrared image taken at the current position to a set of visual images with known positions and determining which visual image matches the best. Two methods were tried, using different image feature extractors and different ways to match the features. The first method used phase information in the images to generate soft features and then minimised the square error of the optical flow equation to estimate the transformation between the visual and infrared image. The second method used the Canny edge detector to extract hard features from the images and Chamfer distance as an error measure. Both methods were evaluated for registration as well as position determination and yielded promising results. However, the performance of both methods was image dependent. The soft edge method proved to be more robust and precise and worked better than the hard edge method for both registration and position determination.

11

Zamzmi, Ghada. "Automatic Multimodal Assessment of Neonatal Pain". Scholar Commons, 2018. https://scholarcommons.usf.edu/etd/7662.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

For several decades, pediatricians used to believe that neonates do not feel pain. The American Academy of Pediatrics (AAP) recognized neonates' sense of pain in 1987. Since then, there have been many studies reporting a strong association between repeated pain exposure (under-treatment) and alterations in brain structure and function. This association has led to the increased use of anesthetic medications. However, recent studies found that the excessive use of analgesic medications (over-treatment) can cause many side effects. The current standard for assessing neonatal pain is discontinuous and suffers from inter-observer variations, which can lead to over- or under-treatment. Therefore, it is critical to address the shortcomings of the current standard and develop continuous and less subjective pain assessment tools. This dissertation introduces an automatic and comprehensive neonatal pain assessment system. The presented system is different from the previous ones in three principal ways. First, it is specifically designed to assess pain of neonates using data captured while they are hospitalized in the Neonatal Intensive Care Units (NICU). Second, it dynamically analyzes neonatal pain as it unfolds in a particular pattern over time. Third, it combines visual, vocal, and physiological signals to create a system that continues to assess pain even when one or more signals become temporarily unavailable. The presented system has four main components. The first three components consist of novel algorithms for analyzing the visual, vocal, and physiological signals separately. The last component combines all the three signals to create a multimodal pain assessment system. The performance of the system in recognizing pain events is comparable to that of trained nurses; hence, it demonstrates the feasibility of automatic pain assessment in typical neonatal care environments.

12

Baum, Karl G. "Multimodal breast imaging : registration, visualization, and image synthesis /". Online version of thesis, 2008. http://hdl.handle.net/1850/7063.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

13

Panchev, Christo. "Spatio-temporal and multimodal processing in a spiking neural mind of a robot". Thesis, University of Sunderland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.420478.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

14

Sanders, Teresa H. "Multimodal assessment of Parkinson's disease using electrophysiology and automated motor scoring". Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/51970.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

A suite of signal processing algorithms designed for extracting information from brain electrophysiology and movement signals, along with new insights gained by applying these tools to understanding parkinsonism, were presented in this dissertation. The approach taken does not assume any particular stimulus, underlying activity, or synchronizing event, nor does it assume any particular encoding scheme. Instead, novel signal processing applications of complex continuous wavelet transforms, cross-frequency-coupling, feature selection, and canonical correlation were developed to discover the most significant electrophysiologic changes in the basal ganglia and cortex of parkinsonian rhesus monkeys and how these changes are related to the motor signs of parkinsonism. The resulting algorithms effectively characterize the severity of parkinsonism and, when combined with motor signal decoding algorithms, allow technology-assisted multi-modal grading of the primary pathological signs. Based on these results, parallel data collection algorithms were implemented in real-time embedded software and off-the-shelf hardware to develop a new system to facilitate monitoring of the severity of Parkinson's disease signs and symptoms in human patients. Off -line analysis of data collected with the system was subsequently shown to allow discrimination between normal and simulated parkinsonian conditions. The main contributions of the work were in three areas: 1) Evidence of the importance of optimally selecting multiple, non-redundant features for understanding neural information, 2) Discovery of signi ficant correlations between certain pathological motor signs and brain electrophysiology in different brain regions, and 3) Implementation and human subject testing of multi-modal monitoring technology.

15

Whitehurst, Daniel Scott. "Techniques for Processing Airborne Imagery for Multimodal Crop Health Monitoring and Early Insect Detection". Thesis, Virginia Tech, 2016. http://hdl.handle.net/10919/73048.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

During their growth, crops may experience a variety of health issues, which often lead to a reduction in crop yield. In order to avoid financial loss and sustain crop survival, it is imperative for farmers to detect and treat crop health issues. Interest in the use of unmanned aerial vehicles (UAVs) for precision agriculture has continued to grow as the cost of these platforms and sensing payloads has decreased. The increase in availability of this technology may enable farmers to scout their fields and react to issues more quickly and inexpensively than current satellite and other airborne methods. In the work of this thesis, methods have been developed for applications of UAV remote sensing using visible spectrum and multispectral imagery. An algorithm has been developed to work on a server for the remote processing of images acquired of a crop field with a UAV. This algorithm first enhances the images to adjust the contrast and then classifies areas of the image based upon the vigor and greenness of the crop. The classification is performed using a support vector machine with a Gaussian kernel, which achieved a classification accuracy of 86.4%. Additionally, an analysis of multispectral imagery was performed to determine indices which correlate with the health of corn crops. Through this process, a method for correcting hyperspectral images for lighting issues was developed. The Normalized Difference Vegetation Index values did not show a significant correlation with the health, but several indices were created from the hyperspectral data. Optimal correlation was achieved by using the reflectance values for 740 nm and 760 nm wavelengths, which produced a correlation coefficient of 0.84 with the yield of corn. In addition to this, two algorithms were created to detect stink bugs on crops with aerial visible spectrum images. The first method used a superpixel segmentation approach and achieved a recognition rate of 93.9%, although the processing time was high. The second method used an approach based upon texture and color and achieved a recognition rate of 95.2% while improving upon the processing speed of the first method. While both methods achieved similar accuracy, the superpixel approach allows for detection from higher altitudes, but this comes at the cost of extra processing time.
Master of Science

16

Lizarraga, Gabriel M. "A Neuroimaging Web Interface for Data Acquisition, Processing and Visualization of Multimodal Brain Images". FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3855.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Structural and functional brain images are generated as essential modalities for medical experts to learn about the different functions of the brain. These images are typically visually inspected by experts. Many software packages are available to process medical images, but they are complex and difficult to use. The software packages are also hardware intensive. As a consequence, this dissertation proposes a novel Neuroimaging Web Services Interface (NWSI) as a series of processing pipelines for a common platform to store, process, visualize and share data. The NWSI system is made up of password-protected interconnected servers accessible through a web interface. The web-interface driving the NWSI is based on Drupal, a popular open source content management system. Drupal provides a user-based platform, in which the core code for the security and design tools are updated and patched frequently. New features can be added via modules, while maintaining the core software secure and intact. The webserver architecture allows for the visualization of results and the downloading of tabulated data. Several forms are ix available to capture clinical data. The processing pipeline starts with a FreeSurfer (FS) reconstruction of T1-weighted MRI images. Subsequently, PET, DTI, and fMRI images can be uploaded. The Webserver captures uploaded images and performs essential functionalities, while processing occurs in supporting servers. The computational platform is responsive and scalable. The current pipeline for PET processing calculates all regional Standardized Uptake Value ratios (SUVRs). The FS and SUVR calculations have been validated using Alzheimer's Disease Neuroimaging Initiative (ADNI) results posted at Laboratory of Neuro Imaging (LONI). The NWSI system provides access to a calibration process through the centiloid scale, consolidating Florbetapir and Florbetaben tracers in amyloid PET images. The interface also offers onsite access to machine learning algorithms, and introduces new heat maps that augment expert visual rating of PET images. NWSI has been piloted using data and expertise from Mount Sinai Medical Center, the 1Florida Alzheimer’s Disease Research Center (ADRC), Baptist Health South Florida, Nicklaus Children's Hospital, and the University of Miami. All results were obtained using our processing servers in order to maintain data validity, consistency, and minimal processing bias.

17

Alameda-Pineda, Xavier. "Egocentric Audio-Visual Scene Analysis : a machine learning and signal processing approach". Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM024/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Depuis les vingt dernières années, l'industrie a développé plusieurs produits commerciaux dotés de capacités auditives et visuelles. La grand majorité de ces produits est composée d'un caméscope et d'un microphone embarqué (téléphones portables, tablettes, etc). D'autres, comme la Kinect, sont équipés de capteurs de profondeur et/ou de petits réseaux de microphones. On trouve également des téléphones portables dotés d'un système de vision stéréo. En même temps, plusieurs systèmes orientés recherche sont apparus (par exemple, le robot humanoïde NAO). Du fait que ces systèmes sont compacts, leurs capteurs sont positionnés près les uns des autres. En conséquence, ils ne peuvent pas capturer la scène complète, mais qu'un point de vue très particulier de l'interaction sociale en cours. On appelle cela "Analyse Égocentrique de Scènes Audio-Visuelles''.Cette thèse contribue à cette thématique de plusieurs façons. D'abord, en fournissant une base de données publique qui cible des applications comme la reconnaissance d'actions et de gestes, localisation et suivi d'interlocuteurs, analyse du tour de parole, localisation de sources auditives, etc. Cette base a été utilisé en dedans et en dehors de cette thèse. Nous avons aussi travaillé le problème de la détection d'événements audio-visuels. Nous avons montré comme la confiance en une des modalités (issue de la vision en l'occurrence), peut être modélisée pour biaiser la méthode, en donnant lieu à un algorithme d'espérance-maximisation visuellement supervisé. Ensuite, nous avons modifié l'approche pour cibler la détection audio-visuelle d'interlocuteurs en utilisant le robot humanoïde NAO. En parallèle aux travaux en détection audio-visuelle d'interlocuteurs, nous avons développé une nouvelle approche pour la reconnaissance audio-visuelle de commandes. Nous avons évalué la qualité de plusieurs indices et classeurs, et confirmé que l'utilisation des données auditives et visuelles favorise la reconnaissance, en comparaison aux méthodes qui n'utilisent que l'audio ou que la vidéo. Plus tard, nous avons cherché la meilleure méthode pour des ensembles d'entraînement minuscules (5-10 observations par catégorie). Il s'agit d'un problème intéressant, car les systèmes réels ont besoin de s'adapter très rapidement et d'apprendre de nouvelles commandes. Ces systèmes doivent être opérationnels avec très peu d'échantillons pour l'usage publique. Pour finir, nous avons contribué au champ de la localisation de sources sonores, dans le cas particulier des réseaux coplanaires de microphones. C'est une problématique importante, car la géométrie du réseau est arbitraire et inconnue. En conséquence, cela ouvre la voie pour travailler avec des réseaux de microphones dynamiques, qui peuvent adapter leur géométrie pour mieux répondre à certaines tâches. De plus, la conception des produits commerciaux peut être contrainte de façon que les réseaux linéaires ou circulaires ne sont pas bien adaptés
Along the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as "Egocentric Audio-Visual Scene Analysis''.This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been used later on inside and outside the thesis. We also investigated the problem of AV event detection. We showed how the trust on one of the modalities (visual to be precise) can be modeled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. We explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to video-only classifiers. Later, we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited

18

Karvonen, Tuukka Matias. "Towards Visuocomputational Endoscopy: Visual Computing for Multimodal and Multi-Articulated Endoscopy". Kyoto University, 2017. http://hdl.handle.net/2433/227661.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

19

Koubaroulis, D. A. "The multimodal neighbourhood signature for modelling object colour appearance and applications in computer vision". Thesis, University of Surrey, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.365142.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

20

Ladwig, Stefan [Verfasser]. "About multimodal information processing and the relation of proximal and distal action effects / Stefan Ladwig". Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2015. http://d-nb.info/1066812535/34.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

21

Leatherday, Christopher. "Evaluation of recurrent glioma and Alzheimer’s disease using novel multimodal brain image processing and analysis". Thesis, Curtin University, 2016. http://hdl.handle.net/20.500.11937/2238.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Novel analysis techniques were applied to two different sets of multi-modality brain images. Localised metabolic rate within the hippocampus was assessed for its ability to differentiate between groups of healthy, mildly cognitively impaired, and Alzheimer’s disease brains, and an investigation of its potential clinical diagnostic utility was conducted. Relative uptake and retention of two PET tracers (11Carbon Methionine and 18Fluoro Thymidine) in a post-treatment glioma patient cohort was utilized to perform survival prediction analysis.

22

Appelstål, Michael. "Multimodal Model for Construction Site Aversion Classification". Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-421011.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Aversion on construction sites can be everything from missingmaterial, fire hazards, or insufficient cleaning. These aversionsappear very often on construction sites and the construction companyneeds to report and take care of them in order for the site to runcorrectly. The reports consist of an image of the aversion and atext describing the aversion. Report categorization is currentlydone manually which is both time and cost-ineffective. The task for this thesis was to implement and evaluate an automaticmultimodal machine learning classifier for the reported aversionsthat utilized both the image and text data from the reports. Themodel presented is a late-fusion model consisting of a Swedish BERTtext classifier and a VGG16 for image classification. The results showed that an automated classifier is feasible for thistask and could be used in real life to make the classification taskmore time and cost-efficient. The model scored a 66.2% accuracy and89.7% top-5 accuracy on the task and the experiments revealed someareas of improvement on the data and model that could be furtherexplored to potentially improve the performance.

23

Delecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les mécanismes de compréhension chez l'être humain sont par essence multimodaux. Comprendre le monde qui l'entoure revient chez l'être humain à fusionner l'information issue de l'ensemble de ses récepteurs sensoriels. La plupart des documents utilisés en traitement automatique de l'information sont multimodaux. Par exemple, du texte et des images dans des documents textuels ou des images et du son dans des documents vidéo. Cependant, les traitements qui leurs sont appliqués sont le plus souvent monomodaux. Le but de cette thèse est de proposer des traitements joints s'appliquant principalement au texte et à l'image pour le traitement de documents multimodaux à travers deux études : l'une portant sur la fusion multimodale pour la reconnaissance du rôle du locuteur dans des émissions télévisuelles, l'autre portant sur la complémentarité des modalités pour une tâche d'analyse linguistique sur des corpus d'images avec légendes. Pour la première étude nous nous intéressons à l'analyse de documents audiovisuels provenant de chaînes d'information télévisuelle. Nous proposons une approche utilisant des réseaux de neurones profonds pour la création d'une représentation jointe multimodale pour les représentations et la fusion des modalités. Dans la seconde partie de cette thèse nous nous intéressons aux approches permettant d'utiliser plusieurs sources d'informations multimodales pour une tâche monomodale de traitement automatique du langage, afin d'étudier leur complémentarité. Nous proposons un système complet de correction de rattachements prépositionnels utilisant de l'information visuelle, entraîné sur un corpus multimodal d'images avec légendes
The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions

24

Jaime, Mark. "The Role of Temporal Synchrony in the Facilitation of Perceptual Learning during Prenatal Development". FIU Digital Commons, 2007. http://digitalcommons.fiu.edu/etd/58.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This study explored the critical features of temporal synchrony for the facilitation of prenatal perceptual learning with respect to unimodal stimulation using an animal model, the bobwhite quail. The following related hypotheses were examined: (1) the availability of temporal synchrony is a critical feature to facilitate prenatal perceptual learning, (2) a single temporally synchronous note is sufficient to facilitate prenatal perceptual learning, with respect to unimodal stimulation, and (3) in situations where embryos are exposed to a single temporally synchronous note, facilitated perceptual learning, with respect to unimodal stimulation, will be optimal when the temporally synchronous note occurs at the onset of the stimulation bout. To assess these hypotheses, two experiments were conducted in which quail embryos were exposed to various audio-visual configurations of a bobwhite maternal call and tested at 24 hr after hatching for evidence of facilitated prenatal perceptual learning with respect to unimodal stimulation. Experiment 1 explored if intermodal equivalence was sufficient to facilitate prenatal perceptual learning with respect to unimodal stimulation. A Bimodal Sequential Temporal Equivalence (BSTE) condition was created that provided embryos with sequential auditory and visual stimulation in which the same amodal properties (rate, duration, rhythm) were made available across modalities. Experiment 2 assessed: (a) whether a limited number of temporally synchronous notes are sufficient for facilitated prenatal perceptual learning with respect to unimodal stimulation, and (b) whether there is a relationship between timing of occurrence of a temporally synchronous note and the facilitation of prenatal perceptual learning. Results revealed that prenatal exposure to BSTE was not sufficient to facilitate perceptual learning. In contrast, a maternal call that contained a single temporally synchronous note was sufficient to facilitate embryos’ prenatal perceptual learning with respect to unimodal stimulation. Furthermore, the most salient prenatal condition was that which contained the synchronous note at the onset of the call burst. Embryos’ prenatal perceptual learning of the call was four times faster in this condition than when exposed to a unimodal call. Taken together, bobwhite quail embryos’ remarkable sensitivity to temporal synchrony suggests that this amodal property plays a key role in attention and learning during prenatal development.

25

Caixeta, Fabio Viegas. "Atividade multimodal no c?rtex sensorial prim?rio de ratos". Universidade Federal do Rio Grande do Norte, 2010. http://repositorio.ufrn.br:8080/jspui/handle/123456789/17290.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Made available in DSpace on 2014-12-17T15:36:57Z (GMT). No. of bitstreams: 1 FabioVC.pdf: 1566053 bytes, checksum: ff1176922095055a7b4aa58b50429d3f (MD5) Previous issue date: 2010-02-26
Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior
The currently accepted model of sensory processing states that different senses are processed in parallel, and that the activity of specific cortical regions define the sensorial modality perceived by the subject. In this work we used chronic multielectrode extracellular recordings to investigate to which extent neurons in the visual and tactile primary cortices (V1 and S1) of anesthetized rats would respond to sensory modalities not traditionaly associated with these cortices. Visual stimulation yielded 87% of responsive neurons in V1, while 82% of S1 neurons responded to tactile stimulation. In the same stimulation sessions, we found 23% of V1 neurons responding to tactile stimuli and 22% of S1 neurons responding to visual stimuli. Our data supports an increasing body of evidence that indicates the existence multimodal processing in primary sensory cortices. Our data challenge the unimodal sensory processing paradigm, and suggest the need of a reinterpretation of the currently accepted model of cortical hierarchy.
O modelo de processamento sensorial mais aceito atualmente afirma que os sentidos s?o processados em paralelo, e que a atividade de c?rtices sensoriais espec?ficos define a modalidade sens?ria percebida subjetivamente. Neste trabalho utilizamos registros eletrofisiol?gicos cr?nicos de m?ltiplos neur?nios para investigar se neur?nios nos c?rtices prim?rios visual (V1) e t?til (S1) de ratos anestesiados podem responder a est?mulos das modalidades sensoriais n?o associadas tradicionalmente a estes c?rtices. Durante a estimula??o visual, 87% dos neur?nios de V1 foram responsivos, enquanto 82% dos neur?nios de S1 responderam ? estimula??o t?til. Nos mesmos registros, encontramos 23% dos neur?nios de V1 responsivos a est?mulos t?teis e 22% dos neur?nios de S1 responsivos a est?mulos visuais. Nossos dados corroboram uma crescente s?rie de evid?ncias que indica a presen?a de processamento multimodal nos c?rtices sensoriais prim?rios, o que desafia o paradigma do processamento sensorial unimodal e sugere a necessidade de uma reinterpreta??o do modelo de hierarquia cortical atualmente aceito.

26

Padula, Claudia B. "The Functional and Structural Neural Connectivity of Affective Processing in Alcohol Dependence: A Multimodal Imaging Study". University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1377869730.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

27

Poria, Soujanya. "Novel symbolic and machine-learning approaches for text-based and multimodal sentiment analysis". Thesis, University of Stirling, 2017. http://hdl.handle.net/1893/25396.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Emotions and sentiments play a crucial role in our everyday lives. They aid decision-making, learning, communication, and situation awareness in human-centric environments. Over the past two decades, researchers in artificial intelligence have been attempting to endow machines with cognitive capabilities to recognize, infer, interpret and express emotions and sentiments. All such efforts can be attributed to affective computing, an interdisciplinary field spanning computer science, psychology, social sciences and cognitive science. Sentiment analysis and emotion recognition has also become a new trend in social media, avidly helping users understand opinions being expressed on different platforms in the web. In this thesis, we focus on developing novel methods for text-based sentiment analysis. As an application of the developed methods, we employ them to improve multimodal polarity detection and emotion recognition. Specifically, we develop innovative text and visual-based sentiment-analysis engines and use them to improve the performance of multimodal sentiment analysis. We begin by discussing challenges involved in both text-based and multimodal sentiment analysis. Next, we present a number of novel techniques to address these challenges. In particular, in the context of concept-based sentiment analysis, a paradigm gaining increasing interest recently, it is important to identify concepts in text; accordingly, we design a syntaxbased concept-extraction engine. We then exploit the extracted concepts to develop conceptbased affective vector space which we term, EmoSenticSpace. We then use this for deep learning-based sentiment analysis, in combination with our novel linguistic pattern-based affective reasoning method termed sentiment flow. Finally, we integrate all our text-based techniques and combine them with a novel deep learning-based visual feature extractor for multimodal sentiment analysis and emotion recognition. Comparative experimental results using a range of benchmark datasets have demonstrated the effectiveness of the proposed approach.

28

Ouenniche, Kaouther. "Multimodal deep learning for audiovisual production". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS020.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Dans le contexte en constante évolution du contenu audiovisuel, la nécessité cruciale d'automatiser l'indexation et l'organisation des archives s'est imposée comme un objectif primordial. En réponse, cette recherche explore l'utilisation de techniques d'apprentissage profond pour automatiser l'extraction de métadonnées diverses dans les archives, améliorant ainsi leur accessibilité et leur réutilisation. La première contribution de cette recherche concerne la classification des mouvements de caméra. Il s'agit d'un aspect crucial de l'indexation du contenu, car il permet une catégorisation efficace et une récupération du contenu vidéo en fonction de la dynamique visuelle qu'il présente. L'approche proposée utilise des réseaux neuronaux convolutionnels 3D avec des blocs résiduels. Une approche semi-automatique pour la construction d'un ensemble de données fiable sur les mouvements de caméra à partir de vidéos disponibles au public est également présentée, réduisant au minimum le besoin d'intervention manuelle. De plus, la création d'un ensemble de données d'évaluation exigeant, comprenant des vidéos de la vie réelle tournées avec des caméras professionnelles à différentes résolutions, met en évidence la robustesse et la capacité de généralisation de la technique proposée, atteignant un taux de précision moyen de 94 %.La deuxième contribution se concentre sur la tâche de Vidéo Question Answering. Dans ce contexte, notre Framework intègre un Transformers léger et un module de cross modalité. Ce module utilise une corrélation croisée pour permettre un apprentissage réciproque entre les caractéristiques visuelles conditionnées par le texte et les caractéristiques textuelles conditionnées par la vidéo. De plus, un scénario de test adversarial avec des questions reformulées met en évidence la robustesse du modèle et son applicabilité dans le monde réel. Les résultats expérimentaux sur MSVD-QA et MSRVTT-QA, valident la méthodologie proposée, avec une précision moyenne de 45 % et 42 % respectivement. La troisième contribution de cette recherche aborde le problème de vidéo captioning. Le travail introduit intègre un module de modality attention qui capture les relations complexes entre les données visuelles et textuelles à l'aide d'une corrélation croisée. De plus, l'intégration de l'attention temporelle améliore la capacité du modèle à produire des légendes significatives en tenant compte de la dynamique temporelle du contenu vidéo. Notre travail intègre également une tâche auxiliaire utilisant une fonction de perte contrastive, ce qui favorise la généralisation du modèle et une compréhension plus approfondie des relations intermodales et des sémantiques sous-jacentes. L'utilisation d'une architecture de transformer pour l'encodage et le décodage améliore considérablement la capacité du modèle à capturer les interdépendances entre les données textuelles et vidéo. La recherche valide la méthodologie proposée par une évaluation rigoureuse sur MSRVTT, atteignant des scores BLEU4, ROUGE et METEOR de 0,4408, 0,6291 et 0,3082 respectivement. Notre approche surpasse les méthodes de l'état de l'art, avec des gains de performance allant de 1,21 % à 1,52 % pour les trois métriques considérées. En conclusion, ce manuscrit offre une exploration holistique des techniques basées sur l'apprentissage profond pour automatiser l'indexation du contenu télévisuel, en abordant la nature laborieuse et chronophage de l'indexation manuelle. Les contributions englobent la classification des types de mouvements de caméra, la vidéo question answering et la vidéo captioning, faisant avancer collectivement l'état de l'art et fournissant des informations précieuses pour les chercheurs dans le domaine. Ces découvertes ont non seulement des applications pratiques pour la recherche et l'indexation de contenu, mais contribuent également à l'avancement plus large des méthodologies d'apprentissage profond dans le contexte multimodal
Within the dynamic landscape of television content, the critical need to automate the indexing and organization of archives has emerged as a paramount objective. In response, this research explores the use of deep learning techniques to automate the extraction of diverse metadata from television archives, improving their accessibility and reuse.The first contribution of this research revolves around the classification of camera motion types. This is a crucial aspect of content indexing as it allows for efficient categorization and retrieval of video content based on the visual dynamics it exhibits. The novel approach proposed employs 3D convolutional neural networks with residual blocks, a technique inspired by action recognition methods. A semi-automatic approach for constructing a reliable camera motion dataset from publicly available videos is also presented, minimizing the need for manual intervention. Additionally, the creation of a challenging evaluation dataset, comprising real-life videos shot with professional cameras at varying resolutions, underlines the robustness and generalization power of the proposed technique, achieving an average accuracy rate of 94%.The second contribution centers on the demanding task of Video Question Answering. In this context, we explore the effectiveness of attention-based transformers for facilitating grounded multimodal learning. The challenge here lies in bridging the gap between the visual and textual modalities and mitigating the quadratic complexity of transformer models. To address these issues, a novel framework is introduced, which incorporates a lightweight transformer and a cross-modality module. This module leverages cross-correlation to enable reciprocal learning between text-conditioned visual features and video-conditioned textual features. Furthermore, an adversarial testing scenario with rephrased questions highlights the model's robustness and real-world applicability. Experimental results on benchmark datasets, such as MSVD-QA and MSRVTT-QA, validate the proposed methodology, with an average accuracy of 45% and 42%, respectively, which represents notable improvements over existing approaches.The third contribution of this research addresses the multimodal video captioning problem, a critical aspect of content indexing. The introduced framework incorporates a modality-attention module that captures the intricate relationships between visual and textual data using cross-correlation. Moreover, the integration of temporal attention enhances the model's ability to produce meaningful captions, considering the temporal dynamics of video content. Our work also incorporates an auxiliary task employing a contrastive loss function, which promotes model generalization and a deeper understanding of inter-modal relationships and underlying semantics. The utilization of a transformer architecture for encoding and decoding significantly enhances the model's capacity to capture interdependencies between text and video data. The research validates the proposed methodology through rigorous evaluation on the MSRVTT benchmark,viachieving BLEU4, ROUGE, and METEOR scores of 0.4408, 0.6291 and 0.3082, respectively. In comparison to state-of-the-art methods, this approach consistently outperforms, with performance gains ranging from 1.21% to 1.52% across the three metrics considered.In conclusion, this manuscript offers a holistic exploration of deep learning-based techniques to automate television content indexing, addressing the labor-intensive and time-consuming nature of manual indexing. The contributions encompass camera motion type classification, VideoQA, and multimodal video captioning, collectively advancing the state of the art and providing valuable insights for researchers in the field. These findings not only have practical applications for content retrieval and indexing but also contribute to the broader advancement of deep learning methodologies in the multimodal context

29

Harris, Matthew Joshua. "Accelerating Reverse Engineering Image Processing Using FPGA". Wright State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright155535529307322.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

30

Gimenes, Gabriel Perri. "Advanced techniques for graph analysis: a multimodal approach over planetary-scale data". Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-26062015-105026/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Applications such as electronic commerce, computer networks, social networks, and biology (protein interaction), to name a few, have led to the production of graph-like data in planetary scale { possibly with millions of nodes and billions of edges. These applications pose challenging problems when the task is to use their data to support decision making processes by means of non-obvious and potentially useful patterns. In order to process such data for pattern discover, researchers and practitioners have used distributed processing resources organized in computational clusters. However, building and managing such clusters can be complex, bringing technical and financial issues that can be prohibitive in a variety of scenarios. Alternatively, it is desirable to process large scale graphs using only one computational node. To do so, we developed processes and algorithms according to three different approaches, building up towards an analytical set capable of revealing patterns, comprehension, and to help with the decision making process over planetary-scale graphs.
Aplicações como comércio eletrônico, redes de computadores, redes sociais e biologia (interação proteica), entre outras, levaram a produção de dados que podem ser representados como grafos à escala planetária { podendo possuir milhões de nós e bilhões de arestas. Tais aplicações apresentam problemas desafiadores quando a tarefa consiste em usar as informações contidas nos grafos para auxiliar processos de tomada de decisão através da descoberta de padrões não triviais e potencialmente utéis. Para processar esses grafos em busca de padrões, tanto pesquisadores como a indústria tem usado recursos de processamento distribuído organizado em clusters computacionais. Entretanto, a construção e manutenção desses clusters pode ser complexa, trazendo tanto problemas técnicos como financeiros que podem ser proibitivos em diversos casos. Por isso, torna-se desejável a capacidade de se processar grafos em larga escala usando somente um nó computacional. Para isso, foram desenvolvidos processos e algoritmos seguindo três abordagens diferentes, visando a definição de um arcabouço de análise capaz de revelar padrões, compreensão e auxiliar na tomada de decisão sobre grafos em escala planetária.

31

Clark, Rebecca A. "Multimodal flavour perception : the impact of sweetness, bitterness, alcohol content and carbonation level on flavour perception". Thesis, University of Nottingham, 2011. http://eprints.nottingham.ac.uk/13432/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Flavour perception of food and beverages is a complex multisensory experience involving the gustatory, olfactory, trigeminal, auditory and visual senses. Thus, investigations into multimodal flavour perception require a multidisciplinary design of experiments approach. This research has focussed on beer flavour perception and the fundamental interactions between the main flavour components - sweetness, bitterness (from hop acids), alcohol content and carbonation level. A model beer was developed using representative ingredients which could be manipulated to systematically vary the concentration of the main flavour components in beer and was used in the following experiments. Using a full factorial design, the physical effect of ethanol, C02 and hop acid addition was determined by headspace analysis and in-nose expired breath (in-vivo) measurements. Results from headspace and in-vivo methods differed and highlighted the importance of in-vivo measures when correlating to sensory experience. Ethanol and C02 significantly increased volatile partitioning during model beverage consumption. The effects of ethanol and C02 appeared to be independent and therefore additive, which could account for up to 86% increase in volatile partitioning. This would increase volatile delivery to the olfactory bulb and thus potentially enhance aroma and flavour perception. This was investigated using quantitative descriptive analysis. Results showed that C02 significantly impacted all discriminating attributes, either directly or as a result of complex interactions with other design factors. C02 suppressed the sweetness of dextrose and interacted with hop acids to modify bitterness and tingly perception. Ethanol was the main driver of complexity of flavour and enhanced sweet perception. In a first study of its kind, the impact of C02 on gustatory perception was further investigated using functional magnetic resonance imaging (fMRI) to understand cortical response. In addition, classification of subjects into PROP taster status groups and thermal taster status groups was carried out. Groups were tested for their sensitivity to oral stimuli using sensory techniques and for the first time, cortical response to taste and C02 was investigated between groups using fMRI techniques and behavioural data. There was no correlation between PROP taster status and thermal taster status. PROP taster status groups varied in their cortical response to stimuli with PROP super-tasters showing significantly higher cortical activation to samples than PROP non-tasters. The mechanism for thermal taster status is not currently known but thermal tasters were found to have higher cortical activation in response to the samples. The difference in cortical activation between thermal taster groups was supported by behavioural data as thermal tasters least preferred, but were more able to discriminate the high C02 sample than thermal non-tasters. This research has provided in-depth study into the importance of flavour components in beer. It advances the limited data available on the effects of C02 on sensory perception in a carbonated beverage, providing sound data for the successful development of products with reduced ethanol or C02 levels. The use of functional magnetic resonance imaging has revealed for the first time that oral C02 significantly increases activation in the somatosensory cortex. However, C02 seemed to have a limited impact on activation strength in 'taste' areas, such as the anterior insula. Research comparing data from PROP taster status groups and thermal taster status groups has given insight into the possible mechanisms accounting for differences in oral intensity of stimuli.

32

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.

33

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.

34

Delecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents". Electronic Thesis or Diss., Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les mécanismes de compréhension chez l'être humain sont par essence multimodaux. Comprendre le monde qui l'entoure revient chez l'être humain à fusionner l'information issue de l'ensemble de ses récepteurs sensoriels. La plupart des documents utilisés en traitement automatique de l'information sont multimodaux. Par exemple, du texte et des images dans des documents textuels ou des images et du son dans des documents vidéo. Cependant, les traitements qui leurs sont appliqués sont le plus souvent monomodaux. Le but de cette thèse est de proposer des traitements joints s'appliquant principalement au texte et à l'image pour le traitement de documents multimodaux à travers deux études : l'une portant sur la fusion multimodale pour la reconnaissance du rôle du locuteur dans des émissions télévisuelles, l'autre portant sur la complémentarité des modalités pour une tâche d'analyse linguistique sur des corpus d'images avec légendes. Pour la première étude nous nous intéressons à l'analyse de documents audiovisuels provenant de chaînes d'information télévisuelle. Nous proposons une approche utilisant des réseaux de neurones profonds pour la création d'une représentation jointe multimodale pour les représentations et la fusion des modalités. Dans la seconde partie de cette thèse nous nous intéressons aux approches permettant d'utiliser plusieurs sources d'informations multimodales pour une tâche monomodale de traitement automatique du langage, afin d'étudier leur complémentarité. Nous proposons un système complet de correction de rattachements prépositionnels utilisant de l'information visuelle, entraîné sur un corpus multimodal d'images avec légendes
The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions

35

Warraich, Daud Sana Mechanical &amp Manufacturing Engineering Faculty of Engineering UNSW. "Ultrasonic stochastic localization of hidden discontinuities in composites using multimodal probability beliefs". Publisher:University of New South Wales. Mechanical & Manufacturing Engineering, 2009. http://handle.unsw.edu.au/1959.4/43719.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This thesis presents a technique used to stochastically estimate the location of hidden discontinuities in carbon fiber composite materials. Composites pose a challenge to signal processing because speckle noise, as a result of reflections from impregnated laminas, masks useful information and impedes detection of hidden discontinuities. Although digital signal processing techniques have been exploited to lessen speckle noise and help to localize discontinuities, uncertainty in ultrasonic wave propagation and broadband frequency based inspections of composites still make it a difficult task. The technique proposed in this thesis estimates the location of hidden discontinuities stochastically in one- and two-dimensions based on statistical data of A-Scans and C-Scans. Multiple experiments have been performed on carbon fiber reinforced plastics including artificial delaminations and porosity at different depths in the thickness of material. A probabilistic approach, which precisely localizes discontinuities in high and low amplitude signals, has been used to present this method. Compared to conventional techniques the proposed technique offers a more reliable package, with the ability to detect discontinuities in signals with lower intensities by utilizing the repetitive amplitudes in multiple sensor observations obtained from one-dimensional A-Scans or two-dimensional C-Scan data sets. The thesis presents the methodology encompassing the proposed technique and the implementation of a system to process real ultrasonic signals and images for effective discontinuity detection and localization.

36

Toulouse, Tom. "Estimation par stéréovision multimodale de caractéristiques géométriques d’un feu de végétation en propagation". Thesis, Corte, 2015. http://www.theses.fr/2015CORT0009/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les travaux menés dans cette thèse concernent le développement d'un dispositif de vision permettant l'estimation de caractéristiques géométriques d'un feu de végétation en propagation. Ce dispositif est composé de plusieurs systèmes de stéréovision multimodaux générant des paires d'images stéréoscopiques à partir desquelles des points tridimensionnels sont calculés et les caractéristiques géométriques de feu tels que sa position, vitesse, hauteur, profondeur, inclinaison, surface et volume sont estimées. La première contribution importante de cette thèse est la détection de pixels de feu de végétation. Tous les algorithmes de détection de pixels de feu de la littérature ainsi que ceux développés dans le cadre de cette thèse ont été évalués sur une base de 500 images de feux de végétation acquises dans le domaine du visible et caractérisées en fonction des propriétés du feu dans l'image (couleur, fumée, luminosité). Cinq algorithmes de détection de pixels de feu de végétation basés sur la fusion de données issues d'images acquises dans le domaine du visible et du proche-infrarouge ont également été développés et évalués sur une autre base de données composée de 100 images multimodales caractérisées. La deuxième contribution importante de cette thèse concerne l'utilisation de méthodes de fusion d'images pour l'optimisation des points appariés entre les images multimodales stéréoscopiques.La troisième contribution importante de cette thèse est l'estimation des caractéristiques géométriques de feu à partir de points tridimensionnels obtenus depuis plusieurs paires d'images stéréoscopiques et recalés à l'aide de relevés GPS et d'inclinaison de tous les dispositifs de vision.Le dispositif d'estimation de caractéristiques géométriques à partir de systèmes de stéréovision a été évalué sur des objets rigides de dimensions connues et a permis d'obtenir les informations souhaitées avec une bonne précision. Les résultats des données obtenues pour des feux de végétation en propagation sont aussi présentés
This thesis presents the geometrical characteristics measurement of spreading vegetation fires with multimodal stereovision systems. Image processing and 3D registration are used in order to obtain a three-dimensional modeling of the fire at each instant of image acquisition and then to compute fire front characteristics like its position, its rate of spread, its height, its width, its inclination, its surface and its volume. The first important contribution of this thesis is the fire pixel detection. A benchmark of fire pixel detection algorithms and of those that are developed in this thesis have been on a database of 500 vegetation fire images of the visible spectra which have been characterized according to the fire properties in the image (color, smoke, luminosity). Five fire pixel detection algorithms based on fusion of data from visible and near-infrared spectra images have also been developed and tested on another database of 100 multimodal images. The second important contribution of this thesis is about the use of images fusion for the optimization of the matching point’s number between the multimodal stereo images.The second important contribution of this thesis is the registration method of 3D fire points obtained with stereovision systems. It uses information collected from a housing containing a GPS and an IMU card which is positioned on each stereovision systems. With this registration, a method have been developed to extract the geometrical characteristics when the fire is spreading.The geometrical characteristics estimation device have been evaluated on a car of known dimensions and the results obtained confirm the good accuracy of the device. The results obtained from vegetation fires are also presented

37

Pérez-Rosas, Verónica. "Exploration of Visual, Acoustic, and Physiological Modalities to Complement Linguistic Representations for Sentiment Analysis". Thesis, University of North Texas, 2014. https://digital.library.unt.edu/ark:/67531/metadc699996/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This research is concerned with the identification of sentiment in multimodal content. This is of particular interest given the increasing presence of subjective multimodal content on the web and other sources, which contains a rich and vast source of people's opinions, feelings, and experiences. Despite the need for tools that can identify opinions in the presence of diverse modalities, most of current methods for sentiment analysis are designed for textual data only, and few attempts have been made to address this problem. The dissertation investigates techniques for augmenting linguistic representations with acoustic, visual, and physiological features. The potential benefits of using these modalities include linguistic disambiguation, visual grounding, and the integration of information about people's internal states. The main goal of this work is to build computational resources and tools that allow sentiment analysis to be applied to multimodal data. This thesis makes three important contributions. First, it shows that modalities such as audio, video, and physiological data can be successfully used to improve existing linguistic representations for sentiment analysis. We present a method that integrates linguistic features with features extracted from these modalities. Features are derived from verbal statements, audiovisual recordings, thermal recordings, and physiological sensors signals. The resulting multimodal sentiment analysis system is shown to significantly outperform the use of language alone. Using this system, we were able to predict the sentiment expressed in video reviews and also the sentiment experienced by viewers while exposed to emotionally loaded content. Second, the thesis provides evidence of the portability of the developed strategies to other affect recognition problems. We provided support for this by studying the deception detection problem. Third, this thesis contributes several multimodal datasets that will enable further research in sentiment and deception detection.

38

Bonazza, Pierre. "Système de sécurité biométrique multimodal par imagerie, dédié au contrôle d’accès". Thesis, Bourgogne Franche-Comté, 2019. http://www.theses.fr/2019UBFCK017/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les travaux de recherche de cette thèse consistent à mettre en place des solutions performantes et légères permettant de répondre aux problèmes de sécurisation de produits sensibles. Motivé par une collaboration avec différents acteurs au sein du projet Nuc-Track,le développement d'un système de sécurité biométrique, possiblement multimodal, mènera à une étude sur différentes caractéristiques biométriques telles que le visage, les empreintes digitales et le réseau vasculaire. Cette thèse sera axée sur une adéquation algorithme et architecture, dans le but de minimiser la taille de stockage des modèles d'apprentissages tout en garantissant des performances optimales. Cela permettra leur stockage sur un support personnel, respectant ainsi les normes de vie privée
Research of this thesis consists in setting up efficient and light solutions to answer the problems of securing sensitive products. Motivated by a collaboration with various stakeholders within the Nuc-Track project, the development of a biometric security system, possibly multimodal, will lead to a study on various biometric features such as the face, fingerprints and the vascular network. This thesis will focus on an algorithm and architecture matching, with the aim of minimizing the storage size of the learning models while guaranteeing optimal performances. This will allow it to be stored on a personal support, thus respecting privacy standards

39

Mani, Gayathri. "Smells and multimodal learning: The role of congruency in the processing of olfactory, visual and verbal elements of product offerings". Diss., The University of Arizona, 1999. http://hdl.handle.net/10150/283973.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Smells are being included as key components of product offerings in an ever increasing number of product categories. However, this practice is guided only by intuitive beliefs that the addition of smells might lead to richer brand identities, help brand preference etc. This is because olfaction research in marketing is in its infancy while studies in branding have focused on strategies to extend a brand's existing equity rather than on issues relating to the initial formation of brand knowledge structures. Thus, there is little understanding of the processes that govern consumer learning of products that involve olfactory in addition to visual and verbal elements. This research examines the role of smells vs. visual/verbal elements in the encoding process of such multimodal brands. Our primary focus is on exploring the effects of congruency among the various elements on the derivation of olfactory associations and learning of the brand. Subjects in the study were exposed to fictitious brands of bath oils and asked to rate the appeal of each brand. Subjects examined the triads of brand elements (i.e., smells, colors and labels) in one of two sequences and the combinations that represented each brand differed based on various congruency conditions. Subjects then undertook a recognition task that was devised to test their learning of the associations between the brand elements. The results suggest that visual/verbal elements play a dominant role in shaping encoding of the product offering. Visual/verbal associations were learned quite easily, regardless of congruency. By contrast, associations between the odors and the labels or colors were learned more accurately when the relevant pair was congruent. Further, the labels and colors seemed to guide the learning of smells. Thus, when the smell was the sole incongruent element and the visual/verbal cues consistently pointed in a different direction, the odor was aligned with the other elements. Consequently, overall brand learning was contingent on the number of congruent cues that were present to assist in the derivation of olfactory associations. These findings provide guidelines to marketers faced with various branding decisions relating to product offerings that incorporate smells.

40

SIMONETTA, FEDERICO. "MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS". Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This Thesis discusses the development of technologies for the automatic resynthesis of music recordings using digital synthesizers. First, the main issue is identified in the understanding of how Music Information Processing (MIP) methods can take into consideration the influence of the acoustic context on the music performance. For this, a novel conceptual and mathematical framework named “Music Interpretation Analysis” (MIA) is presented. In the proposed framework, a distinction is made between the “performance” – the physical action of playing – and the “interpretation” – the action that the performer wishes to achieve. Second, the Thesis describes further works aiming at the democratization of music production tools via automatic resynthesis: 1) it elaborates software and file formats for historical music archiving and multimodal machine-learning datasets; 2) it explores and extends MIP technologies; 3) it presents the mathematical foundations of the MIA framework and shows preliminary evaluations to demonstrate the effectiveness of the approach

41

Olsheski, Julia DeBlasio. "The role of synesthetic correspondence in intersensory binding: investigating an unrecognized confound in multimodal perception research". Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50215.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The current program of research tests the following main hypotheses: 1) Synesthetic correspondence is an amodal property that serves to bind intersensory signals and manipulating this correspondence between pairs of audiovisual signals will affect performance on a temporal order judgment (TOJ) task; 2) Manipulating emphasis during a TOJ task from spatial to temporal aspects will strengthen the influence of task-irrelevant auditory signals; 3) The degree of dimensional overlap between audiovisual pairs will moderate the effect of synesthetic correspondence on the TOJ task; and 4) There are gaps in current perceptual theory due to the fact that synesthetic correspondence is a potential confound that has not been sufficiently considered in the design of perception research. The results support these main hypotheses. Finally, potential applications for the findings presented here are discussed.

42

Meseguer, Brocal Gabriel. "Multimodal analysis : informed content estimation and audio source separation". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS111.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Cette thèse propose l'étude de l'apprentissage multimodal dans le contexte de signaux musicaux. Tout au long de ce manuscrit, nous nous concentrerons sur l'interaction entre les signaux audio et les informations textuelles. Parmi les nombreuses sources de texte liées à la musique qui peuvent être utilisées (par exemple les critiques, les métadonnées ou les commentaires des réseaux sociaux), nous nous concentrerons sur les paroles. La voix chantée relie directement le signal audio et les informations textuelles d'une manière unique, combinant mélodie et paroles où une dimension linguistique complète l'abstraction des instruments de musique. Notre étude se focalise sur l'interaction audio et paroles pour cibler la séparation de sources et l'estimation de contenu informé. Les stimuli du monde réel sont produits par des phénomènes complexes et leur interaction constante dans divers domaines. Notre compréhension apprend des abstractions utiles qui fusionnent différentes modalités en une représentation conjointe. L'apprentissage multimodal décrit des méthodes qui analysent les phénomènes de différentes modalités et leur interaction afin de s'attaquer à des tâches complexes. Il en résulte des représentations meilleures et plus riches qui améliorent les performances des méthodes d'apprentissage automatique actuelles. Pour développer notre analyse multimodale, nous devons d'abord remédier au manque de données contenant une voix chantée avec des paroles alignées. Ces données sont obligatoires pour développer nos idées. Par conséquent, nous étudierons comment créer une telle base de données en exploitant automatiquement les ressources du World Wide Web. La création de ce type de base de données est un défi en soi qui soulève de nombreuses questions de recherche. Nous travaillons constamment avec le paradoxe classique de la `` poule ou de l'œuf '': l'acquisition et le nettoyage de ces données nécessitent des modèles précis, mais il est difficile de former des modèles sans données. Nous proposons d'utiliser le paradigme enseignant-élève pour développer une méthode où la création de bases de données et l'apprentissage de modèles ne sont pas considérés comme des tâches indépendantes mais plutôt comme des efforts complémentaires. Dans ce processus, les paroles et les annotations non-expertes de karaoké décrivent les paroles comme une séquence de notes alignées sur le temps avec leurs informations textuelles associées. Nous lions ensuite chaque annotation à l'audio correct et alignons globalement les annotations dessus
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation. Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods. To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data

43

Calumby, Rodrigo Tripodi 1985. "Recuperação multimodal de imagens com realimentação de relevância baseada em programação genética". [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275814.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Orientador: Ricardo da Silva Torres
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-16T05:18:58Z (GMT). No. of bitstreams: 1 Calumby_RodrigoTripodi_M.pdf: 15749586 bytes, checksum: 2493b0b703adc1973eeabf7eb70ad21c (MD5) Previous issue date: 2010
Resumo: Este trabalho apresenta uma abordagem para recuperação multimodal de imagens com realimentação de relevância baseada em programação genética. Supõe-se que cada imagem da coleção possui informação textual associada (metadado, descrição textual, etc.), além de ter suas propriedades visuais (por exemplo, cor e textura) codificadas em vetores de características. A partir da informação obtida ao longo das iterações de realimentação de relevância, programação genética é utilizada para a criação de funções de combinação de medidas de similaridades eficazes. Com essas novas funções, valores de similaridades diversos são combinados em uma única medida, que mais adequadamente reflete as necessidades do usuário. As principais contribuições deste trabalho consistem na proposta e implementação de dois arcabouços. O primeiro, RFCore, é um arcabouço genérico para atividades de realimentação de relevância para manipulação de objetos digitais. O segundo, MMRFGP, é um arcabouço para recuperação de objetos digitais com realimentação de relevância baseada em programação genética, construído sobre o RFCore. O método proposto de recuperação multimodal de imagens foi validado sobre duas coleções de imagens, uma desenvolvida pela Universidade de Washington e outra da ImageCLEF Photographic Retrieval Task. A abordagem proposta mostrou melhores resultados para recuperação multimodal frente a utilização das modalidades isoladas. Além disso, foram obtidos resultados para recuperação visual e multimodal melhores do que as melhores submissões para a ImageCLEF Photographic Retrieval Task 2008
Abstract: This work presents an approach for multimodal content-based image retrieval with relevance feedback based on genetic programming. We assume that there is textual information (e.g., metadata, textual descriptions) associated with collection images. Furthermore, image content properties (e.g., color and texture) are characterized by image descriptores. Given the information obtained over the relevance feedback iterations, genetic programming is used to create effective combination functions that combine similarities associated with different features. Hence using these new functions the different similarities are combined into a unique measure that more properly meets the user needs. The main contribution of this work is the proposal and implementation of two frameworks. The first one, RFCore, is a generic framework for relevance feedback tasks over digital objects. The second one, MMRF-GP, is a framework for digital object retrieval with relevance feedback based on genetic programming and it was built on top of RFCore. We have validated the proposed multimodal image retrieval approach over 2 datasets, one from the University of Washington and another from the ImageCLEF Photographic Retrieval Task. Our approach has yielded the best results for multimodal image retrieval when compared with one-modality approaches. Furthermore, it has achieved better results for visual and multimodal image retrieval than the best submissions for ImageCLEF Photographic Retrieval Task 2008
Mestrado
Sistemas de Recuperação da Informação
Mestre em Ciência da Computação

44

Mozaffari, Maaref Mohammad Hamed. "A Real-Time and Automatic Ultrasound-Enhanced Multimodal Second Language Training System: A Deep Learning Approach". Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/40477.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The critical role of language pronunciation in communicative competence is significant, especially for second language learners. Despite renewed awareness of the importance of articulation, it remains a challenge for instructors to handle the pronunciation needs of language learners. There are relatively scarce pedagogical tools for pronunciation teaching and learning, such as inefficient, traditional pronunciation instructions like listening and repeating. Recently, electronic visual feedback (EVF) systems (e.g., medical ultrasound imaging) have been exploited in new approaches in such a way that they could be effectively incorporated in a range of teaching and learning contexts. Evaluation of ultrasound-enhanced methods for pronunciation training, such as multimodal methods, has asserted that visualizing articulator’s system as biofeedback to language learners might improve the efficiency of articulation learning. Despite the recent successful usage of multimodal techniques for pronunciation training, manual works and human manipulation are inevitable in many stages of those systems. Furthermore, recognizing tongue shape in noisy and low-contrast ultrasound images is a challenging job, especially for non-expert users in real-time applications. On the other hand, our user study revealed that users could not perceive the placement of their tongue inside the mouth comfortably just by watching pre-recorded videos. Machine learning is a subset of Artificial Intelligence (AI), where machines can learn by experiencing and acquiring skills without human involvement. Inspired by the functionality of the human brain, deep artificial neural networks learn from large amounts of data to perform a task repeatedly. Deep learning-based methods in many computer vision tasks have emerged as the dominant paradigm in recent years. Deep learning methods are powerful in automatic learning of a new job, while unlike traditional image processing methods, they are capable of dealing with many challenges such as object occlusion, transformation variant, and background artifacts. In this dissertation, we implemented a guided language pronunciation training system, benefits from the strengths of deep learning techniques. Our modular system attempts to provide a fully automatic and real-time language pronunciation training tool using ultrasound-enhanced augmented reality. Qualitatively and quantitatively assessments indicate an exceptional performance for our system in terms of flexibility, generalization, robustness, and autonomy outperformed previous techniques. Using our ultrasound-enhanced system, a language learner can observe her/his tongue movements during real-time speech, superimposed on her/his face automatically.

45

Itani, Sara T. "EduCase : an automated lecture video recording, post-processing, and viewing system that utilizes multimodal inputs to provide a dynamic student experience". Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/85426.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 59).
This thesis describes the design, implementation, and evaluation of EduCase: an inexpensive automated lecture video recording, post-processing, and viewing system. The EduCase recording system consists of three devices, one per lecture hall board. Each recording device records color, depth, skeletal, and audio inputs. The Post-Processor automatically processes the recordings to produce an output file usable by the Viewer, which provides a more dynamic student experience than traditional video playback systems. In particular, it allows students to flip back to view a previous board while the lecture continues to play in the background. It also allows students to toggle the professor's visibility in and out to see the board they might be blocking. The system was successfully evaluated in blackboard-heavy lectures at MIT and Harvard. We hope that EduCase will be the quickest, most inexpensive, and student-friendly lecture capture system, and contribute to our overarching goal of education for all.
by Sara T. Itani.
M. Eng.

46

Mitra, Jhimli. "Multimodal Image Registration applied to Magnetic Resonance and Ultrasound Prostatic Images". Phd thesis, Université de Bourgogne, 2012. http://tel.archives-ouvertes.fr/tel-00786032.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This thesis investigates the employment of different deformable registration techniques to register pre-operative magnetic resonance and inter-operative ultrasound images during prostate biopsy. Accurate registration ensures appropriate biopsy sampling of malignant prostate tissues and reduces the rate of re-biopsies. Therefore, we provide comparisons and experimental results for some landmark- and intensity-based registration methods: thin-plate splines, free-form deformation with B-splines. The primary contribution of this thesis is a new spline-based diffeomorphic registration framework for multimodal images. In this framework we ensure diffeomorphism of the thin-plate spline-based transformation by incorporating a set of non-linear polynomial functions. In order to ensure clinically meaningful deformations we also introduce the approximating thin-plate splines so that the solution is obtained by a joint-minimization of the surface similarities of the segmented prostate regions and the thin-plate spline bending energy. The method to establish point correspondences for the thin-plate spline-based registration is a geometric method based on prostate shape symmetry but a further improvement is suggested by computing the Bhattacharyya metric on shape-context based representation of the segmented prostate contours. The proposed deformable framework is computationally expensive and is not well-suited for registration of inter-operative images during prostate biopsy. Therefore, we further investigate upon an off-line learning procedure to learn the deformation parameters of a thin-plate spline from a training set of pre-operative magnetic resonance and its corresponding inter-operative ultrasound images and build deformation models by applying spectral clustering on the deformation parameters. Linear estimations of these deformation models are then applied on a test set of inter-operative and pre-operative ultrasound and magnetic resonance images respectively. The problem of finding the pre-operative magnetic resonance image slice from a volume that matches the inter-operative ultrasound image has further motivated us to investigate on shape-based and image-based similarity measures and propose for slice-to-slice correspondence based on joint-maximization of the similarity measures.

47

Rabhi, Sara. "Optimized deep learning-based multimodal method for irregular medical timestamped data". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS003.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

L'adoption des dossiers médicaux électroniques dans les systèmes d'information des hôpitaux a conduit à la définition de bases de données regroupant divers types de données telles que des notes cliniques textuelles, des événements médicaux longitudinaux et des informations statiques sur les patients. Toutefois, les données ne sont renseignées que lors des consultations médicales ou des séjours hospitaliers. La fréquence de ces visites varie selon l’état de santé du patient. Ainsi, un système capable d'exploiter les différents types de données collectées à différentes échelles de temps est essentiel pour reconstruire la trajectoire de soin du patient, analyser son historique et délivrer des soins adaptés. Ce travail de thèse aborde deux défis principaux du traitement des données médicales : Représenter la séquence des observations médicales à échantillonnage irrégulier et optimiser l'extraction des événements médicaux à partir des textes de notes cliniques. Notre objectif principal est de concevoir une représentation multimodale de la trajectoire de soin du patient afin de résoudre les problèmes de prédiction clinique. Notre premier travail porte sur la modélisation des séries temporelles médicales irrégulières afin d'évaluer l'importance de considérer les écarts de temps entre les visites médicales dans la représentation de la trajectoire de soin d'un patient donné. À cette fin, nous avons mené une étude comparative entre les réseaux de neurones récurrents, les modèles basés sur l’architecture « Transformer » et les techniques de représentation du temps. De plus, l'objectif clinique était de prédire les complications de la rétinopathie chez les patients diabétiques de type 1 de la base de données française CaRéDIAB (Champagne Ardenne Réseau Diabète) en utilisant leur historique de mesures HbA1c. Les résultats de l'étude ont montré que le modèle « Transformer », combiné à la représentation `Soft-One-Hot` des écarts temporels a conduit à un score AUC de 88,65% (spécificité de 85,56%, sensibilité de 83,33%), soit une amélioration de 4,3% par rapport au modèle « LSTM ». Motivés par ces résultats, nous avons étendu notre étude à des séries temporelles multivariées plus courtes et avons prédit le risque de mortalité à l'hôpital pour les patients présents dans la base de données MIMIC-III. L'architecture proposée, HiTT, a amélioré le score AUC de 5 % par rapport à l’architecture « Transformer ». Dans la deuxième étape, nous nous sommes intéressés à l'extraction d'informations médicales à partir des comptes rendus médicaux afin d'enrichir la trajectoire de soin du patient. En particulier, les réseaux de neurones basés sur le module « Transformer » ont montré des résultats encourageants dans d'extraction d'informations médicales. Cependant, ces modèles complexes nécessitent souvent un grand corpus annoté. Cette exigence est difficile à atteindre dans le domaine médical car elle nécessite l'accès à des données privées de patients et des annotateurs experts. Pour réduire les coûts d'annotation, nous avons exploré les stratégies d'apprentissage actif qui se sont avérées efficaces dans de nombreuses tâches, notamment la classification de textes, l’analyse d’image et la reconnaissance vocale. En plus des méthodes existantes, nous avons défini une stratégie d'apprentissage actif, Hybrid Weighted Uncertainty Sampling, qui utilise la représentation cachée du texte donnée par le modèle pour mesurer la représentativité des échantillons. Une simulation utilisant les données du challenge i2b2-2010 a montré que la métrique proposée réduit le coût d'annotation de 70% pour atteindre le même score de performance que l'apprentissage passif. Enfin, nous avons combiné des séries temporelles médicales multivariées et des concepts médicaux extraits des notes cliniques de la base de données MIMIC-III pour entraîner une architecture multimodale. Les résultats du test ont montré une amélioration de 5,3% en considérant les informations textuelles
The wide adoption of Electronic Health Records in hospitals’ information systems has led to the definition of large databases grouping various types of data such as textual notes, longitudinal medical events, and tabular patient information. However, the records are only filled during consultations or hospital stays that depend on the patient’s state, and local habits. A system that can leverage the different types of data collected at different time scales is critical for reconstructing the patient’s health trajectory, analyzing his history, and consequently delivering more adapted care.This thesis work addresses two main challenges of medical data processing: learning to represent the sequence of medical observations with irregular elapsed time between consecutive visits and optimizing the extraction of medical events from clinical notes. Our main goal is to design a multimodal representation of the patient’s health trajectory to solve clinical prediction problems. Our first work built a framework for modeling irregular medical time series to evaluate the importance of considering the time gaps between medical episodes when representing a patient’s health trajectory. To that end, we conducted a comparative study of sequential neural networks and irregular time representation techniques. The clinical objective was to predict retinopathy complications for type 1 diabetes patients in the French database CaRéDIAB (Champagne Ardenne Réseau Diabetes) using their history of HbA1c measurements. The study results showed that the attention-based model combined with the soft one-hot representation of time gaps led to AUROC score of 88.65% (specificity of 85.56%, sensitivity of 83.33%), an improvement of 4.3% when compared to the LSTM-based model. Motivated by these results, we extended our framework to shorter multivariate time series and predicted in-hospital mortality for critical care patients of the MIMIC-III dataset. The proposed architecture, HiTT, improved the AUC score by 5% over the Transformer baseline. In the second step, we focused on extracting relevant medical information from clinical notes to enrich the patient’s health trajectories. Particularly, Transformer-based architectures showed encouraging results in medical information extraction tasks. However, these complex models require a large, annotated corpus. This requirement is hard to achieve in the medical field as it necessitates access to private patient data and high expert annotators. To reduce annotation cost, we explored active learning strategies that have been shown to be effective in tasks such as text classification, information extraction, and speech recognition. In addition to existing methods, we defined a Hybrid Weighted Uncertainty Sampling active learning strategy that takes advantage of the contextual embeddings learned by the Transformer-based approach to measuring the representativeness of samples. A simulated study using the i2b2-2010 challenge dataset showed that our proposed metric reduces the annotation cost by 70% to achieve the same score as passive learning. Lastly, we combined multivariate medical time series and medical concepts extracted from clinical notes of the MIMIC-III database to train a multimodal transformer-based architecture. The test results of the in-hospital mortality task showed an improvement of 5.3% when considering additional text data. This thesis contributes to patient health trajectory representation by alleviating the burden of episodic medical records and the manual annotation of free-text notes

48

Prates, Jonathan Simon. "Gerenciamento de diálogo baseado em modelo cognitivo para sistemas de interação multimodal". Universidade do Vale do Rio dos Sinos, 2015. http://www.repositorio.jesuita.org.br/handle/UNISINOS/3348.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Submitted by Maicon Juliano Schmidt (maicons) on 2015-04-24T13:06:47Z No. of bitstreams: 1 Jonathan Simon Prates.pdf: 2514736 bytes, checksum: 58b7bca77d32ecba8467a3e3a533d2a0 (MD5)
Made available in DSpace on 2015-04-24T13:06:48Z (GMT). No. of bitstreams: 1 Jonathan Simon Prates.pdf: 2514736 bytes, checksum: 58b7bca77d32ecba8467a3e3a533d2a0 (MD5) Previous issue date: 2015-01-31
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Os Sistemas de Interação Multimodal possibilitam uma utilização mais amigável dos sistemas de computação. Eles permitem que os usuários recebam informações e indiquem suas necessidades com maior facilidade, amparados por recursos de interação cada vez mais diversos. Neste contexto, um elemento central é o diálogo que se estabelece entre os usuários e estes sistemas. Alguns dos desafios observados na área de Interação Multimodal estão ligados à integração dos diversos estímulos a serem tratados, enquanto outros estão ligados à geração de respostas adequadas a estes estímulos. O gerenciamento do diálogo nestes sistemas envolve atividades diversas associadas tanto com a representação dos assuntos tratados, como com a escolha de alternativas de resposta e com o tratamento de modelos que representam tarefas e usuários. A partir das diversas abordagens conhecidas para estas implementações, são observadas demandas de modelos de diálogo que aproximem os resultados das interações que são geradas pelos sistemas daquelas interações que seriam esperados em situações de interação em linguagem natural. Uma linha de atuação possível para a obtenção de melhorias neste aspecto pode estar ligada à utilização de estudos da psicologia cognitiva sobre a memória de trabalho e a integração de informações. Este trabalho apresenta os resultados obtidos com um modelo de tratamento de diálogo para sistemas de Interação Multimodal baseado em um modelo cognitivo, que visa proporcionar a geração de diálogos que se aproximem de situações de diálogo em linguagem natural. São apresentados os estudos que embasaram esta proposta e a sua justificativa para uso no modelo descrito. Também são demonstrados resultados preliminares obtidos com o uso de protótipos para a validação do modelo. As avaliações realizadas demonstram um bom potencial para o modelo proposto.
Multimodal interaction systems allow a friendly use of computing systems. They allow users to receive information and indicate their needs with ease, supported by new interaction resources. In this context, the central element is the dialogue, established between users and these systems. The dialogue management of these systems involves various activities associated with the representation of subjects treated, possible answers, tasks model and users model treatment. In implementations for these approaches, some demands can be observed to approximate the results of the interactions by these systems of interaction in natural language. One possible line of action to obtain improvements in this aspect can be associated to the use of cognitive psychology studies on working memory and information integration. This work presents results obtained with a model of memory handling for multimodal dialogue interaction based on a cognitive model, which aims to provide conditions for dialogue generation closer to situations in natural language dialogs. This research presents studies that supported this proposal and the justification for the described model’s description. At the end, results using two prototypes for the model’s validation are also shown.

49

Leong, Chee Wee. "Modeling Synergistic Relationships Between Words and Images". Thesis, University of North Texas, 2012. https://digital.library.unt.edu/ark:/67531/metadc177223/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept. By uncovering synergistic, semantic relationships that exist between words and images, I am working to develop novel techniques that can help improve tasks in natural language processing, as well as effective models for text-to-image synthesis, image retrieval, and automatic image annotation. Specifically, in my dissertation, I will explore the interoperability of features between language and vision tasks. In the first part, I will show how it is possible to apply features generated using evidence gathered from text corpora to solve the image annotation problem in computer vision, without the use of any visual information. In the second part, I will address research in the reverse direction, and show how visual cues can be used to improve tasks in natural language processing. Importantly, I propose a novel metric to estimate the similarity of words by comparing the visual similarity of concepts invoked by these words, and show that it can be used further to advance the state-of-the-art methods that employ corpus-based and knowledge-based semantic similarity measures. Finally, I attempt to construct a joint semantic space connecting words with images, and synthesize an evaluation framework to quantify cross-modal semantic relationships that exist between arbitrary pairs of words and images. I study the effectiveness of unsupervised, corpus-based approaches to automatically derive the semantic relatedness between words and images, and perform empirical evaluations by measuring its correlation with human annotators.

50

Maman, Lucien. "Automated analysis of cohesion in small groups interactions". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT030.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Au cours de la dernière décennie, un nouveau domaine de recherche multidisciplinaire appelé traitement des signaux sociaux (SSP) a émergé. Il vise à permettre aux machines de détecter, reconnaître et afficher les signaux sociaux humains. L'analyse automatisée des interactions de groupe est l'une des tâches les plus complexes abordée par ce domaine de recherche. Récemment, une attention particulière s'est portée sur l'étude automatisée des états émergents. En effet, ceux-ci jouent un rôle important dans les dynamiques d'un groupe car ils résultent des interactions entre ses membres. Dans cette Thèse, nous abordons l'analyse automatique de la cohésion dans les interactions de petits groupes. La cohésion est un état émergent affectif multidimensionnel qui peut être défini comme un processus dynamique, reflété par la tendance d'un groupe à rester ensemble pour poursuivre des objectifs et/ou des besoins affectifs. Malgré la riche littérature disponible sur la cohésion du point de vue des Sciences Sociales, l'analyse automatique de la cohésion en est encore à ses débuts. En s'inspirant de connaissances tirées des Sciences Sociales, cette thèse vise à développer des modèles informatiques de cohésion suivant quatre axes de recherche, en s'appuyant sur des techniques d'apprentissage automatique et d'apprentissage profond. Ces modèles doivent en effet tenir compte de la nature temporelle de la cohésion, de sa multidimensionnalité, de la façon de modéliser la cohésion du point de vue des individus et du groupe, d'intégrer les relations entre ses dimensions et leur évolution dans le temps, ainsi que de tenir compte des relations entre la cohésion et d'autres processus de groupe. De plus, face à un manque de données disponibles publiquement, cette thèse a contribué à la collecte d'une base de données multimodales spécifiquement conçue pour étudier la cohésion, et pour contrôler explicitement ses variations dans le temps. Une telle base de données permet, entre autres, de développer des modèles informatiques intégrant la cohésion perçue par les membres du groupe et/ou par des points de vue externes. Nos résultats montrent la pertinence de s'inspirer des théories tirées des Sciences Sociales pour développer de nouveaux modèles computationnels de cohésion et confirment les avantages d'explorer chacun des quatre axes de recherche
Over the last decade, a new multidisciplinary research domain named Social Signal Processing (SSP) emerged. It is aimed at enabling machines to sense, recognize, and display human social signals. One of the challenging tasks addressed by SSP is the automated group interaction analysis. Recently, a particular emphasis is given to the automated study of emergent states as they play an important role in group dynamics. These are social processes that develop throughout group members' interactions.In this Thesis, we address the automated analysis of cohesion in small groups interactions. Cohesion is a multidimensional affective emergent state that can be defined as a dynamic process reflected by the tendency of a group to stick together to pursue goals and/or affective needs. Despite the rich literature available on cohesion from a Social Sciences perspective, its automated analysis is still in its infancy. Grounding on Social Sciences' insights, this Thesis aims to develop computational models of cohesion following four axes research axes, leveraging Machine Learning and Deep Learning techniques. Computational models of cohesion, indeed, should account for the temporal nature of cohesion, the multidimensionality of this group process, take into account how to model cohesion from both individuals and group perspectives, integrate the relationships between its dimensions and their development over time, and take heed of the relationships between cohesion and other group processes.In addition, facing a lack of publicly available data, this Thesis contributed to the collection of a multimodal dataset specifically designed for studying group cohesion and for explicitly controlling its variations over time. Such a dataset enables, among other perspectives, further development of computational models integrating the perceived cohesion from group members and/or external points of view. Our results show the relevance of leveraging Social Sciences' insights to develop new computational models of cohesion and confirm the benefits of exploring each of the four research axes

Tesi sul tema "Multimodal processing"

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili