Bibliografie tematiche / Audiovisual speech processing

Letteratura scientifica selezionata sul tema "Audiovisual speech processing"

Autore: Grafiati

Pubblicato: 22 giugno 2024

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Consulta la lista di attuali articoli, libri, tesi, atti di convegni e altre fonti scientifiche attinenti al tema "Audiovisual speech processing".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Articoli di riviste sul tema "Audiovisual speech processing":

Tsuhan Chen. "Audiovisual speech processing". IEEE Signal Processing Magazine 18, n. 1 (2001): 9–21. http://dx.doi.org/10.1109/79.911195.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, e Takaaki Kuratate. "Overview of audiovisual speech processing". Acoustical Science and Technology 33, n. 3 (2012): 135–41. http://dx.doi.org/10.1250/ast.33.135.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Francisco, Ana A., Alexandra Jesse, Margriet A. Groen e James M. McQueen. "A General Audiovisual Temporal Processing Deficit in Adult Readers With Dyslexia". Journal of Speech, Language, and Hearing Research 60, n. 1 (gennaio 2017): 144–58. http://dx.doi.org/10.1044/2016_jslhr-h-15-0375.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Purpose Because reading is an audiovisual process, reading impairment may reflect an audiovisual processing deficit. The aim of the present study was to test the existence and scope of such a deficit in adult readers with dyslexia. Method We tested 39 typical readers and 51 adult readers with dyslexia on their sensitivity to the simultaneity of audiovisual speech and nonspeech stimuli, their time window of audiovisual integration for speech (using incongruent /aCa/ syllables), and their audiovisual perception of phonetic categories. Results Adult readers with dyslexia showed less sensitivity to audiovisual simultaneity than typical readers for both speech and nonspeech events. We found no differences between readers with dyslexia and typical readers in the temporal window of integration for audiovisual speech or in the audiovisual perception of phonetic categories. Conclusions The results suggest an audiovisual temporal deficit in dyslexia that is not specific to speech-related events. But the differences found for audiovisual temporal sensitivity did not translate into a deficit in audiovisual speech perception. Hence, there seems to be a hiatus between simultaneity judgment and perception, suggesting a multisensory system that uses different mechanisms across tasks. Alternatively, it is possible that the audiovisual deficit in dyslexia is only observable when explicit judgments about audiovisual simultaneity are required.

Bernstein, Lynne E., Edward T. Auer, Michael Wagner e Curtis W. Ponton. "Spatiotemporal dynamics of audiovisual speech processing". NeuroImage 39, n. 1 (gennaio 2008): 423–35. http://dx.doi.org/10.1016/j.neuroimage.2007.08.035.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Sams, M. "Audiovisual Speech Perception". Perception 26, n. 1_suppl (agosto 1997): 347. http://dx.doi.org/10.1068/v970029.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Persons with hearing loss use visual information from articulation to improve their speech perception. Even persons with normal hearing utilise visual information, especially when the stimulus-to-noise ratio is poor. A dramatic demonstration of the role of vision in speech perception is the audiovisual fusion called the ‘McGurk effect’. When the auditory syllable /pa/ is presented in synchrony with the face articulating the syllable /ka/, the subject usually perceives /ta/ or /ka/. The illusory perception is clearly auditory in nature. We recently studied the audiovisual fusion (acoustical /p/, visual /k/) for Finnish (1) syllables, and (2) words. Only 3% of the subjects perceived the syllables according to the acoustical input, ie in 97% of the subjects the perception was influenced by the visual information. For words the percentage of acoustical identifications was 10%. The results demonstrate a very strong influence of visual information of articulation in face-to-face speech perception. Word meaning and sentence context have a negligible influence on the fusion. We have also recorded neuromagnetic responses of the human cortex when the subjects both heard and saw speech. Some subjects showed a distinct response to a ‘McGurk’ stimulus. The response was rather late, emerging about 200 ms from the onset of the auditory stimulus. We suggest that the perisylvian cortex, close to the source area for the auditory 100 ms response (M100), may be activated by the discordant stimuli. The behavioural and neuromagnetic results suggest a precognitive audiovisual speech integration occurring at a relatively early processing level.

Ojanen, Ville, Riikka Möttönen, Johanna Pekkola, Iiro P. Jääskeläinen, Raimo Joensuu, Taina Autti e Mikko Sams. "Processing of audiovisual speech in Broca's area". NeuroImage 25, n. 2 (aprile 2005): 333–38. http://dx.doi.org/10.1016/j.neuroimage.2004.12.001.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Stevenson, Ryan A., Nicholas A. Altieri, Sunah Kim, David B. Pisoni e Thomas W. James. "Neural processing of asynchronous audiovisual speech perception". NeuroImage 49, n. 4 (febbraio 2010): 3308–18. http://dx.doi.org/10.1016/j.neuroimage.2009.12.001.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Hamilton, Roy H., Jeffrey T. Shenton e H. Branch Coslett. "An acquired deficit of audiovisual speech processing". Brain and Language 98, n. 1 (luglio 2006): 66–73. http://dx.doi.org/10.1016/j.bandl.2006.02.001.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Dunham-Carr, Kacie, Jacob I. Feldman, David M. Simon, Sarah R. Edmunds, Alexander Tu, Wayne Kuang, Julie G. Conrad, Pooja Santapuram, Mark T. Wallace e Tiffany G. Woynaroski. "The Processing of Audiovisual Speech Is Linked with Vocabulary in Autistic and Nonautistic Children: An ERP Study". Brain Sciences 13, n. 7 (8 luglio 2023): 1043. http://dx.doi.org/10.3390/brainsci13071043.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Explaining individual differences in vocabulary in autism is critical, as understanding and using words to communicate are key predictors of long-term outcomes for autistic individuals. Differences in audiovisual speech processing may explain variability in vocabulary in autism. The efficiency of audiovisual speech processing can be indexed via amplitude suppression, wherein the amplitude of the event-related potential (ERP) is reduced at the P2 component in response to audiovisual speech compared to auditory-only speech. This study used electroencephalography (EEG) to measure P2 amplitudes in response to auditory-only and audiovisual speech and norm-referenced, standardized assessments to measure vocabulary in 25 autistic and 25 nonautistic children to determine whether amplitude suppression (a) differs or (b) explains variability in vocabulary in autistic and nonautistic children. A series of regression analyses evaluated associations between amplitude suppression and vocabulary scores. Both groups demonstrated P2 amplitude suppression, on average, in response to audiovisual speech relative to auditory-only speech. Between-group differences in mean amplitude suppression were nonsignificant. Individual differences in amplitude suppression were positively associated with expressive vocabulary through receptive vocabulary, as evidenced by a significant indirect effect observed across groups. The results suggest that efficiency of audiovisual speech processing may explain variance in vocabulary in autism.

Tomalski, Przemysław. "Developmental Trajectory of Audiovisual Speech Integration in Early Infancy. A Review of Studies Using the McGurk Paradigm". Psychology of Language and Communication 19, n. 2 (1 ottobre 2015): 77–100. http://dx.doi.org/10.1515/plc-2015-0006.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Abstract Apart from their remarkable phonological skills young infants prior to their first birthday show ability to match the mouth articulation they see with the speech sounds they hear. They are able to detect the audiovisual conflict of speech and to selectively attend to articulating mouth depending on audiovisual congruency. Early audiovisual speech processing is an important aspect of language development, related not only to phonological knowledge, but also to language production during subsequent years. Th is article reviews recent experimental work delineating the complex developmental trajectory of audiovisual mismatch detection. Th e central issue is the role of age-related changes in visual scanning of audiovisual speech and the corresponding changes in neural signatures of audiovisual speech processing in the second half of the first year of life. Th is phenomenon is discussed in the context of recent theories of perceptual development and existing data on the neural organisation of the infant ‘social brain’.

Più fonti

Tesi sul tema "Audiovisual speech processing":

Morís, Fernández Luis 1982. "Audiovisual speech processing: the role of attention and conflict". Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/385348.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Events in our environment do rarely excite only one sensory pathway, but usually involve several modalities offering complimentary information. These different informations are usually integrated into a single percept through the process of multisensory integration. The present dissertation addresses how and under what circumstances this multisensory integration process occurs in the context of audiovisual speech. The findings of this dissertation challenge previous views of audiovisual integration in speech as a low level automatic process by providing evidence, first, of the influence of the attentional focus of the participant on the multisensory integration process, particularly the need of both modalities to be attended for them to be integrated; and second evidence of the engagement of high level processes (i.e. conflict detection and resolution) when incongruent audiovisual speech is presented, particularly in the case of the McGurk effect.
Los eventos que suceden a nuestro alrededor, no suelen estimular una única modalidad sensorial, sino que, al contrario, suelen involucrar varias modalidades sensoriales las cuales ofrecen información complementaria. La información proveniente de estas diferentes modalidades es integrada en un único percepto a través del proceso denominado integración multisensorial. Esta tesis estudia cómo y bajo qué circunstancias ocurre este proceso en el contexto audiovisual del habla. Los resultados de esta tesis cuestionan los enfoques previos que describían la integración audiovisual como un proceso automático y de bajo nivel. Primero, demuestra que el estado atencional es determinante en el proceso de integración multisensorial. Más concretamente, presenta pruebas de la necesidad de atender a ambas modalidades, visual y auditiva, para que ocurra el proceso de integración. Y en segundo lugar, presenta pruebas de la participación de procesos de alto nivel (i.e. detección y resolución de conflictos) cuando existe una incongruencia entre la modalidad auditiva y visual, especialmente en el caso del efecto McGurk.

Copeland, Laura. "Audiovisual processing of affective and linguistic prosody : an event-related fMRI study". Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=111605.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This study was designed to clarify some of the issues surrounding the nature of hemispheric contributions to the processing of emotional and linguistic prosody, as well as to examine the relative contribution of different sensory modalities in processing prosodic structures. Ten healthy young participants were presented with semantically neutral sentences expressing affective or linguistic prosody solely through the use of non-verbal cues (intonation, facial expressions) while undergoing tMRI. The sentences were presented under auditory, visual, as well as audio-visual conditions. The emotional prosody task required participants to identify the emotion of the utterance (happy or angry) and the linguistic prosody task required participants to identify the type of utterance (question or statement). Core peri-sylvian, frontal and occipital areas were activated bilaterally in all conditions suggesting that processing of affective and linguistic prosodic structures is supported by overlapping networks. The strength of these activations may, in part, be modulated by task and modality of presentation.

Krause, Hanna [Verfasser], e Andreas K. [Akademischer Betreuer] Engel. "Audiovisual processing in Schizophrenia : neural responses in audiovisual speech interference and semantic priming / Hanna Krause. Betreuer: Andreas K. Engel". Hamburg : Staats- und Universitätsbibliothek Hamburg, 2015. http://d-nb.info/1075858569/34.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Sadok, Samir. "Audiovisual speech representation learning applied to emotion recognition". Electronic Thesis or Diss., CentraleSupélec, 2024. http://www.theses.fr/2024CSUP0003.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Les émotions sont vitales dans notre quotidien, devenant un centre d'intérêt majeur de la recherche en cours. La reconnaissance automatique des émotions a suscité beaucoup d'attention en raison de ses applications étendues dans des secteurs tels que la santé, l'éducation, le divertissement et le marketing. Ce progrès dans la reconnaissance émotionnelle est essentiel pour favoriser le développement de l'intelligence artificielle centrée sur l'humain. Les systèmes de reconnaissance des émotions supervisés se sont considérablement améliorés par rapport aux approches traditionnelles d’apprentissage automatique. Cependant, cette progression rencontre des limites en raison de la complexité et de la nature ambiguë des émotions. La création de vastes ensembles de données étiquetées émotionnellement est coûteuse, chronophage et souvent impraticable. De plus, la nature subjective des émotions entraîne des ensembles de données biaisés, impactant l'applicabilité des modèles d'apprentissage dans des scénarios réels.Motivé par la manière dont les humains apprennent et conceptualisent des représentations complexes dès un jeune âge avec un minimum de supervision, cette approche démontre l'efficacité de tirer parti de l'expérience antérieure pour s'adapter à de nouvelles situations. Les modèles d'apprentissage non supervisé ou auto-supervisé s'inspirent de ce paradigme. Initialement, ils visent à établir une représentation générale à partir de données non étiquetées, semblable à l'expérience préalable fondamentale dans l'apprentissage humain. Ces représentations doivent répondre à des critères tels que l'invariance, l'interprétabilité et l'efficacité. Ensuite, ces représentations apprises sont appliquées à des tâches ultérieures avec des données étiquetées limitées, telles que la reconnaissance des émotions. Cela reflète l'assimilation de nouvelles situations dans l'apprentissage humain. Dans cette thèse, nous visons à proposer des méthodes d'apprentissage de représentations non supervisées et auto-supervisées conçues spécifiquement pour des données multimodales et séquentielles, et à explorer leurs avantages potentiels dans le contexte des tâches de reconnaissance des émotions. Les principales contributions de cette thèse comprennent :1. Le développement de modèles génératifs via l'apprentissage non supervisé ou auto-supervisé pour l'apprentissage de la représentation audiovisuelle de la parole, en intégrant une modélisation temporelle et multimodale (audiovisuelle) conjointe.2. La structuration de l'espace latent pour permettre des représentations désentrelacées, améliorant l'interprétabilité en contrôlant les facteurs latents interprétables par l'humain.3. La validation de l'efficacité de nos approches à travers des analyses qualitatives et quantitatives, en particulier sur la tâche de reconnaissance des émotions. Nos méthodes facilitent l'analyse, la transformation et la génération de signaux
Emotions are vital in our daily lives, becoming a primary focus of ongoing research. Automatic emotion recognition has gained considerable attention owing to its wide-ranging applications across sectors such as healthcare, education, entertainment, and marketing. This advancement in emotion recognition is pivotal for fostering the development of human-centric artificial intelligence. Supervised emotion recognition systems have significantly improved over traditional machine learning approaches. However, this progress encounters limitations due to the complexity and ambiguous nature of emotions. Acquiring extensive emotionally labeled datasets is costly, time-intensive, and often impractical.Moreover, the subjective nature of emotions results in biased datasets, impacting the learning models' applicability in real-world scenarios. Motivated by how humans learn and conceptualize complex representations from an early age with minimal supervision, this approach demonstrates the effectiveness of leveraging prior experience to adapt to new situations. Unsupervised or self-supervised learning models draw inspiration from this paradigm. Initially, they aim to establish a general representation learning from unlabeled data, akin to the foundational prior experience in human learning. These representations should adhere to criteria like invariance, interpretability, and effectiveness. Subsequently, these learned representations are applied to downstream tasks with limited labeled data, such as emotion recognition. This mirrors the assimilation of new situations in human learning. In this thesis, we aim to propose unsupervised and self-supervised representation learning methods designed explicitly for multimodal and sequential data and to explore their potential advantages in the context of emotion recognition tasks. The main contributions of this thesis encompass:1. Developing generative models via unsupervised or self-supervised learning for audiovisual speech representation learning, incorporating joint temporal and multimodal (audiovisual) modeling.2. Structuring the latent space to enable disentangled representations, enhancing interpretability by controlling human-interpretable latent factors.3. Validating the effectiveness of our approaches through both qualitative and quantitative analyses, in particular on emotion recognition task. Our methods facilitate signal analysis, transformation, and generation

Biau, Emmanuel 1985. "Beat gestures and speech processing: when prosody extends to the speaker's hands". Doctoral thesis, Universitat Pompeu Fabra, 2015. http://hdl.handle.net/10803/325429.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Speakers naturally accompany their speech with hand gestures and extend the auditory prosody to visual modality through rapid beat gestures that help them to structure their narrative and emphasize relevant information. The present thesis aimed to investigate beat gestures and their neural correlates on the listener’s side. We developed a naturalistic approach combining political discourse presentations with neuroimaging techniques (ERPs, EEG and fMRI) and behavioral measures. The main findings of the thesis first revealed that beat-speech processing engaged language-related areas, suggesting that gestures and auditory speech are part of the same language system. Second, the presence of beats modulated the auditory processing of affiliated words around their onsets and later at phonological stages. We concluded that listeners perceive beats as visual prosody and rely on their predictive value to anticipate relevant acoustic cues of their corresponding words, engaging local attentional processes.
Los gestos acompañan de manera natural el discurso de los hablantes, de esta manera, la prosodia auditiva se traslada también a la modalidad visual a través de los gestos rítmicos que ayudan al hablante a estructurar el mensaje y a enfatizar la información relevante. El objetivo principal de esta tesis fue la investigación de la percepción de los gestos rítmicos y la actividad neuronal relacionada con estos. Esta se desarrolló con un enfoque naturalístico combinando la presentación de discursos políticos con técnicas de neuroimagen (ERPs, EEG y fMRI) y medidas conductuales. Sus principales hallazgos fueron, primero, que el procesado conjunto del habla y gestos rítmicos involucraron áreas relacionadas con el lenguaje, esto sugiere que los gestos y el habla forman parte de un único sistema del lenguaje. Segundo, que los gestos rítmicos modulan el procesamiento de las palabras a las que acompañan tanto en el momento de su pronunciación como en etapas posteriores. Concluimos que los oyentes perciben los gestos rítmicos como parte de la prosodia visual y utilizan su valor predictivo para anticipar la señal acústica de la palabra a la que preceden a través de procesos locales de atención.

Blomberg, Rina. "CORTICAL PHASE SYNCHRONISATION MEDIATES NATURAL FACE-SPEECH PERCEPTION". Thesis, Linköpings universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-122825.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

It is a challenging task for researchers to determine how the brain solves multisensory perception, and the neural mechanisms involved remain subject to theoretical conjecture. According to a hypothesised cortical model for natural audiovisual stimulation, phase synchronised communications between participating brain regions play a mechanistic role in natural audiovisual perception. The purpose of this study was to test the hypothesis by investigating oscillatory dynamics from ongoing EEG recordings whilst participants passively viewed ecologically realistic face-speech interactions in film. Lagged-phase synchronisation measures were computed for conditions of eye-closed rest (REST), speech-only (auditory-only, A), face-only (visual-only, V) and face-speech (audio-visual, AV) stimulation. Statistical contrasts examined AV > REST, AV > A, AV > V and AV-REST > sum(A,V)-REST effects. Results indicated that cross-communications between the frontal lobes, intraparietal associative areas and primary auditory and occipital cortices are specifically enhanced during natural face-speech perception and that phase synchronisation mediates the functional exchange of information associated with face-speech processing between both sensory and associative regions in both hemispheres. Furthermore, phase synchronisation between cortical regions was modulated in parallel within multiple frequency bands.

Girin, Laurent. "Débruitage de parole par un filtrage utilisant l'image du locuteur". Grenoble INPG, 1997. http://www.theses.fr/1997INPG0207.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Un probleme majeur pour les systemes de telecommunications est celui du debruitage de parole, c'est-a-dire l'attenuation des effets d'un bruit parasite en vue d'ameliorer l'intelligibilite et la qualite du message. Or, l'homme possede en ce domaine une competence particuliere, celle de pouvoir extraire l'information auditive grace aux signaux captes visuellement sur le visage de l'interlocuteur. Autrement dit, l'homme sait utiliser la bimodalite auditive et visuelle de la parole pour la rehausser. L'utilisation automatisee des informations visuelles a deja permis d'ameliorer la robustesse des systemes de reconnaissance de parole, principalement en milieu bruite. Cette these traite du probleme inedit d'un debruitage audiovisuel qui n'implique pas une tache de classification, mais de generation de parole acoustique intelligible a partir de parole acoustique degradee et d'une information optique complementaire. Aborde sous l'angle du filtrage, l'objectif est l'elaboration d'un systeme de debruitage fiable, simple et efficace, utilisant des filtres rehausseurs estimes pour une grande part a partir d'informations visuelles. Le chapitre 1 du manuscrit montre l'apport des informations visuelles en perception de parole et decrit leur extraction (mesures sur le contour des levres). Le chapitre 2 aborde les techniques classiques de debruitage et introduit une famille de filtres exploitables dans notre etude. Le chapitre 3 decrit les grandes etapes de l'elaboration du systeme. Deux structures sont proposees et leurs modules sont presentes, notamment l'associateur reliant donnees visuelles et filtres. Le chapitre 4 evalue la 1ere structure implantee sur un corpus de voyelles stationnaires. Les chapitres 5 et 6 portent sur la 2eme structure implantee sur des corpus de transitions voyelle a voyelle et voyelle a consonne plosive. Enfin, le chapitre 7 est un bilan degageant diverses perspectives en debruitage et en traitement simultane du son et de l'image en parole.

Teissier, Pascal. "Fusion de capteurs avec contrôle du contexte : application a la reconnaissance de parole dans le bruit". Grenoble INPG, 1999. http://www.theses.fr/1999INPG0023.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Cette these est consacree a la fusion de capteurs incluant un controle par des informations contextuelles. L'application visee est la reconnaissance audiovisuelle de parole dans le bruit. Tout d'abord, nous passons en revue la litterature sur les systemes de reconnaissance automatique de la parole audiovisuelle existants sans oublier les domaines plus generaux comme la fusion de capteur et la perception de la parole. De cette revue de l'etat de l'art qui laisse apparaitre une considerable diversite d'approches, nous mettons en place une strategie et une methodologie permettant d'etudier et de comparer convenablement, pour une tache de reconnaissance simple (voyelles statiques), les principaux elements qui conditionnent l'efficacite des systemes de reconnaissance audiovisuelle : mise en forme des donnees d'entree, choix de l'architecture de fusion et introduction de mecanismes de controle. Une comparaison de quatre architectures, avec une proposition originale de processus de controle par une information externe (contexte) pour chacune d'entre elles, a permis de faire emerger deux modeles d'integration qui donnent des performances similaires. L'introduction de differents pretraitements de donnees, trop souvent negliges dans la litterature, montre l'importance de la mise en forme des donnees dans un processus de fusion ; la proposition d'un algorithme d'apprentissage supervise permettant le depliage des donnees d'entree ameliore les performances de facon tres significative. Une etude sur l'estimation d'information contextuelle, indispensable pour piloter le systeme de fusion selon les conditions externes (bruit), donne une preference pour l'evaluation de cette variable sur les donnees pretraitees avant tout processus de classification. Enfin, dans une seconde serie d'experimentation, nous continuons la comparaison d'architectures pour une tache de reconnaissance plus complexe (stimuli dynamiques) pour les deux meilleurs modeles d'integration.

Decroix, François-Xavier. "Apprentissage en ligne de signatures audiovisuelles pour la reconnaissance et le suivi de personnes au sein d'un réseau de capteurs ambiants". Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30298/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

L'opération neOCampus, initiée en 2013 par l'Université Paul Sabatier, a pour objectif de créer un campus connecté, innovant, intelligent et durable en exploitant les compétences de 11 laboratoires et de plusieurs partenaires industriels. Pluridisciplinaires, ces compétences sont croisées dans le but d'améliorer le confort au quotidien des usagers du campus (étudiants, corps enseignant, personnel administratif) et de diminuer son empreinte écologique. L'intelligence que nous souhaitons apporter au Campus du futur exige de fournir à ses bâtiments une perception de son activité interne. En effet, l'optimisation des ressources énergétiques nécessite une caractérisation des activités des usagers afin que le bâtiment puisse s'y adapter automatiquement. L'activité humaine étant sujet à plusieurs niveaux d'interprétation nos travaux se focalisent sur l'extraction des déplacements des personnes présentes, sa composante la plus élémentaire. La caractérisation de l'activité des usagers, en termes de déplacements, exploite des données extraites de caméras et de microphones disséminés dans une pièce, ces derniers formant ainsi un réseau épars de capteurs hétérogènes. Nous cherchons alors à extraire de ces données une signature audiovisuelle et une localisation grossière des personnes transitant dans ce réseau de capteurs. Tout en préservant la vie privée de l'individu, la signature doit être discriminante, afin de distinguer les personnes entre elles, et compacte, afin d'optimiser les temps de traitement et permettre au bâtiment de s'auto-adapter. Eu égard à ces contraintes, les caractéristiques que nous modélisons sont le timbre de la voix du locuteur, et son apparence vestimentaire en termes de distribution colorimétrique. Les contributions scientifiques de ces travaux s'inscrivent ainsi au croisement des communautés parole et vision, en introduisant des méthodes de fusion de signatures sonores et visuelles d'individus. Pour réaliser cette fusion, des nouveaux indices de localisation de source sonore ainsi qu'une adaptation audiovisuelle d'une méthode de suivi multi-cibles ont été introduits, représentant les contributions principales de ces travaux. Le mémoire est structuré en 4 chapitres. Le premier présente un état de l'art sur les problèmes de ré-identification visuelle de personnes et de reconnaissance de locuteurs. Les modalités sonores et visuelles ne présentant aucune corrélation, deux signatures, une vidéo et une audio sont générées séparément, à l'aide de méthodes préexistantes de la littérature. Le détail de la génération de ces signatures est l'objet du chapitre 2. La fusion de ces signatures est alors traitée comme un problème de mise en correspondance d'observations audio et vidéo, dont les détections correspondantes sont cohérentes et compatibles spatialement, et pour lesquelles deux nouvelles stratégies d'association sont introduites au chapitre 3. La cohérence spatio-temporelle des observations sonores et visuelles est ensuite traitée dans le chapitre 4, dans un contexte de suivi multi-cibles
The neOCampus operation, started in 2013 by Paul Sabatier University in Toulouse, aims to create a connected, innovative, intelligent and sustainable campus, by exploiting the skills of 11 laboratories and several industrial partners. These multidisciplinary skills are combined in order to improve users (students, teachers, administrative staff) daily comfort and to reduce the ecological footprint of the campus. The intelligence we want to bring to the campus of the future requires to provide to its buildings a perception of its intern activity. Indeed, optimizing the energy resources needs a characterization of the user's activities so that the building can automatically adapt itself to it. Human activity being open to multiple levels of interpretation, our work is focused on extracting people trajectories, its more elementary component. Characterizing users activities, in terms of movement, uses data extracted from cameras and microphones distributed in a room, forming a sparse network of heterogeneous sensors. From these data, we then seek to extract audiovisual signatures and rough localizations of the people transiting through this network of sensors. While protecting person privacy, signatures must be discriminative, to distinguish a person from another one, and compact, to optimize computational costs and enables the building to adapt itself. Having regard to these constraints, the characteristics we model are the speaker's timbre, and his appearance, in terms of colorimetric distribution. The scientific contributions of this thesis are thus at the intersection of the fields of speech processing and computer vision, by introducing new methods of fusing audio and visual signatures of individuals. To achieve this fusion, new sound source location indices as well as an audiovisual adaptation of a multi-target tracking method were introduced, representing the main contributions of this work. The thesis is structured in 4 chapters, and the first one presents the state of the art on visual reidentification of persons and speaker recognition. Acoustic and visual modalities are not correlated, so two signatures are separately computed, one for video and one for audio, using existing methods in the literature. After a first chapter dedicated to the state of the art in re-identification and speaker recognition methods, the details of the computation of the signatures is explored in chapter 2. The fusion of the signatures is then dealt as a problem of matching between audio and video observations, whose corresponding detections are spatially coherent and compatible. Two novel association strategies are introduced in chapter 3. Spatio-temporal coherence of the bimodal observations is then discussed in chapter 4, in a context of multi-target tracking

Robert-Ribes, Jordi. "Modèles d'intégration audiovisuelle de signaux linguistiques : de la perception humaine a la reconnaissance automatique des voyelles". Grenoble INPG, 1995. http://www.theses.fr/1995INPG0032.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Cette these concerne l'etude des modeles d'integration des informations auditives et visuelles en vue d'obtenir un modele plausible et fonctionnel pour la reconnaissance audiovisuelle des voyelles du francais. Nous passerons en revue les donnees de la litterature sur l'integration audiovisuelle en perception de parole. Nous presenterons ensuite quatre modeles et nous les classifierons selon des principes inspires a la fois de la psychologie experimentale et de la litterature sur la fusion de capteurs. Notre contrainte de plausibilite (conformite aux donnees experimentales) permettra d'eliminer deux modeles. Les deux modeles restants seront compares par rapport a nos propres resultats de perception audiovisuelle des voyelles ainsi que selon leurs performances fonctionnelles. Le resultat de cette comparaison nous montrera qu'un modele qui projette les deux entrees (auditive et visuelle) dans un espace intermediaire de nature motrice, est a la fois le plus conforme aux donnees experimentales et le plus performant pour la reconnaissance audiovisuelle des voyelles dans du bruit acoustique

Libri sul tema "Audiovisual speech processing":

Bailly, Gerard, Pascal Perrier e Eric Vatikiotis-Bateson, a cura di. Audiovisual Speech Processing. Cambridge: Cambridge University Press, 2012. http://dx.doi.org/10.1017/cbo9780511843891.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Bailly, G., Eric Vatikiotis-Bateson e Pascal Perrier. Audiovisual speech processing. Cambridge: Cambridge University Press, 2012.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Randazzo, Melissa. Audiovisual Integration in Apraxia of Speech: EEG Evidence for Processing Differences. [New York, N.Y.?]: [publisher not identified], 2016.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, Pascal Perrier e Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, Pascal Perrier e Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, Pascal Perrier e Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, Pascal Perrier e Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2015.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Vatikiotis-Bateson, Eric, Pascal Perrier e Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abel, Andrew, e Amir Hussain. Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System. Springer, 2015.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abel, Andrew, e Amir Hussain. Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System. Springer International Publishing AG, 2015.

Cerca il testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Più fonti

Capitoli di libri sul tema "Audiovisual speech processing":

Riekhakaynen, Elena, e Elena Zatevalova. "Should We Believe Our Eyes or Our Ears? Processing Incongruent Audiovisual Stimuli by Russian Listeners". In Speech and Computer, 604–15. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-20980-2_51.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Aleksic, Petar S., Gerasminos Potamianos e Aggelos K. Katsaggelos. "Audiovisual Speech Processing". In The Essential Guide to Video Processing, 689–737. Elsevier, 2009. http://dx.doi.org/10.1016/b978-0-12-374456-2.00024-4.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Pantic, Maja. "Face for Interface". In Encyclopedia of Multimedia Technology and Networking, Second Edition, 560–67. IGI Global, 2009. http://dx.doi.org/10.4018/978-1-60566-014-1.ch075.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The human face is involved in an impressive variety of different activities. It houses the majority of our sensory apparatus: eyes, ears, mouth, and nose, allowing the bearer to see, hear, taste, and smell. Apart from these biological functions, the human face provides a number of signals essential for interpersonal communication in our social life. The face houses the speech production apparatus and is used to identify other members of the species, to regulate the conversation by gazing or nodding, and to interpret what has been said by lip reading. It is our direct and naturally preeminent means of communicating and understanding somebody’s affective state and intentions on the basis of the shown facial expression (Lewis & Haviland-Jones, 2000). Personality, attractiveness, age, and gender can also be seen from someone’s face. Thus the face is a multisignal sender/receiver capable of tremendous flexibility and specificity. In general, the face conveys information via four kinds of signals listed in Table 1. Automating the analysis of facial signals, especially rapid facial signals, would be highly beneficial for fields as diverse as security, behavioral science, medicine, communication, and education. In security contexts, facial expressions play a crucial role in establishing or detracting from credibility. In medicine, facial expressions are the direct means to identify when specific mental processes are occurring. In education, pupils’ facial expressions inform the teacher of the need to adjust the instructional message. As far as natural user interfaces between humans and computers (PCs/robots/machines) are concerned, facial expressions provide a way to communicate basic information about needs and demands to the machine. In fact, automatic analysis of rapid facial signals seem to have a natural place in various vision subsystems and vision-based interfaces (face-for-interface tools), including automated tools for gaze and focus of attention tracking, lip reading, bimodal speech processing, face/visual speech synthesis, face-based command issuing, and facial affect processing. Where the user is looking (i.e., gaze tracking) can be effectively used to free computer users from the classic keyboard and mouse. Also, certain facial signals (e.g., a wink) can be associated with certain commands (e.g., a mouse click) offering an alternative to traditional keyboard and mouse commands. The human capability to “hear” in noisy environments by means of lip reading is the basis for bimodal (audiovisual) speech processing that can lead to the realization of robust speech-driven interfaces. To make a believable “talking head” (avatar) representing a real person, tracking the person’s facial signals and making the avatar mimic those using synthesized speech and facial expressions is compulsory. The human ability to read emotions from someone’s facial expressions is the basis of facial affect processing that can lead to expanding user interfaces with emotional communication and, in turn, to obtaining a more flexible, adaptable, and natural affective interfaces between humans and machines. More specifically, the information about when the existing interaction/processing should be adapted, the importance of such an adaptation, and how the interaction/ reasoning should be adapted, involves information about how the user feels (e.g., confused, irritated, tired, interested). Examples of affect-sensitive user interfaces are still rare, unfortunately, and include the systems of Lisetti and Nasoz (2002), Maat and Pantic (2006), and Kapoor, Burleson, and Picard (2007). It is this wide range of principle driving applications that has lent a special impetus to the research problem of automatic facial expression analysis and produced a surge of interest in this research topic.

Atti di convegni sul tema "Audiovisual speech processing":

Vatikiotis-Bateson, E., K. G. Munhall, Y. Kasahara, F. Garcia e H. Yehia. "Characterizing audiovisual information during speech". In 4th International Conference on Spoken Language Processing (ICSLP 1996). ISCA: ISCA, 1996. http://dx.doi.org/10.21437/icslp.1996-379.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Petridis, Stavros, e Maja Pantic. "Audiovisual discrimination between laughter and speech". In ICASSP 2008 - 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008. http://dx.doi.org/10.1109/icassp.2008.4518810.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Petridis, Stavros, Themos Stafylakis, Pingehuan Ma, Feipeng Cai, Georgios Tzimiropoulos e Maja Pantic. "End-to-End Audiovisual Speech Recognition". In ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. http://dx.doi.org/10.1109/icassp.2018.8461326.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Tran, Tam, Soroosh Mariooryad e Carlos Busso. "Audiovisual corpus to analyze whisper speech". In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6639243.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Katsamanis, Athanassios, George Papandreou e Petros Maragos. "Audiovisual-to-Articulatory Speech Inversion Using HMMs". In 2007 IEEE 9th Workshop on Multimedia Signal Processing. IEEE, 2007. http://dx.doi.org/10.1109/mmsp.2007.4412915.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Rosenblum, Lawrence D. "The perceptual basis for audiovisual speech integration". In 7th International Conference on Spoken Language Processing (ICSLP 2002). ISCA: ISCA, 2002. http://dx.doi.org/10.21437/icslp.2002-424.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Silva, Samuel, e António Teixeira. "An Anthropomorphic Perspective for Audiovisual Speech Synthesis". In 10th International Conference on Bio-inspired Systems and Signal Processing. SCITEPRESS - Science and Technology Publications, 2017. http://dx.doi.org/10.5220/0006150201630172.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Holt, Rebecca, Laurence Bruggeman e Katherine Demuth. "Audiovisual benefits for speech processing speed among children with hearing loss". In The 15th International Conference on Auditory-Visual Speech Processing. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/avsp.2019-10.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Matthews, I. A. "Scale based features for audiovisual speech recognition". In IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication. IEE, 1996. http://dx.doi.org/10.1049/ic:19961152.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Ma, Pingchuan, Stavros Petridis e Maja Pantic. "Detecting Adversarial Attacks on Audiovisual Speech Recognition". In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413661.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri