Einloggen

Thematische Bibliographien / Audiovisual speech processing

Auswahl der wissenschaftlichen Literatur zum Thema „Audiovisual speech processing“

Autor: Grafiati

Veröffentlicht am 22. Juni 2024

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit den Listen der aktuellen Artikel, Bücher, Dissertationen, Berichten und anderer wissenschaftlichen Quellen zum Thema "Audiovisual speech processing" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Inhaltsverzeichnis

Zeitschriftenartikel
Dissertationen
Bücher
Buchteile
Konferenzberichte

Zeitschriftenartikel zum Thema "Audiovisual speech processing":

1

Tsuhan Chen. „Audiovisual speech processing“. IEEE Signal Processing Magazine 18, Nr. 1 (2001): 9–21. http://dx.doi.org/10.1109/79.911195.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Vatikiotis-Bateson, Eric, und Takaaki Kuratate. „Overview of audiovisual speech processing“. Acoustical Science and Technology 33, Nr. 3 (2012): 135–41. http://dx.doi.org/10.1250/ast.33.135.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Francisco, Ana A., Alexandra Jesse, Margriet A. Groen und James M. McQueen. „A General Audiovisual Temporal Processing Deficit in Adult Readers With Dyslexia“. Journal of Speech, Language, and Hearing Research 60, Nr. 1 (Januar 2017): 144–58. http://dx.doi.org/10.1044/2016_jslhr-h-15-0375.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Purpose Because reading is an audiovisual process, reading impairment may reflect an audiovisual processing deficit. The aim of the present study was to test the existence and scope of such a deficit in adult readers with dyslexia. Method We tested 39 typical readers and 51 adult readers with dyslexia on their sensitivity to the simultaneity of audiovisual speech and nonspeech stimuli, their time window of audiovisual integration for speech (using incongruent /aCa/ syllables), and their audiovisual perception of phonetic categories. Results Adult readers with dyslexia showed less sensitivity to audiovisual simultaneity than typical readers for both speech and nonspeech events. We found no differences between readers with dyslexia and typical readers in the temporal window of integration for audiovisual speech or in the audiovisual perception of phonetic categories. Conclusions The results suggest an audiovisual temporal deficit in dyslexia that is not specific to speech-related events. But the differences found for audiovisual temporal sensitivity did not translate into a deficit in audiovisual speech perception. Hence, there seems to be a hiatus between simultaneity judgment and perception, suggesting a multisensory system that uses different mechanisms across tasks. Alternatively, it is possible that the audiovisual deficit in dyslexia is only observable when explicit judgments about audiovisual simultaneity are required.

4

Bernstein, Lynne E., Edward T. Auer, Michael Wagner und Curtis W. Ponton. „Spatiotemporal dynamics of audiovisual speech processing“. NeuroImage 39, Nr. 1 (Januar 2008): 423–35. http://dx.doi.org/10.1016/j.neuroimage.2007.08.035.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

Sams, M. „Audiovisual Speech Perception“. Perception 26, Nr. 1_suppl (August 1997): 347. http://dx.doi.org/10.1068/v970029.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Persons with hearing loss use visual information from articulation to improve their speech perception. Even persons with normal hearing utilise visual information, especially when the stimulus-to-noise ratio is poor. A dramatic demonstration of the role of vision in speech perception is the audiovisual fusion called the ‘McGurk effect’. When the auditory syllable /pa/ is presented in synchrony with the face articulating the syllable /ka/, the subject usually perceives /ta/ or /ka/. The illusory perception is clearly auditory in nature. We recently studied the audiovisual fusion (acoustical /p/, visual /k/) for Finnish (1) syllables, and (2) words. Only 3% of the subjects perceived the syllables according to the acoustical input, ie in 97% of the subjects the perception was influenced by the visual information. For words the percentage of acoustical identifications was 10%. The results demonstrate a very strong influence of visual information of articulation in face-to-face speech perception. Word meaning and sentence context have a negligible influence on the fusion. We have also recorded neuromagnetic responses of the human cortex when the subjects both heard and saw speech. Some subjects showed a distinct response to a ‘McGurk’ stimulus. The response was rather late, emerging about 200 ms from the onset of the auditory stimulus. We suggest that the perisylvian cortex, close to the source area for the auditory 100 ms response (M100), may be activated by the discordant stimuli. The behavioural and neuromagnetic results suggest a precognitive audiovisual speech integration occurring at a relatively early processing level.

6

Ojanen, Ville, Riikka Möttönen, Johanna Pekkola, Iiro P. Jääskeläinen, Raimo Joensuu, Taina Autti und Mikko Sams. „Processing of audiovisual speech in Broca's area“. NeuroImage 25, Nr. 2 (April 2005): 333–38. http://dx.doi.org/10.1016/j.neuroimage.2004.12.001.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Stevenson, Ryan A., Nicholas A. Altieri, Sunah Kim, David B. Pisoni und Thomas W. James. „Neural processing of asynchronous audiovisual speech perception“. NeuroImage 49, Nr. 4 (Februar 2010): 3308–18. http://dx.doi.org/10.1016/j.neuroimage.2009.12.001.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Hamilton, Roy H., Jeffrey T. Shenton und H. Branch Coslett. „An acquired deficit of audiovisual speech processing“. Brain and Language 98, Nr. 1 (Juli 2006): 66–73. http://dx.doi.org/10.1016/j.bandl.2006.02.001.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Dunham-Carr, Kacie, Jacob I. Feldman, David M. Simon, Sarah R. Edmunds, Alexander Tu, Wayne Kuang, Julie G. Conrad, Pooja Santapuram, Mark T. Wallace und Tiffany G. Woynaroski. „The Processing of Audiovisual Speech Is Linked with Vocabulary in Autistic and Nonautistic Children: An ERP Study“. Brain Sciences 13, Nr. 7 (08.07.2023): 1043. http://dx.doi.org/10.3390/brainsci13071043.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Explaining individual differences in vocabulary in autism is critical, as understanding and using words to communicate are key predictors of long-term outcomes for autistic individuals. Differences in audiovisual speech processing may explain variability in vocabulary in autism. The efficiency of audiovisual speech processing can be indexed via amplitude suppression, wherein the amplitude of the event-related potential (ERP) is reduced at the P2 component in response to audiovisual speech compared to auditory-only speech. This study used electroencephalography (EEG) to measure P2 amplitudes in response to auditory-only and audiovisual speech and norm-referenced, standardized assessments to measure vocabulary in 25 autistic and 25 nonautistic children to determine whether amplitude suppression (a) differs or (b) explains variability in vocabulary in autistic and nonautistic children. A series of regression analyses evaluated associations between amplitude suppression and vocabulary scores. Both groups demonstrated P2 amplitude suppression, on average, in response to audiovisual speech relative to auditory-only speech. Between-group differences in mean amplitude suppression were nonsignificant. Individual differences in amplitude suppression were positively associated with expressive vocabulary through receptive vocabulary, as evidenced by a significant indirect effect observed across groups. The results suggest that efficiency of audiovisual speech processing may explain variance in vocabulary in autism.

10

Tomalski, Przemysław. „Developmental Trajectory of Audiovisual Speech Integration in Early Infancy. A Review of Studies Using the McGurk Paradigm“. Psychology of Language and Communication 19, Nr. 2 (01.10.2015): 77–100. http://dx.doi.org/10.1515/plc-2015-0006.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Abstract Apart from their remarkable phonological skills young infants prior to their first birthday show ability to match the mouth articulation they see with the speech sounds they hear. They are able to detect the audiovisual conflict of speech and to selectively attend to articulating mouth depending on audiovisual congruency. Early audiovisual speech processing is an important aspect of language development, related not only to phonological knowledge, but also to language production during subsequent years. Th is article reviews recent experimental work delineating the complex developmental trajectory of audiovisual mismatch detection. Th e central issue is the role of age-related changes in visual scanning of audiovisual speech and the corresponding changes in neural signatures of audiovisual speech processing in the second half of the first year of life. Th is phenomenon is discussed in the context of recent theories of perceptual development and existing data on the neural organisation of the infant ‘social brain’.

Mehr Quellen

Dissertationen zum Thema "Audiovisual speech processing":

1

Morís, Fernández Luis 1982. „Audiovisual speech processing: the role of attention and conflict“. Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/385348.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Events in our environment do rarely excite only one sensory pathway, but usually involve several modalities offering complimentary information. These different informations are usually integrated into a single percept through the process of multisensory integration. The present dissertation addresses how and under what circumstances this multisensory integration process occurs in the context of audiovisual speech. The findings of this dissertation challenge previous views of audiovisual integration in speech as a low level automatic process by providing evidence, first, of the influence of the attentional focus of the participant on the multisensory integration process, particularly the need of both modalities to be attended for them to be integrated; and second evidence of the engagement of high level processes (i.e. conflict detection and resolution) when incongruent audiovisual speech is presented, particularly in the case of the McGurk effect.
Los eventos que suceden a nuestro alrededor, no suelen estimular una única modalidad sensorial, sino que, al contrario, suelen involucrar varias modalidades sensoriales las cuales ofrecen información complementaria. La información proveniente de estas diferentes modalidades es integrada en un único percepto a través del proceso denominado integración multisensorial. Esta tesis estudia cómo y bajo qué circunstancias ocurre este proceso en el contexto audiovisual del habla. Los resultados de esta tesis cuestionan los enfoques previos que describían la integración audiovisual como un proceso automático y de bajo nivel. Primero, demuestra que el estado atencional es determinante en el proceso de integración multisensorial. Más concretamente, presenta pruebas de la necesidad de atender a ambas modalidades, visual y auditiva, para que ocurra el proceso de integración. Y en segundo lugar, presenta pruebas de la participación de procesos de alto nivel (i.e. detección y resolución de conflictos) cuando existe una incongruencia entre la modalidad auditiva y visual, especialmente en el caso del efecto McGurk.

2

Copeland, Laura. „Audiovisual processing of affective and linguistic prosody : an event-related fMRI study“. Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=111605.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

This study was designed to clarify some of the issues surrounding the nature of hemispheric contributions to the processing of emotional and linguistic prosody, as well as to examine the relative contribution of different sensory modalities in processing prosodic structures. Ten healthy young participants were presented with semantically neutral sentences expressing affective or linguistic prosody solely through the use of non-verbal cues (intonation, facial expressions) while undergoing tMRI. The sentences were presented under auditory, visual, as well as audio-visual conditions. The emotional prosody task required participants to identify the emotion of the utterance (happy or angry) and the linguistic prosody task required participants to identify the type of utterance (question or statement). Core peri-sylvian, frontal and occipital areas were activated bilaterally in all conditions suggesting that processing of affective and linguistic prosodic structures is supported by overlapping networks. The strength of these activations may, in part, be modulated by task and modality of presentation.

3

Krause, Hanna [Verfasser], und Andreas K. [Akademischer Betreuer] Engel. „Audiovisual processing in Schizophrenia : neural responses in audiovisual speech interference and semantic priming / Hanna Krause. Betreuer: Andreas K. Engel“. Hamburg : Staats- und Universitätsbibliothek Hamburg, 2015. http://d-nb.info/1075858569/34.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Sadok, Samir. „Audiovisual speech representation learning applied to emotion recognition“. Electronic Thesis or Diss., CentraleSupélec, 2024. http://www.theses.fr/2024CSUP0003.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Les émotions sont vitales dans notre quotidien, devenant un centre d'intérêt majeur de la recherche en cours. La reconnaissance automatique des émotions a suscité beaucoup d'attention en raison de ses applications étendues dans des secteurs tels que la santé, l'éducation, le divertissement et le marketing. Ce progrès dans la reconnaissance émotionnelle est essentiel pour favoriser le développement de l'intelligence artificielle centrée sur l'humain. Les systèmes de reconnaissance des émotions supervisés se sont considérablement améliorés par rapport aux approches traditionnelles d’apprentissage automatique. Cependant, cette progression rencontre des limites en raison de la complexité et de la nature ambiguë des émotions. La création de vastes ensembles de données étiquetées émotionnellement est coûteuse, chronophage et souvent impraticable. De plus, la nature subjective des émotions entraîne des ensembles de données biaisés, impactant l'applicabilité des modèles d'apprentissage dans des scénarios réels.Motivé par la manière dont les humains apprennent et conceptualisent des représentations complexes dès un jeune âge avec un minimum de supervision, cette approche démontre l'efficacité de tirer parti de l'expérience antérieure pour s'adapter à de nouvelles situations. Les modèles d'apprentissage non supervisé ou auto-supervisé s'inspirent de ce paradigme. Initialement, ils visent à établir une représentation générale à partir de données non étiquetées, semblable à l'expérience préalable fondamentale dans l'apprentissage humain. Ces représentations doivent répondre à des critères tels que l'invariance, l'interprétabilité et l'efficacité. Ensuite, ces représentations apprises sont appliquées à des tâches ultérieures avec des données étiquetées limitées, telles que la reconnaissance des émotions. Cela reflète l'assimilation de nouvelles situations dans l'apprentissage humain. Dans cette thèse, nous visons à proposer des méthodes d'apprentissage de représentations non supervisées et auto-supervisées conçues spécifiquement pour des données multimodales et séquentielles, et à explorer leurs avantages potentiels dans le contexte des tâches de reconnaissance des émotions. Les principales contributions de cette thèse comprennent :1. Le développement de modèles génératifs via l'apprentissage non supervisé ou auto-supervisé pour l'apprentissage de la représentation audiovisuelle de la parole, en intégrant une modélisation temporelle et multimodale (audiovisuelle) conjointe.2. La structuration de l'espace latent pour permettre des représentations désentrelacées, améliorant l'interprétabilité en contrôlant les facteurs latents interprétables par l'humain.3. La validation de l'efficacité de nos approches à travers des analyses qualitatives et quantitatives, en particulier sur la tâche de reconnaissance des émotions. Nos méthodes facilitent l'analyse, la transformation et la génération de signaux
Emotions are vital in our daily lives, becoming a primary focus of ongoing research. Automatic emotion recognition has gained considerable attention owing to its wide-ranging applications across sectors such as healthcare, education, entertainment, and marketing. This advancement in emotion recognition is pivotal for fostering the development of human-centric artificial intelligence. Supervised emotion recognition systems have significantly improved over traditional machine learning approaches. However, this progress encounters limitations due to the complexity and ambiguous nature of emotions. Acquiring extensive emotionally labeled datasets is costly, time-intensive, and often impractical.Moreover, the subjective nature of emotions results in biased datasets, impacting the learning models' applicability in real-world scenarios. Motivated by how humans learn and conceptualize complex representations from an early age with minimal supervision, this approach demonstrates the effectiveness of leveraging prior experience to adapt to new situations. Unsupervised or self-supervised learning models draw inspiration from this paradigm. Initially, they aim to establish a general representation learning from unlabeled data, akin to the foundational prior experience in human learning. These representations should adhere to criteria like invariance, interpretability, and effectiveness. Subsequently, these learned representations are applied to downstream tasks with limited labeled data, such as emotion recognition. This mirrors the assimilation of new situations in human learning. In this thesis, we aim to propose unsupervised and self-supervised representation learning methods designed explicitly for multimodal and sequential data and to explore their potential advantages in the context of emotion recognition tasks. The main contributions of this thesis encompass:1. Developing generative models via unsupervised or self-supervised learning for audiovisual speech representation learning, incorporating joint temporal and multimodal (audiovisual) modeling.2. Structuring the latent space to enable disentangled representations, enhancing interpretability by controlling human-interpretable latent factors.3. Validating the effectiveness of our approaches through both qualitative and quantitative analyses, in particular on emotion recognition task. Our methods facilitate signal analysis, transformation, and generation

5

Biau, Emmanuel 1985. „Beat gestures and speech processing: when prosody extends to the speaker's hands“. Doctoral thesis, Universitat Pompeu Fabra, 2015. http://hdl.handle.net/10803/325429.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Speakers naturally accompany their speech with hand gestures and extend the auditory prosody to visual modality through rapid beat gestures that help them to structure their narrative and emphasize relevant information. The present thesis aimed to investigate beat gestures and their neural correlates on the listener’s side. We developed a naturalistic approach combining political discourse presentations with neuroimaging techniques (ERPs, EEG and fMRI) and behavioral measures. The main findings of the thesis first revealed that beat-speech processing engaged language-related areas, suggesting that gestures and auditory speech are part of the same language system. Second, the presence of beats modulated the auditory processing of affiliated words around their onsets and later at phonological stages. We concluded that listeners perceive beats as visual prosody and rely on their predictive value to anticipate relevant acoustic cues of their corresponding words, engaging local attentional processes.
Los gestos acompañan de manera natural el discurso de los hablantes, de esta manera, la prosodia auditiva se traslada también a la modalidad visual a través de los gestos rítmicos que ayudan al hablante a estructurar el mensaje y a enfatizar la información relevante. El objetivo principal de esta tesis fue la investigación de la percepción de los gestos rítmicos y la actividad neuronal relacionada con estos. Esta se desarrolló con un enfoque naturalístico combinando la presentación de discursos políticos con técnicas de neuroimagen (ERPs, EEG y fMRI) y medidas conductuales. Sus principales hallazgos fueron, primero, que el procesado conjunto del habla y gestos rítmicos involucraron áreas relacionadas con el lenguaje, esto sugiere que los gestos y el habla forman parte de un único sistema del lenguaje. Segundo, que los gestos rítmicos modulan el procesamiento de las palabras a las que acompañan tanto en el momento de su pronunciación como en etapas posteriores. Concluimos que los oyentes perciben los gestos rítmicos como parte de la prosodia visual y utilizan su valor predictivo para anticipar la señal acústica de la palabra a la que preceden a través de procesos locales de atención.

6

Blomberg, Rina. „CORTICAL PHASE SYNCHRONISATION MEDIATES NATURAL FACE-SPEECH PERCEPTION“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-122825.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

It is a challenging task for researchers to determine how the brain solves multisensory perception, and the neural mechanisms involved remain subject to theoretical conjecture. According to a hypothesised cortical model for natural audiovisual stimulation, phase synchronised communications between participating brain regions play a mechanistic role in natural audiovisual perception. The purpose of this study was to test the hypothesis by investigating oscillatory dynamics from ongoing EEG recordings whilst participants passively viewed ecologically realistic face-speech interactions in film. Lagged-phase synchronisation measures were computed for conditions of eye-closed rest (REST), speech-only (auditory-only, A), face-only (visual-only, V) and face-speech (audio-visual, AV) stimulation. Statistical contrasts examined AV > REST, AV > A, AV > V and AV-REST > sum(A,V)-REST effects. Results indicated that cross-communications between the frontal lobes, intraparietal associative areas and primary auditory and occipital cortices are specifically enhanced during natural face-speech perception and that phase synchronisation mediates the functional exchange of information associated with face-speech processing between both sensory and associative regions in both hemispheres. Furthermore, phase synchronisation between cortical regions was modulated in parallel within multiple frequency bands.

7

Girin, Laurent. „Débruitage de parole par un filtrage utilisant l'image du locuteur“. Grenoble INPG, 1997. http://www.theses.fr/1997INPG0207.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Un probleme majeur pour les systemes de telecommunications est celui du debruitage de parole, c'est-a-dire l'attenuation des effets d'un bruit parasite en vue d'ameliorer l'intelligibilite et la qualite du message. Or, l'homme possede en ce domaine une competence particuliere, celle de pouvoir extraire l'information auditive grace aux signaux captes visuellement sur le visage de l'interlocuteur. Autrement dit, l'homme sait utiliser la bimodalite auditive et visuelle de la parole pour la rehausser. L'utilisation automatisee des informations visuelles a deja permis d'ameliorer la robustesse des systemes de reconnaissance de parole, principalement en milieu bruite. Cette these traite du probleme inedit d'un debruitage audiovisuel qui n'implique pas une tache de classification, mais de generation de parole acoustique intelligible a partir de parole acoustique degradee et d'une information optique complementaire. Aborde sous l'angle du filtrage, l'objectif est l'elaboration d'un systeme de debruitage fiable, simple et efficace, utilisant des filtres rehausseurs estimes pour une grande part a partir d'informations visuelles. Le chapitre 1 du manuscrit montre l'apport des informations visuelles en perception de parole et decrit leur extraction (mesures sur le contour des levres). Le chapitre 2 aborde les techniques classiques de debruitage et introduit une famille de filtres exploitables dans notre etude. Le chapitre 3 decrit les grandes etapes de l'elaboration du systeme. Deux structures sont proposees et leurs modules sont presentes, notamment l'associateur reliant donnees visuelles et filtres. Le chapitre 4 evalue la 1ere structure implantee sur un corpus de voyelles stationnaires. Les chapitres 5 et 6 portent sur la 2eme structure implantee sur des corpus de transitions voyelle a voyelle et voyelle a consonne plosive. Enfin, le chapitre 7 est un bilan degageant diverses perspectives en debruitage et en traitement simultane du son et de l'image en parole.

8

Teissier, Pascal. „Fusion de capteurs avec contrôle du contexte : application a la reconnaissance de parole dans le bruit“. Grenoble INPG, 1999. http://www.theses.fr/1999INPG0023.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Cette these est consacree a la fusion de capteurs incluant un controle par des informations contextuelles. L'application visee est la reconnaissance audiovisuelle de parole dans le bruit. Tout d'abord, nous passons en revue la litterature sur les systemes de reconnaissance automatique de la parole audiovisuelle existants sans oublier les domaines plus generaux comme la fusion de capteur et la perception de la parole. De cette revue de l'etat de l'art qui laisse apparaitre une considerable diversite d'approches, nous mettons en place une strategie et une methodologie permettant d'etudier et de comparer convenablement, pour une tache de reconnaissance simple (voyelles statiques), les principaux elements qui conditionnent l'efficacite des systemes de reconnaissance audiovisuelle : mise en forme des donnees d'entree, choix de l'architecture de fusion et introduction de mecanismes de controle. Une comparaison de quatre architectures, avec une proposition originale de processus de controle par une information externe (contexte) pour chacune d'entre elles, a permis de faire emerger deux modeles d'integration qui donnent des performances similaires. L'introduction de differents pretraitements de donnees, trop souvent negliges dans la litterature, montre l'importance de la mise en forme des donnees dans un processus de fusion ; la proposition d'un algorithme d'apprentissage supervise permettant le depliage des donnees d'entree ameliore les performances de facon tres significative. Une etude sur l'estimation d'information contextuelle, indispensable pour piloter le systeme de fusion selon les conditions externes (bruit), donne une preference pour l'evaluation de cette variable sur les donnees pretraitees avant tout processus de classification. Enfin, dans une seconde serie d'experimentation, nous continuons la comparaison d'architectures pour une tache de reconnaissance plus complexe (stimuli dynamiques) pour les deux meilleurs modeles d'integration.

9

Decroix, François-Xavier. „Apprentissage en ligne de signatures audiovisuelles pour la reconnaissance et le suivi de personnes au sein d'un réseau de capteurs ambiants“. Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30298/document.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

L'opération neOCampus, initiée en 2013 par l'Université Paul Sabatier, a pour objectif de créer un campus connecté, innovant, intelligent et durable en exploitant les compétences de 11 laboratoires et de plusieurs partenaires industriels. Pluridisciplinaires, ces compétences sont croisées dans le but d'améliorer le confort au quotidien des usagers du campus (étudiants, corps enseignant, personnel administratif) et de diminuer son empreinte écologique. L'intelligence que nous souhaitons apporter au Campus du futur exige de fournir à ses bâtiments une perception de son activité interne. En effet, l'optimisation des ressources énergétiques nécessite une caractérisation des activités des usagers afin que le bâtiment puisse s'y adapter automatiquement. L'activité humaine étant sujet à plusieurs niveaux d'interprétation nos travaux se focalisent sur l'extraction des déplacements des personnes présentes, sa composante la plus élémentaire. La caractérisation de l'activité des usagers, en termes de déplacements, exploite des données extraites de caméras et de microphones disséminés dans une pièce, ces derniers formant ainsi un réseau épars de capteurs hétérogènes. Nous cherchons alors à extraire de ces données une signature audiovisuelle et une localisation grossière des personnes transitant dans ce réseau de capteurs. Tout en préservant la vie privée de l'individu, la signature doit être discriminante, afin de distinguer les personnes entre elles, et compacte, afin d'optimiser les temps de traitement et permettre au bâtiment de s'auto-adapter. Eu égard à ces contraintes, les caractéristiques que nous modélisons sont le timbre de la voix du locuteur, et son apparence vestimentaire en termes de distribution colorimétrique. Les contributions scientifiques de ces travaux s'inscrivent ainsi au croisement des communautés parole et vision, en introduisant des méthodes de fusion de signatures sonores et visuelles d'individus. Pour réaliser cette fusion, des nouveaux indices de localisation de source sonore ainsi qu'une adaptation audiovisuelle d'une méthode de suivi multi-cibles ont été introduits, représentant les contributions principales de ces travaux. Le mémoire est structuré en 4 chapitres. Le premier présente un état de l'art sur les problèmes de ré-identification visuelle de personnes et de reconnaissance de locuteurs. Les modalités sonores et visuelles ne présentant aucune corrélation, deux signatures, une vidéo et une audio sont générées séparément, à l'aide de méthodes préexistantes de la littérature. Le détail de la génération de ces signatures est l'objet du chapitre 2. La fusion de ces signatures est alors traitée comme un problème de mise en correspondance d'observations audio et vidéo, dont les détections correspondantes sont cohérentes et compatibles spatialement, et pour lesquelles deux nouvelles stratégies d'association sont introduites au chapitre 3. La cohérence spatio-temporelle des observations sonores et visuelles est ensuite traitée dans le chapitre 4, dans un contexte de suivi multi-cibles
The neOCampus operation, started in 2013 by Paul Sabatier University in Toulouse, aims to create a connected, innovative, intelligent and sustainable campus, by exploiting the skills of 11 laboratories and several industrial partners. These multidisciplinary skills are combined in order to improve users (students, teachers, administrative staff) daily comfort and to reduce the ecological footprint of the campus. The intelligence we want to bring to the campus of the future requires to provide to its buildings a perception of its intern activity. Indeed, optimizing the energy resources needs a characterization of the user's activities so that the building can automatically adapt itself to it. Human activity being open to multiple levels of interpretation, our work is focused on extracting people trajectories, its more elementary component. Characterizing users activities, in terms of movement, uses data extracted from cameras and microphones distributed in a room, forming a sparse network of heterogeneous sensors. From these data, we then seek to extract audiovisual signatures and rough localizations of the people transiting through this network of sensors. While protecting person privacy, signatures must be discriminative, to distinguish a person from another one, and compact, to optimize computational costs and enables the building to adapt itself. Having regard to these constraints, the characteristics we model are the speaker's timbre, and his appearance, in terms of colorimetric distribution. The scientific contributions of this thesis are thus at the intersection of the fields of speech processing and computer vision, by introducing new methods of fusing audio and visual signatures of individuals. To achieve this fusion, new sound source location indices as well as an audiovisual adaptation of a multi-target tracking method were introduced, representing the main contributions of this work. The thesis is structured in 4 chapters, and the first one presents the state of the art on visual reidentification of persons and speaker recognition. Acoustic and visual modalities are not correlated, so two signatures are separately computed, one for video and one for audio, using existing methods in the literature. After a first chapter dedicated to the state of the art in re-identification and speaker recognition methods, the details of the computation of the signatures is explored in chapter 2. The fusion of the signatures is then dealt as a problem of matching between audio and video observations, whose corresponding detections are spatially coherent and compatible. Two novel association strategies are introduced in chapter 3. Spatio-temporal coherence of the bimodal observations is then discussed in chapter 4, in a context of multi-target tracking

10

Robert-Ribes, Jordi. „Modèles d'intégration audiovisuelle de signaux linguistiques : de la perception humaine a la reconnaissance automatique des voyelles“. Grenoble INPG, 1995. http://www.theses.fr/1995INPG0032.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

Cette these concerne l'etude des modeles d'integration des informations auditives et visuelles en vue d'obtenir un modele plausible et fonctionnel pour la reconnaissance audiovisuelle des voyelles du francais. Nous passerons en revue les donnees de la litterature sur l'integration audiovisuelle en perception de parole. Nous presenterons ensuite quatre modeles et nous les classifierons selon des principes inspires a la fois de la psychologie experimentale et de la litterature sur la fusion de capteurs. Notre contrainte de plausibilite (conformite aux donnees experimentales) permettra d'eliminer deux modeles. Les deux modeles restants seront compares par rapport a nos propres resultats de perception audiovisuelle des voyelles ainsi que selon leurs performances fonctionnelles. Le resultat de cette comparaison nous montrera qu'un modele qui projette les deux entrees (auditive et visuelle) dans un espace intermediaire de nature motrice, est a la fois le plus conforme aux donnees experimentales et le plus performant pour la reconnaissance audiovisuelle des voyelles dans du bruit acoustique

Bücher zum Thema "Audiovisual speech processing":

1

Bailly, Gerard, Pascal Perrier und Eric Vatikiotis-Bateson, Hrsg. Audiovisual Speech Processing. Cambridge: Cambridge University Press, 2012. http://dx.doi.org/10.1017/cbo9780511843891.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Bailly, G., Eric Vatikiotis-Bateson und Pascal Perrier. Audiovisual speech processing. Cambridge: Cambridge University Press, 2012.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Randazzo, Melissa. Audiovisual Integration in Apraxia of Speech: EEG Evidence for Processing Differences. [New York, N.Y.?]: [publisher not identified], 2016.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Vatikiotis-Bateson, Eric, Pascal Perrier und Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

Vatikiotis-Bateson, Eric, Pascal Perrier und Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

6

Vatikiotis-Bateson, Eric, Pascal Perrier und Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Vatikiotis-Bateson, Eric, Pascal Perrier und Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2015.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Vatikiotis-Bateson, Eric, Pascal Perrier und Gérard Bailly. Audiovisual Speech Processing. Cambridge University Press, 2012.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Abel, Andrew, und Amir Hussain. Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System. Springer, 2015.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

10

Abel, Andrew, und Amir Hussain. Cognitively Inspired Audiovisual Speech Filtering: Towards an Intelligent, Fuzzy Based, Multimodal, Two-Stage Speech Enhancement System. Springer International Publishing AG, 2015.

Den vollen Inhalt der Quelle finden

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Buchteile zum Thema "Audiovisual speech processing":

1

Riekhakaynen, Elena, und Elena Zatevalova. „Should We Believe Our Eyes or Our Ears? Processing Incongruent Audiovisual Stimuli by Russian Listeners“. In Speech and Computer, 604–15. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-20980-2_51.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Aleksic, Petar S., Gerasminos Potamianos und Aggelos K. Katsaggelos. „Audiovisual Speech Processing“. In The Essential Guide to Video Processing, 689–737. Elsevier, 2009. http://dx.doi.org/10.1016/b978-0-12-374456-2.00024-4.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Pantic, Maja. „Face for Interface“. In Encyclopedia of Multimedia Technology and Networking, Second Edition, 560–67. IGI Global, 2009. http://dx.doi.org/10.4018/978-1-60566-014-1.ch075.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Annotation:

The human face is involved in an impressive variety of different activities. It houses the majority of our sensory apparatus: eyes, ears, mouth, and nose, allowing the bearer to see, hear, taste, and smell. Apart from these biological functions, the human face provides a number of signals essential for interpersonal communication in our social life. The face houses the speech production apparatus and is used to identify other members of the species, to regulate the conversation by gazing or nodding, and to interpret what has been said by lip reading. It is our direct and naturally preeminent means of communicating and understanding somebody’s affective state and intentions on the basis of the shown facial expression (Lewis & Haviland-Jones, 2000). Personality, attractiveness, age, and gender can also be seen from someone’s face. Thus the face is a multisignal sender/receiver capable of tremendous flexibility and specificity. In general, the face conveys information via four kinds of signals listed in Table 1. Automating the analysis of facial signals, especially rapid facial signals, would be highly beneficial for fields as diverse as security, behavioral science, medicine, communication, and education. In security contexts, facial expressions play a crucial role in establishing or detracting from credibility. In medicine, facial expressions are the direct means to identify when specific mental processes are occurring. In education, pupils’ facial expressions inform the teacher of the need to adjust the instructional message. As far as natural user interfaces between humans and computers (PCs/robots/machines) are concerned, facial expressions provide a way to communicate basic information about needs and demands to the machine. In fact, automatic analysis of rapid facial signals seem to have a natural place in various vision subsystems and vision-based interfaces (face-for-interface tools), including automated tools for gaze and focus of attention tracking, lip reading, bimodal speech processing, face/visual speech synthesis, face-based command issuing, and facial affect processing. Where the user is looking (i.e., gaze tracking) can be effectively used to free computer users from the classic keyboard and mouse. Also, certain facial signals (e.g., a wink) can be associated with certain commands (e.g., a mouse click) offering an alternative to traditional keyboard and mouse commands. The human capability to “hear” in noisy environments by means of lip reading is the basis for bimodal (audiovisual) speech processing that can lead to the realization of robust speech-driven interfaces. To make a believable “talking head” (avatar) representing a real person, tracking the person’s facial signals and making the avatar mimic those using synthesized speech and facial expressions is compulsory. The human ability to read emotions from someone’s facial expressions is the basis of facial affect processing that can lead to expanding user interfaces with emotional communication and, in turn, to obtaining a more flexible, adaptable, and natural affective interfaces between humans and machines. More specifically, the information about when the existing interaction/processing should be adapted, the importance of such an adaptation, and how the interaction/ reasoning should be adapted, involves information about how the user feels (e.g., confused, irritated, tired, interested). Examples of affect-sensitive user interfaces are still rare, unfortunately, and include the systems of Lisetti and Nasoz (2002), Maat and Pantic (2006), and Kapoor, Burleson, and Picard (2007). It is this wide range of principle driving applications that has lent a special impetus to the research problem of automatic facial expression analysis and produced a surge of interest in this research topic.

Konferenzberichte zum Thema "Audiovisual speech processing":

1

Vatikiotis-Bateson, E., K. G. Munhall, Y. Kasahara, F. Garcia und H. Yehia. „Characterizing audiovisual information during speech“. In 4th International Conference on Spoken Language Processing (ICSLP 1996). ISCA: ISCA, 1996. http://dx.doi.org/10.21437/icslp.1996-379.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Petridis, Stavros, und Maja Pantic. „Audiovisual discrimination between laughter and speech“. In ICASSP 2008 - 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008. http://dx.doi.org/10.1109/icassp.2008.4518810.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Petridis, Stavros, Themos Stafylakis, Pingehuan Ma, Feipeng Cai, Georgios Tzimiropoulos und Maja Pantic. „End-to-End Audiovisual Speech Recognition“. In ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. http://dx.doi.org/10.1109/icassp.2018.8461326.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Tran, Tam, Soroosh Mariooryad und Carlos Busso. „Audiovisual corpus to analyze whisper speech“. In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6639243.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

Katsamanis, Athanassios, George Papandreou und Petros Maragos. „Audiovisual-to-Articulatory Speech Inversion Using HMMs“. In 2007 IEEE 9th Workshop on Multimedia Signal Processing. IEEE, 2007. http://dx.doi.org/10.1109/mmsp.2007.4412915.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

6

Rosenblum, Lawrence D. „The perceptual basis for audiovisual speech integration“. In 7th International Conference on Spoken Language Processing (ICSLP 2002). ISCA: ISCA, 2002. http://dx.doi.org/10.21437/icslp.2002-424.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Silva, Samuel, und António Teixeira. „An Anthropomorphic Perspective for Audiovisual Speech Synthesis“. In 10th International Conference on Bio-inspired Systems and Signal Processing. SCITEPRESS - Science and Technology Publications, 2017. http://dx.doi.org/10.5220/0006150201630172.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Holt, Rebecca, Laurence Bruggeman und Katherine Demuth. „Audiovisual benefits for speech processing speed among children with hearing loss“. In The 15th International Conference on Auditory-Visual Speech Processing. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/avsp.2019-10.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Matthews, I. A. „Scale based features for audiovisual speech recognition“. In IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication. IEE, 1996. http://dx.doi.org/10.1049/ic:19961152.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

10

Ma, Pingchuan, Stavros Petridis und Maja Pantic. „Detecting Adversarial Attacks on Audiovisual Speech Recognition“. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413661.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen