Articles de revues sur le sujet « Visual speech model »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Visual speech model.

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleurs articles de revues pour votre recherche sur le sujet « Visual speech model ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les articles de revues sur diverses disciplines et organisez correctement votre bibliographie.

1

Jia, Xi Bin, et Mei Xia Zheng. « Video Based Visual Speech Feature Model Construction ». Applied Mechanics and Materials 182-183 (juin 2012) : 1367–71. http://dx.doi.org/10.4028/www.scientific.net/amm.182-183.1367.

Texte intégral
Résumé :
This paper aims to give a solutions for the construction of chinese visual speech feature model based on HMM. We propose and discuss three kind representation model of the visual speech which are lip geometrical features, lip motion features and lip texture features. The model combines the advantages of the local LBP and global DCT texture information together, which shows better performance than the single feature. Equally the model combines the advantages of the local LBP and geometrical information together is better than single feature. By computing the recognition rate of the visemes from the model, the paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.
Styles APA, Harvard, Vancouver, ISO, etc.
2

Mishra, Saumya, Anup Kumar Gupta et Puneet Gupta. « DARE : Deceiving Audio–Visual speech Recognition model ». Knowledge-Based Systems 232 (novembre 2021) : 107503. http://dx.doi.org/10.1016/j.knosys.2021.107503.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
3

Brahme, Aparna, et Umesh Bhadade. « Effect of Various Visual Speech Units on Language Identification Using Visual Speech Recognition ». International Journal of Image and Graphics 20, no 04 (octobre 2020) : 2050029. http://dx.doi.org/10.1142/s0219467820500291.

Texte intégral
Résumé :
In this paper, we describe our work in Spoken language Identification using Visual Speech Recognition (VSR) and analyze the effect of various visual speech units used to transcribe the visual speech on language recognition. We have proposed a new approach of word recognition followed by the word N-gram language model (WRWLM), which uses high-level syntactic features and the word bigram language model for language discrimination. Also, as opposed to the traditional visemic approach, we propose a holistic approach of using the signature of a whole word, referred to as a “Visual Word” as visual speech unit for transcribing visual speech. The result shows Word Recognition Rate (WRR) of 88% and Language Recognition Rate (LRR) of 94% in speaker dependent cases and 58% WRR and 77% LRR in speaker independent cases for English and Marathi digit classification task. The proposed approach is also evaluated for continuous speech input. The result shows that the Spoken Language Identification rate of 50% is possible even though the WRR using Visual Speech Recognition is below 10%, using only 1[Formula: see text]s of speech. Also, there is an improvement of about 5% in language discrimination as compared to traditional visemic approaches.
Styles APA, Harvard, Vancouver, ISO, etc.
4

Metzger, Brian A. ,., John F. ,. Magnotti, Elizabeth Nesbitt, Daniel Yoshor et Michael S. ,. Beauchamp. « Cross-modal suppression model of speech perception : Visual information drives suppressive interactions between visual and auditory speech in pSTG ». Journal of Vision 20, no 11 (20 octobre 2020) : 434. http://dx.doi.org/10.1167/jov.20.11.434.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
5

Hazen, T. J. « Visual model structures and synchrony constraints for audio-visual speech recognition ». IEEE Transactions on Audio, Speech and Language Processing 14, no 3 (mai 2006) : 1082–89. http://dx.doi.org/10.1109/tsa.2005.857572.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
6

Fagel, Sascha. « Merging methods of speech visualization ». ZAS Papers in Linguistics 40 (1 janvier 2005) : 19–32. http://dx.doi.org/10.21248/zaspil.40.2005.255.

Texte intégral
Résumé :
The author presents MASSY, the MODULAR AUDIOVISUAL SPEECH SYNTHESIZER. The system combines two approaches of visual speech synthesis. Two control models are implemented: a (data based) di-viseme model and a (rule based) dominance model where both produce control commands in a parameterized articulation space. Analogously two visualization methods are implemented: an image based (video-realistic) face model and a 3D synthetic head. Both face models can be driven by both the data based and the rule based articulation model. The high-level visual speech synthesis generates a sequence of control commands for the visible articulation. For every virtual articulator (articulation parameter) the 3D synthetic face model defines a set of displacement vectors for the vertices of the 3D objects of the head. The vertices of the 3D synthetic head then are moved by linear combinations of these displacement vectors to visualize articulation movements. For the image based video synthesis a single reference image is deformed to fit the facial properties derived from the control commands. Facial feature points and facial displacements have to be defined for the reference image. The algorithm can also use an image database with appropriately annotated facial properties. An example database was built automatically from video recordings. Both the 3D synthetic face and the image based face generate visual speech that is capable to increase the intelligibility of audible speech. Other well known image based audiovisual speech synthesis systems like MIKETALK and VIDEO REWRITE concatenate pre-recorded single images or video sequences, respectively. Parametric talking heads like BALDI control a parametric face with a parametric articulation model. The presented system demonstrates the compatibility of parametric and data based visual speech synthesis approaches.
Styles APA, Harvard, Vancouver, ISO, etc.
7

Loh, Marco, Gabriele Schmid, Gustavo Deco et Wolfram Ziegler. « Audiovisual Matching in Speech and Nonspeech Sounds : A Neurodynamical Model ». Journal of Cognitive Neuroscience 22, no 2 (février 2010) : 240–47. http://dx.doi.org/10.1162/jocn.2009.21202.

Texte intégral
Résumé :
Audiovisual speech perception provides an opportunity to investigate the mechanisms underlying multimodal processing. By using nonspeech stimuli, it is possible to investigate the degree to which audiovisual processing is specific to the speech domain. It has been shown in a match-to-sample design that matching across modalities is more difficult in the nonspeech domain as compared to the speech domain. We constructed a biophysically realistic neural network model simulating this experimental evidence. We propose that a stronger connection between modalities in speech underlies the behavioral difference between the speech and the nonspeech domain. This could be the result of more extensive experience with speech stimuli. Because the match-to-sample paradigm does not allow us to draw conclusions concerning the integration of auditory and visual information, we also simulated two further conditions based on the same paradigm, which tested the integration of auditory and visual information within a single stimulus. New experimental data for these two conditions support the simulation results and suggest that audiovisual integration of discordant stimuli is stronger in speech than in nonspeech stimuli. According to the simulations, the connection strength between auditory and visual information, on the one hand, determines how well auditory information can be assigned to visual information, and on the other hand, it influences the magnitude of multimodal integration.
Styles APA, Harvard, Vancouver, ISO, etc.
8

Yu, Wentao, Steffen Zeiler et Dorothea Kolossa. « Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition ». Sensors 22, no 15 (23 juillet 2022) : 5501. http://dx.doi.org/10.3390/s22155501.

Texte intégral
Résumé :
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.
Styles APA, Harvard, Vancouver, ISO, etc.
9

How, Chun Kit, Ismail Mohd Khairuddin, Mohd Azraai Mohd Razman, Anwar P. P. Abdul Majeed et Wan Hasbullah Mohd Isa. « Development of Audio-Visual Speech Recognition using Deep-Learning Technique ». MEKATRONIKA 4, no 1 (27 juin 2022) : 88–95. http://dx.doi.org/10.15282/mekatronika.v4i1.8625.

Texte intégral
Résumé :
Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network (CNN) and Deep Neural Network (DNN), were developed to recognize the speech’s emotion from the dataset. Pytorch framework with torchaudio library was used. Both models were given the same training, validation, testing, and augmented datasets. The training will be stopped when the training loop reaches ten epochs, or the validation loss function does not improve for five epochs. At the end, the highest accuracy and lowest loss function of CNN model in the training dataset are 76.50% and 0.006029 respectively, meanwhile the DNN model achieved 75.42% and 0.086643 respectively. Both models were evaluated using confusion matrix. In conclusion, CNN model has higher performance than DNN model, but needs to improvise as the accuracy of testing dataset is low and the loss function is high.
Styles APA, Harvard, Vancouver, ISO, etc.
10

Holubenko, Nataliia. « Cognitive and Intersemiotic Model of the Visual and Verbal Modes in a Screen Adaptation to Literary Texts ». World Journal of English Language 12, no 6 (18 juillet 2022) : 129. http://dx.doi.org/10.5430/wjel.v12n6p129.

Texte intégral
Résumé :
The aim of the study is to examine screen adaptations from the perspective of cognitive and intersemiotic models of the visual and verbal modes. The purpose of the study is to express the specificity of a screen text which is defined as a combination of three media: speech, image, and music. The scope is to demonstrate the general framework of an intersemiotic translation from a new point of view – like a transliteration. The method of the research refers to semiotic and stylistic analyzes – methods of transformation from one sign system into another from prose works with regard to their cognitive as well as narrative and stylistic features (Zhong, Chen, & Xuan, 2021). Thus, the study analyses such specific relations between the verbal and visual modes in film adaptations of prose literature as a more detailed description of event episodes, events’ temporal structure, presentation of author’s thoughts and characters’ thoughts; their mental activity formulated indirect speech and inner speech that is shown only by the actor’s intonation. The results of the study made possible to show the types of inner speech in their adaptations: author’s thoughts, characters’ thoughts which are presented only by the verbal mode, and visual modes’ inner speeches that combine the modes of character’s voice and image. One can conclude, that taking into account intersemiotic relations between the visual and verbal spaces, it is possible to explain, for instance, how the words of characters are replaced by their facial expressions, gestures, or intonations.
Styles APA, Harvard, Vancouver, ISO, etc.
11

Kröger, Bernd J., Julia Gotto, Susanne Albert et Christiane Neuschaefer-Rube. « visual articulatory model and its application to therapy of speech disorders : a pilot study ». ZAS Papers in Linguistics 40 (1 janvier 2005) : 79–94. http://dx.doi.org/10.21248/zaspil.40.2005.259.

Texte intégral
Résumé :
A visual articulatory model based on static MRI-data of isolated sounds and its application in therapy of speech disorders is described. The model is capable of generating video sequences of articulatory movements or still images of articulatory target positions within the midsagittal plane. On the basis of this model (1) a visual stimulation technique for the therapy of patients suffering from speech disorders and (2) a rating test for visual recognition of speech movements was developed. Results indicate that patients produce recognition rates above level of chance already without any training and that patients are capable of increasing their recognition rate over the time course of therapy significantly.
Styles APA, Harvard, Vancouver, ISO, etc.
12

Li, Dengshi, Yu Gao, Chenyi Zhu, Qianrui Wang et Ruoxi Wang. « Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy ». Sensors 23, no 4 (11 février 2023) : 2053. http://dx.doi.org/10.3390/s23042053.

Texte intégral
Résumé :
The current accuracy of speech recognition has been able to reach over 97% on different data sets, but the accuracy of speech recognition in noisy environments is greatly reduced. Improving speech recognition performance in noisy environments is a challenging task. Due to the fact that visual information is not affected by noise, researchers often use lip information to help improve speech recognition performance. This is where the performance of lip reading and the effect of cross-modal fusion are particularly important. In this paper, we try to improve the accuracy of speech recognition in noisy environments by improving the lip reading performance and the cross-modal fusion effect. First, due to the same lip may contain multiple meanings, we construct a one-to-many mapping relationship model between lips and speech, allowing the lip-reading model to consider the feasibility of which articulations are represented from the input lip movements. Also, audio representations are preserved by modeling the inter-relationships between paired audio-visual representations. At the inference stage, the preserved audio representations can be extracted from memory by the learned interrelationships using only video input. Second, a joint cross-fusion model using the attention mechanism can effectively exploit complementary inter-modal relationships, and the model calculates cross-attention weights based on the correlations between joint feature representations and individual modalities. Finally, our proposed model has a 4.0% reduction in WER in −15 dB SNR environment compared to the baseline method, and a 10.1% reduction in WER compared to speech recognition. The experimental results show that our method has a significant improvement over speech recognition models in different noise environments.
Styles APA, Harvard, Vancouver, ISO, etc.
13

Yuan, Yuan, Chunlin Tian et Xiaoqiang Lu. « Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition ». IEEE Access 6 (2018) : 5573–83. http://dx.doi.org/10.1109/access.2018.2796118.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
14

Slaney, Malcolm, et Richard F. Lyon. « Visual representations of speech—A computer model based on correlation ». Journal of the Acoustical Society of America 88, S1 (novembre 1990) : S23. http://dx.doi.org/10.1121/1.2028916.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
15

Edge, James D., Adrian Hilton et Philip Jackson. « Model-Based Synthesis of Visual Speech Movements from 3D Video ». EURASIP Journal on Audio, Speech, and Music Processing 2009 (2009) : 1–12. http://dx.doi.org/10.1155/2009/597267.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
16

Sharma, Usha, Sushila Maheshkar, A. N. Mishra et Rahul Kaushik. « Visual Speech Recognition Using Optical Flow and Hidden Markov Model ». Wireless Personal Communications 106, no 4 (10 septembre 2018) : 2129–47. http://dx.doi.org/10.1007/s11277-018-5930-z.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
17

Setyati, Endang, Mauridhi Hery Purnomo, Surya Sumpeno et Joan Santoso. « HIDDEN MARKOV MODELS BASED INDONESIAN VISEME MODEL FOR NATURAL SPEECH WITH AFFECTION ». Kursor 8, no 3 (13 décembre 2016) : 102. http://dx.doi.org/10.28961/kursor.v8i3.61.

Texte intégral
Résumé :
In a communication using texts input, viseme (visual phonemes) is derived from a group of phonemes having similar visual appearances. Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition. For speech emotion recognition, a HMM is trained for each emotion and an unknown sample is classified according to the model which illustrate the derived feature sequence best. Viterbi algorithm, HMM is used for guessing the most possible state sequence of observable states. In this work, first stage, we defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation. The second stage, we defined a choice of one of affection type as input in the system. The last stage, we experimentally using Trigram HMMs for generating the viseme sequence to be used for synchronized mouth shape and lip movements. The whole system is interconnected in a sequence. The final system produced a viseme sequence for natural speech of Indonesian sentences with affection. We show through various experiments that the proposed, the results in about 82,19% relative improvement in classification accuracy.
Styles APA, Harvard, Vancouver, ISO, etc.
18

Seo, Minji, et Myungho Kim. « Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition ». Sensors 20, no 19 (28 septembre 2020) : 5559. http://dx.doi.org/10.3390/s20195559.

Texte intégral
Résumé :
Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.
Styles APA, Harvard, Vancouver, ISO, etc.
19

Hertrich, Ingo, Susanne Dietrich et Hermann Ackermann. « Cross-modal Interactions during Perception of Audiovisual Speech and Nonspeech Signals : An fMRI Study ». Journal of Cognitive Neuroscience 23, no 1 (janvier 2011) : 221–37. http://dx.doi.org/10.1162/jocn.2010.21421.

Texte intégral
Résumé :
During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream—prior to its fusion with auditory phonological features [Hertrich, I., Mathiak, K., Lutzenberger, W., & Ackermann, H. Time course of early audiovisual interactions during speech and non-speech central-auditory processing: An MEG study. Journal of Cognitive Neuroscience, 21, 259–274, 2009]. Using functional magnetic resonance imaging, the present follow-up study aims to further elucidate the topographic distribution of visual–phonological operations and audiovisual (AV) interactions during speech perception. Ambiguous acoustic syllables—disambiguated to /pa/ or /ta/ by the visual channel (speaking face)—served as test materials, concomitant with various control conditions (nonspeech AV signals, visual-only and acoustic-only speech, and nonspeech stimuli). (i) Visual speech yielded an AV-subadditive activation of primary auditory cortex and the anterior superior temporal gyrus (STG), whereas the posterior STG responded both to speech and nonspeech motion. (ii) The inferior frontal and the fusiform gyrus of the right hemisphere showed a strong phonetic/phonological impact (differential effects of visual /pa/ vs. /ta/) upon hemodynamic activation during presentation of speaking faces. Taken together with the previous MEG data, these results point at a dual-pathway model of visual speech information processing: On the one hand, access to the auditory system via the anterior supratemporal “what” path may give rise to direct activation of “auditory objects.” On the other hand, visual speech information seems to be represented in a right-hemisphere visual working memory, providing a potential basis for later interactions with auditory information such as the McGurk effect.
Styles APA, Harvard, Vancouver, ISO, etc.
20

Blackburn, Catherine L., Pádraig T. Kitterick, Gary Jones, Christian J. Sumner et Paula C. Stacey. « Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers ». Trends in Hearing 23 (janvier 2019) : 233121651983786. http://dx.doi.org/10.1177/2331216519837866.

Texte intégral
Résumé :
Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.
Styles APA, Harvard, Vancouver, ISO, etc.
21

Fleming, Luke. « Negating speech ». Gesture 14, no 3 (31 décembre 2014) : 263–96. http://dx.doi.org/10.1075/gest.14.3.01fle.

Texte intégral
Résumé :
With the exception of Plains Indian Sign Language and Pacific Northwest sawmill sign languages, highly developed alternate sign languages (sign languages typically employed by and for the hearing) share not only common structural linguistic features, but their use is also characterized by convergent ideological commitments concerning communicative medium and linguistic modality. Though both modalities encode comparable denotational content, speaker-signers tend to understand manual-visual sign as a pragmatically appropriate substitute for oral-aural speech. This paper suggests that two understudied clusters of alternate sign languages, Armenian and Cape York Peninsula sign languages, offer a general model for the development of alternate sign languages, one in which the gesture-to-sign continuum is dialectically linked to hypertrophied forms of interactional avoidance up-to-and-including complete silence in the co-presence of affinal relations. These cases illustrate that the pragmatic appropriateness of sign over speech relies upon local semiotic ideologies which tend to conceptualize the manual-visual linguistic modality on analogy to the gestural communication employed in interactional avoidance, and thus as not counting as true language.
Styles APA, Harvard, Vancouver, ISO, etc.
22

Nikolaus, Mitja, Afra Alishahi et Grzegorz Chrupała. « Learning English with Peppa Pig ». Transactions of the Association for Computational Linguistics 10 (2022) : 922–36. http://dx.doi.org/10.1162/tacl_a_00498.

Texte intégral
Résumé :
Abstract Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children’s cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data, our model succeeds at learning aspects of the visual semantics of spoken language.
Styles APA, Harvard, Vancouver, ISO, etc.
23

Ryumin, Dmitry, Denis Ivanko et Elena Ryumina. « Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices ». Sensors 23, no 4 (17 février 2023) : 2284. http://dx.doi.org/10.3390/s23042284.

Texte intégral
Résumé :
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
Styles APA, Harvard, Vancouver, ISO, etc.
24

Yang, Chih-Chun, Wan-Cyuan Fan, Cheng-Fu Yang et Yu-Chiang Frank Wang. « Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation ». Proceedings of the AAAI Conference on Artificial Intelligence 36, no 3 (28 juin 2022) : 3036–44. http://dx.doi.org/10.1609/aaai.v36i3.20210.

Texte intégral
Résumé :
As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been a challenge, benefiting not only audio/visual speech recognition (ASR/VSR) but also for manipulating data within/across modalities. In this paper, we present a feature disentanglement-based framework for jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert visual or audio-based linguistic features into modality-agnostic representations. Such derived linguistic representations not only allow one to perform ASR, VSR, and AVSR, but also to manipulate audio and visual data output based on the desirable subject identity and linguistic content information. We perform extensive experiments on different recognition and synthesis tasks to show that our model performs favorably against state-of-the-art approaches on each individual task, while ours is a unified solution that is able to jointly tackle the aforementioned audio-visual learning tasks.
Styles APA, Harvard, Vancouver, ISO, etc.
25

Wang, Dong, Bing Liu, Yong Zhou, Mingming Liu, Peng Liu et Rui Yao. « Separate Syntax and Semantics : Part-of-Speech-Guided Transformer for Image Captioning ». Applied Sciences 12, no 23 (22 novembre 2022) : 11875. http://dx.doi.org/10.3390/app122311875.

Texte intégral
Résumé :
Transformer-based image captioning models have recently achieved remarkable performance by using new fully attentive paradigms. However, existing models generally follow the conventional language model of predicting the next word conditioned on the visual features and partially generated words. They treat the predictions of visual and nonvisual words equally and usually tend to produce generic captions. To address these issues, we propose a novel part-of-speech-guided transformer (PoS-Transformer) framework for image captioning. Specifically, a self-attention part-of-speech prediction network is first presented to model the part-of-speech tag sequences for the corresponding image captions. Then, different attention mechanisms are constructed for the decoder to guide the caption generation by using the part-of-speech information. Benefiting from the part-of-speech guiding mechanisms, the proposed framework not only adaptively adjusts the weights between visual features and language signals for the word prediction, but also facilitates the generation of more fine-grained and grounded captions. Finally, a multitask learning is introduced to train the whole PoS-Transformer network in an end-to-end manner. Our model was trained and tested on the MSCOCO and Flickr30k datasets with the experimental evaluation standard CIDEr scores of 1.299 and 0.612, respectively. The qualitative experimental results indicated that the captions generated by our method conformed to the grammatical rules better.
Styles APA, Harvard, Vancouver, ISO, etc.
26

Biswas, Astik, P. K. Sahu et Mahesh Chandra. « Multiple cameras audio visual speech recognition using active appearance model visual features in car environment ». International Journal of Speech Technology 19, no 1 (23 janvier 2016) : 159–71. http://dx.doi.org/10.1007/s10772-016-9332-x.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
27

Lindborg, Alma, et Tobias S. Andersen. « Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception ». PLOS ONE 16, no 2 (19 février 2021) : e0246986. http://dx.doi.org/10.1371/journal.pone.0246986.

Texte intégral
Résumé :
Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.
Styles APA, Harvard, Vancouver, ISO, etc.
28

Zeliang Zhang, Xiongfei Li et Chengjia Yang. « Visual Speech Recognition based on Improved type of Hidden Markov Model ». Journal of Convergence Information Technology 7, no 13 (31 juillet 2012) : 119–26. http://dx.doi.org/10.4156/jcit.vol7.issue13.14.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
29

HONG, PENGYU, ZHEN WEN et THOMAS S. HUANG. « iFACE : A 3D SYNTHETIC TALKING FACE ». International Journal of Image and Graphics 01, no 01 (janvier 2001) : 19–26. http://dx.doi.org/10.1142/s0219467801000037.

Texte intégral
Résumé :
We present the iFACE system, a visual speech synthesizer that provides a form of virtual face-to-face communication. The system provides an interactive tool for the user to customize a graphic head model for the virtual agent of a person based on his/her range data. The texture is mapped onto the customized model to achieve a realistic appearance. Face animations are produced by using text stream or speech stream to drive the model. A set of basic facial shapes and head action is manually built and used to synthesize expressive visual speech based on rules.
Styles APA, Harvard, Vancouver, ISO, etc.
30

Zhou, Hang, Yu Liu, Ziwei Liu, Ping Luo et Xiaogang Wang. « Talking Face Generation by Adversarially Disentangled Audio-Visual Representation ». Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 juillet 2019) : 9299–306. http://dx.doi.org/10.1609/aaai.v33i01.33019299.

Texte intégral
Résumé :
Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.
Styles APA, Harvard, Vancouver, ISO, etc.
31

He, Yibo, Kah Phooi Seng et Li Minn Ang. « Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild ». Sensors 23, no 4 (7 février 2023) : 1834. http://dx.doi.org/10.3390/s23041834.

Texte intégral
Résumé :
This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2—Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.
Styles APA, Harvard, Vancouver, ISO, etc.
32

Jeon, Sanghun, et Mun Sang Kim. « Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications ». Sensors 22, no 20 (12 octobre 2022) : 7738. http://dx.doi.org/10.3390/s22207738.

Texte intégral
Résumé :
Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.
Styles APA, Harvard, Vancouver, ISO, etc.
33

Hertrich, Ingo, Klaus Mathiak, Werner Lutzenberger et Hermann Ackermann. « Time Course of Early Audiovisual Interactions during Speech and Nonspeech Central Auditory Processing : A Magnetoencephalography Study ». Journal of Cognitive Neuroscience 21, no 2 (février 2009) : 259–74. http://dx.doi.org/10.1162/jocn.2008.21019.

Texte intégral
Résumé :
Cross-modal fusion phenomena suggest specific interactions of auditory and visual sensory information both within the speech and nonspeech domains. Using whole-head magnetoencephalography, this study recorded M50 and M100 fields evoked by ambiguous acoustic stimuli that were visually disambiguated to perceived /ta/ or /pa/ syllables. As in natural speech, visual motion onset preceded the acoustic signal by 150 msec. Control conditions included visual and acoustic nonspeech signals as well as visual-only and acoustic-only stimuli. (a) Both speech and nonspeech motion yielded a consistent attenuation of the auditory M50 field, suggesting a visually induced “preparatory baseline shift” at the level of the auditory cortex. (b) Within the temporal domain of the auditory M100 field, visual speech and nonspeech motion gave rise to different response patterns (nonspeech: M100 attenuation; visual /pa/: left-hemisphere M100 enhancement; /ta/: no effect). (c) These interactions could be further decomposed using a six-dipole model. One of these three pairs of dipoles (V270) was fitted to motion-induced activity at a latency of 270 msec after motion onset, that is, the time domain of the auditory M100 field, and could be attributed to the posterior insula. This dipole source responded to nonspeech motion and visual /pa/, but was found suppressed in the case of visual /ta/. Such a nonlinear interaction might reflect the operation of a binary distinction between the marked phonological feature “labial” versus its underspecified competitor “coronal.” Thus, visual processing seems to be shaped by linguistic data structures even prior to its fusion with auditory information channel.
Styles APA, Harvard, Vancouver, ISO, etc.
34

Bielski, Lynn M., et Charissa R. Lansing. « Utility of the Baddeley and Hitch Model of Short-Term Working Memory To Investigate Spoken Language Understanding : A Tutorial. » Perspectives on Aural Rehabilitation and Its Instrumentation 19, no 1 (mai 2012) : 25–33. http://dx.doi.org/10.1044/arii19.1.25.

Texte intégral
Résumé :
Spoken speech understanding can be challenging, particularly in the presence of competing information such as background noise. Researchers have shown that dynamic observable phonetic facial cues improve speech understanding in both quiet and noise. Additionally, cognitive functions such as short-term working memory influence spoken language understanding. Currently, we do not know the utility of visual cues for the improvement of spoken language understanding. Although there are many theoretical models of short-term memory, the Baddeley and Hitch (1974) multicomponent model of short-term working memory is well-suited as a cognitive framework through which the utility of visual cues in spoken language understanding could be investigated. In this tutorial, we will describe the components of the Baddeley and Hitch model, illustrate their contributions to spoken language understanding, and provide possible applications for the model.
Styles APA, Harvard, Vancouver, ISO, etc.
35

Anwar, Miftahulkhairah, Fathiaty Murtadho, Endry Boeriswati, Gusti Yarmi et Helvy Tiana Rosa. « analysis model of impolite Indonesian language use ». Linguistics and Culture Review 5, S3 (5 décembre 2021) : 1426–41. http://dx.doi.org/10.21744/lingcure.v5ns3.1840.

Texte intégral
Résumé :
This research was based on the reality of the use of Indonesian language on social media that was vulgar, destructive, full of blasphemy, scorn, sarcasm, and tended to be provocative. This condition has destructive power because it spreads very quickly and is capable of arousing very strong emotions. This article aimed at presenting the results of research on the analysis model of impolite Indonesian language use. This model was developed from tracing status on social media which included language impoliteness in 2019. The novelty of this analysis model was that it involved a factor of power that allowed the appearance of such impolite speech. Therefore, this model is composed of several stages. First, presenting text in the form of spoken, written, and visual texts. Second, transcribing texts. Third, interpreting language impoliteness. At the interpreting stage, the impoliteness of the speeches was carried out by: (1) analyzing the contexts, (2) analyzing the power, (3) analyzing the dictions and language styles that contained impoliteness, (4) analyzing ethical speech acts, and (5) manipulating language politeness. From these language manipulation efforts, they were made to habituate language discipline to create a polite language society.
Styles APA, Harvard, Vancouver, ISO, etc.
36

Handa, Anand, Rashi Agarwal et Narendra Kohli. « Audio-Visual Emotion Recognition System Using Multi-Modal Features ». International Journal of Cognitive Informatics and Natural Intelligence 15, no 4 (octobre 2021) : 1–14. http://dx.doi.org/10.4018/ijcini.20211001.oa34.

Texte intégral
Résumé :
Due to the highly variant face geometry and appearances, Facial Expression Recognition (FER) is still a challenging problem. CNN can characterize 2-D signals. Therefore, for emotion recognition in a video, the authors propose a feature selection model in AlexNet architecture to extract and filter facial features automatically. Similarly, for emotion recognition in audio, the authors use a deep LSTM-RNN. Finally, they propose a probabilistic model for the fusion of audio and visual models using facial features and speech of a subject. The model combines all the extracted features and use them to train the linear SVM (Support Vector Machine) classifiers. The proposed model outperforms the other existing models and achieves state-of-the-art performance for audio, visual and fusion models. The model classifies the seven known facial expressions, namely anger, happy, surprise, fear, disgust, sad, and neutral on the eNTERFACE’05 dataset with an overall accuracy of 76.61%.
Styles APA, Harvard, Vancouver, ISO, etc.
37

Miller, Christi W., Erin K. Stewart, Yu-Hsiang Wu, Christopher Bishop, Ruth A. Bentler et Kelly Tremblay. « Working Memory and Speech Recognition in Noise Under Ecologically Relevant Listening Conditions : Effects of Visual Cues and Noise Type Among Adults With Hearing Loss ». Journal of Speech, Language, and Hearing Research 60, no 8 (18 août 2017) : 2310–20. http://dx.doi.org/10.1044/2017_jslhr-h-16-0284.

Texte intégral
Résumé :
Purpose This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Method Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. Results A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. Conclusion The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed.
Styles APA, Harvard, Vancouver, ISO, etc.
38

Kusmana, Suherli, Endang Kasupardi et Nunu Nurasa. « PENGARUH MODEL PEMBELAJARAN BERBASIS MASALAH MELALUI MEDIA AUDIO VISUAL TERHADAP PENINGKATAN KEMAMPUAN BERPIDATO SISWA KELAS IX SMP NEGERI 1 NUSAHERANG KABUPATEN KUNINGAN ». Jurnal Tuturan 3, no 1 (28 novembre 2017) : 419. http://dx.doi.org/10.33603/jt.v3i1.776.

Texte intégral
Résumé :
And result process study of ability orate IX SMP Country class student 1 Nusaherang Sub-Province Brass not yet directional and not yet reached result of optimal. Student ability level in speech is still low. It is caused of implementing not relevant with the student characteristics. The aim of this research is is to effectiveness descriptions model study base on the problem of passing visual audio media to ability orate IX SMP Country class student 1 Nusaherang, influence descriptions model study base on the problem of passing visual audio media to ability orate IX SMP Country class student 1 Nusaherang, and IX SMP Country class student respon descriptions 1 Nusaherang about usage model study base on the problem of passing visual audio media to ability orate. In this research the writer use in esperiment method through pretest-postest control group design. This design consist of two control group. In the process problem based learning is done by experiment group and it will be demonstrated by control group. The measurent is given after the writer make the various conditions to the students. Result of the research indicates that problem based learning by using audio visual media is more effective in improving the student speech ability. It can be drawn by the students activity. All the students learn the material more cooperatively and they have ability in speech in amount 0,8752 = 0,76 (76%) it means that the students speech ability is influenced by implementing problem based learning through audio visual media. Most of the student agree and give positive respons toward implementing problem based learning through audio visual media. The benefit of using this approach: 1) to increase student motivation, 2) to increase student creativity, 3) to awoid boring sense in learning, and (4) to improve respect attitude toward other opinion.
Styles APA, Harvard, Vancouver, ISO, etc.
39

Uhler, Kristin M., Rosalinda Baca, Emily Dudas et Tammy Fredrickson. « Refining Stimulus Parameters in Assessing Infant Speech Perception Using Visual Reinforcement Infant Speech Discrimination : Sensation Level ». Journal of the American Academy of Audiology 26, no 10 (novembre 2015) : 807–14. http://dx.doi.org/10.3766/jaaa.14093.

Texte intégral
Résumé :
Background: Speech perception measures have long been considered an integral piece of the audiological assessment battery. Currently, a prelinguistic, standardized measure of speech perception is missing in the clinical assessment battery for infants and young toddlers. Such a measure would allow systematic assessment of speech perception abilities of infants as well as the potential to investigate the impact early identification of hearing loss and early fitting of amplification have on the auditory pathways. Purpose: To investigate the impact of sensation level (SL) on the ability of infants with normal hearing (NH) to discriminate /a-i/ and /ba-da/ and to determine if performance on the two contrasts are significantly different in predicting the discrimination criterion. Research Design: The design was based on a survival analysis model for event occurrence and a repeated measures logistic model for binary outcomes. The outcome for survival analysis was the minimum SL for criterion and the outcome for the logistic regression model was the presence/absence of achieving the criterion. Criterion achievement was designated when an infant’s proportion correct score was >0.75 on the discrimination performance task. Study Sample: Twenty-two infants with NH sensitivity participated in this study. There were 9 males and 13 females, aged 6–14 mo. Data Collection and Analysis: Testing took place over two to three sessions. The first session consisted of a hearing test, threshold assessment of the two speech sounds (/a/ and /i/), and if time and attention allowed, visual reinforcement infant speech discrimination (VRISD). The second session consisted of VRISD assessment for the two test contrasts (/a-i/ and /ba-da/). The presentation level started at 50 dBA. If the infant was unable to successfully achieve criterion (>0.75) at 50 dBA, the presentation level was increased to 70 dBA followed by 60 dBA. Data examination included an event analysis, which provided the probability of criterion distribution across SL. The second stage of the analysis was a repeated measures logistic regression where SL and contrast were used to predict the likelihood of speech discrimination criterion. Results: Infants were able to reach criterion for the /a-i/ contrast at statistically lower SLs when compared to /ba-da/. There were six infants who never reached criterion for /ba-da/ and one never reached criterion for /a-i/. The conditional probability of not reaching criterion by 70 dB SL was 0% for /a-i/ and 21% for /ba-da/. The predictive logistic regression model showed that children were more likely to discriminate the /a-i/ even when controlling for SL. Conclusions: Nearly all normal-hearing infants can demonstrate discrimination criterion of a vowel contrast at 60 dB SL, while a level of ≥70 dB SL may be needed to allow all infants to demonstrate discrimination criterion of a difficult consonant contrast.
Styles APA, Harvard, Vancouver, ISO, etc.
40

Gao, Ying, Yuqin Liu et Chunyue Zhou. « Production and Interaction between Gesture and Speech : A Review ». International Journal of English Linguistics 6, no 2 (29 mars 2016) : 131. http://dx.doi.org/10.5539/ijel.v6n2p131.

Texte intégral
Résumé :
<p>Gesture in multimodal researches has been studied widely recently, and how gesture interacts with speech in communication is the focus in most researches. Some hypotheses or models about production and interaction between gesture and speech are introduced and compared in this paper. We find that it is generally agreed that speech production mechanism can be explained based on Levelt’s Model; while there is no consistency about gesture production and the interaction between gesture and speech. Most of theories argue that gesture stems from the visual-spatial images in working memory; some models approve of the interactive relationship while others consider no interaction between gesture and speech. Further research will be made in the areas of theoretical and applicative aspects.</p>
Styles APA, Harvard, Vancouver, ISO, etc.
41

Massaro, Dominic W., et Michael M. Cohen. « Perception of Synthesized Audible and Visible Speech ». Psychological Science 1, no 1 (janvier 1990) : 55–63. http://dx.doi.org/10.1111/j.1467-9280.1990.tb00068.x.

Texte intégral
Résumé :
The research reported in this paper uses novel stimuli to study how speech perception is influenced by information presented to ear and eye. Auditory and visual sources of information (syllables) were synthesized and presented in isolation or in factorial combination. A five-step continuum between the syllables ibal and idal was synthesized along both auditory and visual dimensions, by varying properties of the syllable at its onset. The onsets of the second and third formants were manipulated in the audible speech. For the visible speech, the shape of the lips and the jaw position at the onset of the syllable were manipulated. Subjects’ identification judgments of the test syllables presented on videotape were influenced by both auditory and visual information. The results were used to test between a fuzzy logical model of speech perception (FLMP) and a categorical model of perception (CMP). These tests indicate that evaluation and integration of the two sources of information makes available continuous as opposed to just categorical information. In addition, the integration of the two sources appears to be nonadditive in that the least ambiguous source has the largest impact on the judgment. The two sources of information appear to be evaluated, integrated, and identified as described by the FLMP-an optimal algorithm for combining information from multiple sources. The research provides a theoretical framework for understanding the improvement in speech perception by hearing-impaired listeners when auditory speech is supplemented with other sources of information.
Styles APA, Harvard, Vancouver, ISO, etc.
42

J., Esra, et Diyar H. « Audio Visual Arabic Speech Recognition using KNN Model by Testing different Audio Features ». International Journal of Computer Applications 180, no 1 (15 décembre 2017) : 33–38. http://dx.doi.org/10.5120/ijca2017915901.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
43

Lü, Guo-yun, Dong-mei Jiang, Yan-ning Zhang, Rong-chun Zhao, H. Sahli, Ilse Ravyse et W. Verhelst. « DBN Based Multi-stream Multi-states Model for Continue Audio-Visual Speech Recognition ». Journal of Electronics & ; Information Technology 30, no 12 (22 avril 2011) : 2906–11. http://dx.doi.org/10.3724/sp.j.1146.2007.00915.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
44

Deena, Salil, Shaobo Hou et Aphrodite Galata. « Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model ». IEEE Transactions on Multimedia 15, no 8 (décembre 2013) : 1755–68. http://dx.doi.org/10.1109/tmm.2013.2279659.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
45

Gogate, Mandar, Kia Dashtipour, Ahsan Adeel et Amir Hussain. « CochleaNet : A robust language-independent audio-visual model for real-time speech enhancement ». Information Fusion 63 (novembre 2020) : 273–85. http://dx.doi.org/10.1016/j.inffus.2020.04.001.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
46

Hazra, Sumon Kumar, Romana Rahman Ema, Syed Md Galib, Shalauddin Kabir et Nasim Adnan. « Emotion recognition of human speech using deep learning method and MFCC features ». Radioelectronic and Computer Systems, no 4 (29 novembre 2022) : 161–72. http://dx.doi.org/10.32620/reks.2022.4.13.

Texte intégral
Résumé :
Subject matter: Speech emotion recognition (SER) is an ongoing interesting research topic. Its purpose is to establish interactions between humans and computers through speech and emotion. To recognize speech emotions, five deep learning models: Convolution Neural Network, Long-Short Term Memory, Artificial Neural Network, Multi-Layer Perceptron, Merged CNN, and LSTM Network (CNN-LSTM) are used in this paper. The Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets were used for this system. They were trained by merging 3 ways TESS+SAVEE, TESS+RAVDESS, and TESS+SAVEE+RAVDESS. These datasets are numerous audios spoken by both male and female speakers of the English language. This paper classifies seven emotions (sadness, happiness, anger, fear, disgust, neutral, and surprise) that is a challenge to identify seven emotions for both male and female data. Whereas most have worked with male-only or female-only speech and both male-female datasets have found low accuracy in emotion detection tasks. Features need to be extracted by a feature extraction technique to train a deep-learning model on audio data. Mel Frequency Cepstral Coefficients (MFCCs) extract all the necessary features from the audio data for speech emotion classification. After training five models with three datasets, the best accuracy of 84.35 % is achieved by CNN-LSTM with the TESS+SAVEE dataset.
Styles APA, Harvard, Vancouver, ISO, etc.
47

Brancazio, Lawrence, et Carol A. Fowler. « Merging auditory and visual phonetic information : A critical test for feedback ? » Behavioral and Brain Sciences 23, no 3 (juin 2000) : 327–28. http://dx.doi.org/10.1017/s0140525x00243240.

Texte intégral
Résumé :
The present description of the Merge model addresses only auditory, not audiovisual, speech perception. However, recent findings in the audiovisual domain are relevant to the model. We outline a test that we are conducting of the adequacy of Merge, modified to accept visual information about articulation.
Styles APA, Harvard, Vancouver, ISO, etc.
48

Vougioukas, Konstantinos, Stavros Petridis et Maja Pantic. « Realistic Speech-Driven Facial Animation with GANs ». International Journal of Computer Vision 128, no 5 (13 octobre 2019) : 1398–413. http://dx.doi.org/10.1007/s11263-019-01251-8.

Texte intégral
Résumé :
Abstract Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.
Styles APA, Harvard, Vancouver, ISO, etc.
49

Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia et Jörg Tiedemann. « Multimodal machine translation through visuals and speech ». Machine Translation 34, no 2-3 (13 août 2020) : 97–147. http://dx.doi.org/10.1007/s10590-020-09250-0.

Texte intégral
Résumé :
Abstract Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.
Styles APA, Harvard, Vancouver, ISO, etc.
50

Et. al., D. N. V. S. L. S. Indira,. « An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer ». Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no 5 (11 avril 2021) : 1378–88. http://dx.doi.org/10.17762/turcomat.v12i5.2030.

Texte intégral
Résumé :
The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.
Styles APA, Harvard, Vancouver, ISO, etc.
Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!

Vers la bibliographie