Log in

Relevant bibliographies by topics / Visual speech model / Journal articles

Journal articles on the topic 'Visual speech model'

To see the other types of publications on this topic, follow the link: Visual speech model.

Author: Grafiati

Published: 10 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual speech model.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Jia, Xi Bin, and Mei Xia Zheng. "Video Based Visual Speech Feature Model Construction." Applied Mechanics and Materials 182-183 (June 2012): 1367–71. http://dx.doi.org/10.4028/www.scientific.net/amm.182-183.1367.

Full text

Abstract:

This paper aims to give a solutions for the construction of chinese visual speech feature model based on HMM. We propose and discuss three kind representation model of the visual speech which are lip geometrical features, lip motion features and lip texture features. The model combines the advantages of the local LBP and global DCT texture information together, which shows better performance than the single feature. Equally the model combines the advantages of the local LBP and geometrical information together is better than single feature. By computing the recognition rate of the visemes from the model, the paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.

APA, Harvard, Vancouver, ISO, and other styles

2

Mishra, Saumya, Anup Kumar Gupta, and Puneet Gupta. "DARE: Deceiving Audio–Visual speech Recognition model." Knowledge-Based Systems 232 (November 2021): 107503. http://dx.doi.org/10.1016/j.knosys.2021.107503.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Brahme, Aparna, and Umesh Bhadade. "Effect of Various Visual Speech Units on Language Identification Using Visual Speech Recognition." International Journal of Image and Graphics 20, no. 04 (October 2020): 2050029. http://dx.doi.org/10.1142/s0219467820500291.

Full text

Abstract:

In this paper, we describe our work in Spoken language Identification using Visual Speech Recognition (VSR) and analyze the effect of various visual speech units used to transcribe the visual speech on language recognition. We have proposed a new approach of word recognition followed by the word N-gram language model (WRWLM), which uses high-level syntactic features and the word bigram language model for language discrimination. Also, as opposed to the traditional visemic approach, we propose a holistic approach of using the signature of a whole word, referred to as a “Visual Word” as visual speech unit for transcribing visual speech. The result shows Word Recognition Rate (WRR) of 88% and Language Recognition Rate (LRR) of 94% in speaker dependent cases and 58% WRR and 77% LRR in speaker independent cases for English and Marathi digit classification task. The proposed approach is also evaluated for continuous speech input. The result shows that the Spoken Language Identification rate of 50% is possible even though the WRR using Visual Speech Recognition is below 10%, using only 1[Formula: see text]s of speech. Also, there is an improvement of about 5% in language discrimination as compared to traditional visemic approaches.

APA, Harvard, Vancouver, ISO, and other styles

4

Metzger, Brian A. ,., John F. ,. Magnotti, Elizabeth Nesbitt, Daniel Yoshor, and Michael S. ,. Beauchamp. "Cross-modal suppression model of speech perception: Visual information drives suppressive interactions between visual and auditory speech in pSTG." Journal of Vision 20, no. 11 (October 20, 2020): 434. http://dx.doi.org/10.1167/jov.20.11.434.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Hazen, T. J. "Visual model structures and synchrony constraints for audio-visual speech recognition." IEEE Transactions on Audio, Speech and Language Processing 14, no. 3 (May 2006): 1082–89. http://dx.doi.org/10.1109/tsa.2005.857572.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Fagel, Sascha. "Merging methods of speech visualization." ZAS Papers in Linguistics 40 (January 1, 2005): 19–32. http://dx.doi.org/10.21248/zaspil.40.2005.255.

Full text

Abstract:

The author presents MASSY, the MODULAR AUDIOVISUAL SPEECH SYNTHESIZER. The system combines two approaches of visual speech synthesis. Two control models are implemented: a (data based) di-viseme model and a (rule based) dominance model where both produce control commands in a parameterized articulation space. Analogously two visualization methods are implemented: an image based (video-realistic) face model and a 3D synthetic head. Both face models can be driven by both the data based and the rule based articulation model. The high-level visual speech synthesis generates a sequence of control commands for the visible articulation. For every virtual articulator (articulation parameter) the 3D synthetic face model defines a set of displacement vectors for the vertices of the 3D objects of the head. The vertices of the 3D synthetic head then are moved by linear combinations of these displacement vectors to visualize articulation movements. For the image based video synthesis a single reference image is deformed to fit the facial properties derived from the control commands. Facial feature points and facial displacements have to be defined for the reference image. The algorithm can also use an image database with appropriately annotated facial properties. An example database was built automatically from video recordings. Both the 3D synthetic face and the image based face generate visual speech that is capable to increase the intelligibility of audible speech. Other well known image based audiovisual speech synthesis systems like MIKETALK and VIDEO REWRITE concatenate pre-recorded single images or video sequences, respectively. Parametric talking heads like BALDI control a parametric face with a parametric articulation model. The presented system demonstrates the compatibility of parametric and data based visual speech synthesis approaches.

APA, Harvard, Vancouver, ISO, and other styles

7

Loh, Marco, Gabriele Schmid, Gustavo Deco, and Wolfram Ziegler. "Audiovisual Matching in Speech and Nonspeech Sounds: A Neurodynamical Model." Journal of Cognitive Neuroscience 22, no. 2 (February 2010): 240–47. http://dx.doi.org/10.1162/jocn.2009.21202.

Full text

Abstract:

Audiovisual speech perception provides an opportunity to investigate the mechanisms underlying multimodal processing. By using nonspeech stimuli, it is possible to investigate the degree to which audiovisual processing is specific to the speech domain. It has been shown in a match-to-sample design that matching across modalities is more difficult in the nonspeech domain as compared to the speech domain. We constructed a biophysically realistic neural network model simulating this experimental evidence. We propose that a stronger connection between modalities in speech underlies the behavioral difference between the speech and the nonspeech domain. This could be the result of more extensive experience with speech stimuli. Because the match-to-sample paradigm does not allow us to draw conclusions concerning the integration of auditory and visual information, we also simulated two further conditions based on the same paradigm, which tested the integration of auditory and visual information within a single stimulus. New experimental data for these two conditions support the simulation results and suggest that audiovisual integration of discordant stimuli is stronger in speech than in nonspeech stimuli. According to the simulations, the connection strength between auditory and visual information, on the one hand, determines how well auditory information can be assigned to visual information, and on the other hand, it influences the magnitude of multimodal integration.

APA, Harvard, Vancouver, ISO, and other styles

8

Yu, Wentao, Steffen Zeiler, and Dorothea Kolossa. "Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition." Sensors 22, no. 15 (July 23, 2022): 5501. http://dx.doi.org/10.3390/s22155501.

Full text

Abstract:

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

APA, Harvard, Vancouver, ISO, and other styles

9

How, Chun Kit, Ismail Mohd Khairuddin, Mohd Azraai Mohd Razman, Anwar P. P. Abdul Majeed, and Wan Hasbullah Mohd Isa. "Development of Audio-Visual Speech Recognition using Deep-Learning Technique." MEKATRONIKA 4, no. 1 (June 27, 2022): 88–95. http://dx.doi.org/10.15282/mekatronika.v4i1.8625.

Full text

Abstract:

Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network (CNN) and Deep Neural Network (DNN), were developed to recognize the speech’s emotion from the dataset. Pytorch framework with torchaudio library was used. Both models were given the same training, validation, testing, and augmented datasets. The training will be stopped when the training loop reaches ten epochs, or the validation loss function does not improve for five epochs. At the end, the highest accuracy and lowest loss function of CNN model in the training dataset are 76.50% and 0.006029 respectively, meanwhile the DNN model achieved 75.42% and 0.086643 respectively. Both models were evaluated using confusion matrix. In conclusion, CNN model has higher performance than DNN model, but needs to improvise as the accuracy of testing dataset is low and the loss function is high.

APA, Harvard, Vancouver, ISO, and other styles

10

Holubenko, Nataliia. "Cognitive and Intersemiotic Model of the Visual and Verbal Modes in a Screen Adaptation to Literary Texts." World Journal of English Language 12, no. 6 (July 18, 2022): 129. http://dx.doi.org/10.5430/wjel.v12n6p129.

Full text

Abstract:

The aim of the study is to examine screen adaptations from the perspective of cognitive and intersemiotic models of the visual and verbal modes. The purpose of the study is to express the specificity of a screen text which is defined as a combination of three media: speech, image, and music. The scope is to demonstrate the general framework of an intersemiotic translation from a new point of view – like a transliteration. The method of the research refers to semiotic and stylistic analyzes – methods of transformation from one sign system into another from prose works with regard to their cognitive as well as narrative and stylistic features (Zhong, Chen, & Xuan, 2021). Thus, the study analyses such specific relations between the verbal and visual modes in film adaptations of prose literature as a more detailed description of event episodes, events’ temporal structure, presentation of author’s thoughts and characters’ thoughts; their mental activity formulated indirect speech and inner speech that is shown only by the actor’s intonation. The results of the study made possible to show the types of inner speech in their adaptations: author’s thoughts, characters’ thoughts which are presented only by the verbal mode, and visual modes’ inner speeches that combine the modes of character’s voice and image. One can conclude, that taking into account intersemiotic relations between the visual and verbal spaces, it is possible to explain, for instance, how the words of characters are replaced by their facial expressions, gestures, or intonations.

APA, Harvard, Vancouver, ISO, and other styles

11

Kröger, Bernd J., Julia Gotto, Susanne Albert, and Christiane Neuschaefer-Rube. "visual articulatory model and its application to therapy of speech disorders: a pilot study." ZAS Papers in Linguistics 40 (January 1, 2005): 79–94. http://dx.doi.org/10.21248/zaspil.40.2005.259.

Full text

Abstract:

A visual articulatory model based on static MRI-data of isolated sounds and its application in therapy of speech disorders is described. The model is capable of generating video sequences of articulatory movements or still images of articulatory target positions within the midsagittal plane. On the basis of this model (1) a visual stimulation technique for the therapy of patients suffering from speech disorders and (2) a rating test for visual recognition of speech movements was developed. Results indicate that patients produce recognition rates above level of chance already without any training and that patients are capable of increasing their recognition rate over the time course of therapy significantly.

APA, Harvard, Vancouver, ISO, and other styles

12

Li, Dengshi, Yu Gao, Chenyi Zhu, Qianrui Wang, and Ruoxi Wang. "Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy." Sensors 23, no. 4 (February 11, 2023): 2053. http://dx.doi.org/10.3390/s23042053.

Full text

Abstract:

The current accuracy of speech recognition has been able to reach over 97% on different data sets, but the accuracy of speech recognition in noisy environments is greatly reduced. Improving speech recognition performance in noisy environments is a challenging task. Due to the fact that visual information is not affected by noise, researchers often use lip information to help improve speech recognition performance. This is where the performance of lip reading and the effect of cross-modal fusion are particularly important. In this paper, we try to improve the accuracy of speech recognition in noisy environments by improving the lip reading performance and the cross-modal fusion effect. First, due to the same lip may contain multiple meanings, we construct a one-to-many mapping relationship model between lips and speech, allowing the lip-reading model to consider the feasibility of which articulations are represented from the input lip movements. Also, audio representations are preserved by modeling the inter-relationships between paired audio-visual representations. At the inference stage, the preserved audio representations can be extracted from memory by the learned interrelationships using only video input. Second, a joint cross-fusion model using the attention mechanism can effectively exploit complementary inter-modal relationships, and the model calculates cross-attention weights based on the correlations between joint feature representations and individual modalities. Finally, our proposed model has a 4.0% reduction in WER in −15 dB SNR environment compared to the baseline method, and a 10.1% reduction in WER compared to speech recognition. The experimental results show that our method has a significant improvement over speech recognition models in different noise environments.

APA, Harvard, Vancouver, ISO, and other styles

13

Yuan, Yuan, Chunlin Tian, and Xiaoqiang Lu. "Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition." IEEE Access 6 (2018): 5573–83. http://dx.doi.org/10.1109/access.2018.2796118.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Slaney, Malcolm, and Richard F. Lyon. "Visual representations of speech—A computer model based on correlation." Journal of the Acoustical Society of America 88, S1 (November 1990): S23. http://dx.doi.org/10.1121/1.2028916.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Edge, James D., Adrian Hilton, and Philip Jackson. "Model-Based Synthesis of Visual Speech Movements from 3D Video." EURASIP Journal on Audio, Speech, and Music Processing 2009 (2009): 1–12. http://dx.doi.org/10.1155/2009/597267.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Sharma, Usha, Sushila Maheshkar, A. N. Mishra, and Rahul Kaushik. "Visual Speech Recognition Using Optical Flow and Hidden Markov Model." Wireless Personal Communications 106, no. 4 (September 10, 2018): 2129–47. http://dx.doi.org/10.1007/s11277-018-5930-z.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Setyati, Endang, Mauridhi Hery Purnomo, Surya Sumpeno, and Joan Santoso. "HIDDEN MARKOV MODELS BASED INDONESIAN VISEME MODEL FOR NATURAL SPEECH WITH AFFECTION." Kursor 8, no. 3 (December 13, 2016): 102. http://dx.doi.org/10.28961/kursor.v8i3.61.

Full text

Abstract:

In a communication using texts input, viseme (visual phonemes) is derived from a group of phonemes having similar visual appearances. Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition. For speech emotion recognition, a HMM is trained for each emotion and an unknown sample is classified according to the model which illustrate the derived feature sequence best. Viterbi algorithm, HMM is used for guessing the most possible state sequence of observable states. In this work, first stage, we defined system of an Indonesian viseme set and the associated mouth shapes, namely system of text input segmentation. The second stage, we defined a choice of one of affection type as input in the system. The last stage, we experimentally using Trigram HMMs for generating the viseme sequence to be used for synchronized mouth shape and lip movements. The whole system is interconnected in a sequence. The final system produced a viseme sequence for natural speech of Indonesian sentences with affection. We show through various experiments that the proposed, the results in about 82,19% relative improvement in classification accuracy.

APA, Harvard, Vancouver, ISO, and other styles

18

Seo, Minji, and Myungho Kim. "Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition." Sensors 20, no. 19 (September 28, 2020): 5559. http://dx.doi.org/10.3390/s20195559.

Full text

Abstract:

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

APA, Harvard, Vancouver, ISO, and other styles

19

Hertrich, Ingo, Susanne Dietrich, and Hermann Ackermann. "Cross-modal Interactions during Perception of Audiovisual Speech and Nonspeech Signals: An fMRI Study." Journal of Cognitive Neuroscience 23, no. 1 (January 2011): 221–37. http://dx.doi.org/10.1162/jocn.2010.21421.

Full text

Abstract:

During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream—prior to its fusion with auditory phonological features [Hertrich, I., Mathiak, K., Lutzenberger, W., & Ackermann, H. Time course of early audiovisual interactions during speech and non-speech central-auditory processing: An MEG study. Journal of Cognitive Neuroscience, 21, 259–274, 2009]. Using functional magnetic resonance imaging, the present follow-up study aims to further elucidate the topographic distribution of visual–phonological operations and audiovisual (AV) interactions during speech perception. Ambiguous acoustic syllables—disambiguated to /pa/ or /ta/ by the visual channel (speaking face)—served as test materials, concomitant with various control conditions (nonspeech AV signals, visual-only and acoustic-only speech, and nonspeech stimuli). (i) Visual speech yielded an AV-subadditive activation of primary auditory cortex and the anterior superior temporal gyrus (STG), whereas the posterior STG responded both to speech and nonspeech motion. (ii) The inferior frontal and the fusiform gyrus of the right hemisphere showed a strong phonetic/phonological impact (differential effects of visual /pa/ vs. /ta/) upon hemodynamic activation during presentation of speaking faces. Taken together with the previous MEG data, these results point at a dual-pathway model of visual speech information processing: On the one hand, access to the auditory system via the anterior supratemporal “what” path may give rise to direct activation of “auditory objects.” On the other hand, visual speech information seems to be represented in a right-hemisphere visual working memory, providing a potential basis for later interactions with auditory information such as the McGurk effect.

APA, Harvard, Vancouver, ISO, and other styles

20

Blackburn, Catherine L., Pádraig T. Kitterick, Gary Jones, Christian J. Sumner, and Paula C. Stacey. "Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers." Trends in Hearing 23 (January 2019): 233121651983786. http://dx.doi.org/10.1177/2331216519837866.

Full text

Abstract:

Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

APA, Harvard, Vancouver, ISO, and other styles

21

Fleming, Luke. "Negating speech." Gesture 14, no. 3 (December 31, 2014): 263–96. http://dx.doi.org/10.1075/gest.14.3.01fle.

Full text

Abstract:

With the exception of Plains Indian Sign Language and Pacific Northwest sawmill sign languages, highly developed alternate sign languages (sign languages typically employed by and for the hearing) share not only common structural linguistic features, but their use is also characterized by convergent ideological commitments concerning communicative medium and linguistic modality. Though both modalities encode comparable denotational content, speaker-signers tend to understand manual-visual sign as a pragmatically appropriate substitute for oral-aural speech. This paper suggests that two understudied clusters of alternate sign languages, Armenian and Cape York Peninsula sign languages, offer a general model for the development of alternate sign languages, one in which the gesture-to-sign continuum is dialectically linked to hypertrophied forms of interactional avoidance up-to-and-including complete silence in the co-presence of affinal relations. These cases illustrate that the pragmatic appropriateness of sign over speech relies upon local semiotic ideologies which tend to conceptualize the manual-visual linguistic modality on analogy to the gestural communication employed in interactional avoidance, and thus as not counting as true language.

APA, Harvard, Vancouver, ISO, and other styles

22

Nikolaus, Mitja, Afra Alishahi, and Grzegorz Chrupała. "Learning English with Peppa Pig." Transactions of the Association for Computational Linguistics 10 (2022): 922–36. http://dx.doi.org/10.1162/tacl_a_00498.

Full text

Abstract:

Abstract Recent computational models of the acquisition of spoken language via grounding in perception exploit associations between spoken and visual modalities and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual data. In the real world the coupling between the linguistic and the visual modality is loose, and often confounded by correlations with non-semantic aspects of the speech signal. Here we address this shortcoming by using a dataset based on the children’s cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data, our model succeeds at learning aspects of the visual semantics of spoken language.

APA, Harvard, Vancouver, ISO, and other styles

23

Ryumin, Dmitry, Denis Ivanko, and Elena Ryumina. "Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices." Sensors 23, no. 4 (February 17, 2023): 2284. http://dx.doi.org/10.3390/s23042284.

Full text

Abstract:

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

APA, Harvard, Vancouver, ISO, and other styles

24

Yang, Chih-Chun, Wan-Cyuan Fan, Cheng-Fu Yang, and Yu-Chiang Frank Wang. "Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (June 28, 2022): 3036–44. http://dx.doi.org/10.1609/aaai.v36i3.20210.

Full text

Abstract:

As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been a challenge, benefiting not only audio/visual speech recognition (ASR/VSR) but also for manipulating data within/across modalities. In this paper, we present a feature disentanglement-based framework for jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert visual or audio-based linguistic features into modality-agnostic representations. Such derived linguistic representations not only allow one to perform ASR, VSR, and AVSR, but also to manipulate audio and visual data output based on the desirable subject identity and linguistic content information. We perform extensive experiments on different recognition and synthesis tasks to show that our model performs favorably against state-of-the-art approaches on each individual task, while ours is a unified solution that is able to jointly tackle the aforementioned audio-visual learning tasks.

APA, Harvard, Vancouver, ISO, and other styles

25

Wang, Dong, Bing Liu, Yong Zhou, Mingming Liu, Peng Liu, and Rui Yao. "Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning." Applied Sciences 12, no. 23 (November 22, 2022): 11875. http://dx.doi.org/10.3390/app122311875.

Full text

Abstract:

Transformer-based image captioning models have recently achieved remarkable performance by using new fully attentive paradigms. However, existing models generally follow the conventional language model of predicting the next word conditioned on the visual features and partially generated words. They treat the predictions of visual and nonvisual words equally and usually tend to produce generic captions. To address these issues, we propose a novel part-of-speech-guided transformer (PoS-Transformer) framework for image captioning. Specifically, a self-attention part-of-speech prediction network is first presented to model the part-of-speech tag sequences for the corresponding image captions. Then, different attention mechanisms are constructed for the decoder to guide the caption generation by using the part-of-speech information. Benefiting from the part-of-speech guiding mechanisms, the proposed framework not only adaptively adjusts the weights between visual features and language signals for the word prediction, but also facilitates the generation of more fine-grained and grounded captions. Finally, a multitask learning is introduced to train the whole PoS-Transformer network in an end-to-end manner. Our model was trained and tested on the MSCOCO and Flickr30k datasets with the experimental evaluation standard CIDEr scores of 1.299 and 0.612, respectively. The qualitative experimental results indicated that the captions generated by our method conformed to the grammatical rules better.

APA, Harvard, Vancouver, ISO, and other styles

26

Biswas, Astik, P. K. Sahu, and Mahesh Chandra. "Multiple cameras audio visual speech recognition using active appearance model visual features in car environment." International Journal of Speech Technology 19, no. 1 (January 23, 2016): 159–71. http://dx.doi.org/10.1007/s10772-016-9332-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Lindborg, Alma, and Tobias S. Andersen. "Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception." PLOS ONE 16, no. 2 (February 19, 2021): e0246986. http://dx.doi.org/10.1371/journal.pone.0246986.

Full text

Abstract:

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.

APA, Harvard, Vancouver, ISO, and other styles

28

Zeliang Zhang, Xiongfei Li, and Chengjia Yang. "Visual Speech Recognition based on Improved type of Hidden Markov Model." Journal of Convergence Information Technology 7, no. 13 (July 31, 2012): 119–26. http://dx.doi.org/10.4156/jcit.vol7.issue13.14.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

HONG, PENGYU, ZHEN WEN, and THOMAS S. HUANG. "iFACE: A 3D SYNTHETIC TALKING FACE." International Journal of Image and Graphics 01, no. 01 (January 2001): 19–26. http://dx.doi.org/10.1142/s0219467801000037.

Full text

Abstract:

We present the iFACE system, a visual speech synthesizer that provides a form of virtual face-to-face communication. The system provides an interactive tool for the user to customize a graphic head model for the virtual agent of a person based on his/her range data. The texture is mapped onto the customized model to achieve a realistic appearance. Face animations are produced by using text stream or speech stream to drive the model. A set of basic facial shapes and head action is manually built and used to synthesize expressive visual speech based on rules.

APA, Harvard, Vancouver, ISO, and other styles

30

Zhou, Hang, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. "Talking Face Generation by Adversarially Disentangled Audio-Visual Representation." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9299–306. http://dx.doi.org/10.1609/aaai.v33i01.33019299.

Full text

Abstract:

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

APA, Harvard, Vancouver, ISO, and other styles

31

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (February 7, 2023): 1834. http://dx.doi.org/10.3390/s23041834.

Full text

Abstract:

This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2—Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.

APA, Harvard, Vancouver, ISO, and other styles

32

Jeon, Sanghun, and Mun Sang Kim. "Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications." Sensors 22, no. 20 (October 12, 2022): 7738. http://dx.doi.org/10.3390/s22207738.

Full text

Abstract:

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

APA, Harvard, Vancouver, ISO, and other styles

33

Hertrich, Ingo, Klaus Mathiak, Werner Lutzenberger, and Hermann Ackermann. "Time Course of Early Audiovisual Interactions during Speech and Nonspeech Central Auditory Processing: A Magnetoencephalography Study." Journal of Cognitive Neuroscience 21, no. 2 (February 2009): 259–74. http://dx.doi.org/10.1162/jocn.2008.21019.

Full text

Abstract:

Cross-modal fusion phenomena suggest specific interactions of auditory and visual sensory information both within the speech and nonspeech domains. Using whole-head magnetoencephalography, this study recorded M50 and M100 fields evoked by ambiguous acoustic stimuli that were visually disambiguated to perceived /ta/ or /pa/ syllables. As in natural speech, visual motion onset preceded the acoustic signal by 150 msec. Control conditions included visual and acoustic nonspeech signals as well as visual-only and acoustic-only stimuli. (a) Both speech and nonspeech motion yielded a consistent attenuation of the auditory M50 field, suggesting a visually induced “preparatory baseline shift” at the level of the auditory cortex. (b) Within the temporal domain of the auditory M100 field, visual speech and nonspeech motion gave rise to different response patterns (nonspeech: M100 attenuation; visual /pa/: left-hemisphere M100 enhancement; /ta/: no effect). (c) These interactions could be further decomposed using a six-dipole model. One of these three pairs of dipoles (V270) was fitted to motion-induced activity at a latency of 270 msec after motion onset, that is, the time domain of the auditory M100 field, and could be attributed to the posterior insula. This dipole source responded to nonspeech motion and visual /pa/, but was found suppressed in the case of visual /ta/. Such a nonlinear interaction might reflect the operation of a binary distinction between the marked phonological feature “labial” versus its underspecified competitor “coronal.” Thus, visual processing seems to be shaped by linguistic data structures even prior to its fusion with auditory information channel.

APA, Harvard, Vancouver, ISO, and other styles

34

Bielski, Lynn M., and Charissa R. Lansing. "Utility of the Baddeley and Hitch Model of Short-Term Working Memory To Investigate Spoken Language Understanding: A Tutorial." Perspectives on Aural Rehabilitation and Its Instrumentation 19, no. 1 (May 2012): 25–33. http://dx.doi.org/10.1044/arii19.1.25.

Full text

Abstract:

Spoken speech understanding can be challenging, particularly in the presence of competing information such as background noise. Researchers have shown that dynamic observable phonetic facial cues improve speech understanding in both quiet and noise. Additionally, cognitive functions such as short-term working memory influence spoken language understanding. Currently, we do not know the utility of visual cues for the improvement of spoken language understanding. Although there are many theoretical models of short-term memory, the Baddeley and Hitch (1974) multicomponent model of short-term working memory is well-suited as a cognitive framework through which the utility of visual cues in spoken language understanding could be investigated. In this tutorial, we will describe the components of the Baddeley and Hitch model, illustrate their contributions to spoken language understanding, and provide possible applications for the model.

APA, Harvard, Vancouver, ISO, and other styles

35

Anwar, Miftahulkhairah, Fathiaty Murtadho, Endry Boeriswati, Gusti Yarmi, and Helvy Tiana Rosa. "analysis model of impolite Indonesian language use." Linguistics and Culture Review 5, S3 (December 5, 2021): 1426–41. http://dx.doi.org/10.21744/lingcure.v5ns3.1840.

Full text

Abstract:

This research was based on the reality of the use of Indonesian language on social media that was vulgar, destructive, full of blasphemy, scorn, sarcasm, and tended to be provocative. This condition has destructive power because it spreads very quickly and is capable of arousing very strong emotions. This article aimed at presenting the results of research on the analysis model of impolite Indonesian language use. This model was developed from tracing status on social media which included language impoliteness in 2019. The novelty of this analysis model was that it involved a factor of power that allowed the appearance of such impolite speech. Therefore, this model is composed of several stages. First, presenting text in the form of spoken, written, and visual texts. Second, transcribing texts. Third, interpreting language impoliteness. At the interpreting stage, the impoliteness of the speeches was carried out by: (1) analyzing the contexts, (2) analyzing the power, (3) analyzing the dictions and language styles that contained impoliteness, (4) analyzing ethical speech acts, and (5) manipulating language politeness. From these language manipulation efforts, they were made to habituate language discipline to create a polite language society.

APA, Harvard, Vancouver, ISO, and other styles

36

Handa, Anand, Rashi Agarwal, and Narendra Kohli. "Audio-Visual Emotion Recognition System Using Multi-Modal Features." International Journal of Cognitive Informatics and Natural Intelligence 15, no. 4 (October 2021): 1–14. http://dx.doi.org/10.4018/ijcini.20211001.oa34.

Full text

Abstract:

Due to the highly variant face geometry and appearances, Facial Expression Recognition (FER) is still a challenging problem. CNN can characterize 2-D signals. Therefore, for emotion recognition in a video, the authors propose a feature selection model in AlexNet architecture to extract and filter facial features automatically. Similarly, for emotion recognition in audio, the authors use a deep LSTM-RNN. Finally, they propose a probabilistic model for the fusion of audio and visual models using facial features and speech of a subject. The model combines all the extracted features and use them to train the linear SVM (Support Vector Machine) classifiers. The proposed model outperforms the other existing models and achieves state-of-the-art performance for audio, visual and fusion models. The model classifies the seven known facial expressions, namely anger, happy, surprise, fear, disgust, sad, and neutral on the eNTERFACE’05 dataset with an overall accuracy of 76.61%.

APA, Harvard, Vancouver, ISO, and other styles

37

Miller, Christi W., Erin K. Stewart, Yu-Hsiang Wu, Christopher Bishop, Ruth A. Bentler, and Kelly Tremblay. "Working Memory and Speech Recognition in Noise Under Ecologically Relevant Listening Conditions: Effects of Visual Cues and Noise Type Among Adults With Hearing Loss." Journal of Speech, Language, and Hearing Research 60, no. 8 (August 18, 2017): 2310–20. http://dx.doi.org/10.1044/2017_jslhr-h-16-0284.

Full text

Abstract:

Purpose This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Method Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. Results A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. Conclusion The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed.

APA, Harvard, Vancouver, ISO, and other styles

38

Kusmana, Suherli, Endang Kasupardi, and Nunu Nurasa. "PENGARUH MODEL PEMBELAJARAN BERBASIS MASALAH MELALUI MEDIA AUDIO VISUAL TERHADAP PENINGKATAN KEMAMPUAN BERPIDATO SISWA KELAS IX SMP NEGERI 1 NUSAHERANG KABUPATEN KUNINGAN." Jurnal Tuturan 3, no. 1 (November 28, 2017): 419. http://dx.doi.org/10.33603/jt.v3i1.776.

Full text

Abstract:

And result process study of ability orate IX SMP Country class student 1 Nusaherang Sub-Province Brass not yet directional and not yet reached result of optimal. Student ability level in speech is still low. It is caused of implementing not relevant with the student characteristics. The aim of this research is is to effectiveness descriptions model study base on the problem of passing visual audio media to ability orate IX SMP Country class student 1 Nusaherang, influence descriptions model study base on the problem of passing visual audio media to ability orate IX SMP Country class student 1 Nusaherang, and IX SMP Country class student respon descriptions 1 Nusaherang about usage model study base on the problem of passing visual audio media to ability orate. In this research the writer use in esperiment method through pretest-postest control group design. This design consist of two control group. In the process problem based learning is done by experiment group and it will be demonstrated by control group. The measurent is given after the writer make the various conditions to the students. Result of the research indicates that problem based learning by using audio visual media is more effective in improving the student speech ability. It can be drawn by the students activity. All the students learn the material more cooperatively and they have ability in speech in amount 0,8752 = 0,76 (76%) it means that the students speech ability is influenced by implementing problem based learning through audio visual media. Most of the student agree and give positive respons toward implementing problem based learning through audio visual media. The benefit of using this approach: 1) to increase student motivation, 2) to increase student creativity, 3) to awoid boring sense in learning, and (4) to improve respect attitude toward other opinion.

APA, Harvard, Vancouver, ISO, and other styles

39

Uhler, Kristin M., Rosalinda Baca, Emily Dudas, and Tammy Fredrickson. "Refining Stimulus Parameters in Assessing Infant Speech Perception Using Visual Reinforcement Infant Speech Discrimination: Sensation Level." Journal of the American Academy of Audiology 26, no. 10 (November 2015): 807–14. http://dx.doi.org/10.3766/jaaa.14093.

Full text

Abstract:

Background: Speech perception measures have long been considered an integral piece of the audiological assessment battery. Currently, a prelinguistic, standardized measure of speech perception is missing in the clinical assessment battery for infants and young toddlers. Such a measure would allow systematic assessment of speech perception abilities of infants as well as the potential to investigate the impact early identification of hearing loss and early fitting of amplification have on the auditory pathways. Purpose: To investigate the impact of sensation level (SL) on the ability of infants with normal hearing (NH) to discriminate /a-i/ and /ba-da/ and to determine if performance on the two contrasts are significantly different in predicting the discrimination criterion. Research Design: The design was based on a survival analysis model for event occurrence and a repeated measures logistic model for binary outcomes. The outcome for survival analysis was the minimum SL for criterion and the outcome for the logistic regression model was the presence/absence of achieving the criterion. Criterion achievement was designated when an infant’s proportion correct score was >0.75 on the discrimination performance task. Study Sample: Twenty-two infants with NH sensitivity participated in this study. There were 9 males and 13 females, aged 6–14 mo. Data Collection and Analysis: Testing took place over two to three sessions. The first session consisted of a hearing test, threshold assessment of the two speech sounds (/a/ and /i/), and if time and attention allowed, visual reinforcement infant speech discrimination (VRISD). The second session consisted of VRISD assessment for the two test contrasts (/a-i/ and /ba-da/). The presentation level started at 50 dBA. If the infant was unable to successfully achieve criterion (>0.75) at 50 dBA, the presentation level was increased to 70 dBA followed by 60 dBA. Data examination included an event analysis, which provided the probability of criterion distribution across SL. The second stage of the analysis was a repeated measures logistic regression where SL and contrast were used to predict the likelihood of speech discrimination criterion. Results: Infants were able to reach criterion for the /a-i/ contrast at statistically lower SLs when compared to /ba-da/. There were six infants who never reached criterion for /ba-da/ and one never reached criterion for /a-i/. The conditional probability of not reaching criterion by 70 dB SL was 0% for /a-i/ and 21% for /ba-da/. The predictive logistic regression model showed that children were more likely to discriminate the /a-i/ even when controlling for SL. Conclusions: Nearly all normal-hearing infants can demonstrate discrimination criterion of a vowel contrast at 60 dB SL, while a level of ≥70 dB SL may be needed to allow all infants to demonstrate discrimination criterion of a difficult consonant contrast.

APA, Harvard, Vancouver, ISO, and other styles

40

Gao, Ying, Yuqin Liu, and Chunyue Zhou. "Production and Interaction between Gesture and Speech: A Review." International Journal of English Linguistics 6, no. 2 (March 29, 2016): 131. http://dx.doi.org/10.5539/ijel.v6n2p131.

Full text

Abstract:

<p>Gesture in multimodal researches has been studied widely recently, and how gesture interacts with speech in communication is the focus in most researches. Some hypotheses or models about production and interaction between gesture and speech are introduced and compared in this paper. We find that it is generally agreed that speech production mechanism can be explained based on Levelt’s Model; while there is no consistency about gesture production and the interaction between gesture and speech. Most of theories argue that gesture stems from the visual-spatial images in working memory; some models approve of the interactive relationship while others consider no interaction between gesture and speech. Further research will be made in the areas of theoretical and applicative aspects.</p>

APA, Harvard, Vancouver, ISO, and other styles

41

Massaro, Dominic W., and Michael M. Cohen. "Perception of Synthesized Audible and Visible Speech." Psychological Science 1, no. 1 (January 1990): 55–63. http://dx.doi.org/10.1111/j.1467-9280.1990.tb00068.x.

Full text

Abstract:

The research reported in this paper uses novel stimuli to study how speech perception is influenced by information presented to ear and eye. Auditory and visual sources of information (syllables) were synthesized and presented in isolation or in factorial combination. A five-step continuum between the syllables ibal and idal was synthesized along both auditory and visual dimensions, by varying properties of the syllable at its onset. The onsets of the second and third formants were manipulated in the audible speech. For the visible speech, the shape of the lips and the jaw position at the onset of the syllable were manipulated. Subjects’ identification judgments of the test syllables presented on videotape were influenced by both auditory and visual information. The results were used to test between a fuzzy logical model of speech perception (FLMP) and a categorical model of perception (CMP). These tests indicate that evaluation and integration of the two sources of information makes available continuous as opposed to just categorical information. In addition, the integration of the two sources appears to be nonadditive in that the least ambiguous source has the largest impact on the judgment. The two sources of information appear to be evaluated, integrated, and identified as described by the FLMP-an optimal algorithm for combining information from multiple sources. The research provides a theoretical framework for understanding the improvement in speech perception by hearing-impaired listeners when auditory speech is supplemented with other sources of information.

APA, Harvard, Vancouver, ISO, and other styles

42

J., Esra, and Diyar H. "Audio Visual Arabic Speech Recognition using KNN Model by Testing different Audio Features." International Journal of Computer Applications 180, no. 1 (December 15, 2017): 33–38. http://dx.doi.org/10.5120/ijca2017915901.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Lü, Guo-yun, Dong-mei Jiang, Yan-ning Zhang, Rong-chun Zhao, H. Sahli, Ilse Ravyse, and W. Verhelst. "DBN Based Multi-stream Multi-states Model for Continue Audio-Visual Speech Recognition." Journal of Electronics & Information Technology 30, no. 12 (April 22, 2011): 2906–11. http://dx.doi.org/10.3724/sp.j.1146.2007.00915.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Deena, Salil, Shaobo Hou, and Aphrodite Galata. "Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model." IEEE Transactions on Multimedia 15, no. 8 (December 2013): 1755–68. http://dx.doi.org/10.1109/tmm.2013.2279659.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Gogate, Mandar, Kia Dashtipour, Ahsan Adeel, and Amir Hussain. "CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement." Information Fusion 63 (November 2020): 273–85. http://dx.doi.org/10.1016/j.inffus.2020.04.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Hazra, Sumon Kumar, Romana Rahman Ema, Syed Md Galib, Shalauddin Kabir, and Nasim Adnan. "Emotion recognition of human speech using deep learning method and MFCC features." Radioelectronic and Computer Systems, no. 4 (November 29, 2022): 161–72. http://dx.doi.org/10.32620/reks.2022.4.13.

Full text

Abstract:

Subject matter: Speech emotion recognition (SER) is an ongoing interesting research topic. Its purpose is to establish interactions between humans and computers through speech and emotion. To recognize speech emotions, five deep learning models: Convolution Neural Network, Long-Short Term Memory, Artificial Neural Network, Multi-Layer Perceptron, Merged CNN, and LSTM Network (CNN-LSTM) are used in this paper. The Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets were used for this system. They were trained by merging 3 ways TESS+SAVEE, TESS+RAVDESS, and TESS+SAVEE+RAVDESS. These datasets are numerous audios spoken by both male and female speakers of the English language. This paper classifies seven emotions (sadness, happiness, anger, fear, disgust, neutral, and surprise) that is a challenge to identify seven emotions for both male and female data. Whereas most have worked with male-only or female-only speech and both male-female datasets have found low accuracy in emotion detection tasks. Features need to be extracted by a feature extraction technique to train a deep-learning model on audio data. Mel Frequency Cepstral Coefficients (MFCCs) extract all the necessary features from the audio data for speech emotion classification. After training five models with three datasets, the best accuracy of 84.35 % is achieved by CNN-LSTM with the TESS+SAVEE dataset.

APA, Harvard, Vancouver, ISO, and other styles

47

Brancazio, Lawrence, and Carol A. Fowler. "Merging auditory and visual phonetic information: A critical test for feedback?" Behavioral and Brain Sciences 23, no. 3 (June 2000): 327–28. http://dx.doi.org/10.1017/s0140525x00243240.

Full text

Abstract:

The present description of the Merge model addresses only auditory, not audiovisual, speech perception. However, recent findings in the audiovisual domain are relevant to the model. We outline a test that we are conducting of the adequacy of Merge, modified to accept visual information about articulation.

APA, Harvard, Vancouver, ISO, and other styles

48

Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "Realistic Speech-Driven Facial Animation with GANs." International Journal of Computer Vision 128, no. 5 (October 13, 2019): 1398–413. http://dx.doi.org/10.1007/s11263-019-01251-8.

Full text

Abstract:

Abstract Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.

APA, Harvard, Vancouver, ISO, and other styles

49

Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. "Multimodal machine translation through visuals and speech." Machine Translation 34, no. 2-3 (August 13, 2020): 97–147. http://dx.doi.org/10.1007/s10590-020-09250-0.

Full text

Abstract:

Abstract Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

APA, Harvard, Vancouver, ISO, and other styles

50

Et. al., D. N. V. S. L. S. Indira,. "An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 5 (April 11, 2021): 1378–88. http://dx.doi.org/10.17762/turcomat.v12i5.2030.

Full text

Abstract:

The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!