Journal articles on the topic 'Visual speech recognition'

To see the other types of publications on this topic, follow the link: Visual speech recognition.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual speech recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Beadles, Robert L. "Audio visual speech recognition." Journal of the Acoustical Society of America 87, no. 5 (May 1990): 2274. http://dx.doi.org/10.1121/1.399137.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Dupont, S., and J. Luettin. "Audio-visual speech modeling for continuous speech recognition." IEEE Transactions on Multimedia 2, no. 3 (2000): 141–51. http://dx.doi.org/10.1109/6046.865479.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Brahme, Aparna, and Umesh Bhadade. "Effect of Various Visual Speech Units on Language Identification Using Visual Speech Recognition." International Journal of Image and Graphics 20, no. 04 (October 2020): 2050029. http://dx.doi.org/10.1142/s0219467820500291.

Full text
Abstract:
In this paper, we describe our work in Spoken language Identification using Visual Speech Recognition (VSR) and analyze the effect of various visual speech units used to transcribe the visual speech on language recognition. We have proposed a new approach of word recognition followed by the word N-gram language model (WRWLM), which uses high-level syntactic features and the word bigram language model for language discrimination. Also, as opposed to the traditional visemic approach, we propose a holistic approach of using the signature of a whole word, referred to as a “Visual Word” as visual speech unit for transcribing visual speech. The result shows Word Recognition Rate (WRR) of 88% and Language Recognition Rate (LRR) of 94% in speaker dependent cases and 58% WRR and 77% LRR in speaker independent cases for English and Marathi digit classification task. The proposed approach is also evaluated for continuous speech input. The result shows that the Spoken Language Identification rate of 50% is possible even though the WRR using Visual Speech Recognition is below 10%, using only 1[Formula: see text]s of speech. Also, there is an improvement of about 5% in language discrimination as compared to traditional visemic approaches.
APA, Harvard, Vancouver, ISO, and other styles
4

Elrefaei, Lamiaa A., Tahani Q. Alhassan, and Shefaa S. Omar. "An Arabic Visual Dataset for Visual Speech Recognition." Procedia Computer Science 163 (2019): 400–409. http://dx.doi.org/10.1016/j.procs.2019.12.122.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Rosenblum, Lawrence D., Deborah A. Yakel, Naser Baseer, Anjani Panchal, Brynn C. Nodarse, and Ryan P. Niehus. "Visual speech information for face recognition." Perception & Psychophysics 64, no. 2 (February 2002): 220–29. http://dx.doi.org/10.3758/bf03195788.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Yu, Dahai, Ovidiu Ghita, Alistair Sutherland, and Paul F. Whelan. "A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition." IPSJ Transactions on Computer Vision and Applications 2 (2010): 25–38. http://dx.doi.org/10.2197/ipsjtcva.2.25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

S.Salama, Elham, Reda A. El-Khoribi, and Mahmoud E. Shoman. "Audio-Visual Speech Recognition for People with Speech Disorders." International Journal of Computer Applications 96, no. 2 (June 18, 2014): 51–56. http://dx.doi.org/10.5120/16770-6337.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Nakadai, Kazuhiro, and Tomoaki Koiwa. "Psychologically-Inspired Audio-Visual Speech Recognition Using Coarse Speech Recognition and Missing Feature Theory." Journal of Robotics and Mechatronics 29, no. 1 (February 20, 2017): 105–13. http://dx.doi.org/10.20965/jrm.2017.p0105.

Full text
Abstract:
[abstFig src='/00290001/10.jpg' width='300' text='System architecture of AVSR based on missing feature theory and P-V grouping' ] Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.**This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp.1751-1756, 2007.”
APA, Harvard, Vancouver, ISO, and other styles
9

Bahal, Akriti. "Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition." IOSR Journal of Computer Engineering 5, no. 1 (2012): 31–36. http://dx.doi.org/10.9790/0661-0513136.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Seong, Thum Wei, M. Z. Ibrahim, and D. J. Mulvaney. "WADA-W: A Modified WADA SNR Estimator for Audio-Visual Speech Recognition." International Journal of Machine Learning and Computing 9, no. 4 (August 2019): 446–51. http://dx.doi.org/10.18178/ijmlc.2019.9.4.824.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Gornostal, Alexandr, and Yaroslaw Dorogyy. "Development of audio-visual speech recognition system." ScienceRise 12, no. 1 (December 30, 2017): 42–47. http://dx.doi.org/10.15587/2313-8416.2017.118212.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Rabi, Gihad. "Visual speech recognition by recurrent neural networks." Journal of Electronic Imaging 7, no. 1 (January 1, 1998): 61. http://dx.doi.org/10.1117/1.482627.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Noda, Kuniaki, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, and Tetsuya Ogata. "Audio-visual speech recognition using deep learning." Applied Intelligence 42, no. 4 (December 20, 2014): 722–37. http://dx.doi.org/10.1007/s10489-014-0629-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Soundarya, B., R. Krishnaraj, and S. Mythili. "Visual Speech Recognition using Convolutional Neural Network." IOP Conference Series: Materials Science and Engineering 1084, no. 1 (March 1, 2021): 012020. http://dx.doi.org/10.1088/1757-899x/1084/1/012020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Mishra, Saumya, Anup Kumar Gupta, and Puneet Gupta. "DARE: Deceiving Audio–Visual speech Recognition model." Knowledge-Based Systems 232 (November 2021): 107503. http://dx.doi.org/10.1016/j.knosys.2021.107503.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Grant, Ken W., Brian E. Walden, and Philip F. Seitz. "Auditory-visual speech recognition by hearing-impaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration." Journal of the Acoustical Society of America 103, no. 5 (May 1998): 2677–90. http://dx.doi.org/10.1121/1.422788.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Kubanek, M., J. Bobulski, and L. Adrjanowicz. "Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition." Bulletin of the Polish Academy of Sciences: Technical Sciences 60, no. 2 (October 1, 2012): 307–16. http://dx.doi.org/10.2478/v10175-012-0041-6.

Full text
Abstract:
Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.
APA, Harvard, Vancouver, ISO, and other styles
18

Raghavan, Arun M., Noga Lipschitz, Joseph T. Breen, Ravi N. Samy, and Gavriel D. Kohlberg. "Visual Speech Recognition: Improving Speech Perception in Noise through Artificial Intelligence." Otolaryngology–Head and Neck Surgery 163, no. 4 (May 26, 2020): 771–77. http://dx.doi.org/10.1177/0194599820924331.

Full text
Abstract:
Objectives To compare speech perception (SP) in noise for normal-hearing (NH) individuals and individuals with hearing loss (IWHL) and to demonstrate improvements in SP with use of a visual speech recognition program (VSRP). Study Design Single-institution prospective study. Setting Tertiary referral center. Subjects and Methods Eleven NH and 9 IWHL participants in a sound-isolated booth facing a speaker through a window. In non-VSRP conditions, SP was evaluated on 40 Bamford-Kowal-Bench speech-in-noise test (BKB-SIN) sentences presented by the speaker at 50 A-weighted decibels (dBA) with multiperson babble noise presented from 50 to 75 dBA. SP was defined as the percentage of words correctly identified. In VSRP conditions, an infrared camera was used to track 35 points around the speaker’s lips during speech in real time. Lip movement data were translated into speech-text via an in-house developed neural network–based VSRP. SP was evaluated similarly in the non-VSRP condition on 42 BKB-SIN sentences, with the addition of the VSRP output presented on a screen to the listener. Results In high-noise conditions (70-75 dBA) without VSRP, NH listeners achieved significantly higher speech perception than IWHL listeners (38.7% vs 25.0%, P = .02). NH listeners were significantly more accurate with VSRP than without VSRP (75.5% vs 38.7%, P < .0001), as were IWHL listeners (70.4% vs 25.0% P < .0001). With VSRP, no significant difference in SP was observed between NH and IWHL listeners (75.5% vs 70.4%, P = .15). Conclusions The VSRP significantly increased speech perception in high-noise conditions for NH and IWHL participants and eliminated the difference in SP accuracy between NH and IWHL listeners.
APA, Harvard, Vancouver, ISO, and other styles
19

Cooke, Martin, Jon Barker, Stuart Cunningham, and Xu Shao. "An audio-visual corpus for speech perception and automatic speech recognition." Journal of the Acoustical Society of America 120, no. 5 (November 2006): 2421–24. http://dx.doi.org/10.1121/1.2229005.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Lalonde, Kaylah, and Rachael Frush Holt. "Preschoolers Benefit From Visually Salient Speech Cues." Journal of Speech, Language, and Hearing Research 58, no. 1 (February 2015): 135–50. http://dx.doi.org/10.1044/2014_jslhr-h-13-0343.

Full text
Abstract:
Purpose This study explored visual speech influence in preschoolers using 3 developmentally appropriate tasks that vary in perceptual difficulty and task demands. They also examined developmental differences in the ability to use visually salient speech cues and visual phonological knowledge. Method Twelve adults and 27 typically developing 3- and 4-year-old children completed 3 audiovisual (AV) speech integration tasks: matching, discrimination, and recognition. The authors compared AV benefit for visually salient and less visually salient speech discrimination contrasts and assessed the visual saliency of consonant confusions in auditory-only and AV word recognition. Results Four-year-olds and adults demonstrated visual influence on all measures. Three-year-olds demonstrated visual influence on speech discrimination and recognition measures. All groups demonstrated greater AV benefit for the visually salient discrimination contrasts. AV recognition benefit in 4-year-olds and adults depended on the visual saliency of speech sounds. Conclusions Preschoolers can demonstrate AV speech integration. Their AV benefit results from efficient use of visually salient speech cues. Four-year-olds, but not 3-year-olds, used visual phonological knowledge to take advantage of visually salient speech cues, suggesting possible developmental differences in the mechanisms of AV benefit.
APA, Harvard, Vancouver, ISO, and other styles
21

ROGOZAN, ALEXANDRINA. "DISCRIMINATIVE LEARNING OF VISUAL DATA FOR AUDIOVISUAL SPEECH RECOGNITION." International Journal on Artificial Intelligence Tools 08, no. 01 (March 1999): 43–52. http://dx.doi.org/10.1142/s021821309900004x.

Full text
Abstract:
In recent years a number of techniques have been proposed to improve the accuracy and the robustness of automatic speech recognition in noisy environments. Among these, suplementing the acoustic information with visual data, mostly extracted from speaker's lip shapes, has been proved to be successful. We have already demonstrated the effectiveness of integrating visual data at two different levels during speech decoding according to both direct and separate identification strategies (DI+SI). This paper outlines methods for reinforcing the visible speech recognition in the framework of separate identification. First, we define visual-specific units using a self-organizing mapping technique. Second, we complete a stochastic learning of these units with a discriminative neural-network-based technique for speech recognition purposes. Finally, we show on a connected-letter speech recognition task that using these methods improves performances of the DI+SI based system under varying noise-level conditions.
APA, Harvard, Vancouver, ISO, and other styles
22

CAO, JIANGTAO, NAOYUKI KUBOTA, PING LI, and HONGHAI LIU. "THE VISUAL-AUDIO INTEGRATED RECOGNITION METHOD FOR USER AUTHENTICATION SYSTEM OF PARTNER ROBOTS." International Journal of Humanoid Robotics 08, no. 04 (December 2011): 691–705. http://dx.doi.org/10.1142/s0219843611002678.

Full text
Abstract:
Some of noncontact biometric ways have been used for user authentication system of partner robots, such as visual-based recognition methods and speech recognition. However, the methods of visual-based recognition are sensitive to the light noise and speech recognition systems are perturbed to the acoustic environment and sound noise. Inspiring from the human's capability of compensating visual information (looking) with audio information (hearing), a visual-audio integrating method is proposed to deal with the disturbance of light noise and to improve the recognition accuracy. Combining with the PCA-based and 2DPCA-based face recognition, a two-stage speaker recognition algorithm is used to extract useful personal identity information from speech signals. With the statistic properties of visual background noise, the visual-audio integrating method is performed to draw the final decision. The proposed method is evaluated on a public visual-audio dataset VidTIMIT and a partner robot authentication system. The results verified the visual-audio integrating method can obtain satisfied recognition results with strong robustness.
APA, Harvard, Vancouver, ISO, and other styles
23

Ujiie, Yuta, and Kohske Takahashi. "Weaker McGurk Effect for Rubin’s Vase-Type Speech in People With High Autistic Traits." Multisensory Research 34, no. 6 (April 16, 2021): 663–79. http://dx.doi.org/10.1163/22134808-bja10047.

Full text
Abstract:
Abstract While visual information from facial speech modulates auditory speech perception, it is less influential on audiovisual speech perception among autistic individuals than among typically developed individuals. In this study, we investigated the relationship between autistic traits (Autism-Spectrum Quotient; AQ) and the influence of visual speech on the recognition of Rubin’s vase-type speech stimuli with degraded facial speech information. Participants were 31 university students (13 males and 18 females; mean age: 19.2, SD: 1.13 years) who reported normal (or corrected-to-normal) hearing and vision. All participants completed three speech recognition tasks (visual, auditory, and audiovisual stimuli) and the AQ–Japanese version. The results showed that accuracies of speech recognition for visual (i.e., lip-reading) and auditory stimuli were not significantly related to participants’ AQ. In contrast, audiovisual speech perception was less susceptible to facial speech perception among individuals with high rather than low autistic traits. The weaker influence of visual information on audiovisual speech perception in autism spectrum disorder (ASD) was robust regardless of the clarity of the visual information, suggesting a difficulty in the process of audiovisual integration rather than in the visual processing of facial speech.
APA, Harvard, Vancouver, ISO, and other styles
24

LEE, Kyungsun, Minseok KEUM, David K. HAN, and Hanseok KO. "Visual Speech Recognition Using Weighted Dynamic Time Warping." IEICE Transactions on Information and Systems E98.D, no. 7 (2015): 1430–33. http://dx.doi.org/10.1587/transinf.2015edl8002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

TAMURA, Satoshi, Hiroshi NINOMIYA, Norihide KITAOKA, Shin OSUGA, Yurie IRIBE, Kazuya TAKEDA, and Satoru HAYAMIZU. "Investigation of DNN-Based Audio-Visual Speech Recognition." IEICE Transactions on Information and Systems E99.D, no. 10 (2016): 2444–51. http://dx.doi.org/10.1587/transinf.2016slp0019.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Stork, David G. "Neural network acoustic and visual speech recognition system." Journal of the Acoustical Society of America 102, no. 3 (September 1997): 1282. http://dx.doi.org/10.1121/1.420021.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

HASHIMOTO, Masahiro, and Masaharu KUMASHIRO. "Intermodal Timing Cues for Audio-Visual Speech Recognition." Journal of UOEH 26, no. 2 (2004): 215–25. http://dx.doi.org/10.7888/juoeh.26.215.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Huang, Jing, Gerasimos Potamianos, Jonathan Connell, and Chalapathy Neti. "Audio-visual speech recognition using an infrared headset." Speech Communication 44, no. 1-4 (October 2004): 83–96. http://dx.doi.org/10.1016/j.specom.2004.10.007.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Nankaku, Yoshihiko, Keiichi Tokuda, Tadashi Kitamura, and Takao Kobayashi. "Normalized training for HMM-Based visual speech recognition." Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 89, no. 11 (2006): 40–50. http://dx.doi.org/10.1002/ecjc.20281.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Tinnemore, Anna R., Sandra Gordon-Salant, and Matthew J. Goupell. "Audiovisual Speech Recognition With a Cochlear Implant and Increased Perceptual and Cognitive Demands." Trends in Hearing 24 (January 2020): 233121652096060. http://dx.doi.org/10.1177/2331216520960601.

Full text
Abstract:
Speech recognition in complex environments involves focusing on the most relevant speech signal while ignoring distractions. Difficulties can arise due to the incoming signal’s characteristics (e.g., accented pronunciation, background noise, distortion) or the listener’s characteristics (e.g., hearing loss, advancing age, cognitive abilities). Listeners who use cochlear implants (CIs) must overcome these difficulties while listening to an impoverished version of the signals available to listeners with normal hearing (NH). In the real world, listeners often attempt tasks concurrent with, but unrelated to, speech recognition. This study sought to reveal the effects of visual distraction and performing a simultaneous visual task on audiovisual speech recognition. Two groups, those with CIs and those with NH listening to vocoded speech, were presented videos of unaccented and accented talkers with and without visual distractions, and with a secondary task. It was hypothesized that, compared with those with NH, listeners with CIs would be less influenced by visual distraction or a secondary visual task because their prolonged reliance on visual cues to aid auditory perception improves the ability to suppress irrelevant information. Results showed that visual distractions alone did not significantly decrease speech recognition performance for either group, but adding a secondary task did. Speech recognition was significantly poorer for accented compared with unaccented speech, and this difference was greater for CI listeners. These results suggest that speech recognition performance is likely more dependent on incoming signal characteristics than a difference in adaptive strategies for managing distractions between those who listen with and without a CI.
APA, Harvard, Vancouver, ISO, and other styles
31

Hazen, T. J. "Visual model structures and synchrony constraints for audio-visual speech recognition." IEEE Transactions on Audio, Speech and Language Processing 14, no. 3 (May 2006): 1082–89. http://dx.doi.org/10.1109/tsa.2005.857572.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Untari, Lilik, SF Luthfie Arguby Purnomo, Nur Asiyah, and Muhammad Zainal Muttaqien. "Speaker-Dependent Based Speech Recognition." Register Journal 9, no. 1 (September 23, 2016): 1. http://dx.doi.org/10.18326/rgt.v9i1.1-12.

Full text
Abstract:
This is the first part of the two parts of a qualitative focused R&D research aimed at designing an application to assist students with visual impairment (VI) in learning English writing and reading skills. The designed application was a speaker-dependent based speech recognition. Conducting alpha and beta testings, it was revealed that MAKTUM, the name of the application, exposed weaknesses on the selection of Ogden’s Basic English as the linguistic resources for the application and on the recording complexities. On the other hand, MAKTUM displayed strengths in individualized pronunciation and simple interfaces to operate.
APA, Harvard, Vancouver, ISO, and other styles
33

Untari, Lilik, SF Luthfie Arguby Purnomo, Nur Asiyah, and Muhammad Zainal Muttaqien. "Speaker-Dependent Based Speech Recognition." Register Journal 9, no. 1 (September 23, 2016): 1. http://dx.doi.org/10.18326/rgt.v9i1.512.

Full text
Abstract:
This is the first part of the two parts of a qualitative focused R&D research aimed at designing an application to assist students with visual impairment (VI) in learning English writing and reading skills. The designed application was a speaker-dependent based speech recognition. Conducting alpha and beta testings, it was revealed that MAKTUM, the name of the application, exposed weaknesses on the selection of Ogden’s Basic English as the linguistic resources for the application and on the recording complexities. On the other hand, MAKTUM displayed strengths in individualized pronunciation and simple interfaces to operate.
APA, Harvard, Vancouver, ISO, and other styles
34

SINGH, PREETY, VIJAY LAXMI, and MANOJ SINGH GAUR. "NEAR-OPTIMAL GEOMETRIC FEATURE SELECTION FOR VISUAL SPEECH RECOGNITION." International Journal of Pattern Recognition and Artificial Intelligence 27, no. 08 (December 2013): 1350026. http://dx.doi.org/10.1142/s0218001413500262.

Full text
Abstract:
To improve the accuracy of visual speech recognition systems, selection of visual features is of fundamental importance. Prominent features, which are of maximum relevance for speech classification, need to be selected from a large set of extracted visual attributes. Existing methods apply feature reduction and selection techniques on image pixels constituting region-of-interest (ROI) to reduce data dimensionality. We propose application of feature selection methods on geometrical features to select the most dominant physical features. Two techniques, Minimum Redundancy Maximum Relevance (mRMR) and Correlation-based Feature Selection (CFS), have been applied on the extracted visual features. Experimental results show that recognition accuracy is not compromised when a few selected features from the complete visual feature set are used for classification, thereby reducing processing time and storage overheads considerably. Results are compared with performance of principal components obtained by application of Principal Component Analysis (PCA) on our dataset. Our set of selected features outperforms the PCA transformed data. Results show that the center and corner segments of the mouth are major contributors to visual speech recognition. Teeth pixels are shown to be a prominent visual cue. It is also seen that lip width contributes more towards visual speech recognition accuracy as compared to lip height.
APA, Harvard, Vancouver, ISO, and other styles
35

Treille, Avril, Coriandre Vilain, Thomas Hueber, Laurent Lamalle, and Marc Sato. "Inside Speech: Multisensory and Modality-specific Processing of Tongue and Lip Speech Actions." Journal of Cognitive Neuroscience 29, no. 3 (March 2017): 448–66. http://dx.doi.org/10.1162/jocn_a_01057.

Full text
Abstract:
Action recognition has been found to rely not only on sensory brain areas but also partly on the observer's motor system. However, whether distinct auditory and visual experiences of an action modulate sensorimotor activity remains largely unknown. In the present sparse sampling fMRI study, we determined to which extent sensory and motor representations interact during the perception of tongue and lip speech actions. Tongue and lip speech actions were selected because tongue movements of our interlocutor are accessible via their impact on speech acoustics but not visible because of its position inside the vocal tract, whereas lip movements are both “audible” and visible. Participants were presented with auditory, visual, and audiovisual speech actions, with the visual inputs related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, previously recorded by an ultrasound imaging system and a video camera. Although the neural networks involved in visual visuolingual and visuofacial perception largely overlapped, stronger motor and somatosensory activations were observed during visuolingual perception. In contrast, stronger activity was found in auditory and visual cortices during visuofacial perception. Complementing these findings, activity in the left premotor cortex and in visual brain areas was found to correlate with visual recognition scores observed for visuolingual and visuofacial speech stimuli, respectively, whereas visual activity correlated with RTs for both stimuli. These results suggest that unimodal and multimodal processing of lip and tongue speech actions rely on common sensorimotor brain areas. They also suggest that visual processing of audible but not visible movements induces motor and visual mental simulation of the perceived actions to facilitate recognition and/or to learn the association between auditory and visual signals.
APA, Harvard, Vancouver, ISO, and other styles
36

Miller, Rachel E., Courtney Strickland, and Daniel Fogerty. "Multimodal recognition of interrupted speech: Benefit from text and visual speech cues." Journal of the Acoustical Society of America 144, no. 3 (September 2018): 1800. http://dx.doi.org/10.1121/1.5067942.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

S*, Manisha, Nafisa H. Saida, Nandita Gopal, and Roshni P. Anand. "Bimodal Emotion Recognition using Machine Learning." International Journal of Engineering and Advanced Technology 10, no. 4 (April 30, 2021): 189–94. http://dx.doi.org/10.35940/ijeat.d2451.0410421.

Full text
Abstract:
The predominant communication channel to convey relevant and high impact information is the emotions that is embedded on our communications. Researchers have tried to exploit these emotions in recent years for human robot interactions (HRI) and human computer interactions (HCI). Emotion recognition through speech or through facial expression is termed as single mode emotion recognition. The rate of accuracy of these single mode emotion recognitions are improved using the proposed bimodal method by combining the modalities of speech and facing and recognition of emotions using a Convolutional Neural Network (CNN) model. In this paper, the proposed bimodal emotion recognition system, contains three major parts such as processing of audio, processing of video and fusion of data for detecting the emotion of a person. The fusion of visual information and audio data obtained from two different channels enhances the emotion recognition rate by providing the complementary data. The proposed method aims to classify 7 basic emotions (anger, disgust, fear, happy, neutral, sad, surprise) from an input video. We take audio and image frame from the video input to predict the final emotion of a person. The dataset used is an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. Dataset used here is RAVDESS dataset which contains audio-visual dataset, visual dataset and audio dataset. For bimodal emotion detection the audio-visual dataset is used.
APA, Harvard, Vancouver, ISO, and other styles
38

Upadhyaya, Prashant, Omar Farooq, M. R. Abidi, and Priyanka Varshney. "Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition." Archives of Acoustics 40, no. 4 (December 1, 2015): 609–19. http://dx.doi.org/10.1515/aoa-2015-0061.

Full text
Abstract:
Abstract In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT- DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to −10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.
APA, Harvard, Vancouver, ISO, and other styles
39

Ivanko, D., and D. Ryumin. "A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIV-2/W1-2021 (April 15, 2021): 85–89. http://dx.doi.org/10.5194/isprs-archives-xliv-2-w1-2021-85-2021.

Full text
Abstract:
Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.
APA, Harvard, Vancouver, ISO, and other styles
40

Su, Rongfeng, Xunying Liu, Lan Wang, and Jingzhou Yang. "Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 185–97. http://dx.doi.org/10.1109/taslp.2019.2950602.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Díaz, Begoña, Helen Blank, and Katharina von Kriegstein. "Task-dependent modulation of the visual sensory thalamus assists visual-speech recognition." NeuroImage 178 (September 2018): 721–34. http://dx.doi.org/10.1016/j.neuroimage.2018.05.032.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Drummond, Sakina S., Jess Dancer, Bettie E. Casey, and Pat O'Sullivan. "Visual Recognition Training of Older Adults with Speech Spectrograms." Perceptual and Motor Skills 82, no. 2 (April 1996): 379–82. http://dx.doi.org/10.2466/pms.1996.82.2.379.

Full text
Abstract:
10 adults' performances on visual training through recognition of speech spectrograms were examined. All subjects completed the training within eight 1-hr. sessions. Success and retention of training were also evident in the subjects' performances on two posttests.
APA, Harvard, Vancouver, ISO, and other styles
43

Zhang, Xuejie, Yan Xu, Andrew K. Abel, Leslie S. Smith, Roger Watt, Amir Hussain, and Chengxiang Gao. "Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features." Entropy 22, no. 12 (December 3, 2020): 1367. http://dx.doi.org/10.3390/e22121367.

Full text
Abstract:
Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.
APA, Harvard, Vancouver, ISO, and other styles
44

Yau, Wai Chee, Dinesh Kant Kumar, and Sridhar Poosapadi Arjunan. "Visual recognition of speech consonants using facial movement features." Integrated Computer-Aided Engineering 14, no. 1 (January 18, 2007): 49–61. http://dx.doi.org/10.3233/ica-2007-14105.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Mian Qaisar, Saeed. "Isolated Speech Recognition and Its Transformation in Visual Signs." Journal of Electrical Engineering & Technology 14, no. 2 (January 23, 2019): 955–64. http://dx.doi.org/10.1007/s42835-018-00071-z.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Connell, Jonathan H. "Audio-only backoff in audio-visual speech recognition system." Journal of the Acoustical Society of America 125, no. 6 (2009): 4109. http://dx.doi.org/10.1121/1.3155497.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Gurban, M., and J. P. Thiran. "Information Theoretic Feature Extraction for Audio-Visual Speech Recognition." IEEE Transactions on Signal Processing 57, no. 12 (December 2009): 4765–76. http://dx.doi.org/10.1109/tsp.2009.2026513.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Saenko, K., K. Livescu, J. Glass, and T. Darrell. "Multistream Articulatory Feature-Based Models for Visual Speech Recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence 31, no. 9 (September 2009): 1700–1707. http://dx.doi.org/10.1109/tpami.2008.303.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Jong-Seok Lee and Cheol Hoon Park. "Robust Audio-Visual Speech Recognition Based on Late Integration." IEEE Transactions on Multimedia 10, no. 5 (August 2008): 767–79. http://dx.doi.org/10.1109/tmm.2008.922789.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Estellers, Virginia, Mihai Gurban, and Jean-Philippe Thiran. "On Dynamic Stream Weighting for Audio-Visual Speech Recognition." IEEE Transactions on Audio, Speech, and Language Processing 20, no. 4 (May 2012): 1145–57. http://dx.doi.org/10.1109/tasl.2011.2172427.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography