Journal articles: 'Audio speech recognition'

1

Beadles, Robert L. "Audio visual speech recognition." Journal of the Acoustical Society of America 87, no. 5 (May 1990): 2274. http://dx.doi.org/10.1121/1.399137.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Bahal, Akriti. "Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition." IOSR Journal of Computer Engineering 5, no. 1 (2012): 31–36. http://dx.doi.org/10.9790/0661-0513136.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Hwang, Jung-Wook, Jeongkyun Park, Rae-Hong Park, and Hyung-Min Park. "Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition." Applied Acoustics 211 (August 2023): 109478. http://dx.doi.org/10.1016/j.apacoust.2023.109478.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Nakadai, Kazuhiro, and Tomoaki Koiwa. "Psychologically-Inspired Audio-Visual Speech Recognition Using Coarse Speech Recognition and Missing Feature Theory." Journal of Robotics and Mechatronics 29, no. 1 (February 20, 2017): 105–13. http://dx.doi.org/10.20965/jrm.2017.p0105.

Full text

Abstract:

[abstFig src='/00290001/10.jpg' width='300' text='System architecture of AVSR based on missing feature theory and P-V grouping' ] Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.**This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp.1751-1756, 2007.”

APA, Harvard, Vancouver, ISO, and other styles

5

BASYSTIUK, Oleh, and Nataliia MELNYKOVA. "MULTIMODAL SPEECH RECOGNITION BASED ON AUDIO AND TEXT DATA." Herald of Khmelnytskyi National University. Technical sciences 313, no. 5 (October 27, 2022): 22–25. http://dx.doi.org/10.31891/2307-5732-2022-313-5-22-25.

Full text

Abstract:

Systems of machine translation of texts from one language to another simulate the work of a human translator. Their performance depends on the ability to understand the grammar rules of the language. In translation, the basic units are not individual words, but word combinations or phraseological units that express different concepts. Only by using them, more complex ideas can be expressed through the translated text. The main feature of machine translation is different length for input and output. The ability to work with different lengths of input and output provides us with the approach of recurrent neural networks. A recurrent neural network (RNN) is a class of artificial neural network that has connections between nodes. In this case, a connection refers to a connection from a more distant node to a less distant node. The presence of connections allows the RNN to remember and reproduce the entire sequence of reactions to one stimulus. From the point of view of programming, such networks are analogous to cyclic execution, and from the point of view of the system, such networks are equivalent to a state machine. RNNs are commonly used to process word sequences in natural language processing. Usually, a hidden Markov model (HMM) and an N-program language model are used to process a sequence of words. Deep learning has completely changed the approach to machine translation. Researchers in the deep learning field has created simple solutions based on machine learning that outperform the best expert systems. In this paper was reviewed the main features of machine translation based on recurrent neural networks. The advantages of systems based on RNN using the sequence-to-sequence model against statistical translation systems are also highlighted in the article. Two machine translation systems based on the sequence-to-sequence model were constructed using Keras and PyTorch machine learning libraries. Based on the obtained results, libraries analysis was done, and their performance comparison.

APA, Harvard, Vancouver, ISO, and other styles

6

Dupont, S., and J. Luettin. "Audio-visual speech modeling for continuous speech recognition." IEEE Transactions on Multimedia 2, no. 3 (2000): 141–51. http://dx.doi.org/10.1109/6046.865479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Kubanek, M., J. Bobulski, and L. Adrjanowicz. "Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition." Bulletin of the Polish Academy of Sciences: Technical Sciences 60, no. 2 (October 1, 2012): 307–16. http://dx.doi.org/10.2478/v10175-012-0041-6.

Full text

Abstract:

Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.

APA, Harvard, Vancouver, ISO, and other styles

8

Kacur, Juraj, Boris Puterka, Jarmila Pavlovicova, and Milos Oravec. "Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications." Sensors 22, no. 16 (August 22, 2022): 6304. http://dx.doi.org/10.3390/s22166304.

Full text

Abstract:

There are many speech and audio processing applications and their number is growing. They may cover a wide range of tasks, each having different requirements on the processed speech or audio signals and, therefore, indirectly, on the audio sensors as well. This article reports on tests and evaluation of the effect of basic physical properties of speech and audio signals on the recognition accuracy of major speech/audio processing applications, i.e., speech recognition, speaker recognition, speech emotion recognition, and audio event recognition. A particular focus is on frequency ranges, time intervals, a precision of representation (quantization), and complexities of models suitable for each class of applications. Using domain-specific datasets, eligible feature extraction methods and complex neural network models, it was possible to test and evaluate the effect of basic speech and audio signal properties on the achieved accuracies for each group of applications. The tests confirmed that the basic parameters do affect the overall performance and, moreover, this effect is domain-dependent. Therefore, accurate knowledge of the extent of these effects can be valuable for system designers when selecting appropriate hardware, sensors, architecture, and software for a particular application, especially in the case of limited resources.

APA, Harvard, Vancouver, ISO, and other styles

9

Showkat Ahmad Dar, Showkat Ahmad Dar. "Emotion Recognition Based On Audio Speech." IOSR Journal of Computer Engineering 11, no. 6 (2013): 46–50. http://dx.doi.org/10.9790/0661-1164650.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Aucouturier, Jean-Julien, and Laurent Daudet. "Pattern recognition of non-speech audio." Pattern Recognition Letters 31, no. 12 (September 2010): 1487–88. http://dx.doi.org/10.1016/j.patrec.2010.05.003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Chaturvedi, Iti, Tim Noel, and Ranjan Satapathy. "Speech Emotion Recognition Using Audio Matching." Electronics 11, no. 23 (November 29, 2022): 3943. http://dx.doi.org/10.3390/electronics11233943.

Full text

Abstract:

It has become popular for people to share their opinions about products on TikTok and YouTube. Automatic sentiment extraction on a particular product can assist users in making buying decisions. For videos in languages such as Spanish, the tone of voice can be used to determine sentiments, since the translation is often unknown. In this paper, we propose a novel algorithm to classify sentiments in speech in the presence of environmental noise. Traditional models rely on pretrained audio feature extractors for humans that do not generalize well across different accents. In this paper, we leverage the vector space of emotional concepts where words with similar meanings often have the same prefix. For example, words starting with ‘con’ or ‘ab’ signify absence and hence negative sentiments. Augmentations are a popular way to amplify the training data during audio classification. However, some augmentations may result in a loss of accuracy. Hence, we propose a new metric based on eigenvalues to select the best augmentations. We evaluate the proposed approach on emotions in YouTube videos and outperform baselines in the range of 10–20%. Each neuron learns words with similar pronunciations and emotions. We also use the model to determine the presence of birds from audio recordings in the city.

APA, Harvard, Vancouver, ISO, and other styles

12

Gnanamanickam, Jenifa, Yuvaraj Natarajan, and Sri Preethaa K. R. "A Hybrid Speech Enhancement Algorithm for Voice Assistance Application." Sensors 21, no. 21 (October 23, 2021): 7025. http://dx.doi.org/10.3390/s21217025.

Full text

Abstract:

In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems’ input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.

APA, Harvard, Vancouver, ISO, and other styles

13

Connell, Jonathan H. "Audio-only backoff in audio-visual speech recognition system." Journal of the Acoustical Society of America 125, no. 6 (2009): 4109. http://dx.doi.org/10.1121/1.3155497.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Hazra, Sumon Kumar, Romana Rahman Ema, Syed Md Galib, Shalauddin Kabir, and Nasim Adnan. "Emotion recognition of human speech using deep learning method and MFCC features." Radioelectronic and Computer Systems, no. 4 (November 29, 2022): 161–72. http://dx.doi.org/10.32620/reks.2022.4.13.

Full text

Abstract:

Subject matter: Speech emotion recognition (SER) is an ongoing interesting research topic. Its purpose is to establish interactions between humans and computers through speech and emotion. To recognize speech emotions, five deep learning models: Convolution Neural Network, Long-Short Term Memory, Artificial Neural Network, Multi-Layer Perceptron, Merged CNN, and LSTM Network (CNN-LSTM) are used in this paper. The Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets were used for this system. They were trained by merging 3 ways TESS+SAVEE, TESS+RAVDESS, and TESS+SAVEE+RAVDESS. These datasets are numerous audios spoken by both male and female speakers of the English language. This paper classifies seven emotions (sadness, happiness, anger, fear, disgust, neutral, and surprise) that is a challenge to identify seven emotions for both male and female data. Whereas most have worked with male-only or female-only speech and both male-female datasets have found low accuracy in emotion detection tasks. Features need to be extracted by a feature extraction technique to train a deep-learning model on audio data. Mel Frequency Cepstral Coefficients (MFCCs) extract all the necessary features from the audio data for speech emotion classification. After training five models with three datasets, the best accuracy of 84.35 % is achieved by CNN-LSTM with the TESS+SAVEE dataset.

APA, Harvard, Vancouver, ISO, and other styles

15

Ryumin, Dmitry, Denis Ivanko, and Elena Ryumina. "Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices." Sensors 23, no. 4 (February 17, 2023): 2284. http://dx.doi.org/10.3390/s23042284.

Full text

Abstract:

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

APA, Harvard, Vancouver, ISO, and other styles

16

Jeon, Sanghun, and Mun Sang Kim. "Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications." Sensors 22, no. 20 (October 12, 2022): 7738. http://dx.doi.org/10.3390/s22207738.

Full text

Abstract:

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

APA, Harvard, Vancouver, ISO, and other styles

17

S.Salama, Elham, Reda A. El-Khoribi, and Mahmoud E. Shoman. "Audio-Visual Speech Recognition for People with Speech Disorders." International Journal of Computer Applications 96, no. 2 (June 18, 2014): 51–56. http://dx.doi.org/10.5120/16770-6337.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Reggiswarashari, Fauzivy, and Sari Widya Sihwi. "Speech emotion recognition using 2D-convolutional neural network." International Journal of Electrical and Computer Engineering (IJECE) 12, no. 6 (December 1, 2022): 6594. http://dx.doi.org/10.11591/ijece.v12i6.pp6594-6601.

Full text

Abstract:

<span lang="EN-US">This research proposes a speech emotion recognition model to predict human emotions using the convolutional neural network (CNN) by learning segmented audio of specific emotions. Speech emotion recognition utilizes the extracted features of audio waves to learn speech emotion characteristics; one of them is mel frequency cepstral coefficient (MFCC). Dataset takes a vital role to obtain valuable results in model learning. Hence this research provides the leverage of dataset combination implementation. The model learns a combined dataset with audio segmentation and zero padding using 2D-CNN. Audio segmentation and zero padding equalize the extracted audio features to learn the characteristics. The model results in 83.69% accuracy to predict seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise from the combined dataset with the segmentation of the audio files.</span>

APA, Harvard, Vancouver, ISO, and other styles

19

S*, Manisha, Nafisa H. Saida, Nandita Gopal, and Roshni P. Anand. "Bimodal Emotion Recognition using Machine Learning." International Journal of Engineering and Advanced Technology 10, no. 4 (April 30, 2021): 189–94. http://dx.doi.org/10.35940/ijeat.d2451.0410421.

Full text

Abstract:

The predominant communication channel to convey relevant and high impact information is the emotions that is embedded on our communications. Researchers have tried to exploit these emotions in recent years for human robot interactions (HRI) and human computer interactions (HCI). Emotion recognition through speech or through facial expression is termed as single mode emotion recognition. The rate of accuracy of these single mode emotion recognitions are improved using the proposed bimodal method by combining the modalities of speech and facing and recognition of emotions using a Convolutional Neural Network (CNN) model. In this paper, the proposed bimodal emotion recognition system, contains three major parts such as processing of audio, processing of video and fusion of data for detecting the emotion of a person. The fusion of visual information and audio data obtained from two different channels enhances the emotion recognition rate by providing the complementary data. The proposed method aims to classify 7 basic emotions (anger, disgust, fear, happy, neutral, sad, surprise) from an input video. We take audio and image frame from the video input to predict the final emotion of a person. The dataset used is an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. Dataset used here is RAVDESS dataset which contains audio-visual dataset, visual dataset and audio dataset. For bimodal emotion detection the audio-visual dataset is used.

APA, Harvard, Vancouver, ISO, and other styles

20

CAO, JIANGTAO, NAOYUKI KUBOTA, PING LI, and HONGHAI LIU. "THE VISUAL-AUDIO INTEGRATED RECOGNITION METHOD FOR USER AUTHENTICATION SYSTEM OF PARTNER ROBOTS." International Journal of Humanoid Robotics 08, no. 04 (December 2011): 691–705. http://dx.doi.org/10.1142/s0219843611002678.

Full text

Abstract:

Some of noncontact biometric ways have been used for user authentication system of partner robots, such as visual-based recognition methods and speech recognition. However, the methods of visual-based recognition are sensitive to the light noise and speech recognition systems are perturbed to the acoustic environment and sound noise. Inspiring from the human's capability of compensating visual information (looking) with audio information (hearing), a visual-audio integrating method is proposed to deal with the disturbance of light noise and to improve the recognition accuracy. Combining with the PCA-based and 2DPCA-based face recognition, a two-stage speaker recognition algorithm is used to extract useful personal identity information from speech signals. With the statistic properties of visual background noise, the visual-audio integrating method is performed to draw the final decision. The proposed method is evaluated on a public visual-audio dataset VidTIMIT and a partner robot authentication system. The results verified the visual-audio integrating method can obtain satisfied recognition results with strong robustness.

APA, Harvard, Vancouver, ISO, and other styles

21

Stewart, Darryl, Rowan Seymour, Adrian Pass, and Ji Ming. "Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions." IEEE Transactions on Cybernetics 44, no. 2 (February 2014): 175–84. http://dx.doi.org/10.1109/tcyb.2013.2250954.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Gornostal, Alexandr, and Yaroslaw Dorogyy. "Development of audio-visual speech recognition system." ScienceRise 12, no. 1 (December 30, 2017): 42–47. http://dx.doi.org/10.15587/2313-8416.2017.118212.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Mishra, Saumya, Anup Kumar Gupta, and Puneet Gupta. "DARE: Deceiving Audio–Visual speech Recognition model." Knowledge-Based Systems 232 (November 2021): 107503. http://dx.doi.org/10.1016/j.knosys.2021.107503.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Hasegawa-Johnson, Mark A., Jui-Ting Huang, Sarah King, and Xi Zhou. "Normalized recognition of speech and audio events." Journal of the Acoustical Society of America 130, no. 4 (October 2011): 2524. http://dx.doi.org/10.1121/1.3655075.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Zick, Gregory L., and Lawrence Yapp. "Speech recognition of MPEG/audio encoded files." Journal of the Acoustical Society of America 112, no. 6 (2002): 2520. http://dx.doi.org/10.1121/1.1536509.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Noda, Kuniaki, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, and Tetsuya Ogata. "Audio-visual speech recognition using deep learning." Applied Intelligence 42, no. 4 (December 20, 2014): 722–37. http://dx.doi.org/10.1007/s10489-014-0629-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Upadhyaya, Prashant, Omar Farooq, M. R. Abidi, and Priyanka Varshney. "Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition." Archives of Acoustics 40, no. 4 (December 1, 2015): 609–19. http://dx.doi.org/10.1515/aoa-2015-0061.

Full text

Abstract:

Abstract In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT- DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to −10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.

APA, Harvard, Vancouver, ISO, and other styles

28

Salian, Beenaa, Omkar Narvade, Rujuta Tambewagh, and Smita Bharne. "Speech Emotion Recognition using Time Distributed CNN and LSTM." ITM Web of Conferences 40 (2021): 03006. http://dx.doi.org/10.1051/itmconf/20214003006.

Full text

Abstract:

Speech has several distinguishing characteristic features which has remained a state-of-the-art tool for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system using these speech features, which would be able to accurately and efficiently recognize emotions through audio analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train the neural network. We have been able to achieve a testing accuracy of about 89.26%.

APA, Harvard, Vancouver, ISO, and other styles

29

Muhammad, Ghulam, and Khalid Alghathbar. "Environment Recognition for Digital Audio Forensics Using MPEG-7 and MEL Cepstral Features." Journal of Electrical Engineering 62, no. 4 (July 1, 2011): 199–205. http://dx.doi.org/10.2478/v10187-011-0032-0.

Full text

Abstract:

Environment Recognition for Digital Audio Forensics Using MPEG-7 and MEL Cepstral FeaturesEnvironment recognition from digital audio for forensics application is a growing area of interest. However, compared to other branches of audio forensics, it is a less researched one. Especially less attention has been given to detect environment from files where foreground speech is present, which is a forensics scenario. In this paper, we perform several experiments focusing on the problems of environment recognition from audio particularly for forensics application. Experimental results show that the task is easier when audio files contain only environmental sound than when they contain both foreground speech and background environment. We propose a full set of MPEG-7 audio features combined with mel frequency cepstral coefficients (MFCCs) to improve the accuracy. In the experiments, the proposed approach significantly increases the recognition accuracy of environment sound even in the presence of high amount of foreground human speech.

APA, Harvard, Vancouver, ISO, and other styles

30

Wolfe, Jace, and Erin C. Schafer. "Optimizing The Benefit of Sound Processors Coupled to Personal FM Systems." Journal of the American Academy of Audiology 19, no. 08 (September 2008): 585–94. http://dx.doi.org/10.3766/jaaa.19.8.2.

Full text

Abstract:

Background: Use of personal frequency modulated (FM) systems significantly improves speech recognition in noise for users of cochlear implants (CI). There are, however, a number of adjustable parameters of the cochlear implant and FM receiver that may affect performance and benefit, and there is limited evidence to guide audiologists in optimizing these parameters. Purpose: This study examined the effect of two sound processor audio-mixing ratios (30/70 and 50/50) on speech recognition and functional benefit for adults with CIs using the Advanced Bionics Auria® sound processors. Research Design: Fully-repeated repeated measures experimental design. Each subject participated in every speech-recognition condition in the study, and qualitative data was collected with subject questionnaires. Study Sample: Twelve adults using Advanced Bionics Auria sound processors. Participants had greater than 20% correct speech recognition on consonant-nucleus-consonant (CNC) monosyllabic words in quiet and had used their CIs for at least six months. Intervention: Performance was assessed at two audio-mixing ratios (30/70 and 50/50). For the 50/50 mixing ratio, equal emphasis is placed on the signals from the sound processor and the FM system. For the 30/70 mixing ratio, the signal from the microphone of the sound processor is attenuated by 10 dB. Data Collection and Analysis: Speech recognition was assessed at two audio-mixing ratios (30/70 and 50/50) in quiet (35 and 50 dB HL) and in noise (+5 signal-to-noise ratio) with and without the personal FM system. After two weeks of using each audio-mixing ratio, the participants completed subjective questionnaires. Results: Study results suggested that use of a personal FM system resulted in significant improvements in speech recognition in quiet at low-presentation levels, speech recognition in noise, and perceived benefit in noise. Use of the 30/70 mixing ratio resulted in significantly poorer speech recognition for low-level speech that was not directed to the FM transmitter. There was no significant difference in speech recognition in noise or functional benefit between the two audio-mixing ratios. Conclusions: Use of a 50/50 audio-mixing ratio is recommended for optimal performance with an FM system in quiet and noisy listening situations.

APA, Harvard, Vancouver, ISO, and other styles

31

Saitoh, Takeshi. "Research on multi-modal silent speech recognition technology." Impact 2018, no. 3 (June 15, 2018): 47–49. http://dx.doi.org/10.21820/23987073.2018.3.47.

Full text

Abstract:

We are all familiar with audio speech recognition technology for interfacing with smartphones and in-car computers. However, technology that can interpret our speech signals without audio is a far greater challenge for scientists. Audio speech recognition (ASR) can only work in situations where there is little or no background noise and where speech is clearly enunciated. Other technologies that use visual signals to lip-read, or that use lip-reading in conjunction with degraded audio input are under development. However, in the situations where a person cannot speak or where the person's face may not be fully visible, silent speech recognition, which uses muscle movements or brain signals to decode speech, is also under development. Associate Professor Takeshi Saitoh's laboratory at the Kyushu Institute of Technology is at the forefront of visual speech recognition (VSR) and is collaborating with researchers worldwide to develop a range of silent speech recognition technologies. Saitoh, whose small team of researchers and students are being supported by the Japan Society for the Promotion of Science (JSPS), says: 'The aim of our work is to achieve smooth and free communication in real time, without the need for audible speech.' The laboratory's VSR prototype is already performing at a high level. There are many reasons why scientists are working on speech technology that does not rely on audio. Saitoh points out that: 'With an ageing population, more people will suffer from speech or hearing disabilities and would benefit from a means to communicate freely. This would vastly improve their quality of life and create employment opportunities.' Also, intelligent machines, controlled by human-machine interfaces, are expected to become increasingly common in our lives. Non-audio speech recognition technology will be useful for interacting with smartphones, driverless cars, surveillance systems and smart appliances. VSR uses a modified camera, combined with image processing and pattern recognition to convert moving shapes made by the mouth, into meaningful language. Earlier VSR technologies matched the shape of a still mouth with vowel sounds, and others have correlated mouth shapes with a key input. However, these do not provide audio output in real-time, so cannot facilitate a smooth conversation. Also, it is vital that VSR is both easy to use and applicable to a range of situations, such as people bedridden in a supine position, where there is a degree of camera movement or where a face is being viewed in profile rather than full-frontal. Any reliable system should also be user-dependent, such that it will work on any skin colour and any shape of face and in spite of head movement.

APA, Harvard, Vancouver, ISO, and other styles

32

Yang, Wenfeng, Pengyi Li, Wei Yang, Yuxing Liu, Yulong He, Ovanes Petrosian, and Aleksandr Davydenko. "Research on Robust Audio-Visual Speech Recognition Algorithms." Mathematics 11, no. 7 (April 5, 2023): 1733. http://dx.doi.org/10.3390/math11071733.

Full text

Abstract:

Automatic speech recognition (ASR) that relies on audio input suffers from significant degradation in noisy conditions and is particularly vulnerable to speech interference. However, video recordings of speech capture both visual and audio signals, providing a potent source of information for training speech models. Audiovisual speech recognition (AVSR) systems enhance the robustness of ASR by incorporating visual information from lip movements and associated sound production in addition to the auditory input. There are many audiovisual speech recognition models and systems for speech transcription, but most of them have been tested based in a single experimental setting and with a limited dataset. However, a good model should be applicable to any scenario. Our main contributions are: (i) Reproducing the three best-performing audiovisual speech recognition models in the current AVSR research area using the most famous audiovisual databases, LSR2 (Lip Reading Sentences 2) LSR3 (Lip Reading Sentences 3), and comparing and analyzing their performances under various noise conditions. (ii) Based on our experimental and research experiences, we analyzed the problems currently encountered in the AVSR domain, which are summarized as the feature-extraction problem and the domain-generalization problem. (iii) According to the experimental results, the Moco (momentum contrast) + word2vec (word to vector) model has the best AVSR effect on the LRS datasets regardless of whether there is noise or not. Additionally, the model also produced the best experimental results in the experiments of audio recognition and video recognition. Our research lays the foundation for further improving the performance of AVSR models.

APA, Harvard, Vancouver, ISO, and other styles

33

Gavali, A. B., Ghugarkar Pooja S., Khatake Supriya R., and Kothawale Rajnandini A. "Visual Speech Recognition Using Lips Movement." Journal of Signal Processing 9, no. 2 (May 29, 2023): 1–7. http://dx.doi.org/10.46610/josp.2023.v09i02.001.

Full text

Abstract:

The visual speech recognition technology look for increase noise-sound in mobile context by extracting lip movement from input face pictures. Bimodal speech recognition algorithms depended on the frontal face (lip) photos, but because users must be able to speak holding a device by using this device in front of the user face, these techniques are difficult for users to utilize. Our suggested method records lip movement using a tiny camera built into a smartphone, making it more practical, simple, and natural. Additionally, this technique effectively prevents a decline in the input signal-to-noise ratio (SNR). For CNN-based recognition, visual properties are extracted via optical-flow analysis and merged with audio data. Although in the previous model, it doesn't give the output in the audio format. Hence in our proposed model, we can provide the output in the audio format. Our proposed aim to record a user speaking into the camera or user will upload the video. The system will initially detect only the lip area from the video. The system will divide this lip video into multiple frames. After sequencing the lip frames, feature extraction will be done from the lip frames. The model will be trained to extract these features further these extracted features from the trained model will be used to find out the sequence of phoneme distributions. The final output will be the word or phrase spoken by the user displayed on the camera.

APA, Harvard, Vancouver, ISO, and other styles

34

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (February 7, 2023): 1834. http://dx.doi.org/10.3390/s23041834.

Full text

Abstract:

This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating data-augmentation techniques to generate more data samples for building the classification models. For the data-augmentation techniques, we utilized a combination of conventional approaches (e.g., flips and rotations), as well as newer approaches, such as generative adversarial networks (GANs). To validate the approaches, we used augmented data from well-known datasets (LRS2—Lip Reading Sentences 2 and LRS3) in the training process and testing was performed using the original data. The study and experimental results indicated that the proposed AVSR model and framework, combined with the augmentation approach, enhanced the performance of the AVSR framework in the wild for noisy datasets. Furthermore, in this study, we discuss the domains of automatic speech recognition (ASR) architectures and audio-visual speech recognition (AVSR) architectures and give a concise summary of the AVSR models that have been proposed.

APA, Harvard, Vancouver, ISO, and other styles

35

Kozma-Spytek, Linda, and Christian Vogler. "Factors Affecting the Accessibility of Voice Telephony for People with Hearing Loss: Audio Encoding, Network Impairments, Video and Environmental Noise." ACM Transactions on Accessible Computing 14, no. 4 (December 31, 2021): 1–35. http://dx.doi.org/10.1145/3479160.

Full text

Abstract:

This paper describes four studies with a total of 114 individuals with hearing loss and 12 hearing controls that investigate the impact of audio quality parameters on voice telecommunications. These studies were first informed by a survey of 439 individuals with hearing loss on their voice telecommunications experiences. While voice telephony was very important, with high usage of wireless mobile phones, respondents reported relatively low satisfaction with their hearing devices’ performance for telephone listening, noting that improved telephone audio quality was a significant need. The studies cover three categories of audio quality parameters: (1) narrowband (NB) versus wideband (WB) audio; (2) encoding audio at varying bit rates, from typical rates used in today's mobile networks to the highest quality supported by these audio codecs; and (3) absence of packet loss to worst-case packet loss in both mobile and VoIP networks. Additionally, NB versus WB audio was tested in auditory-only and audiovisual presentation modes and in quiet and noisy environments. With WB audio in a quiet environment, individuals with hearing loss exhibited better speech recognition, expended less perceived mental effort, and rated speech quality higher than with NB audio. WB audio provided a greater benefit when listening alone than when the visual channel also was available. The noisy environment significantly degraded performance for both presentation modes, but particularly for listening alone. Bit rate affected speech recognition for NB audio, and speech quality ratings for both NB and WB audio. Packet loss affected all of speech recognition, mental effort, and speech quality ratings. WB versus NB audio also affected hearing individuals, especially under packet loss. These results are discussed in terms of the practical steps they suggest for the implementation of telecommunications systems and related technical standards and policy considerations to improve the accessibility of voice telephony for people with hearing loss.

APA, Harvard, Vancouver, ISO, and other styles

36

Auti, Dr Nisha, Atharva Pujari, Anagha Desai, Shreya Patil, Sanika Kshirsagar, and Rutika Rindhe. "Advanced Audio Signal Processing for Speaker Recognition and Sentiment Analysis." International Journal for Research in Applied Science and Engineering Technology 11, no. 5 (May 31, 2023): 1717–24. http://dx.doi.org/10.22214/ijraset.2023.51825.

Full text

Abstract:

Abstract: Automatic Speech Recognition (ASR) technology has revolutionized human-computer interaction by allowing users to communicate with computer interfaces using their voice in a natural way. Speaker recognition is a biometric recognition method that identifies individuals based on their unique speech signal, with potential applications in security, communication, and personalization. Sentiment analysis is a statistical method that analyzes unique acoustic properties of the speaker's voice to identify emotions or sentiments in speech. This allows for automated speech recognition systems to accurately categorize speech as Positive, Neutral, or Negative. While sentiment analysis has been developed for various languages, further research is required for regional languages. This project aims to improve the accuracy of automatic speech recognition systems by implementing advanced audio signal processing and sentiment analysis detection. The proposed system will identify the speaker's voice and analyze the audio signal to detect the context of speech, including the identification of foul language and aggressive speech. The system will be developed for the Marathi Language dataset, with potential for further development in other languages.

APA, Harvard, Vancouver, ISO, and other styles

37

Yin, Bing, Shutong Niu, Haitao Tang, Lei Sun, Jun Du, Zhenhua Ling, and Cong Liu. "An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario." Applied Sciences 13, no. 7 (March 23, 2023): 4100. http://dx.doi.org/10.3390/app13074100.

Full text

Abstract:

Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR), is a promising direction for improving speech recognition. The end-to-end (E2E) framework can learn information between multiple modalities well; however, the model is not easy to train, especially when the amount of data is relatively small. In this paper, we focus on building an encoder–decoder-based end-to-end audio–visual speech recognition system for use under realistic scenarios. First, we discuss different pre-training methods which provide various kinds of initialization for the AVSR framework. Second, we explore different model architectures and audio–visual fusion methods. Finally, we evaluate the performance on the corpus from the first Multi-modal Information based Speech Processing (MISP) challenge, which is recorded in a real home television (TV) room. By system fusion, our final system achieves a 23.98% character error rate (CER), which is better than the champion system of the first MISP challenge (CER = 25.07%).

APA, Harvard, Vancouver, ISO, and other styles

38

Ong, Kah Liang, Chin Poo Lee, Heng Siong Lim, and Kian Ming Lim. "Speech emotion recognition with light gradient boosting decision trees machine." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 4 (August 1, 2023): 4020. http://dx.doi.org/10.11591/ijece.v13i4.pp4020-4028.

Full text

Abstract:

<p>Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the emo-DB dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.</p>

APA, Harvard, Vancouver, ISO, and other styles

39

A, Prof Swethashree. "Speech Emotion Recognition." International Journal for Research in Applied Science and Engineering Technology 9, no. 8 (August 31, 2021): 2637–40. http://dx.doi.org/10.22214/ijraset.2021.37375.

Full text

Abstract:

Abstract: Speech Emotion Recognition, abbreviated as SER, the act of trying to identify a person's feelings and relationships. Affected situations from speech. This is because the truth often reflects the basic feelings of tone and tone of voice. Emotional awareness is a fast-growing field of research in recent years. Unlike humans, machines do not have the power to comprehend and express emotions. But human communication with the computer can be improved by using automatic sensory recognition, accordingly reducing the need for human intervention. In this project, basic emotions such as peace, happiness, fear, disgust, etc. are analyzed signs of emotional expression. We use machine learning techniques such as Multilayer perceptron Classifier (MLP Classifier) which is used to separate information provided by groups to be divided equally. Coefficients of Mel-frequency cepstrum (MFCC), chroma and mel features are extracted from speech signals and used to train MLP differentiation. By accomplishing this purpose, we use python libraries such as Librosa, sklearn, pyaudio, numpy and audio file to analyze speech patterns and see the feeling. Keywords: Speech emotion recognition, mel cepstral coefficient, neural artificial network, multilayer perceptrons, mlp classifier, python.

APA, Harvard, Vancouver, ISO, and other styles

40

Yu, Wentao, Steffen Zeiler, and Dorothea Kolossa. "Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition." Sensors 22, no. 15 (July 23, 2022): 5501. http://dx.doi.org/10.3390/s22155501.

Full text

Abstract:

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

APA, Harvard, Vancouver, ISO, and other styles

41

Wang, Junyi, Bingyao Li, and Jiahong Zhang. "Use Brain-Like Audio Features to Improve Speech Recognition Performance." Journal of Sensors 2022 (September 19, 2022): 1–12. http://dx.doi.org/10.1155/2022/6742474.

Full text

Abstract:

Speech recognition plays an important role in the field of human-computer interaction through the use of acoustic sensors, but speech recognition is technically difficult, has complex overall logic, relies heavily on neural network algorithms, and has extremely high technical requirements. In speech recognition, feature extraction is the first step in speech recognition for recovering and extracting speech features. Existing methods, such as Meier spectral coefficients (MFCC) and spectrograms, lose a large amount of acoustic information and lack biological interpretability. Then, for example, existing speech self-supervised representation learning methods based on contrast prediction need to construct a large number of negative samples during training, and their learning effects depend on large batches of training, which requires a large amount of computational resources for the problem. Therefore, in this paper, we propose a new feature extraction method, called SHH (spike-H), that resembles the human brain and achieves higher speech recognition rates than previous methods. The features extracted using the proposed model are subsequently fed into the classification model. We propose a novel parallel CRNN model with an attention mechanism that considers both temporal and spatial features. Experimental results show that the proposed CRNN achieves an accuracy of 94.8% on the Aurora dataset. In addition, audio similarity experiments show that SHH can better distinguish audio features. In addition, the ablation experiments show that SHH is applicable to digital speech recognition.

APA, Harvard, Vancouver, ISO, and other styles

42

Seong, Thum Wei, M. Z. Ibrahim, and D. J. Mulvaney. "WADA-W: A Modified WADA SNR Estimator for Audio-Visual Speech Recognition." International Journal of Machine Learning and Computing 9, no. 4 (August 2019): 446–51. http://dx.doi.org/10.18178/ijmlc.2019.9.4.824.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Et. al., D. N. V. S. L. S. Indira,. "An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 5 (April 11, 2021): 1378–88. http://dx.doi.org/10.17762/turcomat.v12i5.2030.

Full text

Abstract:

The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.

APA, Harvard, Vancouver, ISO, and other styles

44

Tiwari, Rishin, Saloni Birthare, and Mr Mayank Lovanshi. "Audio to Sign Language Converter." International Journal for Research in Applied Science and Engineering Technology 10, no. 11 (November 30, 2022): 206–11. http://dx.doi.org/10.22214/ijraset.2022.47271.

Full text

Abstract:

Abstract: The hearing and speech disabled people have a communication problem with other people. It is hard for such individuals to express themselves since everyone is not familiar with the sign language. The aim of this paper is to design a system which is helpful for the people with hearing / speech disabilities and convert a voice in Indian sign language (ISL). The task of learning a sign language can be cumbersome for people so this paper proposes a solution to this problem using speech recognition and image processing. The Sign languages have developed a means of easy communication primarily for the deaf and hard of hearing people. In this work we propose a real time system that recognizes the voice input through Pyaudio, SPHINX and Google speech recognition API and converts it into text, followed by sign language output of text which is displayed on the screen of the machine in the form of series of images or motioned video by the help of various python libraries.

APA, Harvard, Vancouver, ISO, and other styles

45

Axyonov, A. A., D. V. Ivanko, I. B. Lashkov, D. A. Ryumin, A. M. Kashevnik, and A. A. Karpov. "A methodology of multimodal corpus creation for audio-visual speech recognition in assistive transport systems." Informatization and communication 5 (December 2020): 87–93. http://dx.doi.org/10.34219/2078-8320-2020-11-5-87-93.

Full text

Abstract:

This paper introduces a new methodology of multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). The article discusses several basic scenarios when speech recognition in the vehicle environment is required to interact with the driver monitoring system. The methodology defi nes the main stages and requirements for the design of a multimodal building. The paper also describes metaparameters that the multimodal corpus must correspond to. In addition, a software package for recording an audiovisual speech corpus is described.

APA, Harvard, Vancouver, ISO, and other styles

46

Ivanko, Denis, Dmitry Ryumin, and Alexey Karpov. "A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition." Mathematics 11, no. 12 (June 12, 2023): 2665. http://dx.doi.org/10.3390/math11122665.

Full text

Abstract:

This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013–2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional “hand-crafted” methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research.

APA, Harvard, Vancouver, ISO, and other styles

47

Wu, Xuan, Silong Zhou, Mingwei Chen, Yihang Zhao, Yifei Wang, Xianmeng Zhao, Danyang Li, and Haibo Pu. "Combined spectral and speech features for pig speech recognition." PLOS ONE 17, no. 12 (December 1, 2022): e0276778. http://dx.doi.org/10.1371/journal.pone.0276778.

Full text

Abstract:

The sound of the pig is one of its important signs, which can reflect various states such as hunger, pain or emotional state, and directly indicates the growth and health status of the pig. Existing speech recognition methods usually start with spectral features. The use of spectrograms to achieve classification of different speech sounds, while working well, may not be the best approach for solving such tasks with single-dimensional feature input. Based on the above assumptions, in order to more accurately grasp the situation of pigs and take timely measures to ensure the health status of pigs, this paper proposes a pig sound classification method based on the dual role of signal spectrum and speech. Spectrograms can visualize information about the characteristics of the sound under different time periods. The audio data are introduced, and the spectrogram features of the model input as well as the audio time-domain features are complemented with each other and passed into a pre-designed parallel network structure. The network model with the best results and the classifier were selected for combination. An accuracy of 93.39% was achieved on the pig speech classification task, while the AUC also reached 0.99163, demonstrating the superiority of the method. This study contributes to the direction of computer vision and acoustics by recognizing the sound of pigs. In addition, a total of 4,000 pig sound datasets in four categories are established in this paper to provide a research basis for later research scholars.

APA, Harvard, Vancouver, ISO, and other styles

48

Reddy, P. Deepak. "Multilingual Speech to Text using Deep Learning based on MFCC Features." Machine Learning and Applications: An International Journal 9, no. 02 (June 30, 2022): 21–30. http://dx.doi.org/10.5121/mlaij.2022.9202.

Full text

Abstract:

The proposed methodology presented in the paper deals with solving the problem of multilingual speech recognition. Current text and speech recognition and translation methods have a very low accuracy in translating sentences which contain a mixture of two or more different languages. The paper proposes a novel approach to tackling this problem and highlights some of the drawbacks of current recognition and translation methods. The proposed approach deals with recognition of audio queries which contain a mixture of words in two different languages - Kannada and English. The novelty in the approach presented, is the use of a next Word Prediction model in combination with a Deep Learning speech recognition model to accurately recognise and convert the input audio query to text. Another method proposed to solve the problem of multilingual speech recognition and translation is the use of cosine similarity between the audio features of words for fast and accurate recognition. The dataset used for training and testing the models was generated manually by the authors as there was no pre-existing audio and text dataset which contained sentences in a mixture of both Kannada and English. The DL speech recognition model in combination with the Word Prediction model gives an accuracy of 71% when tested on the in-house multilingual dataset. This method outperforms other existing translation and recognition solutions for the same test set. Multilingual translation and recognition is an important problem to tackle as there is a tendency for people to speak in a mixture of languages. By solving this problem, the barrier of language and communication can be lifted and thus can help people connect better and more comfortably with each other.

APA, Harvard, Vancouver, ISO, and other styles

49

Aiman, Aisha, Yao Shen, Malika Bendechache, Irum Inayat, and Teerath Kumar. "AUDD: Audio Urdu Digits Dataset for Automatic Audio Urdu Digit Recognition." Applied Sciences 11, no. 19 (September 23, 2021): 8842. http://dx.doi.org/10.3390/app11198842.

Full text

Abstract:

The ongoing development of audio datasets for numerous languages has spurred research activities towards designing smart speech recognition systems. A typical speech recognition system can be applied in many emerging applications, such as smartphone dialing, airline reservations, and automatic wheelchairs, among others. Urdu is a national language of Pakistan and is also widely spoken in many other South Asian countries (e.g., India, Afghanistan). Therefore, we present a comprehensive dataset of spoken Urdu digits ranging from 0 to 9. Our dataset has 25,518 sound samples that are collected from 740 participants. To test the proposed dataset, we apply different existing classification algorithms on the datasets including Support Vector Machine (SVM), Multilayer Perceptron (MLP), and flavors of the EfficientNet. These algorithms serve as a baseline. Furthermore, we propose a convolutional neural network (CNN) for audio digit classification. We conduct the experiment using these networks, and the results show that the proposed CNN is efficient and outperforms the baseline algorithms in terms of classification accuracy.

APA, Harvard, Vancouver, ISO, and other styles

50

HASHIMOTO, Masahiro, and Masaharu KUMASHIRO. "Intermodal Timing Cues for Audio-Visual Speech Recognition." Journal of UOEH 26, no. 2 (2004): 215–25. http://dx.doi.org/10.7888/juoeh.26.215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Audio speech recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles