Academic literature on the topic 'Audio speech recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Audio speech recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Audio speech recognition"

1

Beadles, Robert L. "Audio visual speech recognition." Journal of the Acoustical Society of America 87, no. 5 (May 1990): 2274. http://dx.doi.org/10.1121/1.399137.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Bahal, Akriti. "Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition." IOSR Journal of Computer Engineering 5, no. 1 (2012): 31–36. http://dx.doi.org/10.9790/0661-0513136.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Hwang, Jung-Wook, Jeongkyun Park, Rae-Hong Park, and Hyung-Min Park. "Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition." Applied Acoustics 211 (August 2023): 109478. http://dx.doi.org/10.1016/j.apacoust.2023.109478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Nakadai, Kazuhiro, and Tomoaki Koiwa. "Psychologically-Inspired Audio-Visual Speech Recognition Using Coarse Speech Recognition and Missing Feature Theory." Journal of Robotics and Mechatronics 29, no. 1 (February 20, 2017): 105–13. http://dx.doi.org/10.20965/jrm.2017.p0105.

Full text
Abstract:
[abstFig src='/00290001/10.jpg' width='300' text='System architecture of AVSR based on missing feature theory and P-V grouping' ] Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.**This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp.1751-1756, 2007.”
APA, Harvard, Vancouver, ISO, and other styles
5

BASYSTIUK, Oleh, and Nataliia MELNYKOVA. "MULTIMODAL SPEECH RECOGNITION BASED ON AUDIO AND TEXT DATA." Herald of Khmelnytskyi National University. Technical sciences 313, no. 5 (October 27, 2022): 22–25. http://dx.doi.org/10.31891/2307-5732-2022-313-5-22-25.

Full text
Abstract:
Systems of machine translation of texts from one language to another simulate the work of a human translator. Their performance depends on the ability to understand the grammar rules of the language. In translation, the basic units are not individual words, but word combinations or phraseological units that express different concepts. Only by using them, more complex ideas can be expressed through the translated text. The main feature of machine translation is different length for input and output. The ability to work with different lengths of input and output provides us with the approach of recurrent neural networks. A recurrent neural network (RNN) is a class of artificial neural network that has connections between nodes. In this case, a connection refers to a connection from a more distant node to a less distant node. The presence of connections allows the RNN to remember and reproduce the entire sequence of reactions to one stimulus. From the point of view of programming, such networks are analogous to cyclic execution, and from the point of view of the system, such networks are equivalent to a state machine. RNNs are commonly used to process word sequences in natural language processing. Usually, a hidden Markov model (HMM) and an N-program language model are used to process a sequence of words. Deep learning has completely changed the approach to machine translation. Researchers in the deep learning field has created simple solutions based on machine learning that outperform the best expert systems. In this paper was reviewed the main features of machine translation based on recurrent neural networks. The advantages of systems based on RNN using the sequence-to-sequence model against statistical translation systems are also highlighted in the article. Two machine translation systems based on the sequence-to-sequence model were constructed using Keras and PyTorch machine learning libraries. Based on the obtained results, libraries analysis was done, and their performance comparison.
APA, Harvard, Vancouver, ISO, and other styles
6

Dupont, S., and J. Luettin. "Audio-visual speech modeling for continuous speech recognition." IEEE Transactions on Multimedia 2, no. 3 (2000): 141–51. http://dx.doi.org/10.1109/6046.865479.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Kubanek, M., J. Bobulski, and L. Adrjanowicz. "Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition." Bulletin of the Polish Academy of Sciences: Technical Sciences 60, no. 2 (October 1, 2012): 307–16. http://dx.doi.org/10.2478/v10175-012-0041-6.

Full text
Abstract:
Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.
APA, Harvard, Vancouver, ISO, and other styles
8

Kacur, Juraj, Boris Puterka, Jarmila Pavlovicova, and Milos Oravec. "Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications." Sensors 22, no. 16 (August 22, 2022): 6304. http://dx.doi.org/10.3390/s22166304.

Full text
Abstract:
There are many speech and audio processing applications and their number is growing. They may cover a wide range of tasks, each having different requirements on the processed speech or audio signals and, therefore, indirectly, on the audio sensors as well. This article reports on tests and evaluation of the effect of basic physical properties of speech and audio signals on the recognition accuracy of major speech/audio processing applications, i.e., speech recognition, speaker recognition, speech emotion recognition, and audio event recognition. A particular focus is on frequency ranges, time intervals, a precision of representation (quantization), and complexities of models suitable for each class of applications. Using domain-specific datasets, eligible feature extraction methods and complex neural network models, it was possible to test and evaluate the effect of basic speech and audio signal properties on the achieved accuracies for each group of applications. The tests confirmed that the basic parameters do affect the overall performance and, moreover, this effect is domain-dependent. Therefore, accurate knowledge of the extent of these effects can be valuable for system designers when selecting appropriate hardware, sensors, architecture, and software for a particular application, especially in the case of limited resources.
APA, Harvard, Vancouver, ISO, and other styles
9

Showkat Ahmad Dar, Showkat Ahmad Dar. "Emotion Recognition Based On Audio Speech." IOSR Journal of Computer Engineering 11, no. 6 (2013): 46–50. http://dx.doi.org/10.9790/0661-1164650.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Aucouturier, Jean-Julien, and Laurent Daudet. "Pattern recognition of non-speech audio." Pattern Recognition Letters 31, no. 12 (September 2010): 1487–88. http://dx.doi.org/10.1016/j.patrec.2010.05.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Audio speech recognition"

1

Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda, and Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Seymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.

Full text
Abstract:
In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform ,_.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted.
APA, Harvard, Vancouver, ISO, and other styles
3

Pachoud, Samuel. "Audio-visual speech and emotion recognition." Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Matthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Kaucic, Robert August. "Lip tracking for audio-visual speech recognition." Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Lucey, Simon. "Audio-visual speech processing." Thesis, Queensland University of Technology, 2002. https://eprints.qut.edu.au/36172/7/SimonLuceyPhDThesis.pdf.

Full text
Abstract:
Speech is inherently bimodal, relying on cues from the acoustic and visual speech modalities for perception. The McGurk effect demonstrates that when humans are presented with conflicting acoustic and visual stimuli, the perceived sound may not exist in either modality. This effect has formed the basis for modelling the complementary nature of acoustic and visual speech by encapsulating them into the relatively new research field of audio-visual speech processing (AVSP). Traditional acoustic based speech processing systems have attained a high level of performance in recent years, but the performance of these systems is heavily dependent on a match between training and testing conditions. In the presence of mismatched conditions (eg. acoustic noise) the performance of acoustic speech processing applications can degrade markedly. AVSP aims to increase the robustness and performance of conventional speech processing applications through the integration of the acoustic and visual modalities of speech, in particular the tasks of isolated word speech and text-dependent speaker recognition. Two major problems in AVSP are addressed in this thesis, the first of which concerns the extraction of pertinent visual features for effective speech reading and visual speaker recognition. Appropriate representations of the mouth are explored for improved classification performance for speech and speaker recognition. Secondly, there is the question of how to effectively integrate the acoustic and visual speech modalities for robust and improved performance. This question is explored in-depth using hidden Markov model(HMM)classifiers. The development and investigation of integration strategies for AVSP required research into a new branch of pattern recognition known as classifier combination theory. A novel framework is presented for optimally combining classifiers so their combined performance is greater than any of those classifiers individually. The benefits of this framework are not restricted to AVSP, as they can be applied to any task where there is a need for combining independent classifiers.
APA, Harvard, Vancouver, ISO, and other styles
7

Eriksson, Mattias. "Speech recognition availability." Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2651.

Full text
Abstract:

This project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions.

I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory.

I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.

APA, Harvard, Vancouver, ISO, and other styles
8

Rao, Ram Raghavendra. "Audio-visual interaction in multimedia." Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.

Full text
Abstract:
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
APA, Harvard, Vancouver, ISO, and other styles
10

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Full text
Abstract:
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Audio speech recognition"

1

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. Audio Processing and Speech Recognition. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ogunfunmi, Tokunbo, Roberto Togneri, and Madihally Narasimha, eds. Speech and Audio Processing for Coding, Enhancement and Recognition. New York, NY: Springer New York, 2015. http://dx.doi.org/10.1007/978-1-4939-1456-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Junqua, Jean-Claude. Robustness in automatic speech recognition: Fundamentals and applications. Boston: Kluwer Academic Publishers, 1996.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Harrison, Mark. The use of interactive audio and speech recognition techniques in training. [U.K.]: [s.n.], 1993.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

AVBPA '97 ((1st 1997 Montana,Switzerland). Audio- and video-based biometric person authentication: First International Conference, AVBPA '97, Crans-Montana, Switzerland, March 1997 : proceedings. Berlin: Springer, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

1946-, Kittler Josef, and Nixon Mark S, eds. Audio-and video-based biometric person authentication: 4th International Conference, AVBPA 2003, Guildford, UK, June 2003 : proceedings. Berlin: Springer, 2003.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
7

International Conference, AVBPA (1st 1997 Montana, Switzerland). Audio- and video-based biometric person authentication: First International Conference, AVBPA '97, Crans-Montana, Switzerland, March 12-14, 1997 : proceedings. Berlin: Springer, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
8

IEEE Workshop on Automatic Speech Recognition and Understanding (1997 Santa Barbara, Calif.). 1997 IEEE Workshop on Automatic Speech Recognition and Understanding proceedings. Piscataway, NJ: Published under the sponsorship of the IEEE Signal Processing Society, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
9

Minker, Wolfgang. Speech and human-machine dialog. Boston: Kluwer Academic Publishers, 2004.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
10

Minker, Wolfgang. Speech and human-machine dialog. Boston: Kluwer Academic Publishers, 2004.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Audio speech recognition"

1

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. "Audio Indexing." In Audio Processing and Speech Recognition, 1–11. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. "Audio Classification." In Audio Processing and Speech Recognition, 67–93. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Richter, Michael M., Sheuli Paul, Veton Këpuska, and Marius Silaghi. "Audio Signals and Speech Recognition." In Signal Processing and Machine Learning with Applications, 345–68. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-319-45372-9_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Luettin, Juergen, and Stéphane Dupont. "Continuous audio-visual speech recognition." In Lecture Notes in Computer Science, 657–73. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998. http://dx.doi.org/10.1007/bfb0054771.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. "Speech Processing and Recognition System." In Audio Processing and Speech Recognition, 13–43. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. "Feature Extraction." In Audio Processing and Speech Recognition, 45–66. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Sen, Soumya, Anjan Dutta, and Nilanjan Dey. "Conclusion." In Audio Processing and Speech Recognition, 95–96. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-13-6098-5_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Sethu, Vidhyasaharan, Julien Epps, and Eliathamby Ambikairajah. "Speech Based Emotion Recognition." In Speech and Audio Processing for Coding, Enhancement and Recognition, 197–228. New York, NY: Springer New York, 2014. http://dx.doi.org/10.1007/978-1-4939-1456-2_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Karpov, Alexey, Alexander Ronzhin, Irina Kipyatkova, Andrey Ronzhin, Vasilisa Verkhodanova, Anton Saveliev, and Milos Zelezny. "Bimodal Speech Recognition Fusing Audio-Visual Modalities." In Lecture Notes in Computer Science, 170–79. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-39516-6_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Kratt, Jan, Florian Metze, Rainer Stiefelhagen, and Alex Waibel. "Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit." In Lecture Notes in Computer Science, 488–95. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004. http://dx.doi.org/10.1007/978-3-540-28649-3_60.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Audio speech recognition"

1

Ko, Tom, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. "Audio augmentation for speech recognition." In Interspeech 2015. ISCA: ISCA, 2015. http://dx.doi.org/10.21437/interspeech.2015-711.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Li, Xinyu, Venkata Chebiyyam, and Katrin Kirchhoff. "Speech Audio Super-Resolution for Speech Recognition." In Interspeech 2019. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/interspeech.2019-3043.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Palecek, Karel, and Josef Chaloupka. "Audio-visual speech recognition in noisy audio environments." In 2013 36th International Conference on Telecommunications and Signal Processing (TSP). IEEE, 2013. http://dx.doi.org/10.1109/tsp.2013.6613979.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Fook, C. Y., M. Hariharan, Sazali Yaacob, and AH Adom. "A review: Malay speech recognition and audio visual speech recognition." In 2012 International Conference on Biomedical Engineering (ICoBE). IEEE, 2012. http://dx.doi.org/10.1109/icobe.2012.6179063.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Sinha, Arryan, and G. Suseela. "Deep Learning-Based Speech Emotion Recognition." In International Research Conference on IOT, Cloud and Data Science. Switzerland: Trans Tech Publications Ltd, 2023. http://dx.doi.org/10.4028/p-0892re.

Full text
Abstract:
Speech Emotion Recognition, as described in this study, uses Neural Networks to classify the emotions expressed in each speech (SER). It’s centered upon concept where voice tone and pitch frequently reflect underlying emotion. Speech Emotion Recognition aids in the classification of elicited emotions. The MLP-Classifier is a tool for classifying emotions in a circumstance. As wave signal, allowing for flexible learning rate selection. RAVDESS (Ryerson Audio-Visual Dataset Emotional Speech and Song Database data) will be used. To extract the characteristics from particular audio input, Contrast, MFCC, Mel Spectrograph Frequency, & Chroma are some of factors that may be employed. To facilitate extraction of features from audio script, dataset will be labelled using decimal encoding. Utilizing input audio sample, precision was found to be 80.28%. Additional testing confirmed this result.
APA, Harvard, Vancouver, ISO, and other styles
6

Benhaim, Eric, Hichem Sahbi, and Guillaume Vittey. "Continuous visual speech recognition for audio speech enhancement." In ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015. http://dx.doi.org/10.1109/icassp.2015.7178370.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Reikeras, Helge, Ben Herbst, Johan du Preez, and Herman Engelbrecht. "Audio-Visual Speech Recognition using SciPy." In Python in Science Conference. SciPy, 2010. http://dx.doi.org/10.25080/majora-92bf1922-010.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Tan, Hao, Chenwei Liu, Yinyu Lyu, Xiao Zhang, Denghui Zhang, and Zhaoquan Gu. "Audio Steganography with Speech Recognition System." In 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). IEEE, 2021. http://dx.doi.org/10.1109/dsc53577.2021.00042.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Narisetty, Chaitanya, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, and Shinji Watanabe. "Joint Speech Recognition and Audio Captioning." In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. http://dx.doi.org/10.1109/icassp43922.2022.9746601.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Yang, Karren, Dejan Markovic, Steven Krenn, Vasu Agrawal, and Alexander Richard. "Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis." In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.00805.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Audio speech recognition"

1

STANDARD OBJECT SYSTEMS INC. Advanced Audio Interface for Phonetic Speech Recognition in a High Noise Environment. Fort Belvoir, VA: Defense Technical Information Center, January 2000. http://dx.doi.org/10.21236/ada373461.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Issues in Data Processing and Relevant Population Selection. OSAC Speaker Recognition Subcommittee, November 2022. http://dx.doi.org/10.29325/osac.tg.0006.

Full text
Abstract:
In Forensic Automatic Speaker Recognition (FASR), forensic examiners typically compare audio recordings of a speaker whose identity is in question with recordings of known speakers to assist investigators and triers of fact in a legal proceeding. The performance of automated speaker recognition (SR) systems used for this purpose depends largely on the characteristics of the speech samples being compared. Examiners must understand the requirements of specific systems in use as well as the audio characteristics that impact system performance. Mismatch conditions between the known and questioned data samples are of particular importance, but the need for, and impact of, audio pre-processing must also be understood. The data selected for use in a relevant population can also be critical to the performance of the system. This document describes issues that arise in the processing of case data and in the selections of a relevant population for purposes of conducting an examination using a human supervised automatic speaker recognition approach in a forensic context. The document is intended to comply with the Organization of Scientific Area Committees (OSAC) for Forensic Science Technical Guidance Document.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography