Log in

Relevant bibliographies by topics / Automatic speech recognition – Statistical methods / Journal articles

Journal articles on the topic 'Automatic speech recognition – Statistical methods'

To see the other types of publications on this topic, follow the link: Automatic speech recognition – Statistical methods.

Author: Grafiati

Published: 4 June 2021

Last updated: 30 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Automatic speech recognition – Statistical methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Boyer, A., J. Di Martino, P. Divoux, J. P. Haton, J. F. Mari, and K. Smaili. "Statistical methods in multi-speaker automatic speech recognition." Applied Stochastic Models and Data Analysis 6, no. 3 (September 1990): 143–55. http://dx.doi.org/10.1002/asm.3150060302.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Kłosowski, Piotr. "A Rule-Based Grapheme-to-Phoneme Conversion System." Applied Sciences 12, no. 5 (March 7, 2022): 2758. http://dx.doi.org/10.3390/app12052758.

Full text

Abstract:

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.

APA, Harvard, Vancouver, ISO, and other styles

3

Toth, Laszlo, Ildiko Hoffmann, Gabor Gosztolya, Veronika Vincze, Greta Szatloczki, Zoltan Banreti, Magdolna Pakaski, and Janos Kalman. "A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech." Current Alzheimer Research 15, no. 2 (January 3, 2018): 130–38. http://dx.doi.org/10.2174/1567205014666171121114930.

Full text

Abstract:

Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community.

APA, Harvard, Vancouver, ISO, and other styles

4

Gellatly, Andrew W., and Thomas A. Dingus. "Speech Recognition and Automotive Applications: Using Speech to Perform in-Vehicle Tasks." Proceedings of the Human Factors and Ergonomics Society Annual Meeting 42, no. 17 (October 1998): 1247–51. http://dx.doi.org/10.1177/154193129804201715.

Full text

Abstract:

An experiment was conducted to investigate the effects of automatic speech recognition (ASR) system design, driver input-modality, and driver age on driving performance during in-vehicle task execution and in-vehicle task usability. Results showed that ASR system design (i.e., recognition accuracy and recognition error type) and driver input-modality (i.e., manual or speech) significantly affected certain dependent measures. However, the differences found were small, suggesting that less than ideal ASR system design/performance can be considered for use in automobiles without substantially improving or degrading driving performance. Several of the speech-input conditions tested were statistically similar, as determined by the dependent measures, to current manual-input methods used to perform identical in-vehicle tasks. Further research is warranted to determine how extended exposure to, and use of, ASR systems affects driving performance, in-vehicle task usability, and driver opinion compared with conventional manual-input methods. In addition, the research should investigate whether prolonged exposure to, and use of, ASR systems results in significant improvements compared to the current research findings.

APA, Harvard, Vancouver, ISO, and other styles

5

Seman, Noraini, and Ahmad Firdaus Norazam. "Hybrid methods of brandt’s generalised likelihood ratio and short-term energy for malay word speech segmentation." Indonesian Journal of Electrical Engineering and Computer Science 16, no. 1 (October 1, 2019): 283. http://dx.doi.org/10.11591/ijeecs.v16.i1.pp283-291.

Full text

Abstract:

<p>Speech segmentation is an important part for speech recognition, synthesizing and coding. Statistical based approach detects segmentation points via computing spectral distortion of the signal without prior knowledge of the acoustic information proved to be able to give good match, less omission but lot of insertion. In this study the segmentation is done both manually and automatically using Malay words in traditional Malay poetry. This study proposed a hybrid method of Brandt’s generalized likelihood ratio (GLR) and short-term energy algorithm. The Brandt’s algorithm tries to estimate the abrupt change in energy to determine the segmentation points. A total of five Pantun are used in read mode and spoken by one male student in a noise free room. Experiments are conducted to see the the accuracy, insertion, and omission of the segmentation points. Experimental results show on average 80% accuracy with 0.2 second time tolerance for automatic segmentation with the algorithm having no knowledge of the acoustic characteristics<em>. </em></p>

APA, Harvard, Vancouver, ISO, and other styles

6

Cabral, Frederico Soares, Hidekazu Fukai, and Satoshi Tamura. "Feature Extraction Methods Proposed for Speech Recognition Are Effective on Road Condition Monitoring Using Smartphone Inertial Sensors." Sensors 19, no. 16 (August 9, 2019): 3481. http://dx.doi.org/10.3390/s19163481.

Full text

Abstract:

The objective of our project is to develop an automatic survey system for road condition monitoring using smartphone devices. One of the main tasks of our project is the classification of paved and unpaved roads. Assuming recordings will be archived by using various types of vehicle suspension system and speeds in practice, hence, we use the multiple sensors found in smartphones and state-of-the-art machine learning techniques for signal processing. Despite usually not being paid much attention, the results of the classification are dependent on the feature extraction step. Therefore, we have to carefully choose not only the classification method but also the feature extraction method and their parameters. Simple statistics-based features are most commonly used to extract road surface information from acceleration data. In this study, we evaluated the mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction coefficients (PLP) as a feature extraction step to improve the accuracy for paved and unpaved road classification. Although both MFCC and PLP have been developed in the human speech recognition field, we found that modified MFCC and PLP can be used to improve the commonly used statistical method.

APA, Harvard, Vancouver, ISO, and other styles

7

Hai, Yanfei. "Computer-aided teaching mode of oral English intelligent learning based on speech recognition and network assistance." Journal of Intelligent & Fuzzy Systems 39, no. 4 (October 21, 2020): 5749–60. http://dx.doi.org/10.3233/jifs-189052.

Full text

Abstract:

The purpose of this paper is to use English specific syllables and prosodic features in spoken speech data to carry out English spoken recognition, and to explore effective methods for the design and application of English speech detection and automatic recognition systems. The method proposed by this study is a combination of SVM_FF based classifier, SVM_IER based classifier and syllable classifier. Compared with the method based on the combination of other phonological characteristics such as phonological rate, intensity, formant and energy statistics and pronunciation rate, and the syllable-based classifier based on specific syllable training, a better recognition rate is obtained. In addition, this study conducts simulation experiments on the proposed English recognition and identification method based on specific syllables and prosodic features and analyzes the experimental results. The result found that the recognition performance of the English spoken recognition system constructed by this study is significantly better than the traditional model.

APA, Harvard, Vancouver, ISO, and other styles

8

Markovnikov, Nikita, and Irina Kipyatkova. "Encoder-decoder models for recognition of Russian speech." Information and Control Systems, no. 4 (October 4, 2019): 45–53. http://dx.doi.org/10.31799/1684-8853-2019-4-45-53.

Full text

Abstract:

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

APA, Harvard, Vancouver, ISO, and other styles

9

AFLI, HAITHEM, LOÏC BARRAULT, and HOLGER SCHWENK. "Building and using multimodal comparable corpora for machine translation." Natural Language Engineering 22, no. 4 (June 15, 2016): 603–25. http://dx.doi.org/10.1017/s1351324916000152.

Full text

Abstract:

AbstractIn recent decades, statistical approaches have significantly advanced the development of machine translation systems. However, the applicability of these methods directly depends on the availability of very large quantities of parallel data. Recent works have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. In this paper, we propose an alternative to comparable corpora containing text documents as resources for extracting parallel data: a multimodal comparable corpus with audio documents in source language and text document in target language, built fromEuronewsandTEDweb sites. The audio is transcribed by an automatic speech recognition system, and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language in order to extract parallel sentences/phrases. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-of-the-art baseline.

APA, Harvard, Vancouver, ISO, and other styles

10

Kozlova, A. T. "Temporal Characteristics of Prosody in Imperative Utterances and the Phenomenon of Emphatic Length in the English Language." Bulletin of Kemerovo State University, no. 3 (October 27, 2018): 192–96. http://dx.doi.org/10.21603/2078-8975-2018-3-192-196.

Full text

Abstract:

The paper focuses on one of the most effective factors of linguistic manipulation, i.e. imperative utterance. The subject of the study was direct contact appeals, whose structures corresponded to the literary norms of the English language. The research determined and described the temporal component of imperative prosody. The author employed electro-acoustic, mathematical and statistical methods. The phonetic experiment revealed four prosodic structures, as well as their inter-structural and inter-style levels, the degree of temporal fluctuation and the phenomenon of emphatic length, the latter being recognized as the basic temporal feature of imperative prosody. Temporal variation in a phrase and its functional segments in different prosodic structures and in certain extra-linguistic conditions convincingly demonstrates the set of absolute and inter-style markers of this prosodic subsystem. In practice, the results of the present research can be applied in teaching communicatively oriented utterances and in making up the algorithm of automatic speech recognition and synthesis.

APA, Harvard, Vancouver, ISO, and other styles

11

Ling, Xufeng, Jie Yang, Jingxin Liang, Huaizhong Zhu, and Hui Sun. "A Deep-Learning Based Method for Analysis of Students’ Attention in Offline Class." Electronics 11, no. 17 (August 25, 2022): 2663. http://dx.doi.org/10.3390/electronics11172663.

Full text

Abstract:

Students’ actual learning engagement in class, which we call learning attention, is a major indicator used to measure learning outcomes. Obtaining and analyzing students’ attention accurately in offline classes is important empirical research that can improve teachers’ teaching methods. This paper proposes a method to obtain and measure students’ attention in class by applying a variety of deep-learning models and initiatively divides a whole class into a series of time durations, which are categorized into four states: lecturing, interaction, practice, and transcription. After video and audio information is taken with Internet of Things (IoT) technology in class, Retinaface and the Vision Transformer (ViT) model is used to detect faces and extract students’ head-pose parameters. Automatic speech recognition (ASR) models are used to divide a class into a series of four states. Combining the class-state sequence and each student’s head-pose parameters, the learning attention of each student can be accurately calculated. Finally, individual and statistical learning attention analyses are conducted that can help teachers to improve their teaching methods. This method shows potential application value and can be deployed in schools and applied in different smart education programs.

APA, Harvard, Vancouver, ISO, and other styles

12

Asgari, Meysam, Robert Gale, Katherine Wild, Hiroko Dodge, and Jeffrey Kaye. "Automatic Assessment of Cognitive Tests for Differentiating Mild Cognitive Impairment: A Proof of Concept Study of the Digit Span Task." Current Alzheimer Research 17, no. 7 (November 16, 2020): 658–66. http://dx.doi.org/10.2174/1567205017666201008110854.

Full text

Abstract:

Background: Current conventional cognitive assessments are limited in their efficiency and sensitivity, often relying on a single score such as the total correct items. Typically, multiple features of response go uncaptured. Objectives: We aim to explore a new set of automatically derived features from the Digit Span (DS) task that address some of the drawbacks in the conventional scoring and are also useful for distinguishing subjects with Mild Cognitive Impairment (MCI) from those with intact cognition. Methods: Audio-recordings of the DS tests administered to 85 subjects (22 MCI and 63 healthy controls, mean age 90.2 years) were transcribed using an Automatic Speech Recognition (ASR) system. Next, five correctness measures were generated from Levenshtein distance analysis of responses: number correct, incorrect, deleted, inserted, and substituted words compared to the test item. These per-item features were aggregated across all test items for both Forward Digit Span (FDS) and Backward Digit Span (BDS) tasks using summary statistical functions, constructing a global feature vector representing the detailed assessment of each subject’s response. A support vector machine classifier distinguished MCI from cognitively intact participants. Results: Conventional DS scores did not differentiate MCI participants from controls. The automated multi-feature DS-derived metric achieved 73% on AUC-ROC of the SVM classifier, independent of additional clinical features (77% when combined with demographic features of subjects); well above chance, 50%. Conclusion: Our analysis verifies the effectiveness of introduced measures, solely derived from the DS task, in the context of differentiating subjects with MCI from those with intact cognition.

APA, Harvard, Vancouver, ISO, and other styles

13

Woo, MinJae, Prabodh Mishra, Ju Lin, Snigdhaswin Kar, Nicholas Deas, Caleb Linduff, Sufeng Niu, et al. "Complete and Resilient Documentation for Operational Medical Environments Leveraging Mobile Hands-free Technology in a Systems Approach: Experimental Study." JMIR mHealth and uHealth 9, no. 10 (October 12, 2021): e32301. http://dx.doi.org/10.2196/32301.

Full text

Abstract:

Background Prehospitalization documentation is a challenging task and prone to loss of information, as paramedics operate under disruptive environments requiring their constant attention to the patients. Objective The aim of this study is to develop a mobile platform for hands-free prehospitalization documentation to assist first responders in operational medical environments by aggregating all existing solutions for noise resiliency and domain adaptation. Methods The platform was built to extract meaningful medical information from the real-time audio streaming at the point of injury and transmit complete documentation to a field hospital prior to patient arrival. To this end, the state-of-the-art automatic speech recognition (ASR) solutions with the following modular improvements were thoroughly explored: noise-resilient ASR, multi-style training, customized lexicon, and speech enhancement. The development of the platform was strictly guided by qualitative research and simulation-based evaluation to address the relevant challenges through progressive improvements at every process step of the end-to-end solution. The primary performance metrics included medical word error rate (WER) in machine-transcribed text output and an F1 score calculated by comparing the autogenerated documentation to manual documentation by physicians. Results The total number of 15,139 individual words necessary for completing the documentation were identified from all conversations that occurred during the physician-supervised simulation drills. The baseline model presented a suboptimal performance with a WER of 69.85% and an F1 score of 0.611. The noise-resilient ASR, multi-style training, and customized lexicon improved the overall performance; the finalized platform achieved a medical WER of 33.3% and an F1 score of 0.81 when compared to manual documentation. The speech enhancement degraded performance with medical WER increased from 33.3% to 46.33% and the corresponding F1 score decreased from 0.81 to 0.78. All changes in performance were statistically significant (P<.001). Conclusions This study presented a fully functional mobile platform for hands-free prehospitalization documentation in operational medical environments and lessons learned from its implementation.

APA, Harvard, Vancouver, ISO, and other styles

14

Levinson, S. E. "Structural methods in automatic speech recognition." Proceedings of the IEEE 73, no. 11 (1985): 1625–50. http://dx.doi.org/10.1109/proc.1985.13344.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Sun, Don X., and Frederick Jelinek. "Statistical Methods for Speech Recognition." Journal of the American Statistical Association 94, no. 446 (June 1999): 650. http://dx.doi.org/10.2307/2670189.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Rigazio, Luca. "Disciminative clustering methods for automatic speech recognition." Journal of the Acoustical Society of America 114, no. 4 (2003): 1719. http://dx.doi.org/10.1121/1.1627548.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Russell, M. J., R. K. Moore, and M. J. Tomlinson. "Dynamic Programming and Statistical Modelling in Automatic Speech Recognition." Journal of the Operational Research Society 37, no. 1 (January 1986): 21. http://dx.doi.org/10.2307/2582543.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Russell, M. J., R. K. Moore, and M. J. Tomlinson. "Dynamic Programming and Statistical Modelling in Automatic Speech Recognition." Journal of the Operational Research Society 37, no. 1 (January 1986): 21–30. http://dx.doi.org/10.1057/jors.1986.4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Bourlard, H., and N. Morgan. "Continuous speech recognition by connectionist statistical methods." IEEE Transactions on Neural Networks 4, no. 6 (1993): 893–909. http://dx.doi.org/10.1109/72.286885.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Bojanic, Milana, Vlado Delic, and Milan Secujski. "Relevance of the types and the statistical properties of features in the recognition of basic emotions in speech." Facta universitatis - series: Electronics and Energetics 27, no. 3 (2014): 425–33. http://dx.doi.org/10.2298/fuee1403425b.

Full text

Abstract:

Due to the advance of speech technologies and their increasing usage in various applications, automatic recognition of emotions in speech represents one of the emerging fields in human-computer interaction. This paper deals with several topics related to automatic emotional speech recognition, most notably with the improvement of recognition accuracy by lowering the dimensionality of the feature space and evaluation of the relevance of particular feature types. The research is focused on the classification of emotional speech into five basic emotional classes (anger, joy, fear, sadness and neutral speech) using a recorded corpus of emotional speech in Serbian.

APA, Harvard, Vancouver, ISO, and other styles

21

Kundegorski, Mikolaj, Philip J. B. Jackson, and Bartosz Ziółko. "Two-Microphone Dereverberation for Automatic Speech Recognition of Polish." Archives of Acoustics 39, no. 3 (March 1, 2015): 411–20. http://dx.doi.org/10.2478/aoa-2014-0045.

Full text

Abstract:

Abstract Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.

APA, Harvard, Vancouver, ISO, and other styles

22

Schultz, Benjamin G., Venkata S. Aditya Tarigoppula, Gustavo Noffs, Sandra Rojas, Anneke van der Walt, David B. Grayden, and Adam P. Vogel. "Automatic speech recognition in neurodegenerative disease." International Journal of Speech Technology 24, no. 3 (May 4, 2021): 771–79. http://dx.doi.org/10.1007/s10772-021-09836-w.

Full text

Abstract:

AbstractAutomatic speech recognition (ASR) could potentially improve communication by providing transcriptions of speech in real time. ASR is particularly useful for people with progressive disorders that lead to reduced speech intelligibility or difficulties performing motor tasks. ASR services are usually trained on healthy speech and may not be optimized for impaired speech, creating a barrier for accessing augmented assistance devices. We tested the performance of three state-of-the-art ASR platforms on two groups of people with neurodegenerative disease and healthy controls. We further examined individual differences that may explain errors in ASR services within groups, such as age and sex. Speakers were recorded while reading a standard text. Speech was elicited from individuals with multiple sclerosis, Friedreich’s ataxia, and healthy controls. Recordings were manually transcribed and compared to ASR transcriptions using Amazon Web Services, Google Cloud, and IBM Watson. Accuracy was measured as the proportion of words that were correctly classified. ASR accuracy was higher for controls than clinical groups, and higher for multiple sclerosis compared to Friedreich’s ataxia for all ASR services. Amazon Web Services and Google Cloud yielded higher accuracy than IBM Watson. ASR accuracy decreased with increased disease duration. Age and sex did not significantly affect ASR accuracy. ASR faces challenges for people with neuromuscular disorders. Until improvements are made in recognizing less intelligible speech, the true value of ASR for people requiring augmented assistance devices and alternative communication remains unrealized. We suggest potential methods to improve ASR for those with impaired speech.

APA, Harvard, Vancouver, ISO, and other styles

23

O’Shaughnessy, Douglas. "Invited paper: Automatic speech recognition: History, methods and challenges." Pattern Recognition 41, no. 10 (October 2008): 2965–79. http://dx.doi.org/10.1016/j.patcog.2008.05.008.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

O’Shaughnessy, Douglas D., and T. Nagarajan Li. "Better model and decoding methods for automatic speech recognition." Journal of the Acoustical Society of America 119, no. 5 (May 2006): 3441–42. http://dx.doi.org/10.1121/1.4786938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Debnath, Saswati, and Pinki Roy. "Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis." International Journal of Interactive Multimedia and Artificial Intelligence 7, no. 2 (2021): 121. http://dx.doi.org/10.9781/ijimai.2021.09.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Dashtaki, Parnyan Bahrami. "An Investigation into Methodology and Metrics Employed to Evaluate the (Speech-to-Speech) Way in Translation Systems." Modern Applied Science 11, no. 4 (February 8, 2017): 55. http://dx.doi.org/10.5539/mas.v11n4p55.

Full text

Abstract:

Speech-to-speech translation is a challenging problem, due to poor sentence planning typically associated with spontaneous speech, as well as errors caused by automatic speech recognition. Based upon a statistically trained speech translation system, in this study, we try to investigate methodologies and metrics employed to assess the (speech-to-speech) way in translation systems. The speech translation is performed incrementally based on generation of partial hypotheses from speech recognition. Speech-input translation can be properly approached as a pattern recognition problem by means of statistical alignment models and stochastic finite-state transducers. Under this general framework, some specific models are presented. One of the features of such models is their capability of automatically learning from training examples. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. In this research, we want explore methodologies and metrics employed to assess the (speech-to-speech) way in translation systems.

APA, Harvard, Vancouver, ISO, and other styles

27

Stolcke, Andreas, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. "Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech." Computational Linguistics 26, no. 3 (September 2000): 339–73. http://dx.doi.org/10.1162/089120100561737.

Full text

Abstract:

We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as STATEMENT, Question, BACKCHANNEL, Agreement, Disagreement, and Apology. Our model detects and predicts dialogue acts based on lexical, collocational, and prosodic cues, as well as on the discourse coherence of the dialogue act sequence. The dialogue model is based on treating the discourse structure of a conversation as a hidden Markov model and the individual dialogue acts as observations emanating from the model states. Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram. The statistical dialogue grammar is combined with word n-grams, decision trees, and neural networks modeling the idiosyncratic lexical and prosodic manifestations of each dialogue act. We develop a probabilistic integration of speech recognition with dialogue modeling, to improve both speech recognition and dialogue act classification accuracy. Models are trained and evaluated using a large hand-labeled database of 1,155 conversations from the Switchboard corpus of spontaneous human-to-human telephone speech. We achieved good dialogue act labeling accuracy (65% based on errorful, automatically recognized words and prosody, and 71% based on word transcripts, compared to a chance baseline accuracy of 35% and human accuracy of 84%) and a small reduction in word recognition error.

APA, Harvard, Vancouver, ISO, and other styles

28

Steeneken, Herman J. M., and Andrew Varga. "Assessment for automatic speech recognition: I. Comparison of assessment methods." Speech Communication 12, no. 3 (July 1993): 241–46. http://dx.doi.org/10.1016/0167-6393(93)90094-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Hadiwinoto, P. N., and D. P. Lestari. "Data augmentation on spontaneous Indonesian automatic speech recognition using statistical machine translation." IOP Conference Series: Materials Science and Engineering 803 (May 28, 2020): 012030. http://dx.doi.org/10.1088/1757-899x/803/1/012030.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Partila, Pavol, Miroslav Voznak, and Jaromir Tovarek. "Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System." Scientific World Journal 2015 (2015): 1–7. http://dx.doi.org/10.1155/2015/573068.

Full text

Abstract:

The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks,k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

APA, Harvard, Vancouver, ISO, and other styles

31

Singh, Satyanand. "High level speaker specific features modeling in automatic speaker recognition system." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 2 (April 1, 2020): 1859. http://dx.doi.org/10.11591/ijece.v10i2.pp1859-1867.

Full text

Abstract:

Spoken words convey several levels of information. At the primary level, the speech conveys words or spoken messages, but at the secondary level, the speech also reveals information about the speakers. This work is based on the high-level speaker-specific features on statistical speaker modeling techniques that express the characteristic sound of the human voice. Using Hidden Markov model (HMM), Gaussian mixture model (GMM), and Linear Discriminant Analysis (LDA) models build Automatic Speaker Recognition (ASR) system that are computational inexpensive can recognize speakers regardless of what is said. The performance of the ASR system is evaluated for clear speech to a wide range of speech quality using a standard TIMIT speech corpus. The ASR efficiency of HMM, GMM, and LDA based modeling technique are 98.8%, 99.1%, and 98.6% and Equal Error Rate (EER) is 4.5%, 4.4% and 4.55% respectively. The EER improvement of GMM modeling technique based ASR systemcompared with HMM and LDA is 4.25% and 8.51% respectively.

APA, Harvard, Vancouver, ISO, and other styles

32

Skowronski, Mark D., and John G. Harris. "Statistical automatic species identification of microchiroptera from echolocation calls: Lessons learned from human automatic speech recognition." Journal of the Acoustical Society of America 116, no. 4 (October 2004): 2639. http://dx.doi.org/10.1121/1.4808665.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Ding, Ing-Jr, and Yen-Ming Hsu. "An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition." Mathematical Problems in Engineering 2014 (2014): 1–8. http://dx.doi.org/10.1155/2014/898729.

Full text

Abstract:

In the past, the kernel of automatic speech recognition (ASR) is dynamic time warping (DTW), which is feature-based template matching and belongs to the category technique of dynamic programming (DP). Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM-) like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation). A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

APA, Harvard, Vancouver, ISO, and other styles

34

Kawahara, Tatsuya. "Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)." Proceedings of the AAAI Conference on Artificial Intelligence 26, no. 2 (July 22, 2012): 2224–28. http://dx.doi.org/10.1609/aaai.v26i2.18962.

Full text

Abstract:

This article describes a new automatic transcription system in the Japanese Parliament which deploys our automatic speech recognition (ASR) technology. To achieve high recognition performance in spontaneous meeting speech, we have investigated an efficient training scheme with minimal supervision which can exploit a huge amount of real data. Specifically, we have proposed a lightly-supervised training scheme based on statistical language model transformation, which fills the gap between faithful transcripts of spoken utterances and final texts for documentation. Once this mapping is trained, we no longer need faithful transcripts for training both acoustic and language models. Instead, we can fully exploit the speech and text data available in Parliament as they are. This scheme also realizes a sustainable ASR system which evolves, i.e. update/re-train the models, only with speech and text generated during the system operation. The ASR system has been deployed in the Japanese Parliament since 2010, and consistently achieved character accuracy of nearly 90%, which is useful for streamlining the transcription process.

APA, Harvard, Vancouver, ISO, and other styles

35

JAFARI, AYYOOB, and FARSHAD ALMASGANJ. "USING NONLINEAR MODELING OF RECONSTRUCTED PHASE SPACE AND FREQUENCY DOMAIN ANALYSIS TO IMPROVE AUTOMATIC SPEECH RECOGNITION PERFORMANCE." International Journal of Bifurcation and Chaos 22, no. 03 (March 2012): 1250053. http://dx.doi.org/10.1142/s0218127412500538.

Full text

Abstract:

This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from nonlinear modeling applied to speech reconstructed phase space (RPS) and typical Mel frequency Cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. In the first part of this paper the application of Lyapunov Exponents (LE) and Fractal Dimension as two usual chaotic features in speech recognition are tested and then a short discussion is made on the weakness of these features in speech recognition. In the following, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based (HMM) speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. In final Continuous Speech Recognition (CSR) Experiments, using tri-phone models, 3.7% absolute phoneme recognition accuracy improvement against using MFCC features alone were obtained.

APA, Harvard, Vancouver, ISO, and other styles

36

Rojathai, S., and M. Venkatesulu. "Investigation of ANFIS and FFBNN Recognition Methods Performance in Tamil Speech Word Recognition." International Journal of Software Innovation 2, no. 2 (April 2014): 43–53. http://dx.doi.org/10.4018/ijsi.2014040103.

Full text

Abstract:

In speech word recognition systems, feature extraction and recognition plays a most significant role. More number of feature extraction and recognition methods are available in the existing speech word recognition systems. In most recent Tamil speech word recognition system has given high speech word recognition performance with PAC-ANFIS compared to the earlier Tamil speech word recognition systems. So the investigation of speech word recognition by various recognition methods is needed to prove their performance in the speech word recognition. This paper presents the investigation process with well known Artificial Intelligence method as Feed Forward Back Propagation Neural Network (FFBNN) and Adaptive Neuro Fuzzy Inference System (ANFIS). The Tamil speech word recognition system with PAC-FFBNN performance is analyzed in terms of statistical measures and Word Recognition Rate (WRR) and compared with PAC-ANFIS and other existing Tamil speech word recognition systems.

APA, Harvard, Vancouver, ISO, and other styles

37

Dua, Mohit, Rajesh Kumar Aggarwal, and Mantosh Biswas. "Optimizing Integrated Features for Hindi Automatic Speech Recognition System." Journal of Intelligent Systems 29, no. 1 (October 1, 2018): 959–76. http://dx.doi.org/10.1515/jisys-2018-0057.

Full text

Abstract:

Abstract An automatic speech recognition (ASR) system translates spoken words or utterances (isolated, connected, continuous, and spontaneous) into text format. State-of-the-art ASR systems mainly use Mel frequency (MF) cepstral coefficient (MFCC), perceptual linear prediction (PLP), and Gammatone frequency (GF) cepstral coefficient (GFCC) for extracting features in the training phase of the ASR system. Initially, the paper proposes a sequential combination of all three feature extraction methods, taking two at a time. Six combinations, MF-PLP, PLP-MFCC, MF-GFCC, GF-MFCC, GF-PLP, and PLP-GFCC, are used, and the accuracy of the proposed system using all these combinations was tested. The results show that the GF-MFCC and MF-GFCC integrations outperform all other proposed integrations. Further, these two feature vector integrations are optimized using three different optimization methods, particle swarm optimization (PSO), PSO with crossover, and PSO with quadratic crossover (Q-PSO). The results demonstrate that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations.

APA, Harvard, Vancouver, ISO, and other styles

38

Liu, Chang, Pengyuan Zhang, Ta Li, and Yonghong Yan. "Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition." Applied Sciences 9, no. 23 (November 22, 2019): 5053. http://dx.doi.org/10.3390/app9235053.

Full text

Abstract:

In this work, we aim to re-rank the n-best hypotheses of an automatic speech recognition system by punishing the sentences which have words that are semantically different from the context and rewarding the sentences where all words are in semantical harmony. To achieve this, we proposed a topic similarity score that measures the difference between topic distribution of words and the corresponding sentence. We also proposed another word-discourse score that quantifies the likeliness for a word to appear in the sentence by the inner production of word vector and discourse vector. Besides, we used the latent semantic marginal and a variation of log bi-linear model to get the sentence coordination score. In addition we introduce a fallibility weight, which assists the computation of the sentence semantically coordination score by instructing the model to pay more attention to the words that appear less in the hypotheses list and we show how to use the scores and the fallibility weight in hypotheses rescoring. None of the rescoring methods need extra parameters other than the semantic models. Experiments conducted on the Wall Street Journal corpus show that, by using the proposed word-discourse score on 50-dimension word embedding, we can achieve 0.29% and 0.51% absolute word error rate (WER) reductions on the two testsets.

APA, Harvard, Vancouver, ISO, and other styles

39

Stern, Richard, and Nelson Morgan. "Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition." IEEE Signal Processing Magazine 29, no. 6 (November 2012): 34–43. http://dx.doi.org/10.1109/msp.2012.2207989.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Deng, Li, and Don X. Sun. "A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features." Journal of the Acoustical Society of America 95, no. 5 (May 1994): 2702–19. http://dx.doi.org/10.1121/1.409839.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Mamyrbayev, Orken, Keylan Alimhan, Dina Oralbekova, Akbayan Bekarystankyzy, and Bagashar Zhumazhanov. "Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level." Eastern-European Journal of Enterprise Technologies 1, no. 9(115) (February 28, 2022): 84–92. http://dx.doi.org/10.15587/1729-4061.2022.252801.

Full text

Abstract:

Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using machine learning methods, as well as to develop new mathematical models and algorithms for solving the problem of automatic speech recognition for agglutinative (Turkic) languages. Many research papers have shown that deep learning methods make it easier to train automatic speech recognition systems that use an end-to-end approach. This method can also train an automatic speech recognition system directly, that is, without manual work with raw signals. Despite the good recognition quality, this model has some drawbacks. These disadvantages are based on the need for a large amount of data for training. This is a serious problem for low-data languages, especially Turkic languages such as Kazakh and Azerbaijani. To solve this problem, various methods are needed to apply. Some methods are used for end-to-end speech recognition of languages belonging to the group of languages of the same family (agglutinative languages). Method for low-resource languages is transfer learning, and for large resources – multi-task learning. To increase efficiency and quickly solve the problem associated with a limited resource, transfer learning was used for the end-to-end model. The transfer learning method helped to fit a model trained on the Kazakh dataset to the Azerbaijani dataset. Thereby, two language corpora were trained simultaneously. Conducted experiments with two corpora show that transfer learning can reduce the symbol error rate, phoneme error rate (PER), by 14.23 % compared to baseline models (DNN+HMM, WaveNet, and CNC+LM). Therefore, the realized model with the transfer method can be used to recognize other low-resource languages.

APA, Harvard, Vancouver, ISO, and other styles

42

Raval, Deepang, Vyom Pathak, Muktan Patel, and Brijesh Bhatt. "Improving Deep Learning based Automatic Speech Recognition for Gujarati." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 3 (May 31, 2022): 1–18. http://dx.doi.org/10.1145/3483446.

Full text

Abstract:

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

APA, Harvard, Vancouver, ISO, and other styles

43

Matveev, Yuri, Anton Matveev, Olga Frolova, Elena Lyakso, and Nersisson Ruban. "Automatic Speech Emotion Recognition of Younger School Age Children." Mathematics 10, no. 14 (July 6, 2022): 2373. http://dx.doi.org/10.3390/math10142373.

Full text

Abstract:

This paper introduces the extended description of a database that contains emotional speech in the Russian language of younger school age (8–12-year-old) children and describes the results of validation of the database based on classical machine learning algorithms, such as Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP). The validation is performed using standard procedures and scenarios of the validation similar to other well-known databases of children’s emotional acting speech. Performance evaluation of automatic multiclass recognition on four emotion classes “Neutral (Calm)—Joy—Sadness—Anger” shows the superiority of SVM performance and also MLP performance over the results of perceptual tests. Moreover, the results of automatic recognition on the test dataset which was used in the perceptual test are even better. These results prove that emotions in the database can be reliably recognized both by experts and automatically using classical machine learning algorithms such as SVM and MLP, which can be used as baselines for comparing emotion recognition systems based on more sophisticated modern machine learning methods and deep neural networks. The results also confirm that this database can be a valuable resource for researchers studying affective reactions in speech communication during child-computer interactions in the Russian language and can be used to develop various edutainment, health care, etc. applications.

APA, Harvard, Vancouver, ISO, and other styles

44

Liao, Lyuchao, Francis Afedzie Kwofie, Zhifeng Chen, Guangjie Han, Yongqiang Wang, Yuyuan Lin, and Dongmei Hu. "A Bidirectional Context Embedding Transformer for Automatic Speech Recognition." Information 13, no. 2 (January 29, 2022): 69. http://dx.doi.org/10.3390/info13020069.

Full text

Abstract:

Transformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems that employ two decoders for bidirectional decoding are complex in terms of computation and optimization. The existing ASR transformer with a single decoder for bidirectional decoding requires extra methods (such as a self-mask) to resolve the problem of information leakage in the attention mechanism This paper explores different options for the development of a speech transformer that utilizes a single decoder equipped with bidirectional context embedding (BCE) for bidirectional decoding. The decoding direction, which is set up at the input level, enables the model to attend to different directional contexts without extra decoders and also alleviates any information leakage. The effectiveness of this method was verified with a bidirectional beam search method that generates bidirectional output sequences and determines the best hypothesis according to the output score. We achieved a word error rate (WER) of 7.65%/18.97% on the clean/other LibriSpeech test set, outperforming the left-to-right decoding style in our work by 3.17%/3.47%. The results are also close to, or better than, other state-of-the-art end-to-end models.

APA, Harvard, Vancouver, ISO, and other styles

45

Garberg, Roger B. "Automatic Speech Recognition Applications: A Study of Methods for Defining Command Vocabularies." Proceedings of the Human Factors and Ergonomics Society Annual Meeting 39, no. 3 (October 1995): 203–7. http://dx.doi.org/10.1177/154193129503900307.

Full text

Abstract:

Phoneme-based automatic speech recognition (ASR) technology enables designers to easily create custom command words or phrases that users can employ to request service operations. In this paper, I report results from two experiments concerning important dimensions of these ASR command vocabularies, including command naturalness/appropriateness and command recallability. Ease of recall is a critical dimension for assessing ASR commands used in multi-step applications since service subscribers may be engaged in several different cognitive activities that divide attention. Yet techniques for measuring command recallability can be difficult to implement owing to the time required for data collection and analysis. Results of these studies indicate the the dimension of command “naturalness” and memorability are closely related: under appropriate conditions, the simple procedures associated with measuring command naturalness or appropriateness can predict retrievability of command expressions.

APA, Harvard, Vancouver, ISO, and other styles

46

Aggarwal, Rajesh Kumar, and Mayank Dave. "Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I)." International Journal of Speech Technology 14, no. 4 (September 23, 2011): 297–308. http://dx.doi.org/10.1007/s10772-011-9108-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Haider, Fasih, Pierre Albert, and Saturnino Luz. "User Identity Protection in Automatic Emotion Recognition through Disguised Speech." AI 2, no. 4 (November 25, 2021): 636–49. http://dx.doi.org/10.3390/ai2040038.

Full text

Abstract:

Ambient Assisted Living (AAL) technologies are being developed which could assist elderly people to live healthy and active lives. These technologies have been used to monitor people’s daily exercises, consumption of calories and sleep patterns, and to provide coaching interventions to foster positive behaviour. Speech and audio processing can be used to complement such AAL technologies to inform interventions for healthy ageing by analyzing speech data captured in the user’s home. However, collection of data in home settings presents challenges. One of the most pressing challenges concerns how to manage privacy and data protection. To address this issue, we proposed a low cost system for recording disguised speech signals which can protect user identity by using pitch shifting. The disguised speech so recorded can then be used for training machine learning models for affective behaviour monitoring. Affective behaviour could provide an indicator of the onset of mental health issues such as depression and cognitive impairment, and help develop clinical tools for automatically detecting and monitoring disease progression. In this article, acoustic features extracted from the non-disguised and disguised speech are evaluated in an affect recognition task using six different machine learning classification methods. The results of transfer learning from non-disguised to disguised speech are also demonstrated. We have identified sets of acoustic features which are not affected by the pitch shifting algorithm and also evaluated them in affect recognition. We found that, while the non-disguised speech signal gives the best Unweighted Average Recall (UAR) of 80.01%, the disguised speech signal only causes a slight degradation of performance, reaching 76.29%. The transfer learning from non-disguised to disguised speech results in a reduction of UAR (65.13%). However, feature selection improves the UAR (68.32%). This approach forms part of a large project which includes health and wellbeing monitoring and coaching.

APA, Harvard, Vancouver, ISO, and other styles

48

Pipiras, Laurynas, Rytis Maskeliūnas, and Robertas Damaševičius. "Lithuanian Speech Recognition Using Purely Phonetic Deep Learning." Computers 8, no. 4 (October 18, 2019): 76. http://dx.doi.org/10.3390/computers8040076.

Full text

Abstract:

Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).

APA, Harvard, Vancouver, ISO, and other styles

49

Proksch, Sven-Oliver, Christopher Wratil, and Jens Wäckerle. "Testing the Validity of Automatic Speech Recognition for Political Text Analysis." Political Analysis 27, no. 3 (February 19, 2019): 339–59. http://dx.doi.org/10.1017/pan.2018.62.

Full text

Abstract:

The analysis of political texts from parliamentary speeches, party manifestos, social media, or press releases forms the basis of major and growing fields in political science, not least since advances in “text-as-data” methods have rendered the analysis of large text corpora straightforward. However, a lot of sources of political speech are not regularly transcribed, and their on-demand transcription by humans is prohibitively expensive for research purposes. This class includes political speech in certain legislatures, during political party conferences as well as television interviews and talk shows. We showcase how scholars can use automatic speech recognition systems to analyze such speech with quantitative text analysis models of the “bag-of-words” variety. To probe results for robustness to transcription error, we present an original “word error rate simulation” (WERSIM) procedure implemented in$R$. We demonstrate the potential of automatic speech recognition to address open questions in political science with two substantive applications and discuss its limitations and practical challenges.

APA, Harvard, Vancouver, ISO, and other styles

50

Şchiopu, Daniela. "Using Statistical Methods in a Speech Recognition System for Romanian Language." IFAC Proceedings Volumes 46, no. 28 (2013): 99–103. http://dx.doi.org/10.3182/20130925-3-cz-3023.00078.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!