Acceder

Bibliografías temáticas / Automatic speech recognition / Tesis

Tesis sobre el tema "Automatic speech recognition"

Siga este enlace para ver otros tipos de publicaciones sobre el tema: Automatic speech recognition.

Autor: Grafiati

Publicado: 4 de junio de 2021

Última modificación: 30 de julio de 2024

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores tesis para su investigación sobre el tema "Automatic speech recognition".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Alcaraz, Meseguer Noelia. "Speech Analysis for Automatic Speech Recognition". Thesis, Norwegian University of Science and Technology, Department of Electronics and Telecommunications, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9092.

Texto completo

Resumen

The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Gabriel, Naveen. "Automatic Speech Recognition in Somali". Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.

Texto completo

Resumen

The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Al-Shareef, Sarah. "Conversational Arabic Automatic Speech Recognition". Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/10145/.

Texto completo

Resumen

Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task.

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Jalalvand, Shahab. "Automatic Speech Recognition Quality Estimation". Doctoral thesis, Università degli studi di Trento, 2017. https://hdl.handle.net/11572/368743.

Texto completo

Resumen

Evaluation of automatic speech recognition (ASR) systems is difficult and costly, since it requires manual transcriptions. This evaluation is usually done by computing word error rate (WER) that is the most popular metric in ASR community. Such computation is doable only if the manual references are available, whereas in the real-life applications, it is a too rigid condition. A reference-free metric to evaluate the ASR performance is \textit{confidence measure} which is provided by the ASR decoder. However, the confidence measure is not always available, especially in commercial ASR usages. Even if available, this measure is usually biased towards the decoder. From this perspective, the confidence measure is not suitable for comparison purposes, for example between two ASR systems. These issues motivate the necessity of an automatic quality estimation system for ASR outputs. This thesis explores ASR quality estimation (ASR QE) from different perspectives including: feature engineering, learning algorithms and applications. From feature engineering perspective, a wide range of features extractable from input signal and output transcription are studied. These features represent the quality of the recognition from different aspects and they are divided into four groups: signal, textual, hybrid and word-based features. From learning point of view, we address two main approaches: i) QE via regression, suitable for single hypothesis scenario; ii) QE via machine-learned ranking (MLR), suitable for multiple hypotheses scenario. In the former, a regression model is used to predict the WER score of each single hypothesis that is created through a single automatic transcription channel. In the latter, a ranking model is used to predict the order of multiple hypotheses with respect to their quality. Multiple hypotheses are mainly generated by several ASR systems or several recording microphones. From application point of view, we introduce two applications in which ASR QE makes salient improvement in terms of WER: i) QE-informed data selection for acoustic model adaptation; ii) QE-informed system combination. In the former, we exploit single hypothesis ASR QE methods in order to select the best adaptation data for upgrading the acoustic model. In the latter, we exploit multiple hypotheses ASR QE methods to rank and combine the automatic transcriptions in a supervised manner. The experiments are mostly conducted on CHiME-3 English dataset. CHiME-3 consists of Wall Street Journal utterances, recorded by multiple far distant microphones in noisy environments. The results show that QE-informed acoustic model adaptation leads to 1.8\% absolute WER reduction and QE-informed system combination leads to 1.7% absolute WER reduction in CHiME-3 task. The outcomes of this thesis are packed in the frame of an open source toolkit named TranscRater -transcription rating toolkit- (https://github.com/hlt-mt/TranscRater) which has been developed based on the aforementioned studies. TranscRater can be used to extract informative features, train the QE models and predict the quality of the reference-less recognitions in a variety of ASR tasks.

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

Jalalvand, Shahab. "Automatic Speech Recognition Quality Estimation". Doctoral thesis, University of Trento, 2017. http://eprints-phd.biblio.unitn.it/2058/1/PhD_Thesis.pdf.

Texto completo

Resumen

Evaluation of automatic speech recognition (ASR) systems is difficult and costly, since it requires manual transcriptions. This evaluation is usually done by computing word error rate (WER) that is the most popular metric in ASR community. Such computation is doable only if the manual references are available, whereas in the real-life applications, it is a too rigid condition. A reference-free metric to evaluate the ASR performance is \textit{confidence measure} which is provided by the ASR decoder. However, the confidence measure is not always available, especially in commercial ASR usages. Even if available, this measure is usually biased towards the decoder. From this perspective, the confidence measure is not suitable for comparison purposes, for example between two ASR systems. These issues motivate the necessity of an automatic quality estimation system for ASR outputs. This thesis explores ASR quality estimation (ASR QE) from different perspectives including: feature engineering, learning algorithms and applications. From feature engineering perspective, a wide range of features extractable from input signal and output transcription are studied. These features represent the quality of the recognition from different aspects and they are divided into four groups: signal, textual, hybrid and word-based features. From learning point of view, we address two main approaches: i) QE via regression, suitable for single hypothesis scenario; ii) QE via machine-learned ranking (MLR), suitable for multiple hypotheses scenario. In the former, a regression model is used to predict the WER score of each single hypothesis that is created through a single automatic transcription channel. In the latter, a ranking model is used to predict the order of multiple hypotheses with respect to their quality. Multiple hypotheses are mainly generated by several ASR systems or several recording microphones. From application point of view, we introduce two applications in which ASR QE makes salient improvement in terms of WER: i) QE-informed data selection for acoustic model adaptation; ii) QE-informed system combination. In the former, we exploit single hypothesis ASR QE methods in order to select the best adaptation data for upgrading the acoustic model. In the latter, we exploit multiple hypotheses ASR QE methods to rank and combine the automatic transcriptions in a supervised manner. The experiments are mostly conducted on CHiME-3 English dataset. CHiME-3 consists of Wall Street Journal utterances, recorded by multiple far distant microphones in noisy environments. The results show that QE-informed acoustic model adaptation leads to 1.8\% absolute WER reduction and QE-informed system combination leads to 1.7% absolute WER reduction in CHiME-3 task. The outcomes of this thesis are packed in the frame of an open source toolkit named TranscRater -transcription rating toolkit- (https://github.com/hlt-mt/TranscRater) which has been developed based on the aforementioned studies. TranscRater can be used to extract informative features, train the QE models and predict the quality of the reference-less recognitions in a variety of ASR tasks.

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation". The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Seward, Alexander. "Efficient Methods for Automatic Speech Recognition". Doctoral thesis, KTH, Tal, musik och hörsel, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3675.

Texto completo

Resumen

This thesis presents work in the area of automatic speech recognition (ASR). The thesis focuses on methods for increasing the efficiency of speech recognition systems and on techniques for efficient representation of different types of knowledge in the decoding process. In this work, several decoding algorithms and recognition systems have been developed, aimed at various recognition tasks. The thesis presents the KTH large vocabulary speech recognition system. The system was developed for online (live) recognition with large vocabularies and complex language models. The system utilizes weighted transducer theory for efficient representation of different knowledge sources, with the purpose of optimizing the recognition process. A search algorithm for efficient processing of hidden Markov models (HMMs) is presented. The algorithm is an alternative to the classical Viterbi algorithm for fast computation of shortest paths in HMMs. It is part of a larger decoding strategy aimed at reducing the overall computational complexity in ASR. In this approach, all HMM computations are completely decoupled from the rest of the decoding process. This enables the use of larger vocabularies and more complex language models without an increase of HMM-related computations. Ace is another speech recognition system developed within this work. It is a platform aimed at facilitating the development of speech recognizers and new decoding methods. A real-time system for low-latency online speech transcription is also presented. The system was developed within a project with the goal of improving the possibilities for hard-of-hearing people to use conventional telephony by providing speech-synchronized multimodal feedback. This work addresses several additional requirements implied by this special recognition task.
QC 20100811

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Vipperla, Ravichander. "Automatic Speech Recognition for ageing voices". Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5725.

Texto completo

Resumen

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Guzy, Julius Jonathan. "Automatic speech recognition : a refutation approach". Thesis, De Montfort University, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.254196.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Deterding, David Henry. "Speaker normalisation for automatic speech recognition". Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.359822.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Badr, Ibrahim. "Pronunciation learning for automatic speech recognition". Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66022.

Texto completo

Resumen

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 99-101).
In many ways, the lexicon remains the Achilles heel of modern automatic speech recognizers (ASRs). Unlike stochastic acoustic and language models that learn the values of their parameters from training data, the baseform pronunciations of words in an ASR vocabulary are typically specified manually, and do not change, unless they are edited by an expert. Our work presents a novel generative framework that uses speech data to learn stochastic lexicons, thereby taking a step towards alleviating the need for manual intervention and automnatically learning high-quality baseform pronunciations for words. We test our model on a variety of domains: an isolated-word telephone speech corpus, a weather query corpus and an academic lecture corpus. We show significant improvements of 25%, 15% and 2% over expert-pronunciation lexicons, respectively. We also show that further improvements can be made by combining our pronunciation learning framework with acoustic model training.
by Ibrahim Badr.
S.M.

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Chen, Chia-Ping. "Noise robustness in automatic speech recognition /". Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/5829.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Uebler, Ulla. "Multilingual speech recognition /". Berlin : Logos Verlag, 2000. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=009117880&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Evans, N. W. D. "Spectral subtraction for speech enhancement and automatic speech recognition". Thesis, Swansea University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.636935.

Texto completo

Resumen

The contributions made in this thesis relate to an extensive investigation of spectral subtraction in the context of speech enhancement and noise robust automatic speech recognition (ASR) and the morphological processing of speech spectrograms. Three sources of error in a spectral subtraction approach are identified and assessed with ASR. The effects of phase, cross-term component and spectral magnitude errors are assessed in a common spectral subtraction framework. ASR results confirm that, except for extreme noise conditions, phase and cross-term component errors are relatively negligible compared to noise estimate errors. A topology classifying approaches to spectral subtraction into power and magnitude, linear and non-linear spectral subtraction is proposed. Each class is assessed and compared under otherwise identical experimental conditions. These experiments are thought to be the first to assess the four combinations under such controlled conditions. ASR results illustrate a lesser sensitivity to noise over-estimation for non-linear approaches. With a view to practical systems, different approaches to noise estimation are investigated. In particular approaches that do not require explicit voice activity detection are assessed and shown to compare favourably to the conventional approach, the latter requiring explicit voice activity detection. Following on from this finding a new computationally efficient approach to noise estimation that does not require explicit voice activity detection is proposed. Investigations into the fundamentals of spectral subtraction highlight the limitation of noise estimates: statistical estimates obtained from a number of analysis frames lead to relatively poor representations of the instantaneous values. To ameliorate this situation, estimates from neighbouring, lateral frequencies are used to complement within bin (from the same frequency) statistical approaches. Improvements are found to be negligible. However, the principle of these lateral estimates lead naturally to the final stage of the work presented in this thesis, that of morphologically filtering speech spectrograms. This form of processing is examined for both synthesised and speech signals and promising ASR performance is reported. In 2000 the Aurora 2 database was introduced by the organisers of a special session at Eurospeech 2001 entitled ‘Noise Robust Recognition’, aimed at providing a standard database and experimental protocols for the assessment of noise robust ASR. This facility, when it became available, was used for the work described in this thesis.

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Zhang, Xiaozheng. "Automatic speechreading for improved speech recognition and speaker verification". Diss., Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/13067.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Ragni, Anton. "Discriminative models for speech recognition". Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.707926.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Couper, Kenney Fiona. "Automatic determination of sub-word units for automatic speech recognition". Thesis, University of Edinburgh, 2008. http://hdl.handle.net/1842/2788.

Texto completo

Resumen

Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data. Two of the components of an ASR system are acoustic models and pronunciation models. The variation within spontaneous speech must be accounted for by these components. Phones, or context-dependent phones are typically used as the base subword unit, and one acoustic model is trained for each sub-word unit. Pronunciation modelling largely takes place in a dictionary, which relates words to sequences of phones. Acoustic modelling and pronunciation modelling overlap, and the two are not clearly separable in modelling pronunciation variation. Techniques that find pronunciation variants in the data and then reflect these in the dictionary have not provided expected gains in recognition. An alternative approach to modelling pronunciations in terms of phones is to derive units automatically: using data-driven methods to determine an inventory of sub-word units, their acoustic models, and their relationship to words. This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are 1. automatic and simultaneous generation of a sub-word unit inventory and acoustic model set, using an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion 2. automatic generation of probabilistic dictionaries using joint multigrams The prerequisites of this approach are fewer than in previous work on unit derivation; notably, the timings of word boundaries are not required here. The approach is language independent since it is entirely data-driven and no linguistic information is required. The dictionary generation method outperforms a supervised method using phonetic data. The automatically derived units and dictionary perform reasonably on a small spontaneous speech task, although not yet outperforming phones.

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Thambiratnam, David P. "Speech recognition in adverse environments". Thesis, Queensland University of Technology, 1999. https://eprints.qut.edu.au/36099/1/36099_Thambiratnam_1999.pdf.

Texto completo

Resumen

This thesis presents a study of techniques used to improve the performance of small vocabulary isolated word speaker dependent automatic speech recognition systems in adverse environments. Such systems are applicable to 'command and control' applications, for example industrial applications where machines are controlled by voice, providing hands-free and eyes-free operation. Adverse environments present the largest obstacle to the deployment of accurate and usable speech recognition systems. This is because they cause discrepancies between training and testing environments. Two solutions to the problem are investigated. The first is the use of secondary modelling of the output probability distribution of the primary classifiers. It is shown that a significant improvement in performance is obtained for a small vocabulary isolated word speaker dependent system, operating in an adverse environment. Results are presented of simulations using the NOISE.'-: database as well as in an actual factory environment using a real-time system. Based on the outcome of this research, a voice operated parcel sorting machine has been installed at the Australia Post Mail Centre at Underwood, Queensland. A pilot study is also undertaken for the use of lip information to enhance speech recognition accuracy in adverse environments. It is shown that the inclusion of other data sources can improve the performance of a speech recognition system.

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Tabani, Hamid. "Low-power architectures for automatic speech recognition". Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/462249.

Texto completo

Resumen

Automatic Speech Recognition (ASR) is one of the most important applications in the area of cognitive computing. Fast and accurate ASR is emerging as a key application for mobile and wearable devices. These devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years which is changing the way of human-machine interaction. Effective speech recognition systems require real-time recognition, which is challenging for mobile devices due to the compute-intensive nature of the problem and the power constraints of such systems and involves a huge effort for CPU architectures to reach it. GPU architectures offer parallelization capabilities which can be exploited to increase the performance of speech recognition systems. However, efficiently utilizing the GPU resources for speech recognition is also challenging, as the software implementations exhibit irregular and unpredictable memory accesses and poor temporal locality. The purpose of this thesis is to study the characteristics of ASR systems running on low-power mobile devices in order to propose different techniques to improve performance and energy consumption. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. We use a refactored implementation of the GMM evaluation code to ameliorate the impact of branches. Then, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. In addition, we compute the Gaussians for multiple frames in parallel, significantly reducing memory bandwidth usage. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61% energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59% energy savings without any loss in the accuracy of the ASR system. Secondly, we propose a register renaming technique that exploits register reuse to reduce the pressure on the register file. Our technique leverages physical register sharing by introducing minor changes in the register map table and the issue queue. We evaluated our renaming technique on top of a modern out-of-order processor. The proposed scheme supports precise exceptions and we show that it results in 9.5% performance improvements for GMM evaluation. Our experimental results show that the proposed register renaming scheme provides 6% speedup on average for the SPEC2006 benchmarks. Alternatively, our renaming scheme achieves the same performance while reducing the number of physical registers by 10.5%. Finally, we propose a hardware accelerator for GMM evaluation that reduces the energy consumption by three orders of magnitude compared to solutions based on CPUs and GPUs. The proposed accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the GMM parameters, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.
El reconocimiento automático de voz (ASR) es una de las aplicaciones más importantes en el área de la computación cognitiva. ASR rápido y preciso se está convirtiendo en una aplicación clave para dispositivos móviles y portátiles. Estos dispositivos, como los Smartphones, han incorporado el reconocimiento de voz como una de las principales interfaces de usuario. Es probable que esta tendencia hacia las interfaces de usuario basadas en voz continúe en los próximos años, lo que está cambiando la forma de interacción humano-máquina. Los sistemas de reconocimiento de voz efectivos requieren un reconocimiento en tiempo real, que es un desafío para los dispositivos móviles debido a la naturaleza de cálculo intensivo del problema y las limitaciones de potencia de dichos sistemas y supone un gran esfuerzo para las arquitecturas de CPU. Las arquitecturas GPU ofrecen capacidades de paralelización que pueden aprovecharse para aumentar el rendimiento de los sistemas de reconocimiento de voz. Sin embargo, la utilización eficiente de los recursos de la GPU para el reconocimiento de voz también es un desafío, ya que las implementaciones de software presentan accesos de memoria irregulares e impredecibles y una localidad temporal deficiente. El propósito de esta tesis es estudiar las características de los sistemas ASR que se ejecutan en dispositivos móviles de baja potencia para proponer diferentes técnicas para mejorar el rendimiento y el consumo de energía. Proponemos varias optimizaciones a nivel de software impulsadas por el análisis de potencia y rendimiento. A diferencia de las propuestas anteriores que intercambian precisión por el rendimiento al reducir el número de gaussianas evaluadas, mantenemos la precisión y mejoramos el rendimiento mediante el uso efectivo de la microarquitectura subyacente de la CPU. Usamos una implementación refactorizada del código de evaluación de GMM para reducir el impacto de las instrucciones de salto. Explotamos la unidad vectorial disponible en la mayoría de las CPU modernas para impulsar el cálculo de GMM. Además, calculamos las gaussianas para múltiples frames en paralelo, lo que reduce significativamente el uso de ancho de banda de memoria. Nuestros resultados experimentales muestran que las optimizaciones propuestas proporcionan un speedup de 2.68x sobre el decodificador Pocketsphinx en una CPU Intel Skylake de alta gama, mientras que logra un ahorro de energía del 61%. En segundo lugar, proponemos una técnica de renombrado de registros que explota la reutilización de registros físicos para reducir la presión sobre el banco de registros. Nuestra técnica aprovecha el uso compartido de registros físicos mediante la introducción de cambios en la tabla de renombrado de registros y la issue queue. Evaluamos nuestra técnica de renombrado sobre un procesador moderno. El esquema propuesto admite excepciones precisas y da como resultado mejoras de rendimiento del 9.5% para la evaluación GMM. Nuestros resultados experimentales muestran que el esquema de renombrado de registros propuesto proporciona un 6% de aceleración en promedio para SPEC2006. Finalmente, proponemos un acelerador para la evaluación de GMM que reduce el consumo de energía en tres órdenes de magnitud en comparación con soluciones basadas en CPU y GPU. El acelerador propuesto implementa un esquema de evaluación perezosa donde las GMMs se calculan bajo demanda, evitando el 50% de los cálculos. Finalmente, incluye un esquema de memorización que evita el 74.88% de las operaciones de coma flotante. El diseño final proporciona una aceleración de 164x y una reducción de energía de 3532x en comparación con una implementación altamente optimizada que se ejecuta en una CPU móvil moderna. Comparado con una GPU móvil de última generación, el acelerador de GMM logra un speedup de 5.89x sobre una implementación CUDA optimizada, mientras que reduce la energía en 241x.

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Martínez, del Hoyo Canterla Alfonso. "Design of Detectors for Automatic Speech Recognition". Doctoral thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-16548.

Texto completo

Resumen

This thesis presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. Firstly, we propose a structure suitable for subword detection. This structure is based on the standard HMM framework, but in each detector the MFCC feature extractor and the models are trained for the specific detection problem. Our experiments in the TIMIT database validate the effectiveness of this structure for detection of phones and articulatory features. Secondly, two discriminative training techniques are proposed for detector training. The first one is a modification of Minimum Classification Error training. The second one, Minimum Detection Error training, is the adaptation of Minimum Phone Error to the detection problem. Both methods are used to train HMMs and filterbanks in the detectors, isolated or jointly. MDE has the advantage that any detection performance criterion can be optimized directly. F-score and class accuracy optimization experiments show that MDE training is superior to the MCE-based method. The optimized filterbanks reflect some acoustical properties of the detection classes. Moreover, some changes are consistent over classes with similar acoustical properties. In addition, MDE-training of filterbanks results in filters significatively different than in the standard filterbank. In fact, some filters extract information from different critical bands. Finally, we propose a detection-based automatic speech recognition system. Detectors are built with the proposed HMM-based detection structure and trained discriminatively. The linguistic merger is based on an MLP/Viterbi decoder.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Bengio, Yoshua. "Connectionist models applied to automatic speech recognition". Thesis, McGill University, 1987. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=63920.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Prager, Richard William. "Parallel processing networks for automatic speech recognition". Thesis, University of Cambridge, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.238443.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Austin, Stephen Christopher. "Hidden Markov models for automatic speech recognition". Thesis, University of Cambridge, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.292913.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

Frankel, Joe. "Linear dynamic models for automatic speech recognition". Thesis, University of Edinburgh, 2004. http://hdl.handle.net/1842/1087.

Texto completo

Resumen

The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterize entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrization and parameter initialization. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesized segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A* search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data.

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Gu, Y. "Perceptually-based features in automatic speech recognition". Thesis, Swansea University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.637182.

Texto completo

Resumen

Interspeaker variability of speech features is one of most important problems in automatic speech recognition (ASR), and makes speaker-independent systems much more difficult to achieve than speaker-dependent ones. The work described in the Thesis examines two ideas to overcome this problem. The first attempts to extract more reliable speech features by perceptually-based modelling; the second investigates the speaker variability in this speech feature and reduces its effects by a speaker normalisation scheme. The application of human speech perception in automatic speech recognition is discussed in the Thesis. Several perceptually-based feature analysis techniques are compared in terms of recognition performance, and the effects of individual perceptual parameter encompassed in the feature analysis are investigated. The work demonstrates the benefits of perceptual feature analysis (particularly perceptually-based linear predictive approach) compared with the conventional linear predictive analysis technique. The proposal for speaker normalisation is based on a regional-continuous linear matrix transform function on the perceptual feature space, with an automatic feature classification. This approach is applied in an ASR adaptation system. It is shown that the recognition error rate reduces rapidly when using a few words or a single sentence for adaptation. The adaptation performance demonstrates that such an approach could be very promising for a large vocabulary speaker-independent system.

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Baothman, Fatmah bint Abdul Rahman. "Phonology-based automatic speech recognition for Arabic". Thesis, University of Huddersfield, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.273720.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Holmes, Wendy Jane. "Modelling segmental variability for automatic speech recognition". Thesis, University College London (University of London), 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.267859.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Chan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition". Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Duchnowski, Paul. "A new structure for automatic speech recognition". Thesis, Massachusetts Institute of Technology, 1993. http://hdl.handle.net/1721.1/17333.

Texto completo

Resumen

Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1993.
Includes bibliographical references (leaves 102-110).
by Paul Duchnowski.
Sc.D.

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

Wang, Stanley Xinlei. "Using graphone models in automatic speech recognition". Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53114.

Texto completo

Resumen

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Includes bibliographical references (p. 87-90).
This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown words. For these experiments, graphones models are integrated into the SUMMIT speech recognition framework. First, graphones are applied to automatically generate pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words. Next, a graphone hybrid recognizer is built and tested for searching song lyrics by voice, as well as transcribing spoken lectures in a open vocabulary scenario. These experiments demonstrate significant improvement over traditional word-only speech recognizers. Modifications to the flat hybrid model such as reducing the graphone set size are also considered. Finally, a hierarchical hybrid model is built and compared with the flat hybrid model on the lecture transcription task.
by Stanley Xinlei Wang.
M.Eng.

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Seigel, Matthew Stephen. "Confidence estimation for automatic speech recognition hypotheses". Thesis, University of Cambridge, 2014. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.648633.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Abdelhamied, Kadry A. "Automatic identification and recognition of deaf speech /". The Ohio State University, 1986. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487266691094027.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

Cherri, Mona Youssef 1956. "Automatic Speech Recognition Using Finite Inductive Sequences". Thesis, University of North Texas, 1996. https://digital.library.unt.edu/ark:/67531/metadc277749/.

Texto completo

Resumen

This dissertation addresses the general problem of recognition of acoustic signals which may be derived from speech, sonar, or acoustic phenomena. The specific problem of recognizing speech is the main focus of this research. The intention is to design a recognition system for a definite number of discrete words. For this purpose specifically, eight isolated words from the T1MIT database are selected. Four medium length words "greasy," "dark," "wash," and "water" are used. In addition, four short words are considered "she," "had," "in," and "all." The recognition system addresses the following issues: filtering or preprocessing, training, and decision-making. The preprocessing phase uses linear predictive coding of order 12. Following the filtering process, a vector quantization method is used to further reduce the input data and generate a finite inductive sequence of symbols representative of each input signal. The sequences generated by the vector quantization process of the same word are factored, and a single ruling or reference template is generated and stored in a codebook. This system introduces a new modeling technique which relies heavily on the basic concept that all finite sequences are finitely inductive. This technique is used in the training stage. In order to accommodate the variabilities in speech, the training is performed casualty, and a large number of training speakers is used from eight different dialect regions. Hence, a speaker independent recognition system is realized. The matching process compares the incoming speech with each of the templates stored, and a closeness ration is computed. A ratio table is generated anH the matching word that corresponds to the smallest ratio (i.e. indicating that the ruling has removed most of the symbols) is selected. Promising results were obtained for isolated words, and the recognition rates ranged between 50% and 100%.

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Colton, Larry Don. "Confidence and rejection in automatic speech recognition /". Full text open access at:, 1997. http://content.ohsu.edu/u?/etd,21.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Li, Jinyu. "Soft margin estimation for automatic speech recognition". Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26613.

Texto completo

Resumen

Thesis (Ph.D)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Dr. Chin-Hui Lee; Committee Member: Dr. Anthony Joseph Yezzi; Committee Member: Dr. Biing-Hwang (Fred) Juang; Committee Member: Dr. Mark Clements; Committee Member: Dr. Ming Yuan. Part of the SMARTech Electronic Thesis and Dissertation Collection.

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Principi, Emanuele y Emanuele Principi. "Pre-processing techniques for automatic speech recognition". Doctoral thesis, Università Politecnica delle Marche, 2009. http://hdl.handle.net/11566/242152.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

Lebart, Katia. "Speech dereverberation applied to automatic speech recognition and hearing aids". Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.285064.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

LEBART, KATIA. "Speech dereverberation applied to automatic speech recognition and hearing aids". Rennes 1, 1999. http://www.theses.fr/1999REN10033.

Texto completo

Resumen

Cette these concerne la dereverberation de la parole dans les contextes specifiques de l'application aux appareils pour malentendants ou a la reconnaissance automatique de la parole. Les methodes considerees doivent etre fonctionnelles dans des conditions ou les canaux acoustiques pris en compte sont inconnus et variables. Nous proposons donc de discriminer la reverberation du signal direct a l'aide de proprietes de la reverberation qui sont independantes du canal acoustique. La correlation spatiale des signaux, leurs directions de provenance et leurs supports temporels menent a differentes methodes qui sont examinees successivement. Apres un etat de l'art sur les methodes fondees sur la decorrelation spatiale de la reverberation tardive et leurs limites, nous suggerons des ameliorations pour l'un des algorithmes les plus utilises. Nous presentons ensuite un nouvel algorithme spatialement selectif, qui attenue les contributions de la reverberation en fonction de leur direction. Cet algorithme est complementaire du precedent. Tous deux utilisent deux capteurs. Finalement, nous proposons une methode originale qui attenue efficacement l'effet de masquage par recouvrement de la reverberation. Les methodes sont evaluees a l'aide de diverses mesures objectives (facteur de reduction de bruit, gain en rsb, distance cepstrale et scores de reconnaissance automatique de la parole). Des essais de combinaisons des differentes methodes demontrent le benefice potentiel de telles associations.

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

Johnston, Samuel John Charles y Samuel John Charles Johnston. "An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech". Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/625626.

Texto completo

Resumen

Speech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

Arrowood, Jon A. "Using observation uncertainty for robust speech recognition". Diss., Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04082004-180005/unrestricted/arrowood%5Fjon%5Fa%5F200312%5Fphd.pdf.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Gillespie, Bradford W. "Strategies for improving audible quality and speech recognition accuracy of reverberant speech /". Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/5930.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

Wrede, Britta. "Modelling the effects of speech rate variation for automatic speech recognition". [S.l. : s.n.], 2002. http://deposit.ddb.de/cgi-bin/dokserv?idn=969765304.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

Wilkinson, Nicholas. "Modelling asynchrony in the articulation of speech for automatic speech recognition". Thesis, University of Birmingham, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.399032.

Texto completo

Resumen

Current automatic speech recognition systems make the assumption that all the articulators in the vocal tract move in synchrony with one another to produce speech. This thesis describes the development of a more realistic model that allows some asynchrony between the articulators with the aim of improving speech recognition accuracy. Experiments on the TEVHT database demonstrate that higher phone recognition accuracy is obtained by separate modelling of the voiced and voiceless components of speech by splitting the speech spectrum into high and low frequency bands. To model further articulator asynchrony in speech production requires a representation of speech that is closer to the actual production process. Formant frequency parameters are integrated into typical Mel-frequency cepstral coefficient representation and their effect on recognition accuracy observed. The formant frequency estimates can only accurately be made when the formants are visible in the spectrum, so a technique is developed to ignore frequency estimates generated when the formants are not visible. The formant data allows a unique method of vocal tract normalization, which improves recognition accuracy. Finally a classification experiment examines the potential improvement in speech recognition accuracy of modelling asynchrony between the articulators by allowing asynchrony between all the formants.

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Livescu, Karen 1975. "Analysis and modeling of non-native speech for automatic speech recognition". Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/80204.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

Kocour, Martin. "Automatic Speech Recognition System Continually Improving Based on Subtitled Speech Data". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2019. http://www.nusl.cz/ntk/nusl-399164.

Texto completo

Resumen

V dnešnej dobe systémy rozpoznávania reči s veľkým slovníkom dosahujú pomerne vysoké presnosti. Za ich výsledkami však často stoja desiatky ba až stovky hodín manuálne oanotovaných trénovacích dát. Takéto dáta sú často bežne nedostupné alebo pre požadovaný jazyk vôbec neexistujú. Možným riešením je použitie bežne dostupných no menej kvalitných audiovizuálnych dát. Táto práca sa zaoberá technikou zpracovania práve takýchto dát a ich použitím pre trénovanie akustických modelov. Ďalej táto práca pojednáva o možnom využití týchto dát pre kontinuálne vylepšovanie modelov, kedže tieto dáta sú prakticky nevyčerpateľné. Pre tieto účely bol v rámci práce navrhnutý nový prístup pre výber dát.

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

Nel, Pieter Willem. "Automatic syllabification of untranscribed speech". Thesis, Stellenbosch : Stellenbosch University, 2005. http://hdl.handle.net/10019.1/50285.

Texto completo

Resumen

Thesis (MScEng)--Stellenbosch University, 2005.
ENGLISH ABSTRACT: The syllable has been proposed as a unit of automatic speech recognition due to its strong links with human speech production and perception. Recently, it has been proved that incorporating information from syllable-length time-scales into automatic speech recognition improves results in large vocabulary recognition tasks. It was also shown to aid in various language recognition tasks and in foreign accent identification. Therefore, the ability to automatically segment speech into syllables is an important research tool. Where most previous studies employed knowledge-based methods, this study presents a purely statistical method for the automatic syllabification of speech. We introduce the concept of hierarchical hidden Markov model structures and show how these can be used to implement a purely acoustical syllable segmenter based, on general sonority theory, combined with some of the phonotactic constraints found in the English language. The accurate reporting of syllabification results is a problem in the existing literature. We present a well-defined dynamic time warping (DTW) distance measure used for reporting syllabification results. We achieve a token error rate of 20.3% with a 42ms average boundary error on a relatively large set of data. This compares well with previous knowledge-based and statistically- based methods.
AFRIKAANSE OPSOMMING: Die syllabe is voorheen voorgestel as 'n basiese eenheid vir automatiese spraakherkenning weens die sterk verwantwskap wat dit het met spraak produksie en persepsie. Onlangs is dit bewys dat die gebruik van informasie van syllabe-lengte tydskale die resultate verbeter in groot woordeskat herkennings take. Dit is ook bewys dat die gebruik van syllabes automatiese taalherkenning en vreemdetaal aksent herkenning vergemaklik. Dit is daarom belangrik om vir navorsingsdoeleindes syllabes automaties te kan segmenteer. Vorige studies het kennisgebaseerde metodes gebruik om hierdie segmentasie te bewerkstellig. Hierdie studie gebruik 'n suiwer statistiese metode vir die automatiese syllabifikasie van spraak. Ons gebruik die konsep van hierargiese verskuilde Markov model strukture en wys hoe dit gebruik kan word om 'n suiwer akoestiese syllabe segmenteerder te implementeer. Die model word gebou deur dit te baseer op die teorie van sonoriteit asook die fonotaktiese beperkinge teenwoordig in die Engelse taal. Die akkurate voorstelling van syllabifikasie resultate is problematies in die bestaande literatuur. Ons definieer volledig 'n DTW (Dynamic Time Warping) afstands funksie waarmee ons ons syllabifikasie resultate weergee. Ons behaal 'n TER (Token Error Rate) van 20.3% met 'n 42ms gemiddelde grens fout op 'n relatiewe groot stel data. Dit vergelyk goed met vorige kennis-gebaseerde en statisties-gebaseerde metodes.

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Kleinschmidt, Tristan Friedrich. "Robust speech recognition using speech enhancement". Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/31895/1/Tristan_Kleinschmidt_Thesis.pdf.

Texto completo

Resumen

Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Wolf, Martin. "Channel selection and reverberation-robust automatic speech recognition". Doctoral thesis, Universitat Politècnica de Catalunya, 2013. http://hdl.handle.net/10803/134806.

Texto completo

Resumen

If speech is acquired by a close-talking microphone in a controlled and noise-free environment, current state-of-the-art recognition systems often show an acceptable error rate. The use of close-talking microphones, however, may be too restrictive in many applications. Alternatively, distant-talking microphones, often placed several meters far from the speaker, may be used. Such setup is less intrusive, since the speaker does not have to wear any microphone, but the Automatic Speech Recognition (ASR) performance is strongly affected by noise and reverberation. The thesis is focused on ASR applications in a room environment, where reverberation is the dominant source of distortion, and considers both single- and multi-microphone setups. If speech is recorded in parallel by several microphones arbitrarily located in the room, the degree of distortion may vary from one channel to another. The difference among the signal quality of each recording may be even more evident if those microphones have different characteristics: some are hanging on the walls, others standing on the table, or others build in the personal communication devices of the people present in the room. In a scenario like that, the ASR system may benefit strongly if the signal with the highest quality is used for recognition. To find such signal, what is commonly referred as Channel Selection (CS), several techniques have been proposed, which are discussed in detail in this thesis. In fact, CS aims to rank the signals according to their quality from the ASR perspective. To create such ranking, a measure that either estimates the intrinsic quality of a given signal, or how well it fits the acoustic models of the recognition system is needed. In this thesis we provide an overview of the CS measures presented in the literature so far, and compare them experimentally. Several new techniques are introduced, that surpass the former techniques in terms of recognition accuracy and/or computational efficiency. A combination of different CS measures is also proposed to further increase the recognition accuracy, or to reduce the computational load without any significant performance loss. Besides, we show that CS may be used together with other robust ASR techniques, and that the recognition improvements are cumulative up to some extent. An online real-time version of the channel selection method based on the variance of the speech sub-band envelopes, which was developed in this thesis, was designed and implemented in a smart room environment. When evaluated in experiments with real distant-talking microphone recordings and with moving speakers, a significant recognition performance improvement was observed. Another contribution of this thesis, that does not require multiple microphones, was developed in cooperation with the colleagues from the chair of Multimedia Communications and Signal Processing at the University of Erlangen-Nuremberg, Erlangen, Germany. It deals with the problem of feature extraction within REMOS (REverberation MOdeling for Speech recognition), which is a generic framework for robust distant-talking speech recognition. In this framework, the use of conventional methods to obtain decorrelated feature vector coefficients, like the discrete cosine transform, is constrained by the inner optimization problem of REMOS, which may become unsolvable in a reasonable time. A new feature extraction method based on frequency filtering was proposed to avoid this problem.
Los actuales sistemas de reconocimiento del habla muestran a menudo una tasa de error aceptable si la voz es registrada por micr ofonos próximos a la boca del hablante, en un entorno controlado y libre de ruido. Sin embargo, el uso de estos micr ofonos puede ser demasiado restrictivo en muchas aplicaciones. Alternativamente, se pueden emplear micr ofonos distantes, los cuales a menudo se ubican a varios metros del hablante. Esta con guraci on es menos intrusiva ya que el hablante no tiene que llevar encima ning un micr ofono, pero el rendimiento del reconocimiento autom atico del habla (ASR, del ingl es Automatic Speech Recognition) en dicho caso se ve fuertemente afectado por el ruido y la reverberaci on. Esta tesis se enfoca a aplicaciones ASR en el entorno de una sala, donde la reverberaci on es la causa predominante de distorsi on y se considera tanto el caso de un solo micr ofono como el de m ultiples micr ofonos. Si el habla es grabada en paralelo por varios micr ofonos distribuidos arbitrariamente en la sala, el grado de distorsi on puede variar de un canal a otro. Las diferencias de calidad entre las señales grabadas pueden ser m as acentuadas si dichos micr ofonos muestran diferentes características y colocaciones: unos en las paredes, otros sobre la mesa, u otros integrados en los dispositivos de comunicaci on de las personas presentes en la sala. En dicho escenario el sistema ASR se puede bene ciar enormemente de la utilizaci on de la señal con mayor calidad para el reconocimiento. Para hallar dicha señal se han propuesto diversas t ecnicas, denominadas CS (del ingl es Channel Selection), las cuales se discuten detalladament en esta tesis. De hecho, la selecci on de canal busca ranquear las señales conforme a su calidad desde la perspectiva ASR. Para crear tal ranquin se necesita una medida que tanto estime la calidad intr nseca de una selal, como lo bien que esta se ajusta a los modelos ac usticos del sistema de reconocimiento. En esta tesis proporcionamos un resumen de las medidas CS hasta ahora presentadas en la literatura, compar andolas experimentalmente. Diversas nuevas t ecnicas son presentadas que superan las t ecnicas iniciales en cuanto a exactitud de reconocimiento y/o e ciencia computacional. Tambi en se propone una combinaci on de diferentes medidas CS para incrementar la exactitud de reconocimiento, o para reducir la carga computacional sin ninguna p erdida signi cativa de rendimiento. Adem as mostramos que la CS puede ser empleada junto con otras t ecnicas robustas de ASR, tales como matched condition training o la normalizaci on de la varianza y la media, y que las mejoras de reconocimiento de ambas aproximaciones son hasta cierto punto acumulativas. Una versi on online en tiempo real del m etodo de selecci on de canal basado en la varianza del speech sub-band envelopes, que fue desarrolladas en esta tesis, fue diseñada e implementada en una sala inteligente. Reportamos una mejora signi cativa en el rendimiento del reconocimiento al evaluar experimentalmente grabaciones reales de micr ofonos no pr oximos a la boca con hablantes en movimiento. La otra contribuci on de esta tesis, que no requiere m ultiples micr ofonos, fue desarrollada en colaboraci on con los colegas del departamento de Comunicaciones Multimedia y Procesamiento de Señales de la Universidad de Erlangen-Nuremberg, Erlangen, Alemania. Trata sobre el problema de extracci on de caracter sticas en REMOS (del ingl es REverberation MOdeling for Speech recognition). REMOS es un marco conceptual gen erico para el reconocimiento robusto del habla con micr ofonos lejanos. El uso de los m etodos convencionales para obtener los elementos decorrelados del vector de caracter sticas, como la transformada coseno discreta, est a limitado por el problema de optimizaci on inherente a REMOS, lo que har a que, utilizando las herramientas convencionales, se volviese un problema irresoluble en un tiempo razonable. Para resolver este problema hemos desarrollado un nuevo m etodo de extracci on de caracter sticas basado en fi ltrado frecuencial
Els sistemes actuals de reconeixement de la parla mostren sovint una taxa d'error acceptable si la veu es registrada amb micr ofons pr oxims a la boca del parlant, en un entorn controlat i lliure de soroll. No obstant, l' us d'aquests micr ofons pot ser massa restrictiu en moltes aplicacions. Alternativament, es poden utilitzar micr ofons distants, els quals sovint s on ubicats a diversos metres del parlant. Aquesta con guraci o es menys intrusiva, ja que el parlant no ha de portar a sobre cap micr ofon, per o el rendiment del reconeixement autom atic de la parla (ASR, de l'angl es Automatic Speech Recognition) en aquest cas es veu fortament afectat pel soroll i la reverberaci o. Aquesta tesi s'enfoca a aplicacions ASR en un ambient de sala, on la reverberaci o es la causa predominant de distorsi o i es considera tant el cas d'un sol micr ofon com el de m ultiples micr ofons. Si la parla es gravada en paral lel per diversos micr ofons distribuï ts arbitràriament a la sala, el grau de distorsi o pot variar d'un canal a l'altre. Les difer encies en qualitat entre els senyals enregistrats poden ser m es accentuades si els micr ofons tenen diferents caracter stiques i col locacions: uns a les parets, altres sobre la taula, o b e altres integrats en els aparells de comunicaci o de les persones presents a la sala. En un escenari com aquest, el sistema ASR es pot bene ciar enormement de l'utilitzaci o del senyal de m es qualitat per al reconeixement. Per a trobar aquest senyal s'han proposat diverses t ecniques, anomenades CS (de l'angl es Channel Selection), les quals es discuteixen detalladament en aquesta tesi. De fet, la selecci o de canal busca ordenar els senyals conforme a la seva qualitat des de la perspectiva ASR. Per crear tal r anquing es necessita una mesura que estimi la qualitat intr nseca d'un senyal, o b e una que valori com de b e aquest s'ajusta als models ac ustics del sistema de reconeixement. En aquesta tesi proporcionem un resum de les mesures CS ns ara presentades en la literatura, comparant-les experimentalment. A m es, es presenten diverses noves t ecniques que superen les anteriors en termes d'exactitud de reconeixement i / o e ci encia computacional. Tamb e es proposa una combinaci o de diferents mesures CS amb l'objectiu d'incrementar l'exactitud del reconeixement, o per reduir la c arrega computacional sense cap p erdua signi cativa de rendiment. A m es mostrem que la CS pot ser utilitzada juntament amb altres t ecniques robustes d'ASR, com ara matched condition training o la normalitzaci o de la varian ca i la mitjana, i que les millores de reconeixement de les dues aproximacions s on ns a cert punt acumulatives. Una versi o online en temps real del m etode de selecci o de canal basat en la varian ca de les envolvents sub-banda de la parla, desenvolupada en aquesta tesi, va ser dissenyada i implementada en una sala intel ligent. A l'hora d'avaluar experimentalment gravacions reals de micr ofons no pr oxims a la boca amb parlants en moviment, es va observar una millora signi cativa en el rendiment del reconeixement. L'altra contribuci o d'aquesta tesi, que no requereix m ultiples micr ofons, va ser desenvolupada en col laboraci o amb els col legues del departament de Comunicacions Multimedia i Processament de Senyals de la Universitat de Erlangen-Nuremberg, Erlangen, Alemanya. Tracta sobre el problema d'extracci o de caracter stiques a REMOS (de l'angl es REverberation MOdeling for Speech recognition). REMOS es un marc conceptual gen eric per al reconeixement robust de la parla amb micr ofons llunyans. L' us dels m etodes convencionals per obtenir els elements decorrelats del vector de caracter stiques, com ara la transformada cosinus discreta, est a limitat pel problema d'optimitzaci o inherent a REMOS. Aquest faria que, utilitzant les eines convencionals, es torn es un problema irresoluble en un temps raonable. Per resoldre aquest problema hem desenvolupat un nou m etode d'extracci o de caracter ístiques basat en fi ltrat frecuencial.

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

Zhang, Xiaojia. "Language modeling for automatic speech recognition in telehealth". Diss., Columbia, Mo. : University of Missouri-Columbia, 2005. http://hdl.handle.net/10355/4245.

Texto completo

Resumen

Thesis (M.S.)--University of Missouri-Columbia, 2005.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file viewed on (January 11, 2007) Vita. Includes bibliographical references.

Los estilos APA, Harvard, Vancouver, ISO, etc.

50

Sklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition". Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.

Texto completo

Resumen

In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success. Communication media, either analog (such as analog landline phones) or digital (VoIP) distort the speaker?s speech signal often in very complex ways: linear distortion occurs in all channels, either in the magnitude or phase spectrum. Non-linear but time-invariant distortion will always appear in all real systems. In digital systems we also have network effects which will produce packet losses and delays and repeated packets. Finally, one cannot really assert what path a signal will take, and so having error or distortion in between is almost a certainty. The channel introduces an acoustical mismatch between the speaker's signal and the trained data in the ASR, which results in poor recognition performance. The approach so far, has been to try to undo the havoc produced by the channels, i.e. compensate for the channel's behavior. In this thesis, we try to characterize the effects of different transmission media and use that as an inexpensive and repeatable way to train ASR systems.

Los estilos APA, Harvard, Vancouver, ISO, etc.

Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!