Dissertations / Theses: 'Cloud speech recognition adaptation'

1

Chan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition." Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ho, Ka-Lung. "Kernel eigenvoice speaker adaptation /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20HOK.

Full text

Abstract:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 56-61). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

3

Humphries, J. J. "Accent modelling and adaptation in automatic speech recognition." Thesis, University of Cambridge, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.604784.

Full text

Abstract:

Automatic speech recognition technology has advanced considerably and today's systems are sufficiently fast, affordable and robust to be useful in a wide range of applications. As the scope of these products increases, so does the range of people using them. The diversity of speaker accents which this brings poses a serious problem for existing technology. Speech recognition systems are generally trained on a specific accent group, such as Standard British English (RP). This work demonstrates that the performance of such systems deteriorates significantly when the accent of the incoming speech is different to that represented by the recogniser (typically more than a 200% increase in word error rate). It is shown that this is attributable to both acoustic and phonological differences between accents. Pronunciation modelling can help overcome phonological differences and a new scheme is described which builds upon previous work in this area to give a fully automated method for capturing pronunciation variations. This method requires no linguistic intervention and works with modern large vocabulary continuous speech recognisers. An existing (e.g. British) phone-loop recogniser transcribes speech from the new accent region (e.g. American) and the resulting phonetic transcription compared to standard (British) pronunciations. The phonological differences are then recorded, along with their phonetic context and confidence scores. A statistical pronunciation model of the new accent can then be produced by clustering these observations using binary decision trees. From these, a set of multiple pronunciations, appropriate to the new accent and with associated probabilities, can be generated and used within the original speech recogniser. This technique has been applied to the speech recognition problem in a range of adaptation and re-training scenarios, using American, British and non-native English speech data, and has been shown to reduce recogniser word error rates by up to 20%. Pronunciation effects captured in an automatically generated model of American English are shown to agree well with linguistic theory, and the similarity of pronunciations synthesised from this model to canonical American pronunciations is shown. A scheme for the integration of pronunciation adaptation with acoustic adaptation (specially MLLR) has also been presented and shown to be effective in producing reductions in recogniser word error rates of as much as 40%. The value of syllable and cross-word information in the accent model was also evaluated.

APA, Harvard, Vancouver, ISO, and other styles

4

Cox, S. J. "Techniques for rapid speaker adaptation in speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.267271.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Nieuwoudt, Christoph. "Cross-language acoustic adaptation for automatic speech recognition." Thesis, Pretoria : [s.n.], 2000. http://upetd.up.ac.za/thesis/available/etd-01062005-071829.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Hewett, Andrew John. "Training and speaker adaptation in template-based speech recognition." Thesis, University of Cambridge, 1989. https://www.repository.cam.ac.uk/handle/1810/250961.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Clarkson, P. R. "Adaptation of statistical language models for automatic speech recognition." Thesis, University of Cambridge, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.597745.

Full text

Abstract:

Statistical language models encode linguistic information in such a way as to be useful to systems which process human language. Such systems include those for optical character recognition and machine translation. Currently, however, the most common application of language modelling is in automatic speech recognition, and it is this that forms the focus of this thesis. Most current speech recognition systems are dedicated to one specific task (for example, the recognition of broadcast news), and thus use a language model which has been trained on text which is appropriate to that task. If, however, one wants to perform recognition on more general language, then creating an appropriate language model is far from straightforward. A task-specific language model will often perform very badly on language from a different domain, whereas a model trained on text from many diverse styles of language might perform better in general, but will not be especially well suited to any particular domain. Thus the idea of an adaptive language model whose parameters automatically adjust to the current style of language is an appealing one. In this thesis, two adaptive language models are investigated. The first is a mixture-based model. The training text is partitioned according to the style of text, and a separate language model is constructed for each component. Each component is assigned a weighting according to its performance at modelling the observed text, and a final language model is constructed as the weighted sum of each of the mixture components. The second approach is based on a cache of recent words. Previous work has shown that words that have occurred recently have a higher probability of occurring in the immediate future than would be predicted by a standard triagram language model. This thesis investigates the hypothesis that more recent words should be considered more significant within the cache by implementing a cache in which a word's recurrence probability decays exponentially over time. The problem of how to predict the effect of a particular language model on speech recognition accuracy is also addressed in this thesis. The results presented here, as well as those of other recent research, suggest that perplexity, the most commonly used method of evaluating language models, is not as well correlated with word error rate as was once thought. This thesis investigates the connection between a language model's perplexity and its effect on speech recognition performance, and will describe the development of alternative measures of a language models' quality which are better correlated with word error rate. Finally, it is shown how the recognition performance which is achieved using mixture-based language models can be improved by optimising the mixture weights with respect to these new measures.

APA, Harvard, Vancouver, ISO, and other styles

8

Ge, Zhenhao. "Mispronunciation detection for language learning and speech recognition adaptation." Thesis, Purdue University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3613127.

Full text

Abstract:

The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.

There are a number of Computer Aided Language Learning (CALL) systems with Computer Aided Pronunciation Training (CAPT) techniques that have been developed. In this thesis, a new HMM-based text-dependent mispronunciation system is introduced using text Adaptive Frequency Cepstral Coefficients (AFCCs). It is shown that this system outperforms the conventional HMM method based on Mel Frequency Cepstral Coefficients (MFCCs). In addition, a mispronunciation detection and classification algorithm based on Principle Component Analysis (PCA) is introduced to help language learners identify and correct their pronunciation errors at the word and syllable levels.

To improve speech recognition by adaptation, two projects have been explored. The first one improves name recognition by learning acceptable variations in name pronunciations, as one of the approaches to make grammar-based name recognition adaptive. The second project is accent detection by examining the shifting of fundamental vowels in accented speech. This approach uses both acoustic and phonetic information to detect accents and is shown to be beneficial with accented English. These applications can be integrated into an automated international calling system, to improve recognition of callers' names and speech. It determines the callers' accent based in a short period of speech. Once the type of accents is detected, it switches from the standard speech recognition engine to an accent-adaptive one for better recognition results.

APA, Harvard, Vancouver, ISO, and other styles

9

Uebel, Luís Felipe. "Speaker normalisation and adaptation in large vocabulary speech recognition." Thesis, University of Cambridge, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.616207.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

McInnes, Fergus Robert. "Adaptation of reference patterns in word-based speech recognition." Thesis, University of Edinburgh, 1988. http://hdl.handle.net/1842/12618.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Wang, Chien-Jen. "Joint-space adaptation technique for robust continuous speech recognition /." Thesis, Connect to this title online; UW restricted, 1997. http://hdl.handle.net/1773/5860.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Wu, Jian, and 武健. "Discriminative speaker adaptation and environmental robustness in automatic speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31246138.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Vipperla, Ravichander. "Automatic Speech Recognition for ageing voices." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5725.

Full text

Abstract:

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.

APA, Harvard, Vancouver, ISO, and other styles

14

Hsiao, Roger Wend Huu. "Kernel eigenspace-based MLLR adaptation /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?COMP%202004%20HSIAO.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Lu, Jianhua. "Missing feature decoding and model adaptation for noisy speech recognition." Thesis, Queen's University Belfast, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.527938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Shou-Chun, Yin 1980. "Speaker adaptation in joint factor analysis based text independent speaker verification." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.

Full text

Abstract:

This thesis presents methods for supervised and unsupervised speaker adaptation of Gaussian mixture speaker models in text-independent speaker verification. The proposed methods are based on an approach which is able to separate speaker and channel variability so that progressive updating of speaker models can be performed while minimizing the influence of the channel variability associated with the adaptation recordings. This approach relies on a joint factor analysis model of intrinsic speaker variability and session variability where inter-session variation is assumed to result primarily from the effects of the transmission channel. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 speaker recognition evaluation plan which is based on conversational telephone speech.

APA, Harvard, Vancouver, ISO, and other styles

17

Mandal, Arindam. "Transformation sharing strategies for MLLR speaker adaptation /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6087.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Stokes-Rees, Ian James. "A Study of the Automatic Speech Recognition Process and Speaker Adaptation." Thesis, University of Waterloo, 2000. http://hdl.handle.net/10012/840.

Full text

Abstract:

This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and multiscale, and presents experimental results for cross-task speaker adaptation. An analysis of training parameters and data sufficiency for reasonable system performance estimates are also included. It is found that Maximum Likelihood Linear Regression (MLLR) supervised adaptation can result in 6% reduction (absolute) in word error rate given only one minute of adaptation data, as compared with an unadapted model set trained on a different task. The unadapted system performed at 24% WER and the adapted system at 18% WER. This is achieved with only 4 to 7 adaptation classes per speaker, as generated from a regression tree.

APA, Harvard, Vancouver, ISO, and other styles

19

Stokes-Rees, Ian. "A study of the automatic speech recognition process and speaker adaptation." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0018/MQ56683.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Nolazco, Flores Juan Arturo. "Spectral subtraction and model adaptation for robust speech recognition in noise." Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.318436.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

He, Xiaodong. "Model selection based speaker adaptation and its application to nonnative speech recognition /." free to MU campus, to others for purchase, 2003. http://wwwlib.umi.com/cr/mo/fullcit?p3115555.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Haque, Serajul. "Perceptual features for speech recognition." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0187.

Full text

Abstract:

Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.

APA, Harvard, Vancouver, ISO, and other styles

23

Ahadi-Sarkani, Seyed Mohammad. "Bayesian and predictive techniques for speaker adaptation." Thesis, University of Cambridge, 1996. https://www.repository.cam.ac.uk/handle/1810/273100.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Gabriel, Naveen. "Automatic Speech Recognition in Somali." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.

Full text

Abstract:

The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.

APA, Harvard, Vancouver, ISO, and other styles

25

Ma, Bin, and 馬斌. "A study on acoustic modeling and adaptation in HMM-based speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B31242145.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Ma, Bin. "A study on acoustic modeling and adaptation in HMM-based speech recognition." Hong Kong : University of Hong Kong, 2000. http://sunzi.lib.hku.hk/hkuto/record.jsp?B22142423.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Swietojanski, Paweł. "Learning representations for speech recognition using artificial neural networks." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/22835.

Full text

Abstract:

Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals.

APA, Harvard, Vancouver, ISO, and other styles

28

Li, Wei. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Click to view the E-thesis via HKUTO, 2007. http://sunzi.lib.hku.hk/hkuto/record/B39634073.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Li, Wei, and 李威. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2007. http://hub.hku.hk/bib/B39634073.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Fraga, Da Silva Thiago. "Reducing development costs of large vocabulary speech recognition systems." Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112232/document.

Full text

Abstract:

Au long des dernières décennies, des importants avancements ont été réalisés dans le domaine de la reconnaissance de la parole à grand vocabulaire. Un des défis à relever dans le domaine concerne la réduction des coûts de développement nécessaires pour construire un nouveau système ou adapter un système existant à une nouvelle tâche, langue ou dialecte. Les systèmes de reconnaissance de la parole à l’état de l’art sont basés sur les principes de l’apprentissage statistique, utilisant l’information fournie par deux modèles stochastiques, un modèle acoustique (MA) et un modèle de langue (ML). Les méthodes standards utilisées pour construire ces modèles s’appuient sur deux hypothèses de base : les jeux de données d’apprentissage sont suffisamment grands, et les données d’apprentissage correspondent bien à la tâche cible. Il est bien connu qu’une partie importante des coûts de développement est dû à la préparation des corpora qui remplissent ces deux conditions, l’origine principale des coûts étant la transcription manuelle des données audio. De plus, pour certaines applications, notamment la reconnaissance des langues et dialectes dits "peu dotés", la collecte des données est en soi une mission difficile. Cette thèse a pour but d’examiner et de proposer des méthodes visant à réduire le besoin de transcriptions manuelles des données audio pour une tâche donnée. Deux axes de recherche ont été suivis. Dans un premier temps, des méthodes d’apprentissage dits "non-supervisées" sont explorées. Leur point commun est l’utilisation des transcriptions audio obtenues automatiquement à l’aide d’un système de reconnaissance existant. Des méthodes non-supervisées sont explorées pour la construction de trois des principales composantes des systèmes de reconnaissance. D’abord, une nouvelle méthode d’apprentissage non-supervisée des MAs est proposée : l’utilisation de plusieurs hypothèses de décodage (au lieu de la meilleure uniquement) conduit à des gains de performance substantiels par rapport à l’approche standard. L’approche non-supervisée est également étendue à l’estimation des paramètres du réseau de neurones (RN) utilisé pour l’extraction d’attributs acoustiques. Cette approche permet la construction des modèles acoustiques d’une façon totalement non-supervisée et conduit à des résultats compétitifs en comparaison avec des RNs estimés de façon supervisée. Finalement, des méthodes non-supervisées sont explorées pour l’estimation des MLs à repli (back-off ) standards et MLs neuronaux. Il est montré que l’apprentissage non-supervisée des MLs conduit à des gains de performance additifs (bien que petits) à ceux obtenus par l’apprentissage non-supervisée des MAs. Dans un deuxième temps, cette thèse propose l’utilisation de l’interpolation de modèles comme une alternative rapide et flexible pour la construction des MAs pour une tâche cible. Les modèles obtenus à partir d’interpolation se montrent plus performants que les modèles de base, notamment ceux estimés à échantillons regroupés ou ceux adaptés à la tâche cible. Il est montré que l’interpolation de modèles est particulièrement utile pour la reconnaissance des dialectes peu dotés. Quand la quantité de données d’apprentissage acoustiques du dialecte ciblé est petite (2 à 3 heures) ou même nulle, l’interpolation des modèles conduit à des gains de performances considérables par rapport aux méthodes standards
One of the outstanding challenges in large vocabulary automatic speech recognition (ASR) is the reduction of development costs required to build a new recognition system or adapt an existing one to a new task, language or dialect. The state-of-the-art ASR systems are based on the principles of the statistical learning paradigm, using information provided by two stochastic models, an acoustic (AM) and a language (LM) model. The standard methods used to estimate the parameters of such models are founded on two main assumptions : the training data sets are large enough, and the training data match well the target task. It is well-known that a great part of system development costs is due to the construction of corpora that fulfill these requirements. In particular, manually transcribing the audio data is the most expensive and time-consuming endeavor. For some applications, such as the recognition of low resourced languages or dialects, finding and collecting data is also a hard (and expensive) task. As a means to lower the cost required for ASR system development, this thesis proposes and studies methods that aim to alleviate the need for manually transcribing audio data for a given target task. Two axes of research are explored. First, unsupervised training methods are explored in order to build three of the main components of ASR systems : the acoustic model, the multi-layer perceptron (MLP) used to extract acoustic features and the language model. The unsupervised training methods aim to estimate the model parameters using a large amount of automatically (and inaccurately) transcribed audio data, obtained thanks to an existing recognition system. A novel method for unsupervised AM training that copes well with the automatic audio transcripts is proposed : the use of multiple recognition hypotheses (rather than the best one) leads to consistent gains in performance over the standard approach. Unsupervised MLP training is proposed as an alternative to build efficient acoustic models in a fully unsupervised way. Compared to cross-lingual MLPs trained in a supervised manner, the unsupervised MLP leads to competitive performance levels even if trained on only about half of the data amount. Unsupervised LM training approaches are proposed to estimate standard back-off n-gram and neural network language models. It is shown that unsupervised LM training leads to additive gains in performance on top of unsupervised AM training. Second, this thesis proposes the use of model interpolation as a rapid and flexible way to build task specific acoustic models. In reported experiments, models obtained via interpolation outperform the baseline pooled models and equivalent maximum a posteriori (MAP) adapted models. Interpolation proves to be especially useful for low resourced dialect ASR. When only a few (2 to 3 hours) or no acoustic data truly matching the target dialect are available for AM training, model interpolation leads to substantial performance gains compared to the standard training methods

APA, Harvard, Vancouver, ISO, and other styles

31

Amdal, Ingunn. "Learning pronunciation variation : A data-driven approach to rule-based lecxicon adaptation for automatic speech recognition." Doctoral thesis, Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, 2002. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-1560.

Full text

Abstract:

To achieve a robust system the variation seen for different speaking styles must be handled. An investigation of standard automatic speech recognition techniques for different speaking styles showed that lexical modelling using general-purpose variants gave small improvements, but the errors differed compared with using only one canonical pronunciation per word. Modelling the variation using the acoustic models (using context dependency and/or speaker dependent adaptation) gave a significant improvement, but the resulting performance for non-native and spontaneous speech was still far from read speech.

In this dissertation a complete data-driven approach to rule-based lexicon adaptation is presented, where the effect of the acoustic models is incorporated in the rule pruning metric. Reference and alternative transcriptions were aligned by dynamic programming, but with a data-driven method to derive the phone-to-phone substitution costs. The costs were based on the statistical co-occurrence of phones, association strength. Rules for pronunciation variation were derived from this alignment. The rules were pruned using a new metric based on acoustic log likelihood. Well trained acoustic models are capable of modelling much of the variation seen, and using the acoustic log likelihood to assess the pronunciation rules prevents the lexical modelling from adding variation already accounted for as shown for direct pronunciation variation modelling.

For the non-native task data-driven pronunciation modelling by learning pronunciation rules gave a significant performance gain. Acoustic log likelihood rule pruning performed better than rule probability pruning.

For spontaneous dictation the pronunciation variation experiments did not improve the performance. The answer to how to better model the variation for spontaneous speech seems to lie neither in the acoustical nor the lexical modelling. The main differences between read and spontaneous speech are the grammar used and disfluencies like restarts and long pauses. The language model may thus be the best starting point for more research to achieve better performance for this speaking style.

APA, Harvard, Vancouver, ISO, and other styles

32

Kleynhans, Neil Taylor. "Automatic speech recognition for resource-scarce environments / N.T. Kleynhans." Thesis, North-West University, 2013. http://hdl.handle.net/10394/9668.

Full text

Abstract:

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. In this thesis we present research into developing techniques and tools to (1) harvest audio data, (2) rapidly adapt ASR systems and (3) select “useful” training samples in order to assist with resource-scarce ASR system development. We demonstrate an automatic audio harvesting approach which efficiently creates a speech recognition corpus by harvesting an easily available audio resource. We show that by starting with bootstrapped acoustic models, trained with language data obtain from a dialect, and then running through a few iterations of an alignment-filter-retrain phase it is possible to create an accurate speech recognition corpus. As a demonstration we create a South African English speech recognition corpus by using our approach and harvesting an internet website which provides audio and approximate transcriptions. The acoustic models developed from harvested data are evaluated on independent corpora and show that the proposed harvesting approach provides a robust means to create ASR resources. As there are many acoustic model adaptation techniques which can be implemented by an ASR system developer it becomes a costly endeavour to select the best adaptation technique. We investigate the dependence of the adaptation data amount and various adaptation techniques by systematically varying the adaptation data amount and comparing the performance of various adaptation techniques. We establish a guideline which can be used by an ASR developer to chose the best adaptation technique given a size constraint on the adaptation data, for the scenario where adaptation between narrow- and wide-band corpora must be performed. In addition, we investigate the effectiveness of a novel channel normalisation technique and compare the performance with standard normalisation and adaptation techniques. Lastly, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.
Thesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.

APA, Harvard, Vancouver, ISO, and other styles

33

Sooful, Jayren Jugpal. "Automated phoneme mapping for cross-language speech recognition." Diss., Pretoria [s.n.], 2004. http://upetd.up.ac.za/thesis/available/etd-01112005-131128.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Fanner, Robert M. "Analysis and implementation of the speaker adaptation techniques : MAP, MLLR, and MLED." Thesis, Stellenbosch : Stellenbosch University, 2002. http://hdl.handle.net/10019.1/52653.

Full text

Abstract:

Thesis (MScEng)--University of Stellenbosch, 2002.
ENGLISH ABSTRACT: The topic of this thesis is speaker adaptation, whereby speaker-independent speech models are adapted to more closely match individual speakers by utilising a small amount of data from the targeted individual. Speaker adaptation methods - specifically, the MAP, MLLR and MLED speaker adaptation methods - are critically evaluated and compared. Two novel extensions of the MLED adaptation method are introduced, derived and evaluated. The first incorporates the explicit modelling of the mean speaker model in the speaker-space into the MLED framework. The second extends MLED to use basis vectors modelling inter-class variance for classes of speech models, instead of basis vectors modelling inter-speaker variance. An evaluation of the effect of two different types of feature vector - PLP-cepstra and LPCCs - on the performance of speaker adaptation is made, to determine which feature vector is optimal for speaker-independent systems and the adaptation thereof.
AFRIKAANSE OPSOMMING: Die onderwerp van hierdie tesis is spreker-aanpassing, dit wil sê, die verandering van 'n spreker-onafhanklike spraakmodel om nader aan 'n spreker-afhanklike model vir 'n individu te wees, gegewe 'n klein hoeveelheid spraakdata van die individu. Die volgende sprekeraanpassing-metodes word geëvalueer: MAP, MLLR en MLED. Twee nuwe uitbreidings vir die MLED-metode word beskryf, afgelei en geëvalueer. Die eerste inkorporeer die eksplisiete modellering van die gemiddelde sprekermodel van die sprekerruimte in die MLED metode. Die tweede uitbreiding maak gebruik van basisvektore vir MLED wat vanaf die interklas-variansie tussen 'n stel sprekerklasse in plaas van die interspreker-variansie afgelei is. Die effek van twee tipes kenmerk-vektore - PLP-kepstra en LPCC's - op die prestasie van sprekeraanpassings-metodes word ondersoek, sodat die optimale tipe kenmerk-vektor vir spreker-onafhanklike modelle en hul aanpassing gevind kan word.

APA, Harvard, Vancouver, ISO, and other styles

35

Shin, Sung-Hwan. "Objective-driven discriminative training and adaptation based on an MCE criterion for speech recognition and detection." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50255.

Full text

Abstract:

Acoustic modeling in state-of-the-art speech recognition systems is commonly based on discriminative criteria. Different from the paradigm of the conventional distribution estimation such as maximum a posteriori (MAP) and maximum likelihood (ML), the most popular discriminative criteria such as MCE and MPE aim at direct minimization of the empirical error rate. As recent ASR applications become diverse, it has been increasingly recognized that realistic applications often require a model that can be optimized for a task-specific goal or a particular scenario beyond the general purposes of the current discriminative criteria. These specific requirements cannot be directly handled by the current discriminative criteria since the objective of the criteria is to minimize the overall empirical error rate. In this thesis, we propose novel objective-driven discriminative training and adaptation frameworks, which are generalized from the minimum classification error (MCE) criterion, for various tasks and scenarios of speech recognition and detection. The proposed frameworks are constructed to formulate new discriminative criteria which satisfy various requirements of the recent ASR applications. In this thesis, each objective required by an application or a developer is directly embedded into the learning criterion. Then, the objective-driven discriminative criterion is used to optimize an acoustic model in order to achieve the required objective. Three task-specific requirements that the recent ASR applications often require in practice are mainly taken into account in developing the objective-driven discriminative criteria. First, an issue of individual error minimization of speech recognition is addressed and we propose a direct minimization algorithm for each error type of speech recognition. Second, a rapid adaptation scenario is embedded into formulating discriminative linear transforms under the MCE criterion. A regularized MCE criterion is proposed to efficiently improve the generalization capability of the MCE estimate in a rapid adaptation scenario. Finally, the particular operating scenario that requires a system model optimized at a given specific operating point is discussed over the conventional receiver operating characteristic (ROC) optimization. A constrained discriminative training algorithm which can directly optimize a system model for any particular operating need is proposed. For each of the developed algorithms, we provide an analytical solution and an appropriate optimization procedure.

APA, Harvard, Vancouver, ISO, and other styles

36

Gangireddy, Siva Reddy. "Recurrent neural network language models for automatic speech recognition." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28990.

Full text

Abstract:

The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).

APA, Harvard, Vancouver, ISO, and other styles

37

Pinet, M. "Accent effects on the recognition of speech in noise : second-language proficiency, accent similarity and adaptation." Thesis, University College London (University of London), 2012. http://discovery.ucl.ac.uk/1370645/.

Full text

Abstract:

One of the key factors that determine speech intelligibility under challenging conditions is the difference between the accents of the talker and listener. For example, normal-hearing listeners can be accurate at recognizing a wide range of accents in quiet, but in noise they are much poorer (e.g., 20 percentage points less accurate) if they try to understand native (L1) or non-native (L2) accented speech that does not closely match their own accent. The aim of this PhD research is to provide a more detailed account of this talker-listener interaction in order to establish the underlying factors involved in L1 and L2 speech communication in noise for normal-hearing populations. Study 1 examined the effects of L2 proficiency on the L1-L2 accent interaction in noise, with Study 2 investigating the contribution of acoustic similarity to accent intelligibility. Study 3 examined L1 listeners’ adaptation processes to unfamiliar accents in noise. Finally, Study 4 took a cross-linguistic approach and investigated how language experience and accent similarity affect the talker-listener accent interaction in noise across languages. Overall, the results revealed that several factors contribute strongly to the L1-L2 accent interaction in noise, with the emerging findings contributing to our general understanding of speech in noise perception. For instance, acoustic similarity in the accents of the talkers and the listeners accounted for a great amount of the variance in intelligibility. Linguistic background and L2 experience were also shown to play a major role in the interaction, shaping the listeners’ accent processing patterns in their L1 and L2, as well as general speech-in-noise processes, with bilingual and highly proficient L2 listeners showing facilitation effects for speech processing in both their languages. Finally, the selective tuning processes found for standard accents in English were not replicated for French, indicating that accent processing varies across languages.

APA, Harvard, Vancouver, ISO, and other styles

38

Martirosian, Olga Meruzhanovna. "Adapting a pronunciation dictionary to Standard South African English for automatic speech recognition / Olga Meruzhanovna Martirosian." Thesis, North-West University, 2009. http://hdl.handle.net/10394/4902.

Full text

Abstract:

The pronunciation dictionary is a key resource required during the development of an automatic speech recognition (ASR) system. In this thesis, we adapt a British English pronunciation dictionary to Standard South African English (SSAE), as a case study in dialect adaptation. Our investigation leads us in three different directions: dictionary verification, phoneme redundancy evaluation and phoneme adaptation. A pronunciation dictionary should be verified for correctness before its implementation in experiments or applications. However, employing a human to verify a full pronunciation dictionary is an indulgent process which cannot always be accommodated. In our dictionary verification research we attempt to reduce the human effort required in the verification of a pronunciation dictionary by implementing automatic and semi-automatic techniques that find and isolate possible erroneous entries in the dictionary. We identify a number of new techniques that are very efficient in identifying errors, and apply them to a public domain British English pronunciation dictionary. Investigating phoneme redundancy involves looking into the possibility that not all phoneme distinctions are required in SSAE, and investigating different methods of analysing these distinctions. The methods that are investigated include both data driven and knowledge based pronunciation suggestions for a pronunciation dictionary used in an automatic speech recognition (ASR) system. This investigation facilitates a deeper linguistic insight into the pronunciation of phonemes in SSAE. Finally, we investigate phoneme adaptation by adapting the KIT phoneme between two dialects of English through the implementation of a set of adaptation rules. Adaptation rules are extracted from literature but also formulated through an investigation of the linguistic phenomena in the data. We achieve a 93% predictive accuracy, which is significantly higher than the 71 % achievable through the implementation of previously identified rules. The adaptation of a British pronunciation dictionary to SSAE represents the final step of developing a SSAE pronunciation dictionary, which is the aim of this thesis. In addition, an ASR system utilising the dictionary is developed, achieving an unconstrained phoneme accuracy of 79.7%.
Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009.

APA, Harvard, Vancouver, ISO, and other styles

39

Tomashenko, Natalia. "Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems." Thesis, Le Mans, 2017. http://www.theses.fr/2017LEMA1040/document.

Full text

Abstract:

Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire
Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them

APA, Harvard, Vancouver, ISO, and other styles

40

Ravindran, Sourabh. "Physiologically Motivated Methods For Audio Pattern Classification." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/14066.

Full text

Abstract:

Human-like performance by machines in tasks of speech and audio processing has remained an elusive goal. In an attempt to bridge the gap in performance between humans and machines there has been an increased effort to study and model physiological processes. However, the widespread use of biologically inspired features proposed in the past has been hampered mainly by either the lack of robustness across a range of signal-to-noise ratios or the formidable computational costs. In physiological systems, sensor processing occurs in several stages. It is likely the case that signal features and biological processing techniques evolved together and are complementary or well matched. It is precisely for this reason that modeling the feature extraction processes should go hand in hand with modeling of the processes that use these features. This research presents a front-end feature extraction method for audio signals inspired by the human peripheral auditory system. New developments in the field of machine learning are leveraged to build classifiers to maximize the performance gains afforded by these features. The structure of the classification system is similar to what might be expected in physiological processing. Further, the feature extraction and classification algorithms can be efficiently implemented using the low-power cooperative analog-digital signal processing platform. The usefulness of the features is demonstrated for tasks of audio classification, speech versus non-speech discrimination, and speech recognition. The low-power nature of the classification system makes it ideal for use in applications such as hearing aids, hand-held devices, and surveillance through acoustic scene monitoring

APA, Harvard, Vancouver, ISO, and other styles

41

ITAKURA, Fumitada, Kazuya TAKEDA, Katsunobu ITOU, and Weifeng LI. "Single-Channel Multiple Regression for In-Car Speech Enhancement." Institute of Electronics, Information and Communication Engineers, 2006. http://hdl.handle.net/2237/15051.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Borges, Liselene de Abreu. "Sistemas de adaptação ao locutor utilizando autovozes." Universidade de São Paulo, 2001. http://www.teses.usp.br/teses/disponiveis/3/3142/tde-05052003-104044/.

Full text

Abstract:

O presente trabalho descreve duas técnicas de adaptação ao locutor para sistemas de reconhecimento de voz utilizando um volume de dados de adaptação reduzido. Regressão Linear de Máxima Verossimilhança (MLLR) e Autovozes são as técnicas trabalhadas. Ambas atualizam as médias das Gaussianas dos modelos ocultos de Markov (HMM). A técnica MLLR estima um grupo de transformações lineares para os parâmetros das medias das Gaussianas do sistema. A técnica de Autovozes baseia-se no conhecimento prévio das variações entre locutores. Para obtermos o conhecimento prévio, que está contido nas autovozes, utiliza-se a análise em componentes principais (PCA). Fizemos os testes de adaptação das médias em um sistema de reconhecimento de voz de palavras isoladas e de vocabulário restrito. Contando com um volume grande de dados de adaptação (mais de 70% das palavras do vocabulário) a técnica de autovozes não apresentou resultados expressivos com relação aos que a técnica MLLR apresentou. Agora, quando o volume de dados reduzido (menos de 15% das palavras do vocabulário) a técnica de Autovozes apresentou-se superior à MLLR.
This present work describe two speaker adaptation technique, using a small amount of adaptation data, for a speech recognition system. These techniques are Maximum Likelihood Linear Regression (MLLR) and Eigenvoices. Both re-estimates the mean of a continuous density Hidden Markov Model system. MLLR technique estimates a set of linear transformations for mean parameters of a Gaussian system. The eigenvoice technique is based on a previous knowledge about speaker variation. For obtaining this previous knowledge, that are retained in eigenvoices, it necessary to apply principal component analysis (PCA). We make adaptation tests over an isolated word recognition system, restrict vocabulary. If a large amount of adaptation data is available (up to 70% of all vocabulary) Eigenvoices technique does not appear to be a good implementation if compared with the MLLR technique. Now, when just a small amount of adaptation data is available (less than 15 % of all vocabulary), Eigenvoices technique get better results than MLLR technique.

APA, Harvard, Vancouver, ISO, and other styles

43

Paliesek, Jakub. "Vliv akustiky prostředí na úspěšnost rozpoznávače řeči." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2021. http://www.nusl.cz/ntk/nusl-445549.

Full text

Abstract:

This diploma thesis deals with impact of room acoustics on automatic speech recognition (ASR) accuracy. Experiments were evaluated on speech corpus LibriSpeech and database of impulse responses and noise called ReverbDB. Used ASRs were based on Mini LibriSpeech recipe for Kaldi. First it was examined how well can ASR learn to transcribe in selected environments by using the same acoustic conditions during training and testing. Next, experiments were carried out with modifications of ASR architecture in order to achieve better robustness against new conditions by using methods for adapation to room acoustics - r-vectors and i-vectors. It was shown that recently proposed method of r-vectors is beneficial even when using real impulse responses for data augmentation.

APA, Harvard, Vancouver, ISO, and other styles

44

Kalantari, Shahram. "Improving spoken term detection using complementary information." Thesis, Queensland University of Technology, 2015. https://eprints.qut.edu.au/90074/1/Shahram_Kalantari_Thesis.pdf.

Full text

Abstract:

This research has made contributions to the area of spoken term detection (STD), defined as the process of finding all occurrences of a specified search term in a large collection of speech segments. The use of visual information in the form of lip movements of the speaker in addition to audio and the use of topic of the speech segments, and the expected frequency of words in the target speech domain, are proposed. By using these complementary information, improvement in the performance of STD has been achieved which enables efficient search of key words in large collection of multimedia documents.

APA, Harvard, Vancouver, ISO, and other styles

45

Švec, Ján. "Adaptace rozpoznávače řeči na datech bez přepisu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234944.

Full text

Abstract:

The goal of this thesis is to design and test techniques for unsupervised adaptation of speech recognizers on some audio data without any textual transcripts. A training set is prepared at first, and a baseline speech recognition system is trained. This sistem is used to transcribe some unseen data. We will experiment with an adaptation data selection process based on some speech transcript quality measurement. The system is re-trained on this new set than, and the accuracy is evaluated. Then we experiment with the amount of adaptation data.

APA, Harvard, Vancouver, ISO, and other styles

46

Shukla, Saurabh. "Development of a Human-AI Teaming Based Mobile Language Learning Solution for Dual Language Learners in Early and Special Educations." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1547127943126526.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Lelong, Amélie. "Convergence phonétique en interaction Phonetic convergence in interaction." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENT079/document.

Full text

Abstract:

Le travail présenté dans cette thèse est basé sur l’étude d’un phénomène appelé convergence phonétique qui postule que deux interlocuteurs en interaction vont avoir tendance à adapter leur façon de parler à leur interlocuteur dans un but communicatif. Nous avons donc mis en place un paradigme appelé « Dominos verbaux » afin de collecter un corpus large pour caractériser ce phénomène, le but final étant de doter un agent conversationnel animé de cette capacité d’adaptation afin d’améliorer la qualité des interactions homme-machine.Nous avons mené différentes études pour étudier le phénomène entre des paires d’inconnus, d’amis de longue date, puis entre des personnes provenant de la même famille. On s’attend à ce que l’amplitude de la convergence soit liée à la distance sociale entre les deux interlocuteurs. On retrouve bien ce résultat. Nous avons ensuite étudié l’impact de la connaissance de la cible linguistique sur l’adaptation. Pour caractériser la convergence phonétique, nous avons développé deux méthodes : la première basée sur une analyse discriminante linéaire entre les coefficients MFCC de chaque locuteur, la seconde utilisant la reconnaissance de parole. La dernière méthode nous permettra par la suite d’étudier le phénomène en condition moins contrôlée.Finalement, nous avons caractérisé la convergence phonétique à l’aide d’une mesure subjective en utilisant un nouveau test de perception basé sur la détection « en ligne » d’un changement de locuteur. Le test a été réalisé à l’aide signaux extraits des interactions mais également avec des signaux obtenus avec une synthèse adaptative basé sur la modélisation HNM. Nous avons obtenus des résultats comparables démontrant ainsi la qualité de notre synthèse adaptative
The work presented in this manuscript is based on the study of a phenomenon called phonetic convergence which postulates that two people in interaction will tend to adapt how they talk to their partner in a communicative purpose. We have developed a paradigm called “Verbal Dominoes“ to collect a large corpus to characterize this phenomenon, the ultimate goal being to fill a conversational agent of this adaptability in order to improve the quality of human-machine interactions.We have done several studies to investigate the phenomenon between pairs of unknown people, good friends, and between people coming from the same family. We expect that the amplitude of convergence is proportional to the social distance between the two speakers. We found this result. Then, we have studied the knowledge of the linguistic target impact on adaptation. To characterize the phonetic convergence, we have developed two methods: the first one is based on a linear discriminant analysis between the MFCC coefficients of each speaker and the second one used speech recognition techniques. The last method will allow us to study the phenomenon in less controlled conditions.Finally, we characterized the phonetic convergence with a subjective measurement using a new perceptual test called speaker switching. The test was performed using signals coming from real interactions but also with synthetic data obtained with the harmonic plus

APA, Harvard, Vancouver, ISO, and other styles

48

Sam, Sethserey. "Vers une adaptation autonome des modèles acoustiques multilingues pour le traitement automatique de la parole." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00685204.

Full text

Abstract:

Les technologies de reconnaissance automatique de la parole sont désormais intégrées dans de nombreux systèmes. La performance des systèmes de reconnaissance vocale pour les locuteurs non natifs continue cependant à souffrir de taux d'erreur élevés, en raison de la différence entre la parole non native et les modèles entraînés. La réalisation d'enregistrements en grande quantité de parole non native est généralement une tâche très difficile et peu réaliste pour représenter toutes les origines des locuteurs. Ce travail de thèse porte sur l'amélioration des modèles acoustiques multilingues pour la transcription phonétique de la parole de type " réunion multilingue ". Traiter ce type de parole constitue plusieurs défis : 1) il peut exister de la conversation entre des locuteurs natifs et non natifs ; 2) il y a non seulement de la parole non native d'une langue, mais de plusieurs langues parlées par des locuteurs venant de différentes origines ; 3) il est difficile de collecter suffisamment de données pour amorcer les systèmes de transcription. Pour répondre à ces défis, nous proposons un processus d'adaptation de modèles acoustiques multilingues que nous appelons " adaptation autonome ". Dans l'adaptation autonome, nous étudions plusieurs approches pour adapter les modèles acoustiques multilingues de manière non supervisée (les langues parlées et les origines des locuteurs ne sont pas connues à l'avance) et qui n'utilise aucune donnée supplémentaire lors du processus d'adaptation. Les approches étudiées sont décomposées selon deux modules. Le premier module qui s'appelle " l'observateur de langues " consiste à récupérer les caractéristiques linguistiques (les langues parlées et les origines des locuteurs) des segments à décoder. Le deuxième module consiste à adapter le modèle acoustique multilingue en fonction des connaissances fournies par l'observateur de langue. Pour évaluer l'utilité de l'adaptation autonome d'un modèle acoustique multilingue, nous utilisons les données de test, qui sont extraites de réunions multilingues, contenant de la parole native et non native de trois langues : l'anglais (EN), le français (FR) et le vietnamien (VN). Selon les résultats d'expérimentation, l'adaptation autonome donne des résultats prometteurs pour les paroles non natives mais dégradent très légèrement les performances sur de la parole native. Afin d'améliorer la performance globale des systèmes de transcription pour toutes les paroles natives et non natives, nous étudions plusieurs approches de détection de parole non native et proposons de cascader un tel détecteur avec notre processus d'adaptation autonome. Les résultats obtenus ainsi, sont les meilleurs parmi toutes les expériences réalisées sur notre corpus de réunions multilingues.

APA, Harvard, Vancouver, ISO, and other styles

49

Hromádko, Michal. "Jednoduchý diktovací systém." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-237008.

Full text

Abstract:

This master's thesis deals with design and developement of simple dictation system. It explains methods used for speech recognition and describes existing systems. Design of the system is focused primarily to create graphic user interface with large emphasis on user friendliness.

APA, Harvard, Vancouver, ISO, and other styles

50

Tran, David, and Jonathan Böcker. "Virtual office assistant on Magic Mirror." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20231.

Full text

Abstract:

Every second, major companies such as Google, Apple, Amazon and Microsoft are col- lecting a great amount of data from users. Photos, voice and texts etc. are stored in the companies massive server parks. With this amount of data, along with technical benefits such as computing power and exceeding algorithms, the companies can train their ma- chine learning models to levels which is hard for a local computing landscape to reach up to.Nowadays, the companies allow developers to use their services and this paper proclaims the benefits of using these. The aim for this thesis is to show how cloud based face recognition and speech recognition can be utilized to create a virtual assistant in Magic Mirror. This new concept opens new possibilities for human-computer interaction.The use case for the assistant was to aid visitors who comes into an office for an appointment with an employee. The prototype of the assistant showed 94% accuracy when identifying faces and fulfilled the task well when the employee name was internationally known, while having difficulties with others, e.g. Swedish names.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Cloud speech recognition adaptation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles