Log in

Relevant bibliographies by topics / Automated speech Recognition / Dissertations / Theses

Dissertations / Theses on the topic 'Automated speech Recognition'

To see the other types of publications on this topic, follow the link: Automated speech Recognition.

Author: Grafiati

Published: 4 June 2021

Last updated: 31 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Automated speech Recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Davies, David Richard Llewellyn, and dave davies@canberra edu au. "Representing Time in Automated Speech Recognition." The Australian National University. Research School of Information Sciences and Engineering, 2003. http://thesis.anu.edu.au./public/adt-ANU20040602.163031.

Full text

Abstract:

This thesis explores the treatment of temporal information in Automated Speech Recognition. It reviews the study of time in speech perception and concludes that while some temporal information in the speech signal is of crucial value in the speech decoding process not all temporal information is relevant to decoding. We then review the representation of temporal information in the main automated recognition techniques: Hidden Markov Models and Artificial Neural Networks. We find that both techniques have difficulty representing the type of temporal information that is phonetically or phonologically significant in the speech signal. In an attempt to improve this situation we explore the problem of representation of temporal information in the acoustic vectors commonly used to encode the speech acoustic signal in the front-ends of speech recognition systems. We attempt, where possible, to let the signal provide the temporal structure rather than imposing a fixed, clock-based timing framework. We develop a novel acoustic temporal parameter (the Parameter Similarity Length), a measure of temporal stability, that is tested against the time derivatives of acoustic parameters conventionally used in acoustic vectors.

APA, Harvard, Vancouver, ISO, and other styles

2

Sooful, Jayren Jugpal. "Automated phoneme mapping for cross-language speech recognition." Diss., Pretoria [s.n.], 2004. http://upetd.up.ac.za/thesis/available/etd-01112005-131128.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

LAYOUSS, NIZAR GANDY ASSAF. "A critical examination of deep learningapproaches to automated speech recognition." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-153681.

Full text

Abstract:

Recently, deep learning techniques have been successfully applied to automatic speech recognition (ASR) tasks. Most current speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) are exploited to model the emission probability of the HMM. Deep Neural Networks (DNNs) and Deep Belief Networks(DBNs) have recently proven though to outperform GMMs in modeling the probability of emission in HMMs. Deep architectures such as DBNs with many hidden layers are useful for multilevel feature representation thus building a distributed representation at different levels of a certain input. These networks are first pre-trained as a multi-layer generative model of a window of feature vector without making use of any discriminative information in unsupervised mode. Once the generative pre-training is complete, discriminative fine-tuning is performed to adjust the model parameters to make them better at predicting. Our aim is to study different levels of representation for speech acoustic features that are produced by the hidden layers of DBNs. To this end, we estimate phoneme recognition error and use classification accuracy evaluated with Support Vector Machines (SVMs) as a measure of separability between the DBN representations of 61 phoneme classes. In addition, we investigate the relation between different subgroups/categories of phonemes at various representation levels using correlation analysis. The tests have been performed on TIMIT database and simulations have been developed to run on a graphics processing unit (GPU) cluster at PDC/KTH.

APA, Harvard, Vancouver, ISO, and other styles

4

Dookhoo, Raul. "AUTOMATED REGRESSION TESTING APPROACH TO EXPANSION AND REFINEMENT OF SPEECH RECOGNITION GRAMMARS." Master's thesis, University of Central Florida, 2008. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2634.

Full text

Abstract:

This thesis describes an approach to automated regression testing for speech recognition grammars. A prototype Audio Regression Tester called ART has been developed using Microsoft's Speech API and C#. ART allows a user to perform any of three tasks: automatically generate a new XML-based grammar file from standardized SQL database entries, record and cross-reference audio files for use by an underlying speech recognition engine, and perform regression tests with the aid of an oracle grammar. ART takes as input a wave sound file containing speech and a newly created XML grammar file. It then simultaneously executes two tests: one with the wave file and the new grammar file and the other with the wave file and the oracle grammar. The comparison result of the tests is used to determine whether the test was successful or not. This allows rapid exhaustive evaluations of additions to grammar files to guarantee forward process as the complexity of the voice domain grows. The data used in this research to derive results were taken from the LifeLike project. However, the capabilities of ART extend beyond LifeLike. The results gathered have shown that using a person's recorded voice to do regression testing is as effective as having the person do live testing. A cost-benefit analysis, using two published equations, one for Cost and the other for Benefit, was also performed to determine if automated regression testing is really more effective than manual testing. Cost captures the salaries of the engineers who perform regression testing tasks and Benefit captures revenue gains or losses related to changes in product release time. ART had a higher benefit of $21461.08 when compared to manual regression testing which had a benefit of $21393.99. Coupled with its excellent error detection rates, ART has proven to be very efficient and cost-effective in speech grammar creation and refinement.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS

APA, Harvard, Vancouver, ISO, and other styles

5

Tsuchiya, Shinsuke. "Elicited Imitation and Automated Speech Recognition: Evaluating Differences among Learners of Japanese." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2782.

Full text

Abstract:

This study addresses the usefulness of elicited imitation (EI) and automated speech recognition (ASR) as a tool for second language acquisition (SLA) research by evaluating differences among learners of Japanese. The findings indicate that the EI and ASR grading system used in this study was able to differentiate between beginning- and advanced-level learners as well as instructed and self-instructed learners. No significant difference was found between self-instructed learners with and without post-mission instruction. The procedure, reliability and validity of the ASR-based computerized EI are discussed. Results and discussion will provide insights regarding different types of second language (L2) development, the effects of instruction, implications for teaching, as well as limitations of the EI and ASR grading system.

APA, Harvard, Vancouver, ISO, and other styles

6

Brashear, Helene Margaret. "Improving the efficacy of automated sign language practice tools." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/34703.

Full text

Abstract:

The CopyCat project is an interdisciplinary effort to create a set of computer-aided language learning tools for deaf children. The CopyCat games allow children to interact with characters using American Sign Language (ASL). Through Wizard of Oz pilot studies we have developed a set of games, shown their efficacy in improving young deaf children's language and memory skills, and collected a large corpus of signing examples. Our previous implementation of the automatic CopyCat games uses automatic sign language recognition and verification in the infrastructure of a memory repetition and phrase verification task. The goal of my research is to expand the automatic sign language system to transition the CopyCat games to include the flexibility of a dialogue system. I have created a labeling ontology from analysis of the CopyCat signing corpus, and I have used the ontology to describe the contents of the CopyCat data set. This ontology was used to change and improve the automatic sign language recognition system and to add flexibility to language use in the automatic game.

APA, Harvard, Vancouver, ISO, and other styles

7

Morton, Hazel. "A scenario based approach to speech-enabled computer assisted language learning based on automated speech recognition and virtual reality graphics." Thesis, University of Edinburgh, 2007. http://hdl.handle.net/1842/15438.

Full text

Abstract:

By using speech recognition technology, Computer Assisted Language Learning (CALL) programs can provide learners with opportunities to practise speaking in the target language and develop their oral language skills. This research is a contribution to the emerging and innovative area of speech-enabled CALL applications. It describes a CALL application, SPELL (Spoken Electronic Language Learning), which integrates software for speaker independent continuous speech recognition with embodied virtual agents and virtual worlds to create an immersive environment in which learners can converse in the target language in contextualized scenarios. The design of the program is based on a communicative approach to second language acquisition which posits that learning activities should give learners opportunities to communicate in the target language in meaningful contexts. In applying a communicative approach to the design of a CALL program, the speech recogniser is programmed to allow a variety of responses form the learner and to recognise grammatical and ungrammatical utterances so that the learner can receive relevant and immediate feedback to their utterance. Feedback takes two key forms: reformations, where the system repeats or reformulates the agent’s initial speech, and recasts, where the system repeats the learner’s utterance, implicitly correcting any errors. This research claims that speech-enabled CALL systems which employ an open-ended approach to the recognition grammars and which adapt a communicative approach are usable, engaging and motivating conversational tools for language learners. In addition, by employing implicit feedback strategies in the design, speech recognition errors can be mitigated such that interactions between learners and embodied virtual agents can proceed while providing learners with valuable target language input during the interactions. These claims are based on a series of three empirical studies conducted with end users of the system.

APA, Harvard, Vancouver, ISO, and other styles

8

Gargett, Ross. "The Use of Automated Speech Recognition in Electronic Health Records in Rural Health Care Systems." Digital Commons @ East Tennessee State University, 2016. https://dc.etsu.edu/honors/340.

Full text

Abstract:

Since the HITECH (Health Information Technology for Economic and Clinical Health) Act was enacted, healthcare providers are required to achieve “Meaningful Use.” CPOE (Clinical Provider Order Entry), is one such requirement. Many providers prefer to dictate their orders rather than typing them. Medical vocabulary is wrought with its own terminology and department-specific acronyms, and many ASR (Automated Speech Recognition) systems are not trained to interpret this language. The purpose of this thesis research was to investigate the use and effectiveness of ASR in the healthcare industry. Multiple hospitals and multiple clinicians agreed to be followed through their use of an ASR system to enter patient data into the record. As a result of this research, the effectiveness and use of the ASR was examined, and multiple issues with the use and accuracy of the system were uncovered.

APA, Harvard, Vancouver, ISO, and other styles

9

Zylich, Brian Matthew. "Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos." Digital WPI, 2019. https://digitalcommons.wpi.edu/etd-theses/1289.

Full text

Abstract:

We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (< 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as "Good job", "Thank you", "Please", and "You're welcome", as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language.

APA, Harvard, Vancouver, ISO, and other styles

10

Alcaraz, Meseguer Noelia. "Speech Analysis for Automatic Speech Recognition." Thesis, Norwegian University of Science and Technology, Department of Electronics and Telecommunications, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9092.

Full text

Abstract:

The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

APA, Harvard, Vancouver, ISO, and other styles

11

Gabriel, Naveen. "Automatic Speech Recognition in Somali." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.

Full text

Abstract:

The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.

APA, Harvard, Vancouver, ISO, and other styles

12

Al-Shareef, Sarah. "Conversational Arabic Automatic Speech Recognition." Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/10145/.

Full text

Abstract:

Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task.

APA, Harvard, Vancouver, ISO, and other styles

13

Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Uebler, Ulla. "Multilingual speech recognition /." Berlin : Logos Verlag, 2000. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=009117880&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Seward, Alexander. "Efficient Methods for Automatic Speech Recognition." Doctoral thesis, KTH, Tal, musik och hörsel, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3675.

Full text

Abstract:

This thesis presents work in the area of automatic speech recognition (ASR). The thesis focuses on methods for increasing the efficiency of speech recognition systems and on techniques for efficient representation of different types of knowledge in the decoding process. In this work, several decoding algorithms and recognition systems have been developed, aimed at various recognition tasks. The thesis presents the KTH large vocabulary speech recognition system. The system was developed for online (live) recognition with large vocabularies and complex language models. The system utilizes weighted transducer theory for efficient representation of different knowledge sources, with the purpose of optimizing the recognition process. A search algorithm for efficient processing of hidden Markov models (HMMs) is presented. The algorithm is an alternative to the classical Viterbi algorithm for fast computation of shortest paths in HMMs. It is part of a larger decoding strategy aimed at reducing the overall computational complexity in ASR. In this approach, all HMM computations are completely decoupled from the rest of the decoding process. This enables the use of larger vocabularies and more complex language models without an increase of HMM-related computations. Ace is another speech recognition system developed within this work. It is a platform aimed at facilitating the development of speech recognizers and new decoding methods. A real-time system for low-latency online speech transcription is also presented. The system was developed within a project with the goal of improving the possibilities for hard-of-hearing people to use conventional telephony by providing speech-synchronized multimodal feedback. This work addresses several additional requirements implied by this special recognition task.
QC 20100811

APA, Harvard, Vancouver, ISO, and other styles

16

Vipperla, Ravichander. "Automatic Speech Recognition for ageing voices." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5725.

Full text

Abstract:

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.

APA, Harvard, Vancouver, ISO, and other styles

17

Guzy, Julius Jonathan. "Automatic speech recognition : a refutation approach." Thesis, De Montfort University, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.254196.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Deterding, David Henry. "Speaker normalisation for automatic speech recognition." Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.359822.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Badr, Ibrahim. "Pronunciation learning for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66022.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 99-101).
In many ways, the lexicon remains the Achilles heel of modern automatic speech recognizers (ASRs). Unlike stochastic acoustic and language models that learn the values of their parameters from training data, the baseform pronunciations of words in an ASR vocabulary are typically specified manually, and do not change, unless they are edited by an expert. Our work presents a novel generative framework that uses speech data to learn stochastic lexicons, thereby taking a step towards alleviating the need for manual intervention and automnatically learning high-quality baseform pronunciations for words. We test our model on a variety of domains: an isolated-word telephone speech corpus, a weather query corpus and an academic lecture corpus. We show significant improvements of 25%, 15% and 2% over expert-pronunciation lexicons, respectively. We also show that further improvements can be made by combining our pronunciation learning framework with acoustic model training.
by Ibrahim Badr.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

20

Chen, Chia-Ping. "Noise robustness in automatic speech recognition /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/5829.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Ragni, Anton. "Discriminative models for speech recognition." Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.707926.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Evans, N. W. D. "Spectral subtraction for speech enhancement and automatic speech recognition." Thesis, Swansea University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.636935.

Full text

Abstract:

The contributions made in this thesis relate to an extensive investigation of spectral subtraction in the context of speech enhancement and noise robust automatic speech recognition (ASR) and the morphological processing of speech spectrograms. Three sources of error in a spectral subtraction approach are identified and assessed with ASR. The effects of phase, cross-term component and spectral magnitude errors are assessed in a common spectral subtraction framework. ASR results confirm that, except for extreme noise conditions, phase and cross-term component errors are relatively negligible compared to noise estimate errors. A topology classifying approaches to spectral subtraction into power and magnitude, linear and non-linear spectral subtraction is proposed. Each class is assessed and compared under otherwise identical experimental conditions. These experiments are thought to be the first to assess the four combinations under such controlled conditions. ASR results illustrate a lesser sensitivity to noise over-estimation for non-linear approaches. With a view to practical systems, different approaches to noise estimation are investigated. In particular approaches that do not require explicit voice activity detection are assessed and shown to compare favourably to the conventional approach, the latter requiring explicit voice activity detection. Following on from this finding a new computationally efficient approach to noise estimation that does not require explicit voice activity detection is proposed. Investigations into the fundamentals of spectral subtraction highlight the limitation of noise estimates: statistical estimates obtained from a number of analysis frames lead to relatively poor representations of the instantaneous values. To ameliorate this situation, estimates from neighbouring, lateral frequencies are used to complement within bin (from the same frequency) statistical approaches. Improvements are found to be negligible. However, the principle of these lateral estimates lead naturally to the final stage of the work presented in this thesis, that of morphologically filtering speech spectrograms. This form of processing is examined for both synthesised and speech signals and promising ASR performance is reported. In 2000 the Aurora 2 database was introduced by the organisers of a special session at Eurospeech 2001 entitled ‘Noise Robust Recognition’, aimed at providing a standard database and experimental protocols for the assessment of noise robust ASR. This facility, when it became available, was used for the work described in this thesis.

APA, Harvard, Vancouver, ISO, and other styles

23

Thambiratnam, David P. "Speech recognition in adverse environments." Thesis, Queensland University of Technology, 1999. https://eprints.qut.edu.au/36099/1/36099_Thambiratnam_1999.pdf.

Full text

Abstract:

This thesis presents a study of techniques used to improve the performance of small vocabulary isolated word speaker dependent automatic speech recognition systems in adverse environments. Such systems are applicable to 'command and control' applications, for example industrial applications where machines are controlled by voice, providing hands-free and eyes-free operation. Adverse environments present the largest obstacle to the deployment of accurate and usable speech recognition systems. This is because they cause discrepancies between training and testing environments. Two solutions to the problem are investigated. The first is the use of secondary modelling of the output probability distribution of the primary classifiers. It is shown that a significant improvement in performance is obtained for a small vocabulary isolated word speaker dependent system, operating in an adverse environment. Results are presented of simulations using the NOISE.'-: database as well as in an actual factory environment using a real-time system. Based on the outcome of this research, a voice operated parcel sorting machine has been installed at the Australia Post Mail Centre at Underwood, Queensland. A pilot study is also undertaken for the use of lip information to enhance speech recognition accuracy in adverse environments. It is shown that the inclusion of other data sources can improve the performance of a speech recognition system.

APA, Harvard, Vancouver, ISO, and other styles

24

Lebart, Katia. "Speech dereverberation applied to automatic speech recognition and hearing aids." Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.285064.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

LEBART, KATIA. "Speech dereverberation applied to automatic speech recognition and hearing aids." Rennes 1, 1999. http://www.theses.fr/1999REN10033.

Full text

Abstract:

Cette these concerne la dereverberation de la parole dans les contextes specifiques de l'application aux appareils pour malentendants ou a la reconnaissance automatique de la parole. Les methodes considerees doivent etre fonctionnelles dans des conditions ou les canaux acoustiques pris en compte sont inconnus et variables. Nous proposons donc de discriminer la reverberation du signal direct a l'aide de proprietes de la reverberation qui sont independantes du canal acoustique. La correlation spatiale des signaux, leurs directions de provenance et leurs supports temporels menent a differentes methodes qui sont examinees successivement. Apres un etat de l'art sur les methodes fondees sur la decorrelation spatiale de la reverberation tardive et leurs limites, nous suggerons des ameliorations pour l'un des algorithmes les plus utilises. Nous presentons ensuite un nouvel algorithme spatialement selectif, qui attenue les contributions de la reverberation en fonction de leur direction. Cet algorithme est complementaire du precedent. Tous deux utilisent deux capteurs. Finalement, nous proposons une methode originale qui attenue efficacement l'effet de masquage par recouvrement de la reverberation. Les methodes sont evaluees a l'aide de diverses mesures objectives (facteur de reduction de bruit, gain en rsb, distance cepstrale et scores de reconnaissance automatique de la parole). Des essais de combinaisons des differentes methodes demontrent le benefice potentiel de telles associations.

APA, Harvard, Vancouver, ISO, and other styles

26

Couper, Kenney Fiona. "Automatic determination of sub-word units for automatic speech recognition." Thesis, University of Edinburgh, 2008. http://hdl.handle.net/1842/2788.

Full text

Abstract:

Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data. Two of the components of an ASR system are acoustic models and pronunciation models. The variation within spontaneous speech must be accounted for by these components. Phones, or context-dependent phones are typically used as the base subword unit, and one acoustic model is trained for each sub-word unit. Pronunciation modelling largely takes place in a dictionary, which relates words to sequences of phones. Acoustic modelling and pronunciation modelling overlap, and the two are not clearly separable in modelling pronunciation variation. Techniques that find pronunciation variants in the data and then reflect these in the dictionary have not provided expected gains in recognition. An alternative approach to modelling pronunciations in terms of phones is to derive units automatically: using data-driven methods to determine an inventory of sub-word units, their acoustic models, and their relationship to words. This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are 1. automatic and simultaneous generation of a sub-word unit inventory and acoustic model set, using an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion 2. automatic generation of probabilistic dictionaries using joint multigrams The prerequisites of this approach are fewer than in previous work on unit derivation; notably, the timings of word boundaries are not required here. The approach is language independent since it is entirely data-driven and no linguistic information is required. The dictionary generation method outperforms a supervised method using phonetic data. The automatically derived units and dictionary perform reasonably on a small spontaneous speech task, although not yet outperforming phones.

APA, Harvard, Vancouver, ISO, and other styles

27

Gillespie, Bradford W. "Strategies for improving audible quality and speech recognition accuracy of reverberant speech /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/5930.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Johnston, Samuel John Charles, and Samuel John Charles Johnston. "An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech." Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/625626.

Full text

Abstract:

Speech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.

APA, Harvard, Vancouver, ISO, and other styles

29

Kleinschmidt, Tristan Friedrich. "Robust speech recognition using speech enhancement." Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/31895/1/Tristan_Kleinschmidt_Thesis.pdf.

Full text

Abstract:

Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.

APA, Harvard, Vancouver, ISO, and other styles

30

Tabani, Hamid. "Low-power architectures for automatic speech recognition." Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/462249.

Full text

Abstract:

Automatic Speech Recognition (ASR) is one of the most important applications in the area of cognitive computing. Fast and accurate ASR is emerging as a key application for mobile and wearable devices. These devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years which is changing the way of human-machine interaction. Effective speech recognition systems require real-time recognition, which is challenging for mobile devices due to the compute-intensive nature of the problem and the power constraints of such systems and involves a huge effort for CPU architectures to reach it. GPU architectures offer parallelization capabilities which can be exploited to increase the performance of speech recognition systems. However, efficiently utilizing the GPU resources for speech recognition is also challenging, as the software implementations exhibit irregular and unpredictable memory accesses and poor temporal locality. The purpose of this thesis is to study the characteristics of ASR systems running on low-power mobile devices in order to propose different techniques to improve performance and energy consumption. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. We use a refactored implementation of the GMM evaluation code to ameliorate the impact of branches. Then, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. In addition, we compute the Gaussians for multiple frames in parallel, significantly reducing memory bandwidth usage. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61% energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59% energy savings without any loss in the accuracy of the ASR system. Secondly, we propose a register renaming technique that exploits register reuse to reduce the pressure on the register file. Our technique leverages physical register sharing by introducing minor changes in the register map table and the issue queue. We evaluated our renaming technique on top of a modern out-of-order processor. The proposed scheme supports precise exceptions and we show that it results in 9.5% performance improvements for GMM evaluation. Our experimental results show that the proposed register renaming scheme provides 6% speedup on average for the SPEC2006 benchmarks. Alternatively, our renaming scheme achieves the same performance while reducing the number of physical registers by 10.5%. Finally, we propose a hardware accelerator for GMM evaluation that reduces the energy consumption by three orders of magnitude compared to solutions based on CPUs and GPUs. The proposed accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the GMM parameters, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.
El reconocimiento automático de voz (ASR) es una de las aplicaciones más importantes en el área de la computación cognitiva. ASR rápido y preciso se está convirtiendo en una aplicación clave para dispositivos móviles y portátiles. Estos dispositivos, como los Smartphones, han incorporado el reconocimiento de voz como una de las principales interfaces de usuario. Es probable que esta tendencia hacia las interfaces de usuario basadas en voz continúe en los próximos años, lo que está cambiando la forma de interacción humano-máquina. Los sistemas de reconocimiento de voz efectivos requieren un reconocimiento en tiempo real, que es un desafío para los dispositivos móviles debido a la naturaleza de cálculo intensivo del problema y las limitaciones de potencia de dichos sistemas y supone un gran esfuerzo para las arquitecturas de CPU. Las arquitecturas GPU ofrecen capacidades de paralelización que pueden aprovecharse para aumentar el rendimiento de los sistemas de reconocimiento de voz. Sin embargo, la utilización eficiente de los recursos de la GPU para el reconocimiento de voz también es un desafío, ya que las implementaciones de software presentan accesos de memoria irregulares e impredecibles y una localidad temporal deficiente. El propósito de esta tesis es estudiar las características de los sistemas ASR que se ejecutan en dispositivos móviles de baja potencia para proponer diferentes técnicas para mejorar el rendimiento y el consumo de energía. Proponemos varias optimizaciones a nivel de software impulsadas por el análisis de potencia y rendimiento. A diferencia de las propuestas anteriores que intercambian precisión por el rendimiento al reducir el número de gaussianas evaluadas, mantenemos la precisión y mejoramos el rendimiento mediante el uso efectivo de la microarquitectura subyacente de la CPU. Usamos una implementación refactorizada del código de evaluación de GMM para reducir el impacto de las instrucciones de salto. Explotamos la unidad vectorial disponible en la mayoría de las CPU modernas para impulsar el cálculo de GMM. Además, calculamos las gaussianas para múltiples frames en paralelo, lo que reduce significativamente el uso de ancho de banda de memoria. Nuestros resultados experimentales muestran que las optimizaciones propuestas proporcionan un speedup de 2.68x sobre el decodificador Pocketsphinx en una CPU Intel Skylake de alta gama, mientras que logra un ahorro de energía del 61%. En segundo lugar, proponemos una técnica de renombrado de registros que explota la reutilización de registros físicos para reducir la presión sobre el banco de registros. Nuestra técnica aprovecha el uso compartido de registros físicos mediante la introducción de cambios en la tabla de renombrado de registros y la issue queue. Evaluamos nuestra técnica de renombrado sobre un procesador moderno. El esquema propuesto admite excepciones precisas y da como resultado mejoras de rendimiento del 9.5% para la evaluación GMM. Nuestros resultados experimentales muestran que el esquema de renombrado de registros propuesto proporciona un 6% de aceleración en promedio para SPEC2006. Finalmente, proponemos un acelerador para la evaluación de GMM que reduce el consumo de energía en tres órdenes de magnitud en comparación con soluciones basadas en CPU y GPU. El acelerador propuesto implementa un esquema de evaluación perezosa donde las GMMs se calculan bajo demanda, evitando el 50% de los cálculos. Finalmente, incluye un esquema de memorización que evita el 74.88% de las operaciones de coma flotante. El diseño final proporciona una aceleración de 164x y una reducción de energía de 3532x en comparación con una implementación altamente optimizada que se ejecuta en una CPU móvil moderna. Comparado con una GPU móvil de última generación, el acelerador de GMM logra un speedup de 5.89x sobre una implementación CUDA optimizada, mientras que reduce la energía en 241x.

APA, Harvard, Vancouver, ISO, and other styles

31

Martínez, del Hoyo Canterla Alfonso. "Design of Detectors for Automatic Speech Recognition." Doctoral thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-16548.

Full text

Abstract:

This thesis presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. Firstly, we propose a structure suitable for subword detection. This structure is based on the standard HMM framework, but in each detector the MFCC feature extractor and the models are trained for the specific detection problem. Our experiments in the TIMIT database validate the effectiveness of this structure for detection of phones and articulatory features. Secondly, two discriminative training techniques are proposed for detector training. The first one is a modification of Minimum Classification Error training. The second one, Minimum Detection Error training, is the adaptation of Minimum Phone Error to the detection problem. Both methods are used to train HMMs and filterbanks in the detectors, isolated or jointly. MDE has the advantage that any detection performance criterion can be optimized directly. F-score and class accuracy optimization experiments show that MDE training is superior to the MCE-based method. The optimized filterbanks reflect some acoustical properties of the detection classes. Moreover, some changes are consistent over classes with similar acoustical properties. In addition, MDE-training of filterbanks results in filters significatively different than in the standard filterbank. In fact, some filters extract information from different critical bands. Finally, we propose a detection-based automatic speech recognition system. Detectors are built with the proposed HMM-based detection structure and trained discriminatively. The linguistic merger is based on an MLP/Viterbi decoder.

APA, Harvard, Vancouver, ISO, and other styles

32

Bengio, Yoshua. "Connectionist models applied to automatic speech recognition." Thesis, McGill University, 1987. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=63920.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Prager, Richard William. "Parallel processing networks for automatic speech recognition." Thesis, University of Cambridge, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.238443.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Austin, Stephen Christopher. "Hidden Markov models for automatic speech recognition." Thesis, University of Cambridge, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.292913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Frankel, Joe. "Linear dynamic models for automatic speech recognition." Thesis, University of Edinburgh, 2004. http://hdl.handle.net/1842/1087.

Full text

Abstract:

The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterize entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrization and parameter initialization. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesized segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A* search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data.

APA, Harvard, Vancouver, ISO, and other styles

36

Gu, Y. "Perceptually-based features in automatic speech recognition." Thesis, Swansea University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.637182.

Full text

Abstract:

Interspeaker variability of speech features is one of most important problems in automatic speech recognition (ASR), and makes speaker-independent systems much more difficult to achieve than speaker-dependent ones. The work described in the Thesis examines two ideas to overcome this problem. The first attempts to extract more reliable speech features by perceptually-based modelling; the second investigates the speaker variability in this speech feature and reduces its effects by a speaker normalisation scheme. The application of human speech perception in automatic speech recognition is discussed in the Thesis. Several perceptually-based feature analysis techniques are compared in terms of recognition performance, and the effects of individual perceptual parameter encompassed in the feature analysis are investigated. The work demonstrates the benefits of perceptual feature analysis (particularly perceptually-based linear predictive approach) compared with the conventional linear predictive analysis technique. The proposal for speaker normalisation is based on a regional-continuous linear matrix transform function on the perceptual feature space, with an automatic feature classification. This approach is applied in an ASR adaptation system. It is shown that the recognition error rate reduces rapidly when using a few words or a single sentence for adaptation. The adaptation performance demonstrates that such an approach could be very promising for a large vocabulary speaker-independent system.

APA, Harvard, Vancouver, ISO, and other styles

37

Baothman, Fatmah bint Abdul Rahman. "Phonology-based automatic speech recognition for Arabic." Thesis, University of Huddersfield, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.273720.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Holmes, Wendy Jane. "Modelling segmental variability for automatic speech recognition." Thesis, University College London (University of London), 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.267859.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Chan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition." Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Duchnowski, Paul. "A new structure for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1993. http://hdl.handle.net/1721.1/17333.

Full text

Abstract:

Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1993.
Includes bibliographical references (leaves 102-110).
by Paul Duchnowski.
Sc.D.

APA, Harvard, Vancouver, ISO, and other styles

41

Wang, Stanley Xinlei. "Using graphone models in automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53114.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Includes bibliographical references (p. 87-90).
This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown words. For these experiments, graphones models are integrated into the SUMMIT speech recognition framework. First, graphones are applied to automatically generate pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words. Next, a graphone hybrid recognizer is built and tested for searching song lyrics by voice, as well as transcribing spoken lectures in a open vocabulary scenario. These experiments demonstrate significant improvement over traditional word-only speech recognizers. Modifications to the flat hybrid model such as reducing the graphone set size are also considered. Finally, a hierarchical hybrid model is built and compared with the flat hybrid model on the lecture transcription task.
by Stanley Xinlei Wang.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

42

Seigel, Matthew Stephen. "Confidence estimation for automatic speech recognition hypotheses." Thesis, University of Cambridge, 2014. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.648633.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Abdelhamied, Kadry A. "Automatic identification and recognition of deaf speech /." The Ohio State University, 1986. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487266691094027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Cherri, Mona Youssef 1956. "Automatic Speech Recognition Using Finite Inductive Sequences." Thesis, University of North Texas, 1996. https://digital.library.unt.edu/ark:/67531/metadc277749/.

Full text

Abstract:

This dissertation addresses the general problem of recognition of acoustic signals which may be derived from speech, sonar, or acoustic phenomena. The specific problem of recognizing speech is the main focus of this research. The intention is to design a recognition system for a definite number of discrete words. For this purpose specifically, eight isolated words from the T1MIT database are selected. Four medium length words "greasy," "dark," "wash," and "water" are used. In addition, four short words are considered "she," "had," "in," and "all." The recognition system addresses the following issues: filtering or preprocessing, training, and decision-making. The preprocessing phase uses linear predictive coding of order 12. Following the filtering process, a vector quantization method is used to further reduce the input data and generate a finite inductive sequence of symbols representative of each input signal. The sequences generated by the vector quantization process of the same word are factored, and a single ruling or reference template is generated and stored in a codebook. This system introduces a new modeling technique which relies heavily on the basic concept that all finite sequences are finitely inductive. This technique is used in the training stage. In order to accommodate the variabilities in speech, the training is performed casualty, and a large number of training speakers is used from eight different dialect regions. Hence, a speaker independent recognition system is realized. The matching process compares the incoming speech with each of the templates stored, and a closeness ration is computed. A ratio table is generated anH the matching word that corresponds to the smallest ratio (i.e. indicating that the ruling has removed most of the symbols) is selected. Promising results were obtained for isolated words, and the recognition rates ranged between 50% and 100%.

APA, Harvard, Vancouver, ISO, and other styles

45

Colton, Larry Don. "Confidence and rejection in automatic speech recognition /." Full text open access at:, 1997. http://content.ohsu.edu/u?/etd,21.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Li, Jinyu. "Soft margin estimation for automatic speech recognition." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26613.

Full text

Abstract:

Thesis (Ph.D)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Dr. Chin-Hui Lee; Committee Member: Dr. Anthony Joseph Yezzi; Committee Member: Dr. Biing-Hwang (Fred) Juang; Committee Member: Dr. Mark Clements; Committee Member: Dr. Ming Yuan. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

47

Arrowood, Jon A. "Using observation uncertainty for robust speech recognition." Diss., Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04082004-180005/unrestricted/arrowood%5Fjon%5Fa%5F200312%5Fphd.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Bell, Peter. "Full covariance modelling for speech recognition." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/4912.

Full text

Abstract:

HMM-based systems for Automatic Speech Recognition typically model the acoustic features using mixtures of multivariate Gaussians. In this thesis, we consider the problem of learning a suitable covariance matrix for each Gaussian. A variety of schemes have been proposed for controlling the number of covariance parameters per Gaussian, and studies have shown that in general, the greater the number of parameters used in the models, the better the recognition performance. We therefore investigate systems with full covariance Gaussians. However, in this case, the obvious choice of parameters – given by the sample covariance matrix – leads to matrices that are poorly-conditioned, and do not generalise well to unseen test data. The problem is particularly acute when the amount of training data is limited. We propose two solutions to this problem: firstly, we impose the requirement that each matrix should take the form of a Gaussian graphical model, and introduce a method for learning the parameters and the model structure simultaneously. Secondly, we explain how an alternative estimator, the shrinkage estimator, is preferable to the standard maximum likelihood estimator, and derive formulae for the optimal shrinkage intensity within the context of a Gaussian mixture model. We show how this relates to the use of a diagonal covariance smoothing prior. We compare the effectiveness of these techniques to standard methods on a phone recognition task where the quantity of training data is artificially constrained. We then investigate the performance of the shrinkage estimator on a large-vocabulary conversational telephone speech recognition task. Discriminative training techniques can be used to compensate for the invalidity of the model correctness assumption underpinning maximum likelihood estimation. On the large-vocabulary task, we use discriminative training of the full covariance models and diagonal priors to yield improved recognition performance.

APA, Harvard, Vancouver, ISO, and other styles

49

Wu, Jian, and 武健. "Discriminative speaker adaptation and environmental robustness in automatic speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31246138.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Wrede, Britta. "Modelling the effects of speech rate variation for automatic speech recognition." [S.l. : s.n.], 2002. http://deposit.ddb.de/cgi-bin/dokserv?idn=969765304.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!