Dissertations / Theses on the topic 'Speech recognition'

To see the other types of publications on this topic, follow the link: Speech recognition.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Speech recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Chuchilina, L. M., and I. E. Yeskov. "Speech recognition." Thesis, Видавництво СумДУ, 2008. http://essuir.sumdu.edu.ua/handle/123456789/15995.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Alcaraz, Meseguer Noelia. "Speech Analysis for Automatic Speech Recognition." Thesis, Norwegian University of Science and Technology, Department of Electronics and Telecommunications, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9092.

Full text
Abstract:

The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

APA, Harvard, Vancouver, ISO, and other styles
3

Kleinschmidt, Tristan Friedrich. "Robust speech recognition using speech enhancement." Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/31895/1/Tristan_Kleinschmidt_Thesis.pdf.

Full text
Abstract:
Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.
APA, Harvard, Vancouver, ISO, and other styles
4

Eriksson, Mattias. "Speech recognition availability." Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2651.

Full text
Abstract:

This project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions.

I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory.

I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.

APA, Harvard, Vancouver, ISO, and other styles
5

Uebler, Ulla. "Multilingual speech recognition /." Berlin : Logos Verlag, 2000. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=009117880&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Yonglian. "Speech Recognition under Stress." Available to subscribers only, 2009. http://proquest.umi.com/pqdweb?did=1968468151&sid=9&Fmt=2&clientId=1509&RQT=309&VName=PQD.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Lucas, Adrian Edward. "Acoustic level speech recognition." Thesis, University of Surrey, 1991. http://epubs.surrey.ac.uk/2819/.

Full text
Abstract:
A number of techniques have been developed over the last forty years which attempt to solve the problem of recognizing human speech by machine. Although the general problem of unconstrained, speaker independent connected speech recognition is still not solved, some of the methods have demonstrated varying degrees of success on a number of constrained speech recognition tasks. Human speech communication is considered to take place on a number of levels from the acoustic signal through to higher linguistic and semantic levels. At the acoustic level, the recognition process can be divided into time-alignment (the removal of global and local timing differences between the unknown input speech and the stored reference templates) and referencete mplate matching. Little attention seems to have been given to the effective use of acoustic level contextual information to improve the performance of these tasks. In this thesis, a new template matching scheme is developed which addresses this issue and successfully allows the utilization of acoustic level context. The method, based on Bayesian decision theory, is a dynamic time warping approach which incorporates statistical dependencies in matching errors between frames along the entire length of the reference template. In addition, the method includes a speaker compensation technique operating simultaneously. Implementation is carried out using the highly efficient branch and bound algorithm. Speech model storage requirements are quite small as a result of an elegant feature of the recursive matching criterion. Furthermore, a novel method for inferencing the special speech models is introduced. The new method is tested on data drawn from nearly 8000 utterances of the 26 letters of the British English Alphabet spoken by 104 speakers, split almost equally between male and female speakers. Experiments show that the new approach is a powerful acoustic level speech recognizer achieving up to 34% better recognition performance when compared with a conventional method based on the dynamic programming algorithm.
APA, Harvard, Vancouver, ISO, and other styles
8

Žmolíková, Kateřina. "Far-Field Speech Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255331.

Full text
Abstract:
Systémy rozpoznávání řeči v dnešní době dosahují poměrně vysoké úspěšnosti. V případě řeči, která je snímána vzdáleným mikrofonem a je tak narušena množstvím šumu a dozvukem (reverberací), je ale přesnost rozpoznávání značně zhoršena. Tento problém je možné zmírnit využitím mikrofonních polí. Tato práce se zabývá technikami, které umožňují kombinovat signály z více mikrofonů tak, aby byla zlepšena kvalita výsledného signálu a tedy i přesnost rozpoznávání. Práce nejprve shrnuje teorii rozpoznávání řeči a uvádí nejpoužívanější algoritmy pro zpracování mikrofonních polí. Následně jsou demonstrovány a analyzovány výsledky použití dvou metod pro beamforming a metody dereverberace vícekanálových signálů. Na závěr je vyzkoušen alternativní způsob beamformingu za použití neuronových sítí.
APA, Harvard, Vancouver, ISO, and other styles
9

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-63).
The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.
by Felix Sun.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
10

Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda, and Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Itakura, Fumitada, Tetsuya Shinde, Kiyoshi Tatara, Taisuke Ito, Ikuya Yokoo, Shigeki Matsubara, Kazuya Takeda, and Nobuo Kawaguchi. "CIAIR speech corpus for real world speech recognition." The oriental chapter of COCOSDA (The International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques), 2002. http://hdl.handle.net/2237/15462.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Al-Otaibi, Abdulhadi S. "Arabic speech processing : syllabic segmentation and speech recognition." Thesis, Aston University, 1988. http://publications.aston.ac.uk/8064/.

Full text
Abstract:
A detailed description of the Arabic Phonetic System is given. The syllabic behaviour of the Arabic language is highlighted. Basic statistical properties Of the Arabic language (phoneme and syllabic frequency of repetition) are included. A thorough review of the speech processing techniques, used in speech analysis, synthesis and recognition applications are presented. The development of a PC-based speech processing system is described. The system has proven to be a useful tool in Arabic speech analysis and recognition applications. A sample speotrographic study of two pairs of Arabic similar sounds was performed. it is shown that no clear acoustical property exist in distinguishing between the phonemes /O/ and /f/ except the gradual rise of F1 during formant movements (transitions). The development of an automatic Arabic syllabic segmentation algorithm is described. The performance of the algorithm is tested with monosyllabic and multisyllabic words. An overall accuracy of 92% was achieved. The main parameters affecting the accuracy of the segmentation algorithm are discussed. The syllabic units generated from applying the Arabic syllabic segmentation algorithm are utilized in the implementation of three major speech applications, namely, automatic Arabic vowel recognition system, isolated word recognition system and an acoustic-phonetic model for Arabic. Each application is fully described and its performance results are indicated.
APA, Harvard, Vancouver, ISO, and other styles
14

Tran, Thao, and Nathalie Tkauc. "Face recognition and speech recognition for access control." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-39776.

Full text
Abstract:
This project is a collaboration with the company JayWay in Halmstad. In order to enter theoffice today, a tag-key is needed for the employees and a doorbell for the guests. If someonerings the doorbell, someone on the inside has to open the door manually which is consideredas a disturbance during work time. The purpose with the project is to minimize thedisturbances in the office. The goal with the project is to develop a system that uses facerecognition and speech-to-text to control the lock system for the entrance door. The components used for the project are two Raspberry Pi’s, a 7 inch LCD-touch display, aRaspberry Pi Camera Module V2, a external sound card, a microphone and speaker. Thewhole project was written in Python and the platform used was Amazon Web Services (AWS)for storage and the face recognition while speech-to-text was provided by Google.The system is divided in three functions for employees, guests and deliveries. The employeefunction has two authentication steps, the face recognition and a random generated code that needs to be confirmed to avoid biometric spoofing. The guest function includes the speech-to-text service to state an employee's name that the guest wants to meet and the employee is then notified. The delivery function informs the specific persons in the office that are responsiblefor the deliveries by sending a notification.The test proves that the system will always match with the right person when using the facerecognition. It also shows what the threshold for the face recognition can be set to, to makesure that only authorized people enters the office.Using the two steps authentication, the face recognition and the code makes the system secureand protects the system against spoofing. One downside is that it is an extra step that takestime. The speech-to-text is set to swedish and works quite well for swedish-speaking persons.However, for a multicultural company it can be hard to use the speech-to-text service. It canalso be hard for the service to listen and translate if there is a lot of background noise or ifseveral people speak at the same time.
APA, Harvard, Vancouver, ISO, and other styles
15

Dewey, John K. "Speech recognition of foreign accent." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1994. http://handle.dtic.mil/100.2/ADA282979.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Stemmer, Georg. "Modeling variability in speech recognition /." Berlin : Logos-Verl, 2005. http://deposit.ddb.de/cgi-bin/dokserv?id=2659313&prov=M&dok_var=1&dok_ext=htm.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Mustafa, M. K. "On-device mobile speech recognition." Thesis, Nottingham Trent University, 2016. http://irep.ntu.ac.uk/id/eprint/28044/.

Full text
Abstract:
Despite many years of research, Speech Recognition remains an active area of research in Artificial Intelligence. Currently, the most common commercial application of this technology on mobile devices uses a wireless client – server approach to meet the computational and memory demands of the speech recognition process. Unfortunately, such an approach is unlikely to remain viable when fully applied over the approximately 7.22 Billion mobile phones currently in circulation. In this thesis we present an On – Device Speech recognition system. Such a system has the potential to completely eliminate the wireless client-server bottleneck. For the Voice Activity Detection part of this work, this thesis presents two novel algorithms used to detect speech activity within an audio signal. The first algorithm is based on the Log Linear Predictive Cepstral Coefficients Residual signal. These LLPCCRS feature vectors were then classified into voice signal and non-voice signal segments using a modified K-means clustering algorithm. This VAD algorithm is shown to provide a better performance as compared to a conventional energy frame analysis based approach. The second algorithm developed is based on the Linear Predictive Cepstral Coefficients. This algorithm uses the frames within the speech signal with the minimum and maximum standard deviation, as candidates for a linear cross correlation against the rest of the frames within the audio signal. The cross correlated frames are then classified using the same modified K-means clustering algorithm. The resulting output provides a cluster for Speech frames and another cluster for Non–speech frames. This novel application of the linear cross correlation technique to linear predictive cepstral coefficients feature vectors provides a fast computation method for use on the mobile platform; as shown by the results presented in this thesis. The Speech recognition part of this thesis presents two novel Neural Network approaches to mobile Speech recognition. Firstly, a recurrent neural networks architecture is developed to accommodate the output of the VAD stage. Specifically, an Echo State Network (ESN) is used for phoneme level recognition. The drawbacks and advantages of this method are explained further within the thesis. Secondly, a dynamic Multi-Layer Perceptron approach is developed. This builds on the drawbacks of the ESN and provides a dynamic way of handling speech signal length variabilities within its architecture. This novel Dynamic Multi-Layer Perceptron uses both the Linear Predictive Cepstral Coefficients (LPC) and the Mel Frequency Cepstral Coefficients (MFCC) as input features. A speaker dependent approach is presented using the Centre for spoken Language and Understanding (CSLU) database. The results show a very distinct behaviour from conventional speech recognition approaches because the LPC shows performance figures very close to the MFCC. A speaker independent system, using the standard TIMIT dataset, is then implemented on the dynamic MLP for further confirmation of this. In this mode of operation the MFCC outperforms the LPC. Finally, all the results, with emphasis on the computation time of both these novel neural network approaches are compared directly to a conventional hidden Markov model on the CSLU and TIMIT standard datasets.
APA, Harvard, Vancouver, ISO, and other styles
18

Haque, Serajul. "Perceptual features for speech recognition." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0187.

Full text
Abstract:
Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.
APA, Harvard, Vancouver, ISO, and other styles
19

Nilsson, Tobias. "Speech Recognition Software and Vidispine." Thesis, Umeå universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-71428.

Full text
Abstract:
To evaluate libraries for continuous speech recognition, a test based on TED-talk videos was created. The different speech recognition libraries PocketSphinx, Dragon NaturallySpeaking and Microsoft Speech API were part of the evaluation. From the words that the libraries recognized, Word Error Rate (WER) was calculated and the results show that Microsoft SAPI performed worst with a WER of 60.8%, PocketSphinx at second place with 59.9% and Dragon NaturallySpeaking as the best with 42.6%. These results were all achieved with a Real Time Factor (RTF) of less than 1.0. PocketSphinx was chosen as the best candidate for the intended system on the basis that it is open-source, free and would be a better match to the system. By modifying the language model and dictionary to closer resemble typical TED-talk contents, it was also possible to improve the WER for PocketSphinx to a value of 39.5%, however with the cost of RTF which passed the 1.0 limit,making it less useful for live video.
APA, Harvard, Vancouver, ISO, and other styles
20

Thompson, J. "Speech variability in speaker recognition." Thesis, Swansea University, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.639230.

Full text
Abstract:
This thesis is concerned with investigating the effects of variability on the automatic speaker recognition system performance. Both speaker generated variability and variability of the recording environment are examined. Speaker generated variability (intra-variation) has received less attention than variability of the recording environment, and is therefore the main focus of this thesis. In particular, of most concern is the intra-variation of data typically found in co-operative speaker recognition tasks. That is normally spoken speech, collected over a period of months. To assess the scale of recognition errors attributed to intra-variation, errors due to noise degradation are considered first. Additive noise can rapidly degrade recognition performance, so for a more realistic assessment, a 'state of the art' noise compensation algorithm is also introduced. Comparisons between noise degradation and intra-variation, shows intra-variation to be a significant source of recognition errors, with intra-variation being the source of most recognition errors of a background noise of 9dB SNR or greater. The level of intra-variation and recognition errors is shown to be highly speaker dependent. Analysis of cepstral variation shows intra-variation to correlate more closely with recognition errors than inter-variation. Recognition experiments and analysis of the glottal pulse shape demonstrate that variation between two recording sessions generally increases as the time gap between the recording of the sessions lengthens. Glottal pulse variation is also shown to vary within recording sessions, albeit with less variation than between sessions. Glottal pulse shape variation is shown by others to vary for highly stressed speech. It is shown here to also vary for normally spoken speech collected under relatively controlled conditions. It is hypothesized that these variations occur, in part, due to the speaker's anxiety during recording. Glottal pulse variation is shown to broadly match the hypothesised anxiety profile. The gradual change of glottal pulse variation demonstrates an underlying reason why incremental speaker adaptation can be used for intra-variation compensation. Experiments show that potentially adaptation can reduce speaker identification error rates from 15% to 2.5%.
APA, Harvard, Vancouver, ISO, and other styles
21

Leventis, Constantinos P. "Speech recognition application in C.I.C." Thesis, Monterey, California. Naval Postgraduate School, 1991. http://hdl.handle.net/10945/26786.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Milner, Benjamin Peter. "Speech recognition in adverse environments." Thesis, University of East Anglia, 1994. https://ueaeprints.uea.ac.uk/2907/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Long, Christopher J. "Wavelet methods in speech recognition." Thesis, Loughborough University, 1999. https://dspace.lboro.ac.uk/2134/14108.

Full text
Abstract:
In this thesis, novel wavelet techniques are developed to improve parametrization of speech signals prior to classification. It is shown that non-linear operations carried out in the wavelet domain improve the performance of a speech classifier and consistently outperform classical Fourier methods. This is because of the localised nature of the wavelet, which captures correspondingly well-localised time-frequency features within the speech signal. Furthermore, by taking advantage of the approximation ability of wavelets, efficient representation of the non-stationarity inherent in speech can be achieved in a relatively small number of expansion coefficients. This is an attractive option when faced with the so-called 'Curse of Dimensionality' problem of multivariate classifiers such as Linear Discriminant Analysis (LDA) or Artificial Neural Networks (ANNs). Conventional time-frequency analysis methods such as the Discrete Fourier Transform either miss irregular signal structures and transients due to spectral smearing or require a large number of coefficients to represent such characteristics efficiently. Wavelet theory offers an alternative insight in the representation of these types of signals. As an extension to the standard wavelet transform, adaptive libraries of wavelet and cosine packets are introduced which increase the flexibility of the transform. This approach is observed to be yet more suitable for the highly variable nature of speech signals in that it results in a time-frequency sampled grid that is well adapted to irregularities and transients. They result in a corresponding reduction in the misclassification rate of the recognition system. However, this is necessarily at the expense of added computing time. Finally, a framework based on adaptive time-frequency libraries is developed which invokes the final classifier to choose the nature of the resolution for a given classification problem. The classifier then performs dimensionaIity reduction on the transformed signal by choosing the top few features based on their discriminant power. This approach is compared and contrasted to an existing discriminant wavelet feature extractor. The overall conclusions of the thesis are that wavelets and their relatives are capable of extracting useful features for speech classification problems. The use of adaptive wavelet transforms provides the flexibility within which powerful feature extractors can be designed for these types of application.
APA, Harvard, Vancouver, ISO, and other styles
24

Stewart, Darryl William. "Syllable based continuous speech recognition." Thesis, Queen's University Belfast, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.325993.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Luettin, Juergen. "Visual speech and speaker recognition." Thesis, University of Sheffield, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.264432.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Jafri, Afshan. "Morphology-based Arabic speech recognition." Thesis, University of Essex, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.429298.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

SANTOS, DEBORA ANDREA DE OLIVEIRA. "SPEECH RECOGNITION IN NOISE ENVIRONMENT." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2001. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=1987@1.

Full text
Abstract:
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
Este trabalho apresenta um estudo comparativo de três técnicas de melhoria das taxas de reconhecimento de voz em ambiente adverso, a saber: Normalização da Média Cepestral (CMN), Subtração Espectral e Regressão Linear no Sentido da Máxima Verossimilhança (MLLR), aplicadas isoladamente e em concomitância, duas a duas. Os testes são realizados usando um sistema simples: reconhecimento de palavras isoladas (dígitos de zero a nove, e meia), modo dependente do locutor, modelos ocultos de Markov do tipo contínuo, e vetores de atributos com doze coeficientes cepestrais derivados da análise de predição linear. São adotados três tipos de ruído (gaussiano branco, falatório e de fábrica) em nove razões sinal-ruído diferentes. Os resultados experimentais demonstram que o emprego isolado das técnicas de reconhecimento robusto é, em geral, vantajoso, pois nas diversas razões sinal-ruído para as quais os testes são efetuados, quando as taxas de reconhecimento não sofrem um acréscimo, mantém-se as mesmas obtidas quando não se aplica nenhum método de aumento da robustez. Analisando-se comparativamente as implementações isoladas e simultânea das técnicas, constata-se que a simultânea nem sempre é atraente, dependendo da dupla empregada. Apresentam-se, ainda, os resultados decorrentes do uso de modelos ruidosos, observando-se que, embora sejam inegavelmente melhores, sua utilização é inviável na prática. Das técnicas implementadas, a que representa resultados mais próximos ao emprego de modelos ruidosos é a MLLR, seguida pela CMN, e por último pela Subtração Espectral. Estas últimas, embora percam em desempenho para a primeira, apresentam como vantagem a simplicidade e a generalidade. No que concerne as técnicas usadas concomitantemente, a dupla Subtração Espectral e MLLR é a considerada de melhor performance, pois mostra-se conveniente em relação ao emprego isolado de ambos os métodos, o que nem sempre ocorre com o uso de outras combinações das técnicas individuais.
This work presents a comparative study of three techniques for improving the speech recognition rates in adverse environment, namely: Cepstral Mean Normalization (CMN), Spectral Subtraction and Maximum Likelihood Linear Regression (MLLR). They are implemented in two ways: separately and in pairs. The tests are carried out on a simple system: recognition of isolated words (digits from zero to nine, and the word half), speaker-dependent mode, continuous hidden Markov models, and speech feature vectors with twelve cepstral coefficients derived from linear predictive analysis. Three types of noise are considered (the white one, voice babble and from factory) at nine different signal-to-noise ratios. Experimental result demonstrate that it is worth using separately the techniques of robust recognition. This is because for all signal-to-noise conditions when the recognition accuracy is not improved it is the same one obtained when no method for increasing the robustness is applied. Analyzing comparatively the isolated and simultaneous applications of the techniques, it is verified that the later is not always more attractive than the former one. This depends on the pair of techniques. The use of noisy models is also considered. Although it presents better results, it is not feasible to implement in pratical situations. Among the implemented techniques, MLLR presents closer results to the ones obtaneid with noisy models, followed by CMN, and, at last, by Spectral Subtraction. Although the two later ones are beaten by the first, in terms of recognition accuracy, their advantages are the simplicity and the generality. The use of simultaneous techniques reveals that the pair Spectral Subtraction and MLLR is the one with the best performance because it is superior in comparison with the individual use of both methods. This does not happen with other combination of techniques.
Este trabajo presenta un estudio comparativo de tres técnicas de mejoría de las tasas de reconocimiento de voz en ambiente adverso, a saber: Normalización de la Media Cepextral (CMN), Substracción Espectral y Regresión Lineal en el Sentido de la Máxima Verosimilitud (MLLR), aplicadas separada y conjuntamente, dos a dos. Las pruebas son realizados usando un sistema simple: reconocimiento de palabras aisladas (dígitos de cero al nueve, y media), de modo dependiente del locutor, modelos ocultos de Markov de tipo contínuo, y vectores de atributos con doce coeficientes cepextrales derivados del análisis de predicción lineal. Se adoptan tres tipos de ruido (gausiano blanco, parlatorio y de fábrica) en nueve razones señal- ruido diferentes. Los resultados experimentales demuestran que el empleo aislado de las técnicas de reconocimiento robusto es, en general, ventajoso, pues en las diversas relaciones señal ruido para las cuales las pruebas son efetuadas, cuando la tasa de reconocimiento no aumenta, manteniendo las mismas tasas cuando no se aplica ningún método de aumento de robustez. Analizando comparativamente las implementaciones aisladas y simultáneas de las técnicas, se constata que no siempre la simultánea resulta atractiva, dependiendo de la dupla utilizada. Se presentan además los resultados al utilizar modelos ruidosos, observando que, aunque resultan mejores, su utilización em la práctica resulta inviable. De las técnicas implementadas, la que presenta resultados más próximos al empleo de modelos ruidosos es la MLLR, seguida por la CMN, y por último por la Substracción Espectral. Estas últimas, aunque tienen desempeño peor que la primera, tienen como ventaja la simplicidad y la generalidad. En lo que se refiere a las técnicas usadas concomitantemente, la dupla Substracción Espectral y MLLR es la de mejor performance, pues se muestra conveniente en relación al empleo aislado de ambos métodos, lo que no siempre ocurre con el uso de otras combinaciones de las técnicas individuales.
APA, Harvard, Vancouver, ISO, and other styles
28

Ragni, Anton. "Discriminative models for speech recognition." Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.707926.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Melnikoff, Stephen Jonathan. "Speech recognition in programmable logic." Thesis, University of Birmingham, 2003. http://etheses.bham.ac.uk//id/eprint/16/.

Full text
Abstract:
Speech recognition is a computationally demanding task, especially the decoding part, which converts pre-processed speech data into words or sub-word units, and which incorporates Viterbi decoding and Gaussian distribution calculations. In this thesis, this part of the recognition process is implemented in programmable logic, specifically, on a field-programmable gate array (FPGA). Relevant background material about speech recognition is presented, along with a critical review of previous hardware implementations. Designs for a decoder suitable for implementation in hardware are then described. These include details of how multiple speech files can be processed in parallel, and an original implementation of an algorithm for summing Gaussian mixture components in the log domain. These designs are then implemented on an FPGA. An assessment is made as to how appropriate it is to use hardware for speech recognition. It is concluded that while certain parts of the recognition algorithm are not well suited to this medium, much of it is, and so an efficient implementation is possible. Also presented is an original analysis of the requirements of speech recognition for hardware and software, which relates the parameters that dictate the complexity of the system to processing speed and bandwidth. The FPGA implementations are compared to equivalent software, written for that purpose. For a contemporary FPGA and processor, the FPGA outperforms the software by an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
30

Price, Michael Ph D. (Michael R. ). Massachusetts Institute of Technology. "Energy-scalable speech recognition circuits." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106090.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 135-141).
As people become more comfortable with speaking to machines, the applications of speech interfaces will diversify and include a wider range of devices, such as wearables, appliances, and robots. Automatic speech recognition (ASR) is a key component of these interfaces that is computationally intensive. This thesis shows how we designed special-purpose integrated circuits to bring local ASR capabilities to electronic devices with a small size and power footprint. This thesis adopts a holistic, system-driven approach to ASR hardware design. We identify external memory bandwidth as the main driver in system power consumption and select algorithms and architectures to minimize it. We evaluate three acoustic modeling approaches-Gaussian mixture models (GMMs), subspace GMMs (SGMMs), and deep neural networks (DNNs)-and identify tradeoffs between memory bandwidth and recognition accuracy. DNNs offer the best tradeoffs for our application; we describe a SIMD DNN architecture using parameter quantization and sparse weight matrices to save bandwidth. We also present a hidden Markov model (HMM) search architecture using a weighted finite-state transducer (WFST) representation. Enhancements to the search architecture, including WFST compression and caching, predictive beam width control, and a word lattice, reduce memory bandwidth to 10 MB/s or less, despite having just 414 kB of on-chip SRAM. The resulting system runs in real-time with accuracy comparable to a software recognizer using the same models. We provide infrastructure for deploying recognizers trained with open-source tools (Kaldi) on the hardware platform. We investigate voice activity detection (VAD) as a wake-up mechanism and conclude that an accurate and robust algorithm is necessary to minimize system power, even if it results in larger area and power for the VAD itself. We design fixed-point digital implementations of three VAD algorithms and explore their performance on two synthetic tasks with SNRs from -5 to 30 dB. The best algorithm uses modulation frequency features with an NN classifier, requiring just 8.9 kB of parameters. Throughout this work we emphasize energy scalability, or the ability to save energy when high accuracy or complex models are not required. Our architecture exploits scalability from many sources: model hyperparameters, runtime parameters such as beam width, and voltage/frequency scaling. We demonstrate these concepts with results from five ASR tasks, with vocabularies ranging from 11 words to 145,000 words.
by Michael Price.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
31

Yoder, Benjamin W. (Benjamin Wesley) 1977. "Spontaneous speech recognition using HMMs." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/36108.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2003.
Includes bibliographical references (leaf 63).
This thesis describes a speech recognition system that was built to support spontaneous speech understanding. The system is composed of (1) a front end acoustic analyzer which computes Mel-frequency cepstral coefficients, (2) acoustic models of context-dependent phonemes (triphones), (3) a back-off bigram statistical language model, and (4) a beam search decoder based on the Viterbi algorithm. The contextdependent acoustic models resulted in 67.9% phoneme recognition accuracy on the standard TIMIT speech database. Spontaneous speech was collected using a "Wizard of Oz" simulation of a simple spatial manipulation game. Naive subjects were instructed to manipulate blocks on a computer screen in order to solve a series of geometric puzzles using only spoken commands. A hidden human operator performed actions in response to each spoken command. The speech from thirteen subjects formed the corpus for the speech recognition results reported here. Using a task-specific bigram statistical language model and context-dependent acoustic models, the system achieved a word recognition accuracy of 67.6%. The recognizer operated using a vocabulary of 523 words. The recognition had a word perplexity of 36.
by Benjamin W. Yoder.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
32

Higgins, Irina. "Computational neuroscience of speech recognition." Thesis, University of Oxford, 2015. https://ora.ox.ac.uk/objects/uuid:daa8d096-6534-4174-b63e-cc4161291c90.

Full text
Abstract:
Physical variability of speech combined with its perceptual constancy make speech recognition a challenging task. The human auditory brain, however, is able to perform speech recognition effortlessly. This thesis aims to understand the precise computational mechanisms that allow the auditory brain to do so. In particular, we look for the minimal subset of sub-cortical auditory brain areas that allow the primary auditory cortex to learn 'good representations' of speech-like auditory objects through spike-timing dependent plasticity (STDP) learning mechanisms as described by Bi & Poo (1998). A 'good representation' is defined as that which is informative of the stimulus class regardless of the variability in the raw input, while being less redundant and more compressed than the representations within the auditory nerve, which provides the firing inputs to the rest of the auditory brain hierarchy (Barlow 1961). Neurophysiological studies have provided insights into the architecture and response properties of different areas within the auditory brain hierarchy. We use these insights to guide the development of an unsupervised spiking neural network grounded in the neurophysiology of the auditory brain and equipped with spike-time dependent plasticity (STDP) learning (Bi & Poo 1998). The model was exposed to simple controlled speech- like stimuli (artificially synthesised phonemes and naturally spoken words) to investigate how stable representations that are invariant to the within- and between-speaker differences can emerge in the output area of the model. The output of the model is roughly equivalent to the primary auditory cortex. The aim of the first part of the thesis was to investigate what was the minimal taxonomy necessary for such representations to emerge through the interactions of spiking dynamics of the network neurons, their ability to learn through STDP learning and the statistics of the auditory input stimuli. It was found that sub-cortical pre-processing within the ventral cochlear nucleus and inferior colliculus was necessary to remove jitter inherent to the auditory nerve spike rasters, which would disrupt STDP learning in the primary auditory cortex otherwise. The second half of the thesis investigated the nature of neural encoding used within the primary auditory cortex stage of the model to represent the learnt auditory object categories. It was found that single cell binary encoding (DeWeese & Zador 2003) was sufficient to represent two synthesised vowel classes, however more complex population encoding using precisely timed spikes within polychronous chains (Izhikevich 2006) represented more complex naturally spoken words in speaker-invariant manner.
APA, Harvard, Vancouver, ISO, and other styles
33

Gabriel, Naveen. "Automatic Speech Recognition in Somali." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.

Full text
Abstract:
The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.
APA, Harvard, Vancouver, ISO, and other styles
34

McDermott, Erik. "Discriminative training for speech recognition /." Electronic version of summary, 1997. http://www.wul.waseda.ac.jp/gakui/gaiyo/2460.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Klautau, Aldebaro. "Speech recognition using discriminative classifiers /." Diss., Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 2003. http://wwwlib.umi.com/cr/ucsd/fullcit?p3091208.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Al-Shareef, Sarah. "Conversational Arabic Automatic Speech Recognition." Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/10145/.

Full text
Abstract:
Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task.
APA, Harvard, Vancouver, ISO, and other styles
37

Thambiratnam, David P. "Speech recognition in adverse environments." Thesis, Queensland University of Technology, 1999. https://eprints.qut.edu.au/36099/1/36099_Thambiratnam_1999.pdf.

Full text
Abstract:
This thesis presents a study of techniques used to improve the performance of small vocabulary isolated word speaker dependent automatic speech recognition systems in adverse environments. Such systems are applicable to 'command and control' applications, for example industrial applications where machines are controlled by voice, providing hands-free and eyes-free operation. Adverse environments present the largest obstacle to the deployment of accurate and usable speech recognition systems. This is because they cause discrepancies between training and testing environments. Two solutions to the problem are investigated. The first is the use of secondary modelling of the output probability distribution of the primary classifiers. It is shown that a significant improvement in performance is obtained for a small vocabulary isolated word speaker dependent system, operating in an adverse environment. Results are presented of simulations using the NOISE.'-: database as well as in an actual factory environment using a real-time system. Based on the outcome of this research, a voice operated parcel sorting machine has been installed at the Australia Post Mail Centre at Underwood, Queensland. A pilot study is also undertaken for the use of lip information to enhance speech recognition accuracy in adverse environments. It is shown that the inclusion of other data sources can improve the performance of a speech recognition system.
APA, Harvard, Vancouver, ISO, and other styles
38

Jalalvand, Shahab. "Automatic Speech Recognition Quality Estimation." Doctoral thesis, Università degli studi di Trento, 2017. https://hdl.handle.net/11572/368743.

Full text
Abstract:
Evaluation of automatic speech recognition (ASR) systems is difficult and costly, since it requires manual transcriptions. This evaluation is usually done by computing word error rate (WER) that is the most popular metric in ASR community. Such computation is doable only if the manual references are available, whereas in the real-life applications, it is a too rigid condition. A reference-free metric to evaluate the ASR performance is \textit{confidence measure} which is provided by the ASR decoder. However, the confidence measure is not always available, especially in commercial ASR usages. Even if available, this measure is usually biased towards the decoder. From this perspective, the confidence measure is not suitable for comparison purposes, for example between two ASR systems. These issues motivate the necessity of an automatic quality estimation system for ASR outputs. This thesis explores ASR quality estimation (ASR QE) from different perspectives including: feature engineering, learning algorithms and applications. From feature engineering perspective, a wide range of features extractable from input signal and output transcription are studied. These features represent the quality of the recognition from different aspects and they are divided into four groups: signal, textual, hybrid and word-based features. From learning point of view, we address two main approaches: i) QE via regression, suitable for single hypothesis scenario; ii) QE via machine-learned ranking (MLR), suitable for multiple hypotheses scenario. In the former, a regression model is used to predict the WER score of each single hypothesis that is created through a single automatic transcription channel. In the latter, a ranking model is used to predict the order of multiple hypotheses with respect to their quality. Multiple hypotheses are mainly generated by several ASR systems or several recording microphones. From application point of view, we introduce two applications in which ASR QE makes salient improvement in terms of WER: i) QE-informed data selection for acoustic model adaptation; ii) QE-informed system combination. In the former, we exploit single hypothesis ASR QE methods in order to select the best adaptation data for upgrading the acoustic model. In the latter, we exploit multiple hypotheses ASR QE methods to rank and combine the automatic transcriptions in a supervised manner. The experiments are mostly conducted on CHiME-3 English dataset. CHiME-3 consists of Wall Street Journal utterances, recorded by multiple far distant microphones in noisy environments. The results show that QE-informed acoustic model adaptation leads to 1.8\% absolute WER reduction and QE-informed system combination leads to 1.7% absolute WER reduction in CHiME-3 task. The outcomes of this thesis are packed in the frame of an open source toolkit named TranscRater -transcription rating toolkit- (https://github.com/hlt-mt/TranscRater) which has been developed based on the aforementioned studies. TranscRater can be used to extract informative features, train the QE models and predict the quality of the reference-less recognitions in a variety of ASR tasks.
APA, Harvard, Vancouver, ISO, and other styles
39

Jalalvand, Shahab. "Automatic Speech Recognition Quality Estimation." Doctoral thesis, University of Trento, 2017. http://eprints-phd.biblio.unitn.it/2058/1/PhD_Thesis.pdf.

Full text
Abstract:
Evaluation of automatic speech recognition (ASR) systems is difficult and costly, since it requires manual transcriptions. This evaluation is usually done by computing word error rate (WER) that is the most popular metric in ASR community. Such computation is doable only if the manual references are available, whereas in the real-life applications, it is a too rigid condition. A reference-free metric to evaluate the ASR performance is \textit{confidence measure} which is provided by the ASR decoder. However, the confidence measure is not always available, especially in commercial ASR usages. Even if available, this measure is usually biased towards the decoder. From this perspective, the confidence measure is not suitable for comparison purposes, for example between two ASR systems. These issues motivate the necessity of an automatic quality estimation system for ASR outputs. This thesis explores ASR quality estimation (ASR QE) from different perspectives including: feature engineering, learning algorithms and applications. From feature engineering perspective, a wide range of features extractable from input signal and output transcription are studied. These features represent the quality of the recognition from different aspects and they are divided into four groups: signal, textual, hybrid and word-based features. From learning point of view, we address two main approaches: i) QE via regression, suitable for single hypothesis scenario; ii) QE via machine-learned ranking (MLR), suitable for multiple hypotheses scenario. In the former, a regression model is used to predict the WER score of each single hypothesis that is created through a single automatic transcription channel. In the latter, a ranking model is used to predict the order of multiple hypotheses with respect to their quality. Multiple hypotheses are mainly generated by several ASR systems or several recording microphones. From application point of view, we introduce two applications in which ASR QE makes salient improvement in terms of WER: i) QE-informed data selection for acoustic model adaptation; ii) QE-informed system combination. In the former, we exploit single hypothesis ASR QE methods in order to select the best adaptation data for upgrading the acoustic model. In the latter, we exploit multiple hypotheses ASR QE methods to rank and combine the automatic transcriptions in a supervised manner. The experiments are mostly conducted on CHiME-3 English dataset. CHiME-3 consists of Wall Street Journal utterances, recorded by multiple far distant microphones in noisy environments. The results show that QE-informed acoustic model adaptation leads to 1.8\% absolute WER reduction and QE-informed system combination leads to 1.7% absolute WER reduction in CHiME-3 task. The outcomes of this thesis are packed in the frame of an open source toolkit named TranscRater -transcription rating toolkit- (https://github.com/hlt-mt/TranscRater) which has been developed based on the aforementioned studies. TranscRater can be used to extract informative features, train the QE models and predict the quality of the reference-less recognitions in a variety of ASR tasks.
APA, Harvard, Vancouver, ISO, and other styles
40

Chua, W. W. "Speech recognition predictability of a Cantonese speech intelligibility index." Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B30509737.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Evans, N. W. D. "Spectral subtraction for speech enhancement and automatic speech recognition." Thesis, Swansea University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.636935.

Full text
Abstract:
The contributions made in this thesis relate to an extensive investigation of spectral subtraction in the context of speech enhancement and noise robust automatic speech recognition (ASR) and the morphological processing of speech spectrograms. Three sources of error in a spectral subtraction approach are identified and assessed with ASR. The effects of phase, cross-term component and spectral magnitude errors are assessed in a common spectral subtraction framework. ASR results confirm that, except for extreme noise conditions, phase and cross-term component errors are relatively negligible compared to noise estimate errors. A topology classifying approaches to spectral subtraction into power and magnitude, linear and non-linear spectral subtraction is proposed. Each class is assessed and compared under otherwise identical experimental conditions. These experiments are thought to be the first to assess the four combinations under such controlled conditions. ASR results illustrate a lesser sensitivity to noise over-estimation for non-linear approaches. With a view to practical systems, different approaches to noise estimation are investigated. In particular approaches that do not require explicit voice activity detection are assessed and shown to compare favourably to the conventional approach, the latter requiring explicit voice activity detection. Following on from this finding a new computationally efficient approach to noise estimation that does not require explicit voice activity detection is proposed. Investigations into the fundamentals of spectral subtraction highlight the limitation of noise estimates: statistical estimates obtained from a number of analysis frames lead to relatively poor representations of the instantaneous values. To ameliorate this situation, estimates from neighbouring, lateral frequencies are used to complement within bin (from the same frequency) statistical approaches. Improvements are found to be negligible. However, the principle of these lateral estimates lead naturally to the final stage of the work presented in this thesis, that of morphologically filtering speech spectrograms. This form of processing is examined for both synthesised and speech signals and promising ASR performance is reported. In 2000 the Aurora 2 database was introduced by the organisers of a special session at Eurospeech 2001 entitled ‘Noise Robust Recognition’, aimed at providing a standard database and experimental protocols for the assessment of noise robust ASR. This facility, when it became available, was used for the work described in this thesis.
APA, Harvard, Vancouver, ISO, and other styles
42

Chua, W. W., and 蔡蕙慧. "Speech recognition predictability of a Cantonese speech intelligibility index." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30509737.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Jett, Brandi. "The role of coarticulation in speech-on-speech recognition." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1554498179209764.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Isaacs, Dale. "A comparison of the network speech recognition and distributed speech recognition systems and their effect on speech enabling mobile devices." Master's thesis, University of Cape Town, 2010. http://hdl.handle.net/11427/11232.

Full text
Abstract:
Includes bibliographical references (leaves 67-75).
Over the past 10 years there has been an exponential increase in the number of mobile subscribers worldwide. Market research has shown that the number of mobile subscribers rose to 4.3 billion towards end of Q1 in 2009. The unprecedented development of the telecommunication industry over the last decade has brought about the need for ubiquitous access to a host of different information resources and services. Today, speech remains the best medium of communication between people and it is conceivable that speech enabling mobile devices will allow users who only have mobile devices, to access all the information which is now available over the world wide web.
APA, Harvard, Vancouver, ISO, and other styles
45

Schramm, Hauke. "Modeling spontaneous speech variability for large vocabulary continuous speech recognition." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=97968479X.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Lebart, Katia. "Speech dereverberation applied to automatic speech recognition and hearing aids." Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.285064.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Mwanyoha, Sadiki Pili 1974. "A speech recognition module for speech-to-text language translation." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/9862.

Full text
Abstract:
Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (leaves 47-48).
by Sadiki Pili Mwanyoha.
S.B.and M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
48

LEBART, KATIA. "Speech dereverberation applied to automatic speech recognition and hearing aids." Rennes 1, 1999. http://www.theses.fr/1999REN10033.

Full text
Abstract:
Cette these concerne la dereverberation de la parole dans les contextes specifiques de l'application aux appareils pour malentendants ou a la reconnaissance automatique de la parole. Les methodes considerees doivent etre fonctionnelles dans des conditions ou les canaux acoustiques pris en compte sont inconnus et variables. Nous proposons donc de discriminer la reverberation du signal direct a l'aide de proprietes de la reverberation qui sont independantes du canal acoustique. La correlation spatiale des signaux, leurs directions de provenance et leurs supports temporels menent a differentes methodes qui sont examinees successivement. Apres un etat de l'art sur les methodes fondees sur la decorrelation spatiale de la reverberation tardive et leurs limites, nous suggerons des ameliorations pour l'un des algorithmes les plus utilises. Nous presentons ensuite un nouvel algorithme spatialement selectif, qui attenue les contributions de la reverberation en fonction de leur direction. Cet algorithme est complementaire du precedent. Tous deux utilisent deux capteurs. Finalement, nous proposons une methode originale qui attenue efficacement l'effet de masquage par recouvrement de la reverberation. Les methodes sont evaluees a l'aide de diverses mesures objectives (facteur de reduction de bruit, gain en rsb, distance cepstrale et scores de reconnaissance automatique de la parole). Des essais de combinaisons des differentes methodes demontrent le benefice potentiel de telles associations.
APA, Harvard, Vancouver, ISO, and other styles
49

Söderberg, Hampus. "Engaging Speech UI's - How to address a speech recognition interface." Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20591.

Full text
Abstract:
Speech recognition has existed for a long time in various shapes, often used for recognizing commands, performing text-to-speech transcription or a mix of the two. This thesis investigates how the input affordances for such speech based interactions should be designed to enable intuitive engagement in a multimodal user interface. At the time of writing, current efforts in user interface design typically revolves around the established desktop metaphor where vision is the primary sense. Since speech recognition is based on the sense of hearing, previous work related to GUI design cannot be applied directly to a speech interface. Similar to how traditional GUI’s have evolved to embrace the desktop metaphor and matured into supporting modern touch based experiences, speech interaction needs to undergo a similar evolutionary process before designers can begin to understand its inherent characteristics and make informed assumptions about appropriate interaction mechanics. In order to investigate interface addressability and affordance accessibility, a prototype speech interface for a Windows 8 tablet PC was created. The prototype extended Windows 8’s modern touch optimized interface with speech interaction. The thesis’ outcome is based on a user centered evaluation of the aforementioned prototype. The outcome consists of additional knowledge surrounding foundational interaction mechanics regarding the matter of addressing and engaging a speech interface. These mechanics are important key aspects to consider when developing full featured speech recognition interfaces. This thesis aims to provide a first stepping stone towards understanding how speech interfaces should be designed. Additionally, the thesis’ has also investigated related interaction aspects such as required feedback and considerations when designing a multimodal user interface that includes touch and speech input methods. It has also been identified that a speech transcription or dictating interface needs more interaction mechanics than its inherent start and stop to become usable and useful.
APA, Harvard, Vancouver, ISO, and other styles
50

Johnston, Samuel John Charles, and Samuel John Charles Johnston. "An Approach to Automatic and Human Speech Recognition Using Ear-Recorded Speech." Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/625626.

Full text
Abstract:
Speech in a noisy background presents a challenge for the recognition of that speech both by human listeners and by computers tasked with understanding human speech (automatic speech recognition; ASR). Years of research have resulted in many solutions, though none so far have completely solved the problem. Current solutions generally require some form of estimation of the noise, in order to remove it from the signal. The limitation is that noise can be highly unpredictable and highly variable, both in form and loudness. The present report proposes a method of recording a speech signal in a noisy environment that largely prevents noise from reaching the recording microphone. This method utilizes the human skull as a noise-attenuation device by placing the microphone in the ear canal. For further noise dampening, a pair of noise-reduction earmuffs are used over the speakers' ears. A corpus of speech was recorded with a microphone in the ear canal, while also simultaneously recording speech at the mouth. Noise was emitted from a loudspeaker in the background. Following the data collection, the speech recorded at the ear was analyzed. A substantial noise-reduction benefit was found over mouth-recorded speech. However, this speech was missing much high-frequency information. With minor processing, mid-range frequencies were amplified, increasing the intelligibility of the speech. A human perception task was conducted using both the ear-recorded and mouth-recorded speech. Participants in this experiment were significantly more likely to understand ear-recorded speech over the noisy, mouth-recorded speech. Yet, participants found mouth-recorded speech with no noise the easiest to understand. These recordings were also used with an ASR system. Since the ear-recorded speech is missing much high-frequency information, it did not recognize the ear-recorded speech readily. However, when an acoustic model was trained low-pass filtered speech, performance improved. These experiments demonstrated that humans, and likely an ASR system, with additional training, would be able to more easily recognize ear-recorded speech than speech in noise. Further speech processing and training may be able to improve the signal's intelligibility for both human and automatic speech recognition.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography