To see the other types of publications on this topic, follow the link: Automatic speaker recognition.

Dissertations / Theses on the topic 'Automatic speaker recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Automatic speaker recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Deterding, David Henry. "Speaker normalisation for automatic speech recognition." Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.359822.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Vogt, Robert Jeffery. "Automatic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 2006. https://eprints.qut.edu.au/36195/1/Robert_Vogt_Thesis.pdf.

Full text
Abstract:
Speaker verification is the process of verifying the identity of a person by analysing their speech. There are several important applications for automatic speaker verification (ASV) technology including suspect identification, tracking terrorists and detecting a person’s presence at a remote location in the surveillance domain, as well as person authentication for phone banking and credit card transactions in the private sector. Telephones and telephony networks provide a natural medium for these applications. The aim of this work is to improve the usefulness of ASV technology for practical applications in the presence of adverse conditions. In a telephony environment, background noise, handset mismatch, channel distortions, room acoustics and restrictions on the available testing and training data are common sources of errors for ASV systems. Two research themes were pursued to overcome these adverse conditions: Modelling mismatch and modelling uncertainty. To directly address the performance degradation incurred through mismatched conditions it was proposed to directly model this mismatch. Feature mapping was evaluated for combating handset mismatch and was extended through the use of a blind clustering algorithm to remove the need for accurate handset labels for the training data. Mismatch modelling was then generalised by explicitly modelling the session conditions as a constrained offset of the speaker model means. This session variability modelling approach enabled the modelling of arbitrary sources of mismatch, including handset type, and halved the error rates in many cases. Methods to model the uncertainty in speaker model estimates and verification scores were developed to address the difficulties of limited training and testing data. The Bayes factor was introduced to account for the uncertainty of the speaker model estimates in testing by applying Bayesian theory to the verification criterion, with improved performance in matched conditions. Modelling the uncertainty in the verification score itself met with significant success. Estimating a confidence interval for the "true" verification score enabled an order of magnitude reduction in the average quantity of speech required to make a confident verification decision based on a threshold. The confidence measures developed in this work may also have significant applications for forensic speaker verification tasks.
APA, Harvard, Vancouver, ISO, and other styles
3

Zhang, Xiaozheng. "Automatic speechreading for improved speech recognition and speaker verification." Diss., Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/13067.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Ho, Ka-Lung. "Kernel eigenvoice speaker adaptation /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20HOK.

Full text
Abstract:
Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 56-61). Also available in electronic version. Access restricted to campus users.
APA, Harvard, Vancouver, ISO, and other styles
5

Thiruvaran, Tharmarajah Electrical Engineering &amp Telecommunications Faculty of Engineering UNSW. "Automatic speaker recognition using phase based features." Awarded by:University of New South Wales. Electrical Engineering & Telecommunications, 2009. http://handle.unsw.edu.au/1959.4/44705.

Full text
Abstract:
Despite recent advances, improving the accuracy of automatic speaker recognition systems remains an important and challenging area of research. This thesis investigates two-phase based features, namely the frequency modulation (FM) feature and the group delay feature in order to improve the speaker recognition accuracy. Introducing features complementary to spectral envelope-based features is a promising approach for increasing the information content of the speaker recognition system. Although phase-based features are motivated by psychophysics and speech production considerations, they have rarely been incorporated into speaker recognition front-ends. A theory has been developed and reported in this thesis, to show that the FM component can be extracted using second-order all pole modelling, and a technique for extracting FM features using this model is proposed, to produce very smooth, slowly varying FM features that are effective for speaker recognition tasks. This approach is shown herein to significantly improve speaker recognition performance over other existing FM extraction methods. A highly computationally efficient FM estimation technique is then proposed and its computational efficiency is shown through a comparative study with other methods with respect to the trade off between computational complexity and performance. In order to further enhance the FM based front-end specifically for speaker recognition, optimum frequency band allocation is studied in terms of the number of sub-bands and spacing of centre frequencies, and two new frequency band re-allocations are proposed for FM based speaker recognition. Two group delay features are also proposed: log compressed group delay feature and the sub-band group delay feature, to address problems in group delay caused by the zeros of the z-transform polynomial of a speech signal being close to the unit circle. It has been shown that the combination of group delay and FM, complements Mel Frequency Cepstral Coefficient (MFCC) in speaker recognition tasks. Furthermore, the proposed FM feature is successfully utilised for automatic forensic speaker recognition, which is implemented based on the likelihood ratio framework with two stage modelling and calibration, and shown to behave in a complementary manner to MFCCs. Notably, the FM based system provides better calibration loss than the MFCC based system, suggesting less ambiguity of FM information than MFCC information in an automatic forensic speaker recognition system. In order to demonstrate the effectiveness of FM features in a large scale speaker recognition environment, an FM-based speaker recognition subsystem is developed and submitted to the NIST 2008 speaker recognition evaluation as part of the I4U submission. Post evaluation analysis shows a 19.7% relative improvement over the traditional MFCC based subsystem when it is augmented by the FM based subsystem. Consistent improvements in performance are obtained when MFCC is augmented with FM in all sub-categories of NIST 2008, in three development tasks and for the NIST 2001 database, demonstrating the complementary behaviour of MFCC and FM features.
APA, Harvard, Vancouver, ISO, and other styles
6

Chan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition." Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Kamarauskas, Juozas. "Speaker recognition by voice." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090615_093847-20773.

Full text
Abstract:
Questions of speaker’s recognition by voice are investigated in this dissertation. Speaker recognition systems, their evolution, problems of recognition, systems of features, questions of speaker modeling and matching used in text-independent and text-dependent speaker recognition are considered too. The text-independent speaker recognition system has been developed during this work. The Gaussian mixture model approach was used for speaker modeling and pattern matching. The automatic method for voice activity detection was proposed. This method is fast and does not require any additional actions from the user, such as indicating patterns of the speech signal and noise. The system of the features was proposed. This system consists of parameters of excitation source (glottal) and parameters of the vocal tract. The fundamental frequency was taken as an excitation source parameter and four formants with three antiformants were taken as parameters of the vocal tract. In order to equate dispersions of the formants and antiformants we propose to use them in mel-frequency scale. The standard mel-frequency cepstral coefficients (MFCC) for comparison of the results were implemented in the recognition system too. These features make baseline in speech and speaker recognition. The experiments of speaker recognition have shown that our proposed system of features outperformed standard mel-frequency cepstral coefficients. The equal error rate (EER) was equal to 5.17% using proposed... [to full text]
Disertacijoje nagrinėjami kalbančiojo atpažinimo pagal balsą klausimai. Aptartos kalbančiojo atpažinimo sistemos, jų raida, atpažinimo problemos, požymių sistemos įvairovė bei kalbančiojo modeliavimo ir požymių palyginimo metodai, naudojami nuo ištarto teksto nepriklausomame bei priklausomame kalbančiojo atpažinime. Darbo metu sukurta nuo ištarto teksto nepriklausanti kalbančiojo atpažinimo sistema. Kalbėtojų modelių kūrimui ir požymių palyginimui buvo panaudoti Gauso mišinių modeliai. Pasiūlytas automatinis vokalizuotų garsų išrinkimo (segmentavimo) metodas. Šis metodas yra greitai veikiantis ir nereikalaujantis iš vartotojo jokių papildomų veiksmų, tokių kaip kalbos signalo ir triukšmo pavyzdžių nurodymas. Pasiūlyta požymių vektorių sistema, susidedanti iš žadinimo signalo bei balso trakto parametrų. Kaip žadinimo signalo parametras, panaudotas žadinimo signalo pagrindinis dažnis, kaip balso trakto parametrai, panaudotos keturios formantės bei trys antiformantės. Siekiant suvienodinti žemesnių bei aukštesnių formančių ir antiformančių dispersijas, jas pasiūlėme skaičiuoti melų skalėje. Rezultatų palyginimui sistemoje buvo realizuoti standartiniai požymiai, naudojami kalbos bei asmens atpažinime – melų skalės kepstro koeficientai (MSKK). Atlikti kalbančiojo atpažinimo eksperimentai parodė, kad panaudojus pasiūlytą požymių sistemą buvo gauti geresni atpažinimo rezultatai, nei panaudojus standartinius požymius (MSKK). Gautas lygių klaidų lygis, panaudojant pasiūlytą požymių... [toliau žr. visą tekstą]
APA, Harvard, Vancouver, ISO, and other styles
8

Chan, Chit-man. "Speaker-independent recognition of Putonghua finals /." [Hong Kong : University of Hong Kong], 1987. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12363091.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Du, Toit Ilze. "Non-acoustic speaker recognition." Thesis, Stellenbosch : University of Stellenbosch, 2004. http://hdl.handle.net/10019.1/16315.

Full text
Abstract:
Thesis (MScIng)--University of Stellenbosch, 2004.
ENGLISH ABSTRACT: In this study the phoneme labels derived from a phoneme recogniser are used for phonetic speaker recognition. The time-dependencies among phonemes are modelled by using hidden Markov models (HMMs) for the speaker models. Experiments are done using firstorder and second-order HMMs and various smoothing techniques are examined to address the problem of data scarcity. The use of word labels for lexical speaker recognition is also investigated. Single word frequencies are counted and the use of various word selections as feature sets are investigated. During April 2004, the University of Stellenbosch, in collaboration with Spescom DataVoice, participated in an international speaker verification competition presented by the National Institute of Standards and Technology (NIST). The University of Stellenbosch submitted phonetic and lexical (non-acoustic) speaker recognition systems and a fused system (the primary system) that fuses the acoustic system of Spescom DataVoice with the non-acoustic systems of the University of Stellenbosch. The results were evaluated by means of a cost model. Based on the cost model, the primary system obtained second and third position in the two categories that were submitted.
AFRIKAANSE OPSOMMING: Hierdie projek maak gebruik van foneem-etikette wat geklassifiseer word deur ’n foneemherkenner en daarna gebruik word vir fonetiese sprekerherkenning. Die tyd-afhanklikhede tussen foneme word gemodelleer deur gebruik te maak van verskuilde Markov modelle (HMMs) as sprekermodelle. Daar word ge¨eksperimenteer met eerste-orde en tweede-orde HMMs en verskeie vergladdingstegnieke word ondersoek om dataskaarsheid aan te spreek. Die gebruik van woord-etikette vir sprekerherkenning word ook ondersoek. Enkelwoordfrekwensies word getel en daar word ge¨eksperimenteer met verskeie woordseleksies as kenmerke vir sprekerherkenning. Gedurende April 2004 het die Universiteit van Stellenbosch in samewerking met Spescom DataVoice deelgeneem aan ’n internasionale sprekerverifikasie kompetisie wat deur die National Institute of Standards and Technology (NIST) aangebied is. Die Universiteit van Stellenbosch het ingeskryf vir ’n fonetiese en ’n woordgebaseerde (nie-akoestiese) sprekerherkenningstelsel, asook ’n saamgesmelte stelsel wat as primˆere stelsel dien. Die saamgesmelte stelsel is ’n kombinasie van Spescom DataVoice se akoestiese stelsel en die twee nie-akoestiese stelsels van die Universiteit van Stellenbosch. Die resultate is ge¨evalueer deur gebruik te maak van ’n koste-model. Op grond van die koste-model het die primˆere stelsel tweede en derde plek behaal in die twee kategorie¨e waaraan deelgeneem is.
APA, Harvard, Vancouver, ISO, and other styles
10

Chan, Chit-man, and 陳哲民. "Speaker-independent recognition of Putonghua finals." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1987. http://hub.hku.hk/bib/B12363091.

Full text
Abstract:
(Uncorrected OCR) Abstract of thesis entitled Speaker- Independent Recognition of Putonghua Finals submitted by CHAN, Chit Man for the degree of Doctor of Philosophy at the University of Hong Kong � In December 1987 ABSTRACT A detailed study had been performed to address the problem of speaker-independent recognition of Putonghua (Mandarin) finals. The study included 35 Putonghua finals, 16 of which having trailing nasals. They were spoken by 51 speakers: 38 females, 13 males, in 5 different tones for two times. The sample was spectrally analyzed by a bank of 18 nonoverlapping critical-band filters. Three data reduction techniques: Karhunen-Loeve Transformation (KLT) , Discrete Cosine Transformation (OCT) and Stepwise Discriminant Analysis (SDA) , were comparat i vely studied for their feature representation capability. The results indicated that KLT was superior to both OCT and SDA. Furthermore, the theoretic equivalence of OCT to KLT was found to be valid only with 5 or more feature dimensions used in computation. On the other hand, the results also showed that the Hahalanobis and a proposed modified Mahalanobis distance both gave a better measurement of performance than the other distances tested, which included the City Block, Euclidean, Minkowski, and Chebyshev. .,. In the second Part of the study, the Hidden Markov Modelling (HMM) technique was investigated. Three classification methods: Phonemic Labell ing (PL), Vector Quantization (VQ) and a proposed Hybrid Symbol (HS) generation, were studied for use with HMM. Whilst PL was found to be simple and efficient, its performance was not as good as VQ. However, the time taken by VQ was excessive, especially in training. The results with the HS method showed that it .could successfully merge the speed advantage of PL and the better discriminatory power of VQ. An approximately 80% saving in the quantizer training time could be achieved with only a marginal loss in performance. At the same time, it Abs-l Abstract was also found that allowing skipping of states in a Left-to-Right model (LRM) could lead to a negative effect on overall recognition. As an indication of performance, the recognition rate of the simulated system was 81.3%, 95.0% and 98.0% with the best I, 2, and 3 candidates included, respectively, using a 256-level VQ and a 6-state, no-skip LRM on a sample of 8,400 finals from 48 speakers. The specific rates on non-nasal finals achieved even 96% - 98% using the best candidate alone . .. ," Abs-2
abstract
toc
Electrical and Electronic Engineering
Doctoral
Doctor of Philosophy
APA, Harvard, Vancouver, ISO, and other styles
11

Wark, Timothy J. "Multi-modal speech processing for automatic speaker recognition." Thesis, Queensland University of Technology, 2001.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
12

Shou-Chun, Yin 1980. "Speaker adaptation in joint factor analysis based text independent speaker verification." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.

Full text
Abstract:
This thesis presents methods for supervised and unsupervised speaker adaptation of Gaussian mixture speaker models in text-independent speaker verification. The proposed methods are based on an approach which is able to separate speaker and channel variability so that progressive updating of speaker models can be performed while minimizing the influence of the channel variability associated with the adaptation recordings. This approach relies on a joint factor analysis model of intrinsic speaker variability and session variability where inter-session variation is assumed to result primarily from the effects of the transmission channel. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 speaker recognition evaluation plan which is based on conversational telephone speech.
APA, Harvard, Vancouver, ISO, and other styles
13

Tran, Michael. "An approach to a robust speaker recognition system." Diss., This resource online, 1994. http://scholar.lib.vt.edu/theses/available/etd-06062008-164814/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Wu, Jian, and 武健. "Discriminative speaker adaptation and environmental robustness in automatic speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31246138.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Slomka, Stefan. "Multiple classifier structures for automatic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 1999.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
16

Elenius, Daniel. "Accounting for Individual Speaker Properties in Automatic Speech Recognition." Licentiate thesis, KTH, Speech Communication and Technology, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-12258.

Full text
Abstract:

In this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames.

Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate.

A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique.

A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique.

The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature.

An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database.


Pf-Star
KOBRA
APA, Harvard, Vancouver, ISO, and other styles
17

Akita, Yuya. "Automatic speaker indexing and speech recognition for panel discussions." 京都大学 (Kyoto University), 2005. http://hdl.handle.net/2433/144802.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Galler, Michael. "Improving phoneme models for speaker-independent automatic speech recognition." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=56977.

Full text
Abstract:
This thesis explores the use of randomized, performance-based search strategies to improve the generalization of an automatic speech recognition system based on hidden Markov models. We apply simulated annealing and random search to several components of the system, including phoneme model topologies, distribution tying, and the clustering of allophonic contexts. By using knowledge of the speech problem to constrain the search appropriately, we obtain reduced numbers of parameters and higher phonemic recognition results. Performance is measured on both our own data set and the Darpa TIMIT database.
APA, Harvard, Vancouver, ISO, and other styles
19

Adami, André Gustavo. "Modeling prosodic differences for speaker and language recognition /." Full text open access at:, 2004. http://content.ohsu.edu/u?/etd,19.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Mandal, Arindam. "Transformation sharing strategies for MLLR speaker adaptation /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6087.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Hazen, Timothy J. (Timothy James) 1969. "The use of speaker correlation information for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/49989.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (p. 171-179).
by Timothy J. Hazen.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
22

Ramirez, Jose Luis. "Effects of clipping distortion on an Automatic Speaker Recognition system." Thesis, University of Colorado at Denver, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10112619.

Full text
Abstract:

Clipping distortion is a common problem faced in the audio recording world in which an audio signal is recorded at higher amplitude than the recording system’s limitations, resulting in a portion of the acoustic event not being recorded. Several government agencies employ the use of Automatic Speaker Recognition (ASR) systems in order to identify the speaker of an acquired recording. This is done automatically using a nonbiased approach by running a questioned recording through an ASR system and comparing it to a pre-existing database of voice samples of whom the speakers are known. A matched speaker is indicated by a high correlation of likelihood between the questioned recording and the ones from the known database. It is possible that during the process of making the questioned recording the speaker was speaking too loudly into the recording device, a gain setting was set too high, or there was post-processing done to the point that clipping distortion is introduced into the recording. Clipping distortion results from the amplitude of an audio signal surpassing the maximum sampling value of the recording system. This affects the quantized audio signal by truncating peaks at the max value rather than the actual amplitude of the input signal. In theory clipping distortion will affect likelihood ratios in a negative way between two compared recordings of the same speaker. This thesis will test this hypothesis. Currently there is no research that has helped as a guideline for knowing the limitations when using clipped recordings. This thesis will investigate to what degree of effect will clipped material have on the system performance of a Forensic Automatic Speaker Recognition system.

APA, Harvard, Vancouver, ISO, and other styles
23

Mathan, Luc Stefan. "Speaker-independent access to a large lexicon." Thesis, McGill University, 1987. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=63773.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Vipperla, Ravichander. "Automatic Speech Recognition for ageing voices." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5725.

Full text
Abstract:
With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.
APA, Harvard, Vancouver, ISO, and other styles
25

Bates, Rebecca Anne. "Speaker dynamics as a source of pronunciation variability for continuous speech recognition models /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/5858.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

McLaren, Mitchell Leigh. "Improving automatic speaker verification using SVM techniques." Thesis, Queensland University of Technology, 2009. https://eprints.qut.edu.au/32063/1/Mitchell_McLaren_Thesis.pdf.

Full text
Abstract:
Automatic recognition of people is an active field of research with important forensic and security applications. In these applications, it is not always possible for the subject to be in close proximity to the system. Voice represents a human behavioural trait which can be used to recognise people in such situations. Automatic Speaker Verification (ASV) is the process of verifying a persons identity through the analysis of their speech and enables recognition of a subject at a distance over a telephone channel { wired or wireless. A significant amount of research has focussed on the application of Gaussian mixture model (GMM) techniques to speaker verification systems providing state-of-the-art performance. GMM's are a type of generative classifier trained to model the probability distribution of the features used to represent a speaker. Recently introduced to the field of ASV research is the support vector machine (SVM). An SVM is a discriminative classifier requiring examples from both positive and negative classes to train a speaker model. The SVM is based on margin maximisation whereby a hyperplane attempts to separate classes in a high dimensional space. SVMs applied to the task of speaker verification have shown high potential, particularly when used to complement current GMM-based techniques in hybrid systems. This work aims to improve the performance of ASV systems using novel and innovative SVM-based techniques. Research was divided into three main themes: session variability compensation for SVMs; unsupervised model adaptation; and impostor dataset selection. The first theme investigated the differences between the GMM and SVM domains for the modelling of session variability | an aspect crucial for robust speaker verification. Techniques developed to improve the robustness of GMMbased classification were shown to bring about similar benefits to discriminative SVM classification through their integration in the hybrid GMM mean supervector SVM classifier. Further, the domains for the modelling of session variation were contrasted to find a number of common factors, however, the SVM-domain consistently provided marginally better session variation compensation. Minimal complementary information was found between the techniques due to the similarities in how they achieved their objectives. The second theme saw the proposal of a novel model for the purpose of session variation compensation in ASV systems. Continuous progressive model adaptation attempts to improve speaker models by retraining them after exploiting all encountered test utterances during normal use of the system. The introduction of the weight-based factor analysis model provided significant performance improvements of over 60% in an unsupervised scenario. SVM-based classification was then integrated into the progressive system providing further benefits in performance over the GMM counterpart. Analysis demonstrated that SVMs also hold several beneficial characteristics to the task of unsupervised model adaptation prompting further research in the area. In pursuing the final theme, an innovative background dataset selection technique was developed. This technique selects the most appropriate subset of examples from a large and diverse set of candidate impostor observations for use as the SVM background by exploiting the SVM training process. This selection was performed on a per-observation basis so as to overcome the shortcoming of the traditional heuristic-based approach to dataset selection. Results demonstrate the approach to provide performance improvements over both the use of the complete candidate dataset and the best heuristically-selected dataset whilst being only a fraction of the size. The refined dataset was also shown to generalise well to unseen corpora and be highly applicable to the selection of impostor cohorts required in alternate techniques for speaker verification.
APA, Harvard, Vancouver, ISO, and other styles
27

Stokes-Rees, Ian James. "A Study of the Automatic Speech Recognition Process and Speaker Adaptation." Thesis, University of Waterloo, 2000. http://hdl.handle.net/10012/840.

Full text
Abstract:
This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and multiscale, and presents experimental results for cross­-task speaker adaptation. An analysis of training parameters and data sufficiency for reasonable system performance estimates are also included. It is found that Maximum Likelihood Linear Regression (MLLR) supervised adaptation can result in 6% reduction (absolute) in word error rate given only one minute of adaptation data, as compared with an unadapted model set trained on a different task. The unadapted system performed at 24% WER and the adapted system at 18% WER. This is achieved with only 4 to 7 adaptation classes per speaker, as generated from a regression tree.
APA, Harvard, Vancouver, ISO, and other styles
28

Stokes-Rees, Ian. "A study of the automatic speech recognition process and speaker adaptation." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0018/MQ56683.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Finan, Robert Andrew. "Towards the use of sub-band processing in automatic speaker recognition." Thesis, University of Abertay Dundee, 1998. http://eprints.soton.ac.uk/256266/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Brummer, Niko. "Measuring, refining and calibrating speaker and language information extracted from speech." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/5139.

Full text
Abstract:
Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2010.
ENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having di fferent cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizers likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers.
AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender en meer algemeen toepasbaar.
APA, Harvard, Vancouver, ISO, and other styles
31

Chan, Siu Man. "Improved speaker verification with discrimination power weighting /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202004%20CHANS.

Full text
Abstract:
Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2004.
Includes bibliographical references (leaves 86-93). Also available in electronic version. Access restricted to campus users.
APA, Harvard, Vancouver, ISO, and other styles
32

Garau, Giulia. "Speaker normalisation for large vocabulary multiparty conversational speech recognition." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/3983.

Full text
Abstract:
One of the main problems faced by automatic speech recognition is the variability of the testing conditions. This is due both to the acoustic conditions (different transmission channels, recording devices, noises etc.) and to the variability of speech across different speakers (i.e. due to different accents, coarticulation of phonemes and different vocal tract characteristics). Vocal tract length normalisation (VTLN) aims at normalising the acoustic signal, making it independent from the vocal tract length. This is done by a speaker specific warping of the frequency axis parameterised through a warping factor. In this thesis the application of VTLN to multiparty conversational speech was investigated focusing on the meeting domain. This is a challenging task showing a great variability of the speech acoustics both across different speakers and across time for a given speaker. VTL, the distance between the lips and the glottis, varies over time. We observed that the warping factors estimated using Maximum Likelihood seem to be context dependent: appearing to be influenced by the current conversational partner and being correlated with the behaviour of formant positions and the pitch. This is because VTL also influences the frequency of vibration of the vocal cords and thus the pitch. In this thesis we also investigated pitch-adaptive acoustic features with the goal of further improving the speaker normalisation provided by VTLN. We explored the use of acoustic features obtained using a pitch-adaptive analysis in combination with conventional features such as Mel frequency cepstral coefficients. These spectral representations were combined both at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA), and at the system level using ROVER. We evaluated this approach on a challenging large vocabulary speech recognition task: multiparty meeting transcription. We found that VTLN benefits the most from pitch-adaptive features. Our experiments also suggested that combining conventional and pitch-adaptive acoustic features using HLDA results in a consistent, significant decrease in the word error rate across all the tasks. Combining at the system level using ROVER resulted in a further significant improvement. Further experiments compared the use of pitch adaptive spectral representation with the adoption of a smoothed spectrogram for the extraction of cepstral coefficients. It was found that pitch adaptive spectral analysis, providing a representation which is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to be advantageous when HLDA is applied. The combination of a pitch adaptive spectral representation and VTLN based speaker normalisation in the context of LVCSR for multiparty conversational speech led to more speaker independent acoustic models improving the overall recognition performances.
APA, Harvard, Vancouver, ISO, and other styles
33

Castellano, Pierre John. "Speaker recognition modelling with artificial neural networks." Thesis, Queensland University of Technology, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
34

Malyska, Nicolas 1977. "Analysis of nonmodal glottal event patterns with application to automatic speaker recognition." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43804.

Full text
Abstract:
Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2008.
Includes bibliographical references (p. 211-215).
Regions of phonation exhibiting nonmodal characteristics are likely to contain information about speaker identity, language, dialect, and vocal-fold health. As a basis for testing such dependencies, we develop a representation of patterns in the relative timing and height of nonmodal glottal pulses. To extract the timing and height of candidate pulses, we investigate a variety of inverse-filtering schemes including maximum-entropy deconvolution that minimizes predictability of a signal and minimum-entropy deconvolution that maximizes pulse-likeness. Hybrid formulations of these methods are also considered. we then derive a theoretical framework for understanding frequency- and time-domain properties of a pulse sequence, a process that sheds light on the transformation of nonmodal pulse trains into useful parameters. In the frequency domain, we introduce the first comprehensive mathematical derivation of the effect of deterministic and stochastic source perturbation on the short-time spectrum. We also propose a pitch representation of nonmodality that provides an alternative viewpoint on the frequency content that does not rely on Fourier bases. In developing time-domain properties, we use projected low-dimensional histograms of feature vectors derived from pulse timing and height parameters. For these features, we have found clusters of distinct pulse patterns, reflecting a wide variety of glottal-pulse phenomena including near-modal phonation, shimmer and jitter, diplophonia and triplophonia, and aperiodicity. Using temporal relationships between successive feature vectors, an algorithm by which to separate these different classes of glottal-pulse characteristics has also been developed.
(cont.) We have used our glottal-pulse-pattern representation to automatically test for one signal dependency: speaker dependence of glottal-pulse sequences. This choice is motivated by differences observed between talkers in our separated feature space. Using an automatic speaker verification experiment, we investigate tradeoffs in speaker dependency for short-time pulse patterns, reflecting local irregularity, as well as long-time patterns related to higher-level cyclic variations. Results, using speakers with a broad array of modal and nonmodal behaviors, indicate a high accuracy in speaker recognition performance, complementary to the use of conventional mel-cepstral features. These results suggest that there is rich structure to the source excitation that provides information about a particular speaker's identity.
by Nicolas Malyska.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
35

Marchetto, Enrico. "Automatic Speaker Recognition and Characterization by means of Robust Vocal Source Features." Doctoral thesis, Università degli studi di Padova, 2011. http://hdl.handle.net/11577/3427390.

Full text
Abstract:
Automatic Speaker Recognition is a wide research field, which encompasses many topics: signal processing, human vocal and auditory physiology, statistical modelling, cognitive sciences, and so on. The study of these techniques started about thirty years ago and, since then, the improvement has been dramatic. Nonetheless the field still poses open issues and many active research centers around the world are working towards more reliable and better performing systems. This thesis documents a Philosophiae Doctor project funded by the private held company RT - Radio Trevisan Elettronica Industriale S.p.A. The title of the fellowship is "Automatic speaker recognition with applications to security and intelligence". Part of the work was carried out during a six-month visit in the Speech, Music and Hearing Department of the KTH Royal Institute of Technology, Stockholm. Speaker Recognition research develops techniques to automatically associate a given human voice to a previously recorded version of it. Speaker Recognition is usually further defined as Speaker Identification or Speaker Verification; in the former the identity of a voice has to be found among a (possibly high) number of speaker voices, while in the latter the system is provided with both a voice and a claimed identity, and the association has to be verified as a true/false statement. The recognition systems also provides a confidence score about the found results. The first Part of the thesis reviews the state of the art of Speaker Recognition research. The main components of a recognition system are described: audio features extraction, statistical modelling, and performance assessment. During the years the research community developed a number of Audio Features, use to describe the information carried by the vocal signal in a compact and deterministic way. In every automatic recognition application, even speech or language, the feature extraction process is the first step, in charge of compressing substantially the size of the input data without loosing any important information. The choice of the best fitted features for a specific application, and their tuning, are crucial to obtain satisfactory recognition results; moreover the definition of innovative features is a lively research direction because it is generally recognized that existing features are still far from the exploitation of the whole information load carried by the vocal signal. There are audio features which during the years have proved to perform better than other; some of them are described in Part I: Mel-Frequency Cepstral Coefficients and Linear Prediction Coefficients. More refined and experimental features are also introduced, and will be explained in Part III. Statistical modelling is introduced, particularly by discussing the Gaussian Mixture Models structure and their training through the EM algorithm; specific modelling techniques for recognition, such as Universal Background Model, are described. Scoring is the last phase of a Speaker Recognition process and involves a number of normalizations; it compensates for different recording conditions or model issues. Part I continues presenting a number of audio databases that are commonly used in the literature as benchmark databases to compare results or recognition systems, in particular TIMIT and NIST Speaker Recognition Evaluation - SRE 2004. A recognition prototype system has been built during the PhD project, and it is detailed in Part II. The first Chapter describes the proposed application, referring to intelligence and security. The application fulfils specific requirements of the Authorities when investigations involve phone wiretapping or environmental interceptions. In these cases Authorities have to listen to a large amount of recordings, most of which are not related to the investigations. The application idea is to automatically detect and label speakers, giving the possibility to search for a specific speaker through the recording collection. This can avoid time wasting, resulting in an economical advantage. Many difficulties arises from the phone lines, which are known to degrade the speech signal and cause a reduction of the recognition performances; main issues are the narrow audio bandwidth, the additive noises and the convolution noise, the last resulting in phase distortion. The second Chapter in Part II describes in detail the developed Speaker Recognition system; a number of design choices are discussed. During the development the research scope of the system has been crucial: a lot of effort has been put to obtain a system with good performances and still easily and deeply modifiable. The assessment of results on different databases posed further challenges, which has been solved with a unified interface to the databases. The fundamental components of a speaker recognition system have been developed, with also some speed-up improvements. Lastly, the whole software can run on a cluster computer without any reconfiguration, a crucial characteristic in order to assess performance on big database in reasonable times. During the three-years project some works have been developed which are related to the Speaker Recognition, although not directly involved with it. These developments are described in Part II as extensions of the prototype. First a Voice Activity Detector suitable for noisy recordings is explained. The first step of feature extraction is to find and select, from a given record, only the segments containing voice; this is not a trivial task when the record is noisy and a simple "energy threshold" approach fails. The developed VAD is based on advanced features, computed from Wavelet Transforms, which are further processed using an adaptive threshold. One second developed application is Speaker Diarization: it permits to automatically segment an audio recording when it contains different speakers. The outputs of the diarization are a segmentation and a speaker label for each segment, resulting in a "who speaks when" answer. The third and last collateral work is a Noise Reduction system for voice applications, developed on a hardware DSP. The noise reduction algorithm adaptively detects the noise and reduces it, keeping only the voice; it works in real time using only a slight portion of the DSP computing power. Lastly, Part III discusses innovative audio features, which are the main novel contribution of this thesis. The features are obtained from the glottal flow, therefore the first Chapter in this Part describes the anatomy of the vocal folds and of the vocal tract. The working principle of the phonation apparatus is described and the importance of the vocal folds physics is pointed out. The glottal flow is an input air flow for the vocal tract, which acts as a filter; an open-source toolkit for the inversion of the vocal tract filter is introduced: it permits to estimate the glottal flow from speech records. A description of some methods used to give a numerical characterization to the glottal flow is given. In the subsequent Chapter, a definition of the novel glottal features is presented. The glottal flow estimates are not always reliable, so a first step detects and deletes unlikely flows. A numerical procedure then groups and sorts the flow estimates, preparing them for a statistical modelling. Performance measures are then discussed, comparing the novel features against the standard ones, applied on the reference databases TIMIT and SRE 2004. A Chapter is dedicated to a different research work, related with glottal flow characterization. A physical model of the vocal folds is presented, with a number of control rules, able to describe the vocal folds dynamic. The rules permit to translate a specific pharyngeal muscular set-up in mechanical parameters of the model, which results in a specific glottal flow (obtained after a computer simulation of the model). The so-called Inverse Problem is defined in this way: given a glottal flow it has to be found the muscular set-up which, used to drive a model simulation, can obtain the same glottal flow as the given one. The inverse problem has a number of difficulties in it, such as the non-univocity of the inversion and the sensitivity to slight variations in the input flow. An optimization control technique has been developed and is explained. The final Chapter summarizes the achievements of the thesis. Along with this discussion, a roadmap for the future improvements to the features is sketched. In the end, a resume of the published and submitted articles for both conferences and journals is presented.
Il Riconoscimento Automatico del Parlatore rappresenta un campo di ricerca esteso, che comprende molti argomenti: elaborazione del segnale, fisiologia vocale e dell'apparato uditivo, strumenti di modellazione statistica, studio del linguaggio, ecc. Lo studio di queste tecniche è iniziato circa trenta anni fa e, da allora, ci sono stati grandi miglioramenti. Nondimeno, il campo di ricerca continua a porre questioni e, in tutto il mondo, gruppi di ricerca continuano a lavorare per ottenere sistemi di riconoscimento più affidabili e con prestazioni migliori. La presente tesi documenta un progetto di Philosophiae Doctor finanziato dall'Azienda privata RT - Radio Trevisan Elettronica Industriale S.p.A. Il titolo della borsa di studio è "Riconoscimento automatico del parlatore con applicazioni alla sicurezza e all'intelligence". Parte del lavoro ha avuto luogo durante una visita, durata sei mesi, presso lo Speech, Music and Hearing Department del KTH - Royal Institute of Technology di Stoccolma. La ricerca inerente il Riconoscimento del Parlatore sviluppa tecnologie per associare automaticamente una data voce umana ad una versione precedentemente registrata della stessa. Il Riconoscimento del Parlatore (Speaker Recognition) viene solitamente meglio definito in termini di Verifica o di Identificazione del Parlatore (in letteratura Speaker Verification o Speaker Identification, rispettivamente). L'Identificazione consiste nel recupero dell'identità di una voce fra un numero (anche alto) di voci modellate dal sistema; nella Verifica invece, date una voce ed una identità, si chiede al sistema di verificare l'associazione tra le due. I sistemi di riconoscimento producono anche un punteggio (Score) che attesta l'attendibilità della risposta fornita. La prima Parte della tesi propone una revisione dello stato dell'arte circa il Riconoscimento del Parlatore. Vengono descritte le componenti principali di un prototipo per il riconoscimento: estrazione di Features audio, modellazione statistica e verifica delle prestazioni. Nel tempo, la comunità di ricerca ha sviluppato una quantità di Features Acustiche: si tratta di tecniche per descrivere numericamente il segnale vocale in modo compatto e deterministico. In ogni applicazione di riconoscimento, anche per le parole o il linguaggio (Speech o Language Recognition), l'estrazione di Features è il primo passo: ha lo scopo di ridurre drasticamente la dimensione dei dati di ingresso, ma senza perdere alcuna informazione significativa. La scelta delle Features più idonee ad una specifica applicazione, e la loro taratura, sono cruciali per ottenere buoni risultati di riconoscimento; inoltre, la definizione di nuove features costituisce un attivo campo di ricerca perché la comunità scientifica ritiene che le features esistenti siano ancora lontane dallo sfruttamento dell'intera informazione portata dal segnale vocale. Alcune Features si sono affermate nel tempo per le loro migliori prestazioni: Coefficienti Cepstrali in scala Mel (Mel-Frequency Cepstral Coefficients) e Coefficienti di Predizione Lineare (Linear Prediction Coefficients); tali Features vengono descritte nella Parte I. Viene introdotta anche la modellazione statistica, spiegando la struttura dei Modelli a Misture di Gaussiane (Gaussian Mixture Models) ed il relativo algoritmo di addestramento (Expectation-Maximization). Tecniche di modellazione specifiche, quali Universal Background Model, completano poi la descrizione degli strumenti statistici usati per il riconoscimento. Lo Scoring rappresenta, infine, la fase di produzione dei risultati da parte del sistema di riconoscimento; comprende diverse procedure di normalizzazione che compensano, ad esempio, i problemi di modellazione o le diverse condizioni acustiche con cui i dati audio sono stati registrati. La Parte I prosegue poi presentando alcuni database audio usati comunemente in letteratura quali riferimento per il confronto delle prestazioni dei sistemi di riconoscimento; in particolare, vengono presentati TIMIT e NIST Speaker Recognition Evaluation (SRE) 2004. Tali database sono adatti alla valutazione delle prestazioni su audio di natura telefonica, di interesse per la presente tesi; tale argomento verrà ulteriormente discusso nella Parte II. Durante il progetto di PhD è stato progettato e realizzato un prototipo di sistema di riconoscimento, discusso nella Parte II. Il primo Capitolo descrive l'applicazione di riconoscimento proposta; la tecnologia per Riconoscimento del Parlatore viene applicate alle linee telefoniche, con riferimento alla sicurezza e all'intelligence. L'applicazione risponde a una specifica necessità delle Autorità quando le investigazioni coinvolgono intercettazioni telefoniche. In questi casi le Autorità devono ascoltare grandi quantità di dati telefonici, la maggior parte dei quali risulta essere inutile ai fini investigativi. L'idea applicativa consiste nell'identificazione e nell'etichettatura automatiche dei parlatori presenti nelle intercettazioni, permettendo così la ricerca di uno specifico parlatore presente nella collezione di registrazioni. Questo potrebbe ridurre gli sprechi di tempo, ottenendo così vantaggi economici. L'audio proveniente da linee telefoniche pone difficoltà al riconoscimento automatico, perché degrada significativamente il segnale e peggiora quindi le prestazioni. Vengono generalmente riconosciute alcune problematiche del segnale audio telefonico: banda ridotta, rumore additivo e rumore convolutivo; quest'ultimo causa distorsione di fase, che altera la forma d'onda del segnale. Il secondo Capitolo della Parte II descrive in dettaglio il sistema di Riconoscimento del Parlatore sviluppato; vengono discusse le diverse scelte di progettazione. Sono state sviluppate le componenti fondamentali di un sistema di riconoscimento, con alcune migliorie per contenere il carico computazionale. Durante lo sviluppo si è ritenuto primario lo scopo di ricerca del software da realizzare: è stato profuso molto impegno per ottenere un sistema con buone prestazioni, che però rimanesse semplice da modificare anche in profondità. La necessità (ed opportunità) di verificare le prestazioni del prototipo ha posto ulteriori requisiti allo sviluppo, che sono stati soddisfatti mediante l'adozione di un'interfaccia comune ai diversi database. Infine, tutti i moduli del software sviluppato possono essere eseguiti su un Cluster di Calcolo (calcolatore ad altre prestazioni per il calcolo parallelo); questa caratteristica del prototipo è stata cruciale per permettere una approfondita valutazione delle prestazioni del software in tempi ragionevoli. Durante il lavoro svolto per il progetto di Dottorato sono stati condotti studi affini al Riconoscimento del Parlatore, ma non direttamente correlati ad esso. Questi sviluppi vengono descritti nella Parte II quali estensioni del prototipo. Viene innanzitutto presentato un Rilevatore di Parlato (Voice Activity Detector) adatto all'impiego in presenza di rumore. Questo componente assume particolare importanza quale primo passo dell'estrazione delle Features: è necessario infatti selezionare e mantenere solo i segmenti audio che contengono effettivamente segnale vocale. In situazioni con rilevante rumore di fondo i semplici approcci a "soglia di energia" falliscono. Il Rilevatore realizzato è basato su Features avanzate, ottenute mediante le Trasformate Wavelet, ulteriormente elaborate mediante una sogliatura adattiva. Una seconda applicazione consiste in un prototipo per la Speaker Diarization, ovvero l'etichettatura automatica di registrazioni audio contenenti diversi parlatori. Il risultato del procedimento consiste nella segmentazione dell'audio ed in una serie di etichette, una per ciascun segmento; il sistema fornisce una risposta del tipo "chi parla quando". Il terzo ed ultimo studio collaterale al Riconoscimento del Parlatore consiste nello sviluppo di un sistema di Riduzione del Rumore (Noise Reduction) su piattaforma hardware DSP dedicata. L'algoritmo di Riduzione individua il rumore in modo adattivo e lo riduce, cercando di mantenere solo il segnale vocale; l'elaborazione avviene in tempo reale, pur usando solo una parte molto limitata delle risorse di calcolo del DSP. La Parte III della tesi introduce, infine, Features audio innovative, che costituiscono il principale contributo innovativo della tesi. Tali Features sono ottenute dal flusso glottale, quindi il primo Capitolo della Parte discute l'anatomia del tratto e delle corde vocali. Viene descritto il principio di funzionamento della fonazione e l'importanza della fisica delle corde vocali. Il flusso glottale costituisce un ingresso per il tratto vocale, che agisce come un filtro. Viene descritto uno strumento software open-source per l'inversione del tratto vocale: esso permette la stima del flusso glottale a partire da semplici registrazioni vocali. Alcuni dei metodi usati per caratterizzare numericamente il flusso glottale vengono infine esposti. Nel Capitolo successivo viene presentata la definizione delle nuove Features glottali. Le stime del flusso glottale non sono sempre affidabili quindi, durante l'estrazione delle nuove Features, il primo passo individua ed esclude i flussi giudicati non attendibili. Una procedure numerica provvede poi a raggruppare ed ordinare le stime dei flussi, preparandoli per la modellazione statistica. Le Features glottali, applicate al Riconoscimento del Parlatore sui database TIMIT e NIST SRE 2004, vengono comparate alle Features standard. Il Capitolo finale della Parte III è dedicato ad un diverso lavoro di ricerca, comunque correlato alla caratterizzazione del flusso glottale. Viene presentato un modello fisico delle corde vocali, controllato da alcune regole numeriche, in grado di descrivere la dinamica delle corde stesse. Le regole permettono di tradurre una specifica impostazione dei muscoli glottali nei parametri meccanici del modello, che portano ad un preciso flusso glottale (ottenuto dopo una simulazione al computer del modello). Il cosiddetto Problema Inverso è definito nel seguente modo: dato un flusso glottale si chiede di trovare una impostazione dei muscoli glottali che, usata per guidare il modello fisico, permetta la risintesi di un segnale glottale il più possibile simile a quello dato. Il problema inverso comporta una serie di difficoltà, quali la non-univocità dell'inversione e la sensitività alle variazioni, anche piccole, del flusso di ingresso. E' stata sviluppata una tecnica di ottimizzazione del controllo, che viene descritta. Il capitolo conclusivo della tesi riassume i risultati ottenuti. A fianco di questa discussione è presentata un piano di lavoro per lo sviluppo delle Features introdotte. Vengono infine presentate le pubblicazioni prodotte.
APA, Harvard, Vancouver, ISO, and other styles
36

Li, Wei. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Click to view the E-thesis via HKUTO, 2007. http://sunzi.lib.hku.hk/hkuto/record/B39634073.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Li, Wei, and 李威. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2007. http://hub.hku.hk/bib/B39634073.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Baghdasaryan, Areg Gagik. "Automatic Phoneme Recognition with Segmental Hidden Markov Models." Thesis, Virginia Tech, 2010. http://hdl.handle.net/10919/31182.

Full text
Abstract:
A speaker independent continuous speech phoneme recognition and segmentation system is presented. We discuss the training and recognition phases of the phoneme recognition system as well as a detailed description of the integrated elements. The Hidden Markov Model (HMM) based phoneme models are trained using the Baum-Welch re-estimation procedure. Recognition and segmentation of the phonemes in the continuous speech is performed by a Segmental Viterbi Search on a Segmental Ergodic HMM for the phoneme states. We describe in detail the three phases of the phoneme joint recognition and segmentation system. First, the extraction of the Mel-Frequency Cepstral Coefficients (MFCC) and the corresponding Delta and Delta Log Power coefficients is described. Second, we describe the operation of the Baum-Welch re-estimation procedure for the training of the phoneme HMM models, including the K-Means and the Expectation-Maximization (EM) clustering algorithms used for the initialization of the Baum-Welch algorithm. Additionally, we describe the structural framework of - and the recognition procedure for - the ergodic Segmental HMM for the phoneme segmentation and recognition. We include test and simulation results for each of the individual systems integrated into the phoneme recognition system and finally for the phoneme recognition/segmentation system as a whole.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
39

Keyvani, Alireza. "Robustness in ASR : an experimental study of the interrelationship between discriminant feature-space transformation, speaker normalization and environment compensation." Thesis, McGill University, 2007. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=99772.

Full text
Abstract:
This thesis addresses the general problem of maintaining robust automatic speech recognition (ASR) performance under diverse speaker populations, channel conditions, and acoustic environments. To this end, the thesis analyzes the interactions between environment compensation techniques, frequency warping based speaker normalization, and discriminant feature-space transformation (DFT). These interactions were quantified by performing experiments on the connected digit utterances comprising the Aurora 2 database, using continuous density hidden Markov models (HMM) representing individual digits.
Firstly, given that the performance of speaker normalization techniques degrades in the presence of noise, it is shown that reducing the effects of noise through environmental compensation, prior to speaker normalization, leads to substantial improvements in ASR performance. The speaker normalization techniques considered here were vocal tract length normalization (VTLN) and the augmented state-space acoustic decoder (MATE). Secondly, given that discriminant feature-space transformation (DFT) are known to increase class separation, it is shown that performing speaker normalization using VTLN in a discriminant feature-space leads to improvements in the performance of this technique. Classes, in our experiments, corresponded to HMM states. Thirdly, an effort was made to achieve higher class discrimination by normalizing the speech data used to estimate the discriminant feature-space transform. Normalization, in our experiments, corresponded to reducing the variability within each class through the use of environment compensation and speaker normalization. Significant ASR performance improvements were obtained when normalization was performed using environment compensation, while our results were inconclusive for the case where normalization consisted of speaker normalization. Finally, aimed at increasing its noise robustness, a simple modification of MATE is presented. This modification consisted of using, during recognition, knowledge of the distribution of warping factors selected by MATE during training.
APA, Harvard, Vancouver, ISO, and other styles
40

Campanelli, Michael R. "Computer classification of stop consonants in a speaker independent continuous speech environment /." Online version of thesis, 1991. http://hdl.handle.net/1850/11051.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Cilliers, Francois Dirk. "Tree-based Gaussian mixture models for speaker verification." Thesis, Link to the online version, 2005. http://hdl.handle.net/10019.1/1639.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Alsharhan, Iman. "Exploiting phonological constraints and automatic identification of speaker classes for Arabic speech recognition." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/exploiting-phonologicalconstraints-and-automaticidentification-of-speakerclasses-for-arabic-speechrecognition(8d443cae-e9e4-4f40-8884-99e2a01df8e9).html.

Full text
Abstract:
The aim of this thesis is to investigate a number of factors that could affect the performance of an Arabic automatic speech understanding (ASU) system. The work described in this thesis belongs to the speech recognition (ASR) phase, but the fact that it is part of an ASU project rather than a stand-alone piece of work on ASR influences the way in which it will be carried out. Our main concern in this work is to determine the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recogniser. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassifiy it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and eventually provide a phonetic transcription that is sensitive to the local context for the ASR system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predened dictionary that typically does not reect the actual pronunciation of the word. Besides investigating the boundary effect on pronunciation, the research also seeks to address the problem of Arabic's complex morphology. Two solutions are proposed to tackle this problem, namely, using underspecified phonetic transcription to build the system, and using phonemes instead of words to build the hidden markov models (HMMS). The research also seeks to investigate several technical settings that might have an effect on the system's performance. These include training on the sub-population to minimise the variation caused by training on the main undifferentiated population, as well as investigating the correlation between training size and performance of the ASR system.
APA, Harvard, Vancouver, ISO, and other styles
43

CAON, D. R. S. "Automatic Speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing." Universidade Federal do Espírito Santo, 2010. http://repositorio.ufes.br/handle/10/4229.

Full text
Abstract:
Made available in DSpace on 2016-08-29T15:33:13Z (GMT). No. of bitstreams: 1 tese_4090_.pdf: 1568197 bytes, checksum: 71e4fb308c2516a5a0a305e67f32990f (MD5) Previous issue date: 2010-08-27
Durante todo o trabalho, o sistema de reconhecimento de fala contínua de grande vocabulário Julius é utilizado em conjunto com o Hidden Markov Model Toolkit(HTK). O sistema Julius tem suas principais características descritas, tendo inclusive sido modificado. Inicialmente, a teoria de reconhecimento de sinais de fala é demonstrada. Experimentos são feitos com adaptação de modelos ocultos de Marvov e com a técnica de validação cruzada K-Fold. Resultados de reconhecimento de fala após adaptação acústica à um locutor específico (e da criação de modelos de linguagem específicos para um cenário de demonstração do sistema) demonstraram 86.39% de taxa de acerto de sentença para os modelos acústicos holandeses. Os mesmos dados demonstram 94.44% de taxa de acerto semântico de sentença.
APA, Harvard, Vancouver, ISO, and other styles
44

Caon, Daniel Régis Sarmento. "Automatic speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing." Universidade Federal do Espírito Santo, 2010. http://repositorio.ufes.br/handle/10/6390.

Full text
Abstract:
Made available in DSpace on 2016-12-23T14:33:42Z (GMT). No. of bitstreams: 1 Dissertacao de Daniel Regis Sarmento Caon.pdf: 1566094 bytes, checksum: 67b557539f4bc5b354bc90066e805215 (MD5) Previous issue date: 2010-08-27
This work aims to provide automatic cognitive assistance via speech interface, to the elderly who live alone, at risk situation. Distress expressions and voice commands are part of the target vocabulary for speech recognition. Throughout the work, the large vocabulary continuous speech recognition system Julius is used in conjunction with the Hidden Markov Model Toolkit(HTK). The system Julius has its main features described, including its modification. This modification is part of the contribution which is in this work, including the detection of distress expressions ( situations of speech which suggest emergency). Four different languages were provided as target for recognition: French, Dutch, Spanish and English. In this same sequence of languages (determined by data availability and the local of scenarios for the integration of systems) theoretical studies and experiments were conducted to solve the need of working with each new configuration. This work includes studies of the French and Dutch languages. Initial experiments (in French) were made with adaptation of hidden Markov models and were analyzed by cross validation. In order to perform a new demonstration in Dutch, acoustic and language models were built and the system was integrated with other auxiliary modules (such as voice activity detector and the dialogue system). Results of speech recognition after acoustic adaptation to a specific speaker (and the creation of language models for a specific scenario to demonstrate the system) showed 86.39 % accuracy rate of sentence for the Dutch acoustic models. The same data shows 94.44 % semantical accuracy rate of sentence
Este trabalho visa prover assistência cognitiva automática via interface de fala, à idosos que moram sozinhos, em situação de risco. Expressões de angústia e comandos vocais fazem parte do vocabulário alvo de reconhecimento de fala. Durante todo o trabalho, o sistema de reconhecimento de fala contínua de grande vocabulário Julius é utilizado em conjunto com o Hidden Markov Model Toolkit(HTK). O sistema Julius tem suas principais características descritas, tendo inclusive sido modificado. Tal modificação é parte da contribuição desse estudo, assim como a detecção de expressões de angústia (situações de fala que caracterizam emergência). Quatro diferentes linguas foram previstas como alvo de reconhecimento: Francês, Holandês, Espanhol e Inglês. Nessa mesma ordem de linguas (determinadas pela disponibilidade de dados e local de cenários de integração de sistemas) os estudos teóricos e experimentos foram conduzidos para suprir a necessidade de trabalhar com cada nova configuração. Este trabalho inclui estudos feitos com as linguas Francês e Holandês. Experimentos iniciais (em Francês) foram feitos com adaptação de modelos ocultos de Markov e analisados por validação cruzada. Para realizar uma nova demonstração em Holandês, modelos acústicos e de linguagem foram construídos e o sistema foi integrado a outros módulos auxiliares (como o detector de atividades vocais e sistema de diálogo). Resultados de reconhecimento de fala após adaptação dos modelos acústicos à um locutor específico (e da criação de modelos de linguagem específicos para um cenário de demonstração do sistema) demonstraram 86,39% de taxa de acerto de sentença para os modelos acústicos holandeses. Os mesmos dados demonstram 94,44% de taxa de acerto semântico de sentença
APA, Harvard, Vancouver, ISO, and other styles
45

Reynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Gabriel, Naveen. "Automatic Speech Recognition in Somali." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.

Full text
Abstract:
The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.
APA, Harvard, Vancouver, ISO, and other styles
47

Patino, Villar José María. "Efficient speaker diarization and low-latency speaker spotting." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS003.

Full text
Abstract:
La segmentation et le regroupement en locuteurs (SRL) impliquent la détection des locuteurs dans un flux audio et les intervalles pendant lesquels chaque locuteur est actif, c'est-à-dire la détermination de ‘qui parle quand’. La première partie des travaux présentés dans cette thèse exploite une approche de modélisation du locuteur utilisant des clés binaires (BKs) comme solution à la SRL. La modélisation BK est efficace et fonctionne sans données d'entraînement externes, car elle utilise uniquement des données de test. Les contributions présentées incluent l'extraction des BKs basée sur l'analyse spectrale multi-résolution, la détection explicite des changements de locuteurs utilisant les BKs, ainsi que les techniques de fusion SRL qui combinent les avantages des BKs et des solutions basées sur un apprentissage approfondi. La tâche de la SRL est étroitement liée à celle de la reconnaissance ou de la détection du locuteur, qui consiste à comparer deux segments de parole et à déterminer s'ils ont été prononcés par le même locuteur ou non. Même si de nombreuses applications pratiques nécessitent leur combinaison, les deux tâches sont traditionnellement exécutées indépendamment l'une de l'autre. La deuxième partie de cette thèse porte sur une application où les solutions de SRL et de reconnaissance des locuteurs sont réunies. La nouvelle tâche, appelée détection de locuteurs à faible latence, consiste à détecter rapidement les locuteurs connus dans des flux audio à locuteurs multiples. Il s'agit de repenser la SRL en ligne et la manière dont les sous-systèmes de SRL et de détection devraient être combinés au mieux
Speaker diarization (SD) involves the detection of speakers within an audio stream and the intervals during which each speaker is active, i.e. the determination of ‘who spoken when’. The first part of the work presented in this thesis exploits an approach to speaker modelling involving binary keys (BKs) as a solution to SD. BK modelling is efficient and operates without external training data, as it operates using test data alone. The presented contributions include the extraction of BKs based on multi-resolution spectral analysis, the explicit detection of speaker changes using BKs, as well as SD fusion techniques that combine the benefits of both BK and deep learning based solutions. The SD task is closely linked to that of speaker recognition or detection, which involves the comparison of two speech segments and the determination of whether or not they were uttered by the same speaker. Even if many practical applications require their combination, the two tasks are traditionally tackled independently from each other. The second part of this thesis considers an application where SD and speaker recognition solutions are brought together. The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams. It involves the re-thinking of online diarization and the manner by which diarization and detection sub-systems should best be combined
APA, Harvard, Vancouver, ISO, and other styles
48

He, Xiaodong. "Model selection based speaker adaptation and its application to nonnative speech recognition /." free to MU campus, to others for purchase, 2003. http://wwwlib.umi.com/cr/mo/fullcit?p3115555.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Ishizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Alamri, Safi S. "Text-independent, automatic speaker recognition system evaluation with males speaking both Arabic and English." Thesis, University of Colorado at Denver, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1605087.

Full text
Abstract:

Automatic speaker recognition is an important key to speaker identification in media forensics and with the increase of cultures mixing, there?s an increase in bilingual speakers all around the world. The purpose of this thesis is to compare text-independent samples of one person using two different languages, Arabic and English, against a single language reference population. The hope is that a design can be started that may be useful in further developing software that can complete accurate text-independent ASR for bilingual speakers speaking either language against a single language reference population. This thesis took an Arabic model sample and compared it against samples that were both Arabic and English using and an Arabic reference population, all collected from videos downloaded from the Internet. All of the samples were text-independent and enhanced to optimal performance. The data was run through a biometric software called BATVOX 4.1, which utilizes the MFCCs and GMM methods of speaker recognition and identification. The result of testing through BATVOX 4.1 was likelihood ratios for each sample that were evaluated for similarities and differences, trends, and problems that had occurred.

APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography