Dissertations / Theses on the topic 'Automatic speaker recognition'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Automatic speaker recognition.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Deterding, David Henry. "Speaker normalisation for automatic speech recognition." Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.359822.
Full textVogt, Robert Jeffery. "Automatic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 2006. https://eprints.qut.edu.au/36195/1/Robert_Vogt_Thesis.pdf.
Full textZhang, Xiaozheng. "Automatic speechreading for improved speech recognition and speaker verification." Diss., Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/13067.
Full textHo, Ka-Lung. "Kernel eigenvoice speaker adaptation /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20HOK.
Full textIncludes bibliographical references (leaves 56-61). Also available in electronic version. Access restricted to campus users.
Thiruvaran, Tharmarajah Electrical Engineering & Telecommunications Faculty of Engineering UNSW. "Automatic speaker recognition using phase based features." Awarded by:University of New South Wales. Electrical Engineering & Telecommunications, 2009. http://handle.unsw.edu.au/1959.4/44705.
Full textChan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition." Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.
Full textKamarauskas, Juozas. "Speaker recognition by voice." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090615_093847-20773.
Full textDisertacijoje nagrinėjami kalbančiojo atpažinimo pagal balsą klausimai. Aptartos kalbančiojo atpažinimo sistemos, jų raida, atpažinimo problemos, požymių sistemos įvairovė bei kalbančiojo modeliavimo ir požymių palyginimo metodai, naudojami nuo ištarto teksto nepriklausomame bei priklausomame kalbančiojo atpažinime. Darbo metu sukurta nuo ištarto teksto nepriklausanti kalbančiojo atpažinimo sistema. Kalbėtojų modelių kūrimui ir požymių palyginimui buvo panaudoti Gauso mišinių modeliai. Pasiūlytas automatinis vokalizuotų garsų išrinkimo (segmentavimo) metodas. Šis metodas yra greitai veikiantis ir nereikalaujantis iš vartotojo jokių papildomų veiksmų, tokių kaip kalbos signalo ir triukšmo pavyzdžių nurodymas. Pasiūlyta požymių vektorių sistema, susidedanti iš žadinimo signalo bei balso trakto parametrų. Kaip žadinimo signalo parametras, panaudotas žadinimo signalo pagrindinis dažnis, kaip balso trakto parametrai, panaudotos keturios formantės bei trys antiformantės. Siekiant suvienodinti žemesnių bei aukštesnių formančių ir antiformančių dispersijas, jas pasiūlėme skaičiuoti melų skalėje. Rezultatų palyginimui sistemoje buvo realizuoti standartiniai požymiai, naudojami kalbos bei asmens atpažinime – melų skalės kepstro koeficientai (MSKK). Atlikti kalbančiojo atpažinimo eksperimentai parodė, kad panaudojus pasiūlytą požymių sistemą buvo gauti geresni atpažinimo rezultatai, nei panaudojus standartinius požymius (MSKK). Gautas lygių klaidų lygis, panaudojant pasiūlytą požymių... [toliau žr. visą tekstą]
Chan, Chit-man. "Speaker-independent recognition of Putonghua finals /." [Hong Kong : University of Hong Kong], 1987. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12363091.
Full textDu, Toit Ilze. "Non-acoustic speaker recognition." Thesis, Stellenbosch : University of Stellenbosch, 2004. http://hdl.handle.net/10019.1/16315.
Full textENGLISH ABSTRACT: In this study the phoneme labels derived from a phoneme recogniser are used for phonetic speaker recognition. The time-dependencies among phonemes are modelled by using hidden Markov models (HMMs) for the speaker models. Experiments are done using firstorder and second-order HMMs and various smoothing techniques are examined to address the problem of data scarcity. The use of word labels for lexical speaker recognition is also investigated. Single word frequencies are counted and the use of various word selections as feature sets are investigated. During April 2004, the University of Stellenbosch, in collaboration with Spescom DataVoice, participated in an international speaker verification competition presented by the National Institute of Standards and Technology (NIST). The University of Stellenbosch submitted phonetic and lexical (non-acoustic) speaker recognition systems and a fused system (the primary system) that fuses the acoustic system of Spescom DataVoice with the non-acoustic systems of the University of Stellenbosch. The results were evaluated by means of a cost model. Based on the cost model, the primary system obtained second and third position in the two categories that were submitted.
AFRIKAANSE OPSOMMING: Hierdie projek maak gebruik van foneem-etikette wat geklassifiseer word deur ’n foneemherkenner en daarna gebruik word vir fonetiese sprekerherkenning. Die tyd-afhanklikhede tussen foneme word gemodelleer deur gebruik te maak van verskuilde Markov modelle (HMMs) as sprekermodelle. Daar word ge¨eksperimenteer met eerste-orde en tweede-orde HMMs en verskeie vergladdingstegnieke word ondersoek om dataskaarsheid aan te spreek. Die gebruik van woord-etikette vir sprekerherkenning word ook ondersoek. Enkelwoordfrekwensies word getel en daar word ge¨eksperimenteer met verskeie woordseleksies as kenmerke vir sprekerherkenning. Gedurende April 2004 het die Universiteit van Stellenbosch in samewerking met Spescom DataVoice deelgeneem aan ’n internasionale sprekerverifikasie kompetisie wat deur die National Institute of Standards and Technology (NIST) aangebied is. Die Universiteit van Stellenbosch het ingeskryf vir ’n fonetiese en ’n woordgebaseerde (nie-akoestiese) sprekerherkenningstelsel, asook ’n saamgesmelte stelsel wat as primˆere stelsel dien. Die saamgesmelte stelsel is ’n kombinasie van Spescom DataVoice se akoestiese stelsel en die twee nie-akoestiese stelsels van die Universiteit van Stellenbosch. Die resultate is ge¨evalueer deur gebruik te maak van ’n koste-model. Op grond van die koste-model het die primˆere stelsel tweede en derde plek behaal in die twee kategorie¨e waaraan deelgeneem is.
Chan, Chit-man, and 陳哲民. "Speaker-independent recognition of Putonghua finals." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1987. http://hub.hku.hk/bib/B12363091.
Full textabstract
toc
Electrical and Electronic Engineering
Doctoral
Doctor of Philosophy
Wark, Timothy J. "Multi-modal speech processing for automatic speaker recognition." Thesis, Queensland University of Technology, 2001.
Find full textShou-Chun, Yin 1980. "Speaker adaptation in joint factor analysis based text independent speaker verification." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.
Full textTran, Michael. "An approach to a robust speaker recognition system." Diss., This resource online, 1994. http://scholar.lib.vt.edu/theses/available/etd-06062008-164814/.
Full textWu, Jian, and 武健. "Discriminative speaker adaptation and environmental robustness in automatic speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31246138.
Full textSlomka, Stefan. "Multiple classifier structures for automatic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 1999.
Find full textElenius, Daniel. "Accounting for Individual Speaker Properties in Automatic Speech Recognition." Licentiate thesis, KTH, Speech Communication and Technology, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-12258.
Full textIn this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames.
Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate.
A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique.
A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique.
The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature.
An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database.
Pf-Star
KOBRA
Akita, Yuya. "Automatic speaker indexing and speech recognition for panel discussions." 京都大学 (Kyoto University), 2005. http://hdl.handle.net/2433/144802.
Full textGaller, Michael. "Improving phoneme models for speaker-independent automatic speech recognition." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=56977.
Full textAdami, André Gustavo. "Modeling prosodic differences for speaker and language recognition /." Full text open access at:, 2004. http://content.ohsu.edu/u?/etd,19.
Full textMandal, Arindam. "Transformation sharing strategies for MLLR speaker adaptation /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6087.
Full textHazen, Timothy J. (Timothy James) 1969. "The use of speaker correlation information for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/49989.
Full textIncludes bibliographical references (p. 171-179).
by Timothy J. Hazen.
Ph.D.
Ramirez, Jose Luis. "Effects of clipping distortion on an Automatic Speaker Recognition system." Thesis, University of Colorado at Denver, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10112619.
Full textClipping distortion is a common problem faced in the audio recording world in which an audio signal is recorded at higher amplitude than the recording system’s limitations, resulting in a portion of the acoustic event not being recorded. Several government agencies employ the use of Automatic Speaker Recognition (ASR) systems in order to identify the speaker of an acquired recording. This is done automatically using a nonbiased approach by running a questioned recording through an ASR system and comparing it to a pre-existing database of voice samples of whom the speakers are known. A matched speaker is indicated by a high correlation of likelihood between the questioned recording and the ones from the known database. It is possible that during the process of making the questioned recording the speaker was speaking too loudly into the recording device, a gain setting was set too high, or there was post-processing done to the point that clipping distortion is introduced into the recording. Clipping distortion results from the amplitude of an audio signal surpassing the maximum sampling value of the recording system. This affects the quantized audio signal by truncating peaks at the max value rather than the actual amplitude of the input signal. In theory clipping distortion will affect likelihood ratios in a negative way between two compared recordings of the same speaker. This thesis will test this hypothesis. Currently there is no research that has helped as a guideline for knowing the limitations when using clipped recordings. This thesis will investigate to what degree of effect will clipped material have on the system performance of a Forensic Automatic Speaker Recognition system.
Mathan, Luc Stefan. "Speaker-independent access to a large lexicon." Thesis, McGill University, 1987. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=63773.
Full textVipperla, Ravichander. "Automatic Speech Recognition for ageing voices." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5725.
Full textBates, Rebecca Anne. "Speaker dynamics as a source of pronunciation variability for continuous speech recognition models /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/5858.
Full textMcLaren, Mitchell Leigh. "Improving automatic speaker verification using SVM techniques." Thesis, Queensland University of Technology, 2009. https://eprints.qut.edu.au/32063/1/Mitchell_McLaren_Thesis.pdf.
Full textStokes-Rees, Ian James. "A Study of the Automatic Speech Recognition Process and Speaker Adaptation." Thesis, University of Waterloo, 2000. http://hdl.handle.net/10012/840.
Full textStokes-Rees, Ian. "A study of the automatic speech recognition process and speaker adaptation." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0018/MQ56683.pdf.
Full textFinan, Robert Andrew. "Towards the use of sub-band processing in automatic speaker recognition." Thesis, University of Abertay Dundee, 1998. http://eprints.soton.ac.uk/256266/.
Full textBrummer, Niko. "Measuring, refining and calibrating speaker and language information extracted from speech." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/5139.
Full textENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having di fferent cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizers likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers.
AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender en meer algemeen toepasbaar.
Chan, Siu Man. "Improved speaker verification with discrimination power weighting /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202004%20CHANS.
Full textIncludes bibliographical references (leaves 86-93). Also available in electronic version. Access restricted to campus users.
Garau, Giulia. "Speaker normalisation for large vocabulary multiparty conversational speech recognition." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/3983.
Full textCastellano, Pierre John. "Speaker recognition modelling with artificial neural networks." Thesis, Queensland University of Technology, 1997.
Find full textMalyska, Nicolas 1977. "Analysis of nonmodal glottal event patterns with application to automatic speaker recognition." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43804.
Full textIncludes bibliographical references (p. 211-215).
Regions of phonation exhibiting nonmodal characteristics are likely to contain information about speaker identity, language, dialect, and vocal-fold health. As a basis for testing such dependencies, we develop a representation of patterns in the relative timing and height of nonmodal glottal pulses. To extract the timing and height of candidate pulses, we investigate a variety of inverse-filtering schemes including maximum-entropy deconvolution that minimizes predictability of a signal and minimum-entropy deconvolution that maximizes pulse-likeness. Hybrid formulations of these methods are also considered. we then derive a theoretical framework for understanding frequency- and time-domain properties of a pulse sequence, a process that sheds light on the transformation of nonmodal pulse trains into useful parameters. In the frequency domain, we introduce the first comprehensive mathematical derivation of the effect of deterministic and stochastic source perturbation on the short-time spectrum. We also propose a pitch representation of nonmodality that provides an alternative viewpoint on the frequency content that does not rely on Fourier bases. In developing time-domain properties, we use projected low-dimensional histograms of feature vectors derived from pulse timing and height parameters. For these features, we have found clusters of distinct pulse patterns, reflecting a wide variety of glottal-pulse phenomena including near-modal phonation, shimmer and jitter, diplophonia and triplophonia, and aperiodicity. Using temporal relationships between successive feature vectors, an algorithm by which to separate these different classes of glottal-pulse characteristics has also been developed.
(cont.) We have used our glottal-pulse-pattern representation to automatically test for one signal dependency: speaker dependence of glottal-pulse sequences. This choice is motivated by differences observed between talkers in our separated feature space. Using an automatic speaker verification experiment, we investigate tradeoffs in speaker dependency for short-time pulse patterns, reflecting local irregularity, as well as long-time patterns related to higher-level cyclic variations. Results, using speakers with a broad array of modal and nonmodal behaviors, indicate a high accuracy in speaker recognition performance, complementary to the use of conventional mel-cepstral features. These results suggest that there is rich structure to the source excitation that provides information about a particular speaker's identity.
by Nicolas Malyska.
Ph.D.
Marchetto, Enrico. "Automatic Speaker Recognition and Characterization by means of Robust Vocal Source Features." Doctoral thesis, Università degli studi di Padova, 2011. http://hdl.handle.net/11577/3427390.
Full textIl Riconoscimento Automatico del Parlatore rappresenta un campo di ricerca esteso, che comprende molti argomenti: elaborazione del segnale, fisiologia vocale e dell'apparato uditivo, strumenti di modellazione statistica, studio del linguaggio, ecc. Lo studio di queste tecniche è iniziato circa trenta anni fa e, da allora, ci sono stati grandi miglioramenti. Nondimeno, il campo di ricerca continua a porre questioni e, in tutto il mondo, gruppi di ricerca continuano a lavorare per ottenere sistemi di riconoscimento più affidabili e con prestazioni migliori. La presente tesi documenta un progetto di Philosophiae Doctor finanziato dall'Azienda privata RT - Radio Trevisan Elettronica Industriale S.p.A. Il titolo della borsa di studio è "Riconoscimento automatico del parlatore con applicazioni alla sicurezza e all'intelligence". Parte del lavoro ha avuto luogo durante una visita, durata sei mesi, presso lo Speech, Music and Hearing Department del KTH - Royal Institute of Technology di Stoccolma. La ricerca inerente il Riconoscimento del Parlatore sviluppa tecnologie per associare automaticamente una data voce umana ad una versione precedentemente registrata della stessa. Il Riconoscimento del Parlatore (Speaker Recognition) viene solitamente meglio definito in termini di Verifica o di Identificazione del Parlatore (in letteratura Speaker Verification o Speaker Identification, rispettivamente). L'Identificazione consiste nel recupero dell'identità di una voce fra un numero (anche alto) di voci modellate dal sistema; nella Verifica invece, date una voce ed una identità, si chiede al sistema di verificare l'associazione tra le due. I sistemi di riconoscimento producono anche un punteggio (Score) che attesta l'attendibilità della risposta fornita. La prima Parte della tesi propone una revisione dello stato dell'arte circa il Riconoscimento del Parlatore. Vengono descritte le componenti principali di un prototipo per il riconoscimento: estrazione di Features audio, modellazione statistica e verifica delle prestazioni. Nel tempo, la comunità di ricerca ha sviluppato una quantità di Features Acustiche: si tratta di tecniche per descrivere numericamente il segnale vocale in modo compatto e deterministico. In ogni applicazione di riconoscimento, anche per le parole o il linguaggio (Speech o Language Recognition), l'estrazione di Features è il primo passo: ha lo scopo di ridurre drasticamente la dimensione dei dati di ingresso, ma senza perdere alcuna informazione significativa. La scelta delle Features più idonee ad una specifica applicazione, e la loro taratura, sono cruciali per ottenere buoni risultati di riconoscimento; inoltre, la definizione di nuove features costituisce un attivo campo di ricerca perché la comunità scientifica ritiene che le features esistenti siano ancora lontane dallo sfruttamento dell'intera informazione portata dal segnale vocale. Alcune Features si sono affermate nel tempo per le loro migliori prestazioni: Coefficienti Cepstrali in scala Mel (Mel-Frequency Cepstral Coefficients) e Coefficienti di Predizione Lineare (Linear Prediction Coefficients); tali Features vengono descritte nella Parte I. Viene introdotta anche la modellazione statistica, spiegando la struttura dei Modelli a Misture di Gaussiane (Gaussian Mixture Models) ed il relativo algoritmo di addestramento (Expectation-Maximization). Tecniche di modellazione specifiche, quali Universal Background Model, completano poi la descrizione degli strumenti statistici usati per il riconoscimento. Lo Scoring rappresenta, infine, la fase di produzione dei risultati da parte del sistema di riconoscimento; comprende diverse procedure di normalizzazione che compensano, ad esempio, i problemi di modellazione o le diverse condizioni acustiche con cui i dati audio sono stati registrati. La Parte I prosegue poi presentando alcuni database audio usati comunemente in letteratura quali riferimento per il confronto delle prestazioni dei sistemi di riconoscimento; in particolare, vengono presentati TIMIT e NIST Speaker Recognition Evaluation (SRE) 2004. Tali database sono adatti alla valutazione delle prestazioni su audio di natura telefonica, di interesse per la presente tesi; tale argomento verrà ulteriormente discusso nella Parte II. Durante il progetto di PhD è stato progettato e realizzato un prototipo di sistema di riconoscimento, discusso nella Parte II. Il primo Capitolo descrive l'applicazione di riconoscimento proposta; la tecnologia per Riconoscimento del Parlatore viene applicate alle linee telefoniche, con riferimento alla sicurezza e all'intelligence. L'applicazione risponde a una specifica necessità delle Autorità quando le investigazioni coinvolgono intercettazioni telefoniche. In questi casi le Autorità devono ascoltare grandi quantità di dati telefonici, la maggior parte dei quali risulta essere inutile ai fini investigativi. L'idea applicativa consiste nell'identificazione e nell'etichettatura automatiche dei parlatori presenti nelle intercettazioni, permettendo così la ricerca di uno specifico parlatore presente nella collezione di registrazioni. Questo potrebbe ridurre gli sprechi di tempo, ottenendo così vantaggi economici. L'audio proveniente da linee telefoniche pone difficoltà al riconoscimento automatico, perché degrada significativamente il segnale e peggiora quindi le prestazioni. Vengono generalmente riconosciute alcune problematiche del segnale audio telefonico: banda ridotta, rumore additivo e rumore convolutivo; quest'ultimo causa distorsione di fase, che altera la forma d'onda del segnale. Il secondo Capitolo della Parte II descrive in dettaglio il sistema di Riconoscimento del Parlatore sviluppato; vengono discusse le diverse scelte di progettazione. Sono state sviluppate le componenti fondamentali di un sistema di riconoscimento, con alcune migliorie per contenere il carico computazionale. Durante lo sviluppo si è ritenuto primario lo scopo di ricerca del software da realizzare: è stato profuso molto impegno per ottenere un sistema con buone prestazioni, che però rimanesse semplice da modificare anche in profondità. La necessità (ed opportunità) di verificare le prestazioni del prototipo ha posto ulteriori requisiti allo sviluppo, che sono stati soddisfatti mediante l'adozione di un'interfaccia comune ai diversi database. Infine, tutti i moduli del software sviluppato possono essere eseguiti su un Cluster di Calcolo (calcolatore ad altre prestazioni per il calcolo parallelo); questa caratteristica del prototipo è stata cruciale per permettere una approfondita valutazione delle prestazioni del software in tempi ragionevoli. Durante il lavoro svolto per il progetto di Dottorato sono stati condotti studi affini al Riconoscimento del Parlatore, ma non direttamente correlati ad esso. Questi sviluppi vengono descritti nella Parte II quali estensioni del prototipo. Viene innanzitutto presentato un Rilevatore di Parlato (Voice Activity Detector) adatto all'impiego in presenza di rumore. Questo componente assume particolare importanza quale primo passo dell'estrazione delle Features: è necessario infatti selezionare e mantenere solo i segmenti audio che contengono effettivamente segnale vocale. In situazioni con rilevante rumore di fondo i semplici approcci a "soglia di energia" falliscono. Il Rilevatore realizzato è basato su Features avanzate, ottenute mediante le Trasformate Wavelet, ulteriormente elaborate mediante una sogliatura adattiva. Una seconda applicazione consiste in un prototipo per la Speaker Diarization, ovvero l'etichettatura automatica di registrazioni audio contenenti diversi parlatori. Il risultato del procedimento consiste nella segmentazione dell'audio ed in una serie di etichette, una per ciascun segmento; il sistema fornisce una risposta del tipo "chi parla quando". Il terzo ed ultimo studio collaterale al Riconoscimento del Parlatore consiste nello sviluppo di un sistema di Riduzione del Rumore (Noise Reduction) su piattaforma hardware DSP dedicata. L'algoritmo di Riduzione individua il rumore in modo adattivo e lo riduce, cercando di mantenere solo il segnale vocale; l'elaborazione avviene in tempo reale, pur usando solo una parte molto limitata delle risorse di calcolo del DSP. La Parte III della tesi introduce, infine, Features audio innovative, che costituiscono il principale contributo innovativo della tesi. Tali Features sono ottenute dal flusso glottale, quindi il primo Capitolo della Parte discute l'anatomia del tratto e delle corde vocali. Viene descritto il principio di funzionamento della fonazione e l'importanza della fisica delle corde vocali. Il flusso glottale costituisce un ingresso per il tratto vocale, che agisce come un filtro. Viene descritto uno strumento software open-source per l'inversione del tratto vocale: esso permette la stima del flusso glottale a partire da semplici registrazioni vocali. Alcuni dei metodi usati per caratterizzare numericamente il flusso glottale vengono infine esposti. Nel Capitolo successivo viene presentata la definizione delle nuove Features glottali. Le stime del flusso glottale non sono sempre affidabili quindi, durante l'estrazione delle nuove Features, il primo passo individua ed esclude i flussi giudicati non attendibili. Una procedure numerica provvede poi a raggruppare ed ordinare le stime dei flussi, preparandoli per la modellazione statistica. Le Features glottali, applicate al Riconoscimento del Parlatore sui database TIMIT e NIST SRE 2004, vengono comparate alle Features standard. Il Capitolo finale della Parte III è dedicato ad un diverso lavoro di ricerca, comunque correlato alla caratterizzazione del flusso glottale. Viene presentato un modello fisico delle corde vocali, controllato da alcune regole numeriche, in grado di descrivere la dinamica delle corde stesse. Le regole permettono di tradurre una specifica impostazione dei muscoli glottali nei parametri meccanici del modello, che portano ad un preciso flusso glottale (ottenuto dopo una simulazione al computer del modello). Il cosiddetto Problema Inverso è definito nel seguente modo: dato un flusso glottale si chiede di trovare una impostazione dei muscoli glottali che, usata per guidare il modello fisico, permetta la risintesi di un segnale glottale il più possibile simile a quello dato. Il problema inverso comporta una serie di difficoltà, quali la non-univocità dell'inversione e la sensitività alle variazioni, anche piccole, del flusso di ingresso. E' stata sviluppata una tecnica di ottimizzazione del controllo, che viene descritta. Il capitolo conclusivo della tesi riassume i risultati ottenuti. A fianco di questa discussione è presentata un piano di lavoro per lo sviluppo delle Features introdotte. Vengono infine presentate le pubblicazioni prodotte.
Li, Wei. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Click to view the E-thesis via HKUTO, 2007. http://sunzi.lib.hku.hk/hkuto/record/B39634073.
Full textLi, Wei, and 李威. "A study of an active approach to speaker and task adaptation based on automatic analysis of vocabulary confusability." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2007. http://hub.hku.hk/bib/B39634073.
Full textBaghdasaryan, Areg Gagik. "Automatic Phoneme Recognition with Segmental Hidden Markov Models." Thesis, Virginia Tech, 2010. http://hdl.handle.net/10919/31182.
Full textMaster of Science
Keyvani, Alireza. "Robustness in ASR : an experimental study of the interrelationship between discriminant feature-space transformation, speaker normalization and environment compensation." Thesis, McGill University, 2007. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=99772.
Full textFirstly, given that the performance of speaker normalization techniques degrades in the presence of noise, it is shown that reducing the effects of noise through environmental compensation, prior to speaker normalization, leads to substantial improvements in ASR performance. The speaker normalization techniques considered here were vocal tract length normalization (VTLN) and the augmented state-space acoustic decoder (MATE). Secondly, given that discriminant feature-space transformation (DFT) are known to increase class separation, it is shown that performing speaker normalization using VTLN in a discriminant feature-space leads to improvements in the performance of this technique. Classes, in our experiments, corresponded to HMM states. Thirdly, an effort was made to achieve higher class discrimination by normalizing the speech data used to estimate the discriminant feature-space transform. Normalization, in our experiments, corresponded to reducing the variability within each class through the use of environment compensation and speaker normalization. Significant ASR performance improvements were obtained when normalization was performed using environment compensation, while our results were inconclusive for the case where normalization consisted of speaker normalization. Finally, aimed at increasing its noise robustness, a simple modification of MATE is presented. This modification consisted of using, during recognition, knowledge of the distribution of warping factors selected by MATE during training.
Campanelli, Michael R. "Computer classification of stop consonants in a speaker independent continuous speech environment /." Online version of thesis, 1991. http://hdl.handle.net/1850/11051.
Full textCilliers, Francois Dirk. "Tree-based Gaussian mixture models for speaker verification." Thesis, Link to the online version, 2005. http://hdl.handle.net/10019.1/1639.
Full textAlsharhan, Iman. "Exploiting phonological constraints and automatic identification of speaker classes for Arabic speech recognition." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/exploiting-phonologicalconstraints-and-automaticidentification-of-speakerclasses-for-arabic-speechrecognition(8d443cae-e9e4-4f40-8884-99e2a01df8e9).html.
Full textCAON, D. R. S. "Automatic Speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing." Universidade Federal do Espírito Santo, 2010. http://repositorio.ufes.br/handle/10/4229.
Full textDurante todo o trabalho, o sistema de reconhecimento de fala contínua de grande vocabulário Julius é utilizado em conjunto com o Hidden Markov Model Toolkit(HTK). O sistema Julius tem suas principais características descritas, tendo inclusive sido modificado. Inicialmente, a teoria de reconhecimento de sinais de fala é demonstrada. Experimentos são feitos com adaptação de modelos ocultos de Marvov e com a técnica de validação cruzada K-Fold. Resultados de reconhecimento de fala após adaptação acústica à um locutor específico (e da criação de modelos de linguagem específicos para um cenário de demonstração do sistema) demonstraram 86.39% de taxa de acerto de sentença para os modelos acústicos holandeses. Os mesmos dados demonstram 94.44% de taxa de acerto semântico de sentença.
Caon, Daniel Régis Sarmento. "Automatic speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing." Universidade Federal do Espírito Santo, 2010. http://repositorio.ufes.br/handle/10/6390.
Full textThis work aims to provide automatic cognitive assistance via speech interface, to the elderly who live alone, at risk situation. Distress expressions and voice commands are part of the target vocabulary for speech recognition. Throughout the work, the large vocabulary continuous speech recognition system Julius is used in conjunction with the Hidden Markov Model Toolkit(HTK). The system Julius has its main features described, including its modification. This modification is part of the contribution which is in this work, including the detection of distress expressions ( situations of speech which suggest emergency). Four different languages were provided as target for recognition: French, Dutch, Spanish and English. In this same sequence of languages (determined by data availability and the local of scenarios for the integration of systems) theoretical studies and experiments were conducted to solve the need of working with each new configuration. This work includes studies of the French and Dutch languages. Initial experiments (in French) were made with adaptation of hidden Markov models and were analyzed by cross validation. In order to perform a new demonstration in Dutch, acoustic and language models were built and the system was integrated with other auxiliary modules (such as voice activity detector and the dialogue system). Results of speech recognition after acoustic adaptation to a specific speaker (and the creation of language models for a specific scenario to demonstrate the system) showed 86.39 % accuracy rate of sentence for the Dutch acoustic models. The same data shows 94.44 % semantical accuracy rate of sentence
Este trabalho visa prover assistência cognitiva automática via interface de fala, à idosos que moram sozinhos, em situação de risco. Expressões de angústia e comandos vocais fazem parte do vocabulário alvo de reconhecimento de fala. Durante todo o trabalho, o sistema de reconhecimento de fala contínua de grande vocabulário Julius é utilizado em conjunto com o Hidden Markov Model Toolkit(HTK). O sistema Julius tem suas principais características descritas, tendo inclusive sido modificado. Tal modificação é parte da contribuição desse estudo, assim como a detecção de expressões de angústia (situações de fala que caracterizam emergência). Quatro diferentes linguas foram previstas como alvo de reconhecimento: Francês, Holandês, Espanhol e Inglês. Nessa mesma ordem de linguas (determinadas pela disponibilidade de dados e local de cenários de integração de sistemas) os estudos teóricos e experimentos foram conduzidos para suprir a necessidade de trabalhar com cada nova configuração. Este trabalho inclui estudos feitos com as linguas Francês e Holandês. Experimentos iniciais (em Francês) foram feitos com adaptação de modelos ocultos de Markov e analisados por validação cruzada. Para realizar uma nova demonstração em Holandês, modelos acústicos e de linguagem foram construídos e o sistema foi integrado a outros módulos auxiliares (como o detector de atividades vocais e sistema de diálogo). Resultados de reconhecimento de fala após adaptação dos modelos acústicos à um locutor específico (e da criação de modelos de linguagem específicos para um cenário de demonstração do sistema) demonstraram 86,39% de taxa de acerto de sentença para os modelos acústicos holandeses. Os mesmos dados demonstram 94,44% de taxa de acerto semântico de sentença
Reynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.
Full textGabriel, Naveen. "Automatic Speech Recognition in Somali." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166216.
Full textPatino, Villar José María. "Efficient speaker diarization and low-latency speaker spotting." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS003.
Full textSpeaker diarization (SD) involves the detection of speakers within an audio stream and the intervals during which each speaker is active, i.e. the determination of ‘who spoken when’. The first part of the work presented in this thesis exploits an approach to speaker modelling involving binary keys (BKs) as a solution to SD. BK modelling is efficient and operates without external training data, as it operates using test data alone. The presented contributions include the extraction of BKs based on multi-resolution spectral analysis, the explicit detection of speaker changes using BKs, as well as SD fusion techniques that combine the benefits of both BK and deep learning based solutions. The SD task is closely linked to that of speaker recognition or detection, which involves the comparison of two speech segments and the determination of whether or not they were uttered by the same speaker. Even if many practical applications require their combination, the two tasks are traditionally tackled independently from each other. The second part of this thesis considers an application where SD and speaker recognition solutions are brought together. The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams. It involves the re-thinking of online diarization and the manner by which diarization and detection sub-systems should best be combined
He, Xiaodong. "Model selection based speaker adaptation and its application to nonnative speech recognition /." free to MU campus, to others for purchase, 2003. http://wwwlib.umi.com/cr/mo/fullcit?p3115555.
Full textIshizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.
Full textAlamri, Safi S. "Text-independent, automatic speaker recognition system evaluation with males speaking both Arabic and English." Thesis, University of Colorado at Denver, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1605087.
Full textAutomatic speaker recognition is an important key to speaker identification in media forensics and with the increase of cultures mixing, there?s an increase in bilingual speakers all around the world. The purpose of this thesis is to compare text-independent samples of one person using two different languages, Arabic and English, against a single language reference population. The hope is that a design can be started that may be useful in further developing software that can complete accurate text-independent ASR for bilingual speakers speaking either language against a single language reference population. This thesis took an Arabic model sample and compared it against samples that were both Arabic and English using and an Arabic reference population, all collected from videos downloaded from the Internet. All of the samples were text-independent and enhanced to optimal performance. The data was run through a biometric software called BATVOX 4.1, which utilizes the MFCCs and GMM methods of speaker recognition and identification. The result of testing through BATVOX 4.1 was likelihood ratios for each sample that were evaluated for similarities and differences, trends, and problems that had occurred.