Dissertations / Theses on the topic 'Speaker recognition'

To see the other types of publications on this topic, follow the link: Speaker recognition.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Speaker recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Chatzaras, Anargyros, and Georgios Savvidis. "Seamless speaker recognition." Thesis, KTH, Radio Systems Laboratory (RS Lab), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-159021.

Full text
Abstract:
In a technologically advanced society, the average person manages dozens of accounts for e-mail, social networks, e-banking, and other electronic services. As the number of these accounts increases, the need for automatic user identification becomes more essential. Biometrics have long been used to identify people and are the most common (if not the only) method to achieve this task. Over the past few years, smartphones have become frequently used gadgets.  These devices have built-in microphones and are commonly used by a single user or a small set of users, such as a couple or a family. This thesis uses a smartphone’s microphone to capture user’s speech and identify him/her. Existing speaker recognition systems typically prompt the user to provide long voice samples in order to provide accurate results. This results in a poor user experience and discourages users who do not have the patience to go through such a process.  The main idea behind the speaker recognition approach presented in this thesis is to provide a seamless user experience where the recording of the user’s voice takes place in the background. An Android application is developed which silently collects voices samples and performs speaker recognition without requiring extensive user interaction.  Two variants of the proposed tool have been developed and are described in depth in this thesis. The open source framework Recognito is used to perform the speaker recognition task. The analysis of Recognito showed that it is not capable of achieving high accuracy especially when the voice samples contain background noise. Finally, the comparison between the two architectures showed that they do not differ significantly in terms of performance.
I ett teknologiskt avancerat samhälle så hanterar den genomsnittliga personen dussintals konton för e-post, sociala nätverk, internetbanker, och andra elektroniska tjänster. Allt eftersom antalet konton ökar, blir behovet av automatisk identifiering av användaren mer väsentlig. Biometri har länge använts för att identifiera personer och är den vanligaste (om inte den enda) metoden för att utföra denna uppgift. Smartphones har under de senaste åren blivit allt mer vanligt förekommande, de ger användaren tillgång till de flesta av sina konton och, i viss mån, även personifiering av enheterna baserat på deras profiler på sociala nätverk. Dessa enheter har inbyggda mikrofoner och används ofta av en enskild användare eller en liten grupp av användare, till exempel ett par eller en familj. Denna avhandling använder mikrofonen i en smartphone för att spela in användarens tal och identifiera honom/henne. Befintliga lösningar för talarigenkänning ber vanligtvis användaren om att ge långa röstprover för att kunna ge korrekta resultat.  Detta resulterar i en dålig användarupplevelse och avskräcker användare som inte har tålamod att gå igenom en sådan process. Huvudtanken bakom den strategi för talarigenkänningen som presenteras i denna avhandling är att ge en sömlös användarupplevelse där inspelningen av användarens röst sker i bakgrunden. En Android-applikation har utvecklats som, utan att märkas, samlar in röstprover och utför talarigenkänning på dessa utan att kräva omfattande interaktion av användaren. Två varianter av verktyget har utvecklats och dessa beskrivs ingående i denna avhandling. Öpen source-ramverket Recognito används för att utföra talarigenkänningen. Analysen av Recognito visade att det inte klarar av att uppnå tillräckligt hög noggrannhet, speciellt när röstproverna innehåller bakgrundsbrus. Dessutom visade jämförelsen mellan de två arkitekturerna att de inte skiljer sig nämnvärt i fråga om prestanda.
APA, Harvard, Vancouver, ISO, and other styles
2

VASILAKAKIS, VASILEIOS. "Forensic speaker recognition: speaker and height estimation techniques." Doctoral thesis, Politecnico di Torino, 2014. http://hdl.handle.net/11583/2551370.

Full text
Abstract:
In this work, we analyse some techniques used to perform speaker verification, ex- plaining the steps from feature extraction to mathematical models used for speaker characterisation and discriminative modelling. The main contributions of the au- thor, is a modification on the i–vector generation process, making it either faster or less memory-demanding, a novel way to perform speaker verification by the use of the Pairwise Support Vector Machine and a new way to perform speaker characteri- sation by means of Deep belief networks. Apart from these contributions, additional work in Automatic Speech-Based Height Estimation is presented, including a base- line model and then improvement of this by the use of a Mixture of Expert Neural Networks.
APA, Harvard, Vancouver, ISO, and other styles
3

Kamarauskas, Juozas. "Speaker recognition by voice." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090615_093847-20773.

Full text
Abstract:
Questions of speaker’s recognition by voice are investigated in this dissertation. Speaker recognition systems, their evolution, problems of recognition, systems of features, questions of speaker modeling and matching used in text-independent and text-dependent speaker recognition are considered too. The text-independent speaker recognition system has been developed during this work. The Gaussian mixture model approach was used for speaker modeling and pattern matching. The automatic method for voice activity detection was proposed. This method is fast and does not require any additional actions from the user, such as indicating patterns of the speech signal and noise. The system of the features was proposed. This system consists of parameters of excitation source (glottal) and parameters of the vocal tract. The fundamental frequency was taken as an excitation source parameter and four formants with three antiformants were taken as parameters of the vocal tract. In order to equate dispersions of the formants and antiformants we propose to use them in mel-frequency scale. The standard mel-frequency cepstral coefficients (MFCC) for comparison of the results were implemented in the recognition system too. These features make baseline in speech and speaker recognition. The experiments of speaker recognition have shown that our proposed system of features outperformed standard mel-frequency cepstral coefficients. The equal error rate (EER) was equal to 5.17% using proposed... [to full text]
Disertacijoje nagrinėjami kalbančiojo atpažinimo pagal balsą klausimai. Aptartos kalbančiojo atpažinimo sistemos, jų raida, atpažinimo problemos, požymių sistemos įvairovė bei kalbančiojo modeliavimo ir požymių palyginimo metodai, naudojami nuo ištarto teksto nepriklausomame bei priklausomame kalbančiojo atpažinime. Darbo metu sukurta nuo ištarto teksto nepriklausanti kalbančiojo atpažinimo sistema. Kalbėtojų modelių kūrimui ir požymių palyginimui buvo panaudoti Gauso mišinių modeliai. Pasiūlytas automatinis vokalizuotų garsų išrinkimo (segmentavimo) metodas. Šis metodas yra greitai veikiantis ir nereikalaujantis iš vartotojo jokių papildomų veiksmų, tokių kaip kalbos signalo ir triukšmo pavyzdžių nurodymas. Pasiūlyta požymių vektorių sistema, susidedanti iš žadinimo signalo bei balso trakto parametrų. Kaip žadinimo signalo parametras, panaudotas žadinimo signalo pagrindinis dažnis, kaip balso trakto parametrai, panaudotos keturios formantės bei trys antiformantės. Siekiant suvienodinti žemesnių bei aukštesnių formančių ir antiformančių dispersijas, jas pasiūlėme skaičiuoti melų skalėje. Rezultatų palyginimui sistemoje buvo realizuoti standartiniai požymiai, naudojami kalbos bei asmens atpažinime – melų skalės kepstro koeficientai (MSKK). Atlikti kalbančiojo atpažinimo eksperimentai parodė, kad panaudojus pasiūlytą požymių sistemą buvo gauti geresni atpažinimo rezultatai, nei panaudojus standartinius požymius (MSKK). Gautas lygių klaidų lygis, panaudojant pasiūlytą požymių... [toliau žr. visą tekstą]
APA, Harvard, Vancouver, ISO, and other styles
4

Du, Toit Ilze. "Non-acoustic speaker recognition." Thesis, Stellenbosch : University of Stellenbosch, 2004. http://hdl.handle.net/10019.1/16315.

Full text
Abstract:
Thesis (MScIng)--University of Stellenbosch, 2004.
ENGLISH ABSTRACT: In this study the phoneme labels derived from a phoneme recogniser are used for phonetic speaker recognition. The time-dependencies among phonemes are modelled by using hidden Markov models (HMMs) for the speaker models. Experiments are done using firstorder and second-order HMMs and various smoothing techniques are examined to address the problem of data scarcity. The use of word labels for lexical speaker recognition is also investigated. Single word frequencies are counted and the use of various word selections as feature sets are investigated. During April 2004, the University of Stellenbosch, in collaboration with Spescom DataVoice, participated in an international speaker verification competition presented by the National Institute of Standards and Technology (NIST). The University of Stellenbosch submitted phonetic and lexical (non-acoustic) speaker recognition systems and a fused system (the primary system) that fuses the acoustic system of Spescom DataVoice with the non-acoustic systems of the University of Stellenbosch. The results were evaluated by means of a cost model. Based on the cost model, the primary system obtained second and third position in the two categories that were submitted.
AFRIKAANSE OPSOMMING: Hierdie projek maak gebruik van foneem-etikette wat geklassifiseer word deur ’n foneemherkenner en daarna gebruik word vir fonetiese sprekerherkenning. Die tyd-afhanklikhede tussen foneme word gemodelleer deur gebruik te maak van verskuilde Markov modelle (HMMs) as sprekermodelle. Daar word ge¨eksperimenteer met eerste-orde en tweede-orde HMMs en verskeie vergladdingstegnieke word ondersoek om dataskaarsheid aan te spreek. Die gebruik van woord-etikette vir sprekerherkenning word ook ondersoek. Enkelwoordfrekwensies word getel en daar word ge¨eksperimenteer met verskeie woordseleksies as kenmerke vir sprekerherkenning. Gedurende April 2004 het die Universiteit van Stellenbosch in samewerking met Spescom DataVoice deelgeneem aan ’n internasionale sprekerverifikasie kompetisie wat deur die National Institute of Standards and Technology (NIST) aangebied is. Die Universiteit van Stellenbosch het ingeskryf vir ’n fonetiese en ’n woordgebaseerde (nie-akoestiese) sprekerherkenningstelsel, asook ’n saamgesmelte stelsel wat as primˆere stelsel dien. Die saamgesmelte stelsel is ’n kombinasie van Spescom DataVoice se akoestiese stelsel en die twee nie-akoestiese stelsels van die Universiteit van Stellenbosch. Die resultate is ge¨evalueer deur gebruik te maak van ’n koste-model. Op grond van die koste-model het die primˆere stelsel tweede en derde plek behaal in die twee kategorie¨e waaraan deelgeneem is.
APA, Harvard, Vancouver, ISO, and other styles
5

Hong, Z. (Zimeng). "Speaker gender recognition system." Master's thesis, University of Oulu, 2017. http://jultika.oulu.fi/Record/nbnfioulu-201706082645.

Full text
Abstract:
Abstract. Automatic gender recognition through speech is one of the fundamental mechanisms in human-machine interaction. Typical application areas of this technology range from gender-targeted advertising to gender-specific IoT (Internet of Things) applications. It can also be used to narrow down the scope of investigations in crime scenarios. There are many possible methods of recognizing the gender of a speaker. In machine learning applications, the first step is to acquire and convert the natural human voice into a form of machine understandable signal. Useful voice features then could be extracted and labelled with gender information so that are then trained by machines. After that, new input voice can be captured and processed and the machine is able to extract the features by pattern modelling. In this thesis, a real-time speaker gender recognition system was designed within Matlab environment. This system could automatically identify the gender of a speaker by voice. The implementation work utilized voice processing and feature extraction techniques to deal with an input speech coming from a microphone or a recorded speech file. The response features are extracted and classified. Then the machine learning classification method (Naïve Bayes Classifier) is used to distinguish the gender features. The recognition result with gender information is then finally displayed. The evaluation of the speaker gender recognition systems was done in an experiment with 40 participants (half male and half female) in a quite small room. The experiment recorded 400 speech samples by speakers from 16 countries in 17 languages. These 400 speech samples were tested by the gender recognition system and showed a considerably good performance, with only 29 errors of recognition (92.75% accuracy). In comparison with previous speaker gender recognition systems, most of them obtained the accuracy no more than 90% and only one obtained 100% accuracy with very limited testers. We can then conclude that the performance of the speaker gender recognition system designed in this thesis is reliable.
APA, Harvard, Vancouver, ISO, and other styles
6

Al-Kilani, Menia. "Voice-signature-based Speaker Recognition." University of the Western Cape, 2017. http://hdl.handle.net/11394/5888.

Full text
Abstract:
Magister Scientiae - MSc (Computer Science)
Personal identification and the protection of data are important issues because of the ubiquitousness of computing and these have thus become interesting areas of research in the field of computer science. Previously people have used a variety of ways to identify an individual and protect themselves, their property and their information. This they did mostly by means of locks, passwords, smartcards and biometrics. Verifying individuals by using their physical or behavioural features is more secure than using other data such as passwords or smartcards, because everyone has unique features which distinguish him or her from others. Furthermore the biometrics of a person are difficult to imitate or steal. Biometric technologies represent a significant component of a comprehensive digital identity solution and play an important role in security. The technologies that support identification and authentication of individuals is based on either their physiological or their behavioural characteristics. Live-­‐data, in this instance the human voice, is the topic of this research. The aim is to recognize a person’s voice and to identify the user by verifying that his/her voice is the same as a record of his / her voice-­‐signature in a systems database. To address the main research question: “What is the best way to identify a person by his / her voice signature?”, design science research, was employed. This methodology is used to develop an artefact for solving a problem. Initially a pilot study was conducted using visual representation of voice signatures, to check if it is possible to identify speakers without using feature extraction or matching methods. Subsequently, experiments were conducted with 6300 data sets derived from Texas Instruments and the Massachusetts Institute of Technology audio database. Two methods of feature extraction and classification were considered—mel frequency cepstrum coefficient and linear prediction cepstral coefficient feature extraction—and for classification, the Support Vector Machines method was used. The three methods were compared in terms of their effectiveness and it was found that the system using the mel frequency cepstrum coefficient, for feature extraction, gave the marginally better results for speaker recognition.
APA, Harvard, Vancouver, ISO, and other styles
7

Oglesby, J. "Neural models for speaker recognition." Thesis, Swansea University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.638359.

Full text
Abstract:
In recent years a resurgence of interest in neural modeling has taken place. This thesis examines one such class applied to the task of speaker recognition, with direct comparisons made to a contemporary approach based on vector quantisation (VQ). Speaker recognition systems in general, including feature representations and distance measures, are reviewed. The VQ approach, used for comparisons throughout the experimental work, is described in detail. Currently popular neural architectures are also reviewed and associated gradient-based training procedures examined. The performance of a VQ speaker identification system is determined experimentally for a range of popular speech features, using codebooks of varying sizes. Perceptually-based cepstral features are found to out-perform both standard LPC and filterbank representations. New approaches to speaker recognition based on multilayer perceptrons (MLP) and a variant using radial basis functions (RBF) are proposed and examined. To facilitate the research in terms of computational requirements a novel parallel training algorithm is proposed, which dynamically schedules the computational load amongst the available processors. This is shown to give close to linear speed-up on typical training tasks for up to fifty transputers. A transputer-based processing module with appropriate speech capture and synthesis facilities is also developed. For the identification task the MLP approach is found to give approximately the same performance as equivalent sized VQ codebooks. The MLP approach is slightly better for smaller models, however for larger models the VQ approach gives marginally superior results. MLP and RBF models are investigated for speaker verification. Both techniques significantly out-perform the VQ approach, giving 29.5% (MLP) and 21.5% (RBF) true talker rejections for a fixed 2% imposter acceptance rate, compared to 34.5% for the VQ approach. These figures relate to single digit test utterances. Extending the duration of the test utterance is found to significantly improve performance across all techniques. The best overall performance is obtained from RBF models: five digit utterances achieve around 2.5% true talker rejections for a fixed 2% imposter acceptance rate.
APA, Harvard, Vancouver, ISO, and other styles
8

Thompson, J. "Speech variability in speaker recognition." Thesis, Swansea University, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.639230.

Full text
Abstract:
This thesis is concerned with investigating the effects of variability on the automatic speaker recognition system performance. Both speaker generated variability and variability of the recording environment are examined. Speaker generated variability (intra-variation) has received less attention than variability of the recording environment, and is therefore the main focus of this thesis. In particular, of most concern is the intra-variation of data typically found in co-operative speaker recognition tasks. That is normally spoken speech, collected over a period of months. To assess the scale of recognition errors attributed to intra-variation, errors due to noise degradation are considered first. Additive noise can rapidly degrade recognition performance, so for a more realistic assessment, a 'state of the art' noise compensation algorithm is also introduced. Comparisons between noise degradation and intra-variation, shows intra-variation to be a significant source of recognition errors, with intra-variation being the source of most recognition errors of a background noise of 9dB SNR or greater. The level of intra-variation and recognition errors is shown to be highly speaker dependent. Analysis of cepstral variation shows intra-variation to correlate more closely with recognition errors than inter-variation. Recognition experiments and analysis of the glottal pulse shape demonstrate that variation between two recording sessions generally increases as the time gap between the recording of the sessions lengthens. Glottal pulse variation is also shown to vary within recording sessions, albeit with less variation than between sessions. Glottal pulse shape variation is shown by others to vary for highly stressed speech. It is shown here to also vary for normally spoken speech collected under relatively controlled conditions. It is hypothesized that these variations occur, in part, due to the speaker's anxiety during recording. Glottal pulse variation is shown to broadly match the hypothesised anxiety profile. The gradual change of glottal pulse variation demonstrates an underlying reason why incremental speaker adaptation can be used for intra-variation compensation. Experiments show that potentially adaptation can reduce speaker identification error rates from 15% to 2.5%.
APA, Harvard, Vancouver, ISO, and other styles
9

Mukherjee, Rishiraj. "Speaker Recognition Using Shifted MFCC." Scholar Commons, 2012. http://scholarcommons.usf.edu/etd/4136.

Full text
Abstract:
Speaker Recognition is the art of recognizing a speaker from a given database using speech as the only input. In this thesis we will be discussing a novel approach to detect speakers. Here we will introduce the concept of shifted MFCC to add improvement over the performance from previous work which has shown quite a decent amount of accuracy of about 95% at best. We will be talking about adding different parameters which also contributed in improving the efficiency of speaker recognition. Also we will be testing our algorithm on Text dependent speech data and Text Independent speech data. Our technique was evaluated on TIDIGIT - database. In order to further increase the speaker recognition rate at lower FARs, we combined accent information added with pitch and higher order formants. The possible application areas for the work done here is in any access control entry system or now a day's a lot of smart phones, laptops, operating systems etc have Also, in homeland security applications; speaker accent will play a critical role in the evaluation of biometric systems since users will be international in nature. So incorporating accent information into the speaker recognition/verification system is a key component that our study focused on. The accent incorporation method and Shifted MFCC techniques discussed in this work can also be applied to any other speaker recognition systems.
APA, Harvard, Vancouver, ISO, and other styles
10

Mwangi, Elijah. "Speaker independent isolated word recognition." Thesis, Loughborough University, 1987. https://dspace.lboro.ac.uk/2134/15425.

Full text
Abstract:
The work presented in this thesis concerns the recognition of isolated words using a pattern matching approach. In such a system, an unknown speech utterance, which is to be identified, is transformed into a pattern of characteristic features. These features are then compared with a set of pre-stored reference patterns that were generated from the vocabulary words. The unknown word is identified as that vocabulary word for which the reference pattern gives the best match. One of the major difficul ties in the pattern comparison process is that speech patterns, obtained from the same word, exhibit non-linear temporal fluctuations and thus a high degree of redundancy. The initial part of this thesis considers various dynamic time warping techniques used for normalizing the temporal differences between speech patterns. Redundancy removal methods are also considered, and their effect on the recognition accuracy is assessed. Although the use of dynamic time warping algorithms provide considerable improvement in the accuracy of isolated word recognition schemes, the performance is ultimately limited by their poor ability to discriminate between acoustically similar words. Methods for enhancing the identification rate among acoustically similar words, by using common pattern features for similar sounding regions, are investigated. Pattern matching based, speaker independent systems, can only operate with a high recognition rate, by using multiple reference patterns for each of the words included in the vocabulary. These patterns are obtained from the utterances of a group of speakers. The use of multiple reference patterns, not only leads to a large increase in the memory requirements of the recognizer, but also an increase in the computational load. A recognition system is proposed in this thesis, which overcomes these difficulties by (i) employing vector quantization techniques to reduce the storage of reference patterns, and (ii) eliminating the need for dynamic time warping which reduces the computational complexity of the system. Finally, a method of identifying the acoustic structure of an utterance in terms of voiced, unvoiced, and silence segments by using fuzzy set theory is proposed. The acoustic structure is then employed to enhance the recognition accuracy of a conventional isolated word recognizer.
APA, Harvard, Vancouver, ISO, and other styles
11

Luettin, Juergen. "Visual speech and speaker recognition." Thesis, University of Sheffield, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.264432.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Alkilani, Menia Mohamed. "Voice signature based Speaker Recognition." University of the Western Cape, 2017. http://hdl.handle.net/11394/6196.

Full text
Abstract:
Magister Scientiae - MSc (Computer Science)
Personal identification and the protection of data are important issues because of the ubiquitousness of computing and these havethus become interesting areas of research in the field of computer science. Previously people have used a variety of ways to identify an individual and protect themselves, their property and their information.
APA, Harvard, Vancouver, ISO, and other styles
13

Mošner, Ladislav. "Microphone Arrays for Speaker Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363803.

Full text
Abstract:
Tato diplomová práce se zabývá problematikou vzdáleného rozpoznávání mluvčích. V případě dat zachycených odlehlým mikrofonem se přesnost standardního rozpoznávání značně snižuje, proto jsem navrhl dva přístupy pro zlepšení výsledků. Prvním z nich je použití mikrofonního pole (záměrně rozestavené sady mikrofonů), které je schopné nasměrovat virtuální "paprsek" na pozici řečníka. Dále jsem prováděl adaptaci komponent systému (PLDA skórování a extraktoru i-vektorů). S využitím simulace pokojových podmínek jsem syntetizoval trénovací a testovací data ze standardní datové sady NIST 2010. Ukázal jsem, že obě techniky a jejich kombinace vedou k výraznému zlepšení výsledků. Dále jsem se zabýval společným určením identity a pozice mluvčího. Zatímco výsledky ve venkovním simulovaném prostředí (bez ozvěn) jsou slibné, výsledky z interiéru (s ozvěnami) jsou smíšené a vyžadují další prozkoumání. Na závěr jsem mohl systémem vyhodnotit omezené množství reálných dat získaných přehráním a záznamem nahrávek ve skutečné místnosti. Zatímco výsledky pro mužské nahrávky odpovídají simulaci, výsledky pro ženské nahrávky nejsou přesvědčivé a vyžadují další analýzu.
APA, Harvard, Vancouver, ISO, and other styles
14

CUMANI, SANDRO. "Speaker and Language Recognition Techniques." Doctoral thesis, Politecnico di Torino, 2012. http://hdl.handle.net/11583/2496928.

Full text
Abstract:
In this work we give an overview of different state–of–the–art speaker and language recognition systems. We analyze some techniques to extract and model features from the acoustic signal and to model the speech content by means of phonetic decoding. We then present state–of–the–art generative systems based on latent variable models and discriminative techniques based on Support Vector Machines. We also present the author’s contributions to the field. These contributions cover the different topics presented in this work. First we propose an improvement to Neural Network training for speech decoding which is based on the use of General Purpose Graphic Processing Units computational framework. We also propose adaptations of latent variable models developed for speaker recognition to the field of language identification. A novel technique which enhances the generation of low–dimensional utterance representations for speaker verification is also presented. Finally, we give a detailed analysis of different training algorithms for SVM–based speaker verification and we propose a novel discriminative framework for speaker verification, the Pairwise SVM approach, which allows for fast utterance testing and allows to achieve very good recognition performance.
APA, Harvard, Vancouver, ISO, and other styles
15

Ho, Ka-Lung. "Kernel eigenvoice speaker adaptation /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20HOK.

Full text
Abstract:
Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 56-61). Also available in electronic version. Access restricted to campus users.
APA, Harvard, Vancouver, ISO, and other styles
16

Seymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.

Full text
Abstract:
In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform ,_.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted.
APA, Harvard, Vancouver, ISO, and other styles
17

Neville, Katrina Lee, and katrina neville@rmit edu au. "Channel Compensation for Speaker Recognition Systems." RMIT University. Electrical and Computer Engineering, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080514.093453.

Full text
Abstract:
This thesis attempts to address the problem of how best to remedy different types of channel distortions on speech when that speech is to be used in automatic speaker recognition and verification systems. Automatic speaker recognition is when a person's voice is analysed by a machine and the person's identity is worked out by the comparison of speech features to a known set of speech features. Automatic speaker verification is when a person claims an identity and the machine determines if that claimed identity is correct or whether that person is an impostor. Channel distortion occurs whenever information is sent electronically through any type of channel whether that channel is a basic wired telephone channel or a wireless channel. The types of distortion that can corrupt the information include time-variant or time-invariant filtering of the information or the addition of 'thermal noise' to the information, both of these types of distortion can cause varying degrees of error in information being received and analysed. The experiments presented in this thesis investigate the effects of channel distortion on the average speaker recognition rates and testing the effectiveness of various channel compensation algorithms designed to mitigate the effects of channel distortion. The speaker recognition system was represented by a basic recognition algorithm consisting of: speech analysis, extraction of feature vectors in the form of the Mel-Cepstral Coefficients, and a classification part based on the minimum distance rule. Two types of channel distortion were investigated: • Convolutional (or lowpass filtering) effects • Addition of white Gaussian noise Three different methods of channel compensation were tested: • Cepstral Mean Subtraction (CMS) • RelAtive SpecTrAl (RASTA) Processing • Constant Modulus Algorithm (CMA) The results from the experiments showed that for both CMS and RASTA processing that filtering at low cutoff frequencies, (3 or 4 kHz), produced improvements in the average speaker recognition rates compared to speech with no compensation. The levels of improvement due to RASTA processing were higher than the levels achieved due to the CMS method. Neither the CMS or RASTA methods were able to improve accuracy of the speaker recognition system for cutoff frequencies of 5 kHz, 6 kHz or 7 kHz. In the case of noisy speech all methods analysed were able to compensate for high SNR of 40 dB and 30 dB and only RASTA processing was able to compensate and improve the average recognition rate for speech corrupted with a high level of noise (SNR of 20 dB and 10 dB).
APA, Harvard, Vancouver, ISO, and other styles
18

Domínguez, Sánchez Carlos. "Speaker Recognition in a handheld computer." Thesis, KTH, Kommunikationssystem, CoS, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-99123.

Full text
Abstract:
Handheld computers are widely used, be it a mobile phone, personal digital assistant (PDA), or a media player. Although these devices are personal, often a small set of persons can use a given device, for example a group of friends or a family. The most natural way to communicate for most humans is through speech. Therefore a natural way for these devices to know who is using them is for the device to listen to the user’s speech, i.e., to recognize the speaker based upon their speech. This project exploits the microphone built into most of these devices and asks whether it is possible to develop an effective speaker recognition system which can operate within the limited resources of these devices (as compared to a desktop PC). The goal of this speaker recognition is to distinguish between the small set of people that could share a handheld device and those outside of this small set. Therefore the criteria is that the device should work for any of the members of this small set and not work for anyone outside of this small set. Furthermore, within this small set the device should recognize which specific person within this small group is using it. An application for a Windows Mobile PDA has been developed using C++. This application and its underlying theoretical concepts, as well as parts of the code and the results obtained (in terms of accuracy rate and performance) are presented in this thesis. The experiments conducted within this research indicate that it is feasible to recognize the user based upon their speech is within a small group and further more to identify which member of the group is the user. This has great potential for automatically configuring devices within a home or office environment for the specific user. Potentially all a user needs to do is speak within hearing range of the device to identify themselves to the device. The device in turn can configure itself for this user.
Handdatorer används mycket, det kan vara en mobiltelefon, handdator (PDA) eller en media spelare. Även om dessa enheter är personliga, kan en liten uppsättning med personer ofta använda en viss enhet, t.ex. en grupp av vänner eller en familj. Det mest naturliga sättet att kommunicera för de flesta människor är att tala. Därför ett naturligt sätt för dessa enheten att veta vem som använder dem är för enheten att lyssna på användarens röst, till exempel att erkänna talaren baserat på deras röst. Detta projekt utnyttjar mikrofonen inbyggd i de flesta av dessa enheter och frågar om det är möjligt att utveckla ett effektivt system högtalare erkännande som kan verka inom de begränsade resurserna av dessa enheter (jämfört med en stationär dator). Målet med denna högtalare erkännande är att skilja mellan den lilla set av människor som skulle kunna dela en handdator och de utanför detta lilla set. Därför kriterierna är att enheten bör arbeta för någon av medlemmarna i detta lilla set och inte fungerar för någon utanför detta lilla set. Övrigt inom denna lilla set, bör enheten erkänna som specifik person inom denna lilla grupp. En ansökan om emph Windows Mobile PDA har utvecklats med C++. Denna ansökan och det underliggande teoretiska begreppet, liksom delar av koden och uppnådda resultat (i form av noggrannhet hastighet och prestanda) presenteras i denna avhandling. Experimenten som utförs inom denna forskning visar att det är möjligt att känna användaren baserat på deras röst inom en liten grupp och ytterligare mer att identifiera vilken medlem i gruppen är användaren. Detta har stor potential för att automatiskt konfigurera enheter inom en hemifrån eller från kontoret till den specifika användaren. Potentiellt behöver en användare tala inom hörhåll för att identifiera sig till enheten. Enheten kan konfigurera själv för denna användare.
APA, Harvard, Vancouver, ISO, and other styles
19

Chan, Chit-man, and 陳哲民. "Speaker-independent recognition of Putonghua finals." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1987. http://hub.hku.hk/bib/B12363091.

Full text
Abstract:
(Uncorrected OCR) Abstract of thesis entitled Speaker- Independent Recognition of Putonghua Finals submitted by CHAN, Chit Man for the degree of Doctor of Philosophy at the University of Hong Kong � In December 1987 ABSTRACT A detailed study had been performed to address the problem of speaker-independent recognition of Putonghua (Mandarin) finals. The study included 35 Putonghua finals, 16 of which having trailing nasals. They were spoken by 51 speakers: 38 females, 13 males, in 5 different tones for two times. The sample was spectrally analyzed by a bank of 18 nonoverlapping critical-band filters. Three data reduction techniques: Karhunen-Loeve Transformation (KLT) , Discrete Cosine Transformation (OCT) and Stepwise Discriminant Analysis (SDA) , were comparat i vely studied for their feature representation capability. The results indicated that KLT was superior to both OCT and SDA. Furthermore, the theoretic equivalence of OCT to KLT was found to be valid only with 5 or more feature dimensions used in computation. On the other hand, the results also showed that the Hahalanobis and a proposed modified Mahalanobis distance both gave a better measurement of performance than the other distances tested, which included the City Block, Euclidean, Minkowski, and Chebyshev. .,. In the second Part of the study, the Hidden Markov Modelling (HMM) technique was investigated. Three classification methods: Phonemic Labell ing (PL), Vector Quantization (VQ) and a proposed Hybrid Symbol (HS) generation, were studied for use with HMM. Whilst PL was found to be simple and efficient, its performance was not as good as VQ. However, the time taken by VQ was excessive, especially in training. The results with the HS method showed that it .could successfully merge the speed advantage of PL and the better discriminatory power of VQ. An approximately 80% saving in the quantizer training time could be achieved with only a marginal loss in performance. At the same time, it Abs-l Abstract was also found that allowing skipping of states in a Left-to-Right model (LRM) could lead to a negative effect on overall recognition. As an indication of performance, the recognition rate of the simulated system was 81.3%, 95.0% and 98.0% with the best I, 2, and 3 candidates included, respectively, using a 256-level VQ and a 6-state, no-skip LRM on a sample of 8,400 finals from 48 speakers. The specific rates on non-nasal finals achieved even 96% - 98% using the best candidate alone . .. ," Abs-2
abstract
toc
Electrical and Electronic Engineering
Doctoral
Doctor of Philosophy
APA, Harvard, Vancouver, ISO, and other styles
20

Deterding, David Henry. "Speaker normalisation for automatic speech recognition." Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.359822.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Park, Alex S. (Alex Seungryong) 1979. "ASR dependent techniques for speaker recognition." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/87287.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Chan, Chit-man. "Speaker-independent recognition of Putonghua finals /." [Hong Kong : University of Hong Kong], 1987. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12363091.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Vogt, Robert Jeffery. "Automatic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 2006. https://eprints.qut.edu.au/36195/1/Robert_Vogt_Thesis.pdf.

Full text
Abstract:
Speaker verification is the process of verifying the identity of a person by analysing their speech. There are several important applications for automatic speaker verification (ASV) technology including suspect identification, tracking terrorists and detecting a person’s presence at a remote location in the surveillance domain, as well as person authentication for phone banking and credit card transactions in the private sector. Telephones and telephony networks provide a natural medium for these applications. The aim of this work is to improve the usefulness of ASV technology for practical applications in the presence of adverse conditions. In a telephony environment, background noise, handset mismatch, channel distortions, room acoustics and restrictions on the available testing and training data are common sources of errors for ASV systems. Two research themes were pursued to overcome these adverse conditions: Modelling mismatch and modelling uncertainty. To directly address the performance degradation incurred through mismatched conditions it was proposed to directly model this mismatch. Feature mapping was evaluated for combating handset mismatch and was extended through the use of a blind clustering algorithm to remove the need for accurate handset labels for the training data. Mismatch modelling was then generalised by explicitly modelling the session conditions as a constrained offset of the speaker model means. This session variability modelling approach enabled the modelling of arbitrary sources of mismatch, including handset type, and halved the error rates in many cases. Methods to model the uncertainty in speaker model estimates and verification scores were developed to address the difficulties of limited training and testing data. The Bayes factor was introduced to account for the uncertainty of the speaker model estimates in testing by applying Bayesian theory to the verification criterion, with improved performance in matched conditions. Modelling the uncertainty in the verification score itself met with significant success. Estimating a confidence interval for the "true" verification score enabled an order of magnitude reduction in the average quantity of speech required to make a confident verification decision based on a threshold. The confidence measures developed in this work may also have significant applications for forensic speaker verification tasks.
APA, Harvard, Vancouver, ISO, and other styles
24

Al-Ali, Ahmed Kamil Hasan. "Forensic speaker recognition under adverse conditions." Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/130783/1/Ahmed%20Kamil%20Hasan_Al-Ali_Thesis.pdf.

Full text
Abstract:
The performance of forensic speaker recognition systems degrades significantly in the presence of environmental noise and reverberant conditions. This research developed new techniques to improve forensic speaker recognition performance under these conditions using fusion feature extraction techniques and speech enhancement based on the independent component analysis algorithm. A range of forensic speaker recognition applications will benefit from the research outcomes including criminal investigations and law enforcement agencies.
APA, Harvard, Vancouver, ISO, and other styles
25

Shou-Chun, Yin 1980. "Speaker adaptation in joint factor analysis based text independent speaker verification." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.

Full text
Abstract:
This thesis presents methods for supervised and unsupervised speaker adaptation of Gaussian mixture speaker models in text-independent speaker verification. The proposed methods are based on an approach which is able to separate speaker and channel variability so that progressive updating of speaker models can be performed while minimizing the influence of the channel variability associated with the adaptation recordings. This approach relies on a joint factor analysis model of intrinsic speaker variability and session variability where inter-session variation is assumed to result primarily from the effects of the transmission channel. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 speaker recognition evaluation plan which is based on conversational telephone speech.
APA, Harvard, Vancouver, ISO, and other styles
26

Thiruvaran, Tharmarajah Electrical Engineering &amp Telecommunications Faculty of Engineering UNSW. "Automatic speaker recognition using phase based features." Awarded by:University of New South Wales. Electrical Engineering & Telecommunications, 2009. http://handle.unsw.edu.au/1959.4/44705.

Full text
Abstract:
Despite recent advances, improving the accuracy of automatic speaker recognition systems remains an important and challenging area of research. This thesis investigates two-phase based features, namely the frequency modulation (FM) feature and the group delay feature in order to improve the speaker recognition accuracy. Introducing features complementary to spectral envelope-based features is a promising approach for increasing the information content of the speaker recognition system. Although phase-based features are motivated by psychophysics and speech production considerations, they have rarely been incorporated into speaker recognition front-ends. A theory has been developed and reported in this thesis, to show that the FM component can be extracted using second-order all pole modelling, and a technique for extracting FM features using this model is proposed, to produce very smooth, slowly varying FM features that are effective for speaker recognition tasks. This approach is shown herein to significantly improve speaker recognition performance over other existing FM extraction methods. A highly computationally efficient FM estimation technique is then proposed and its computational efficiency is shown through a comparative study with other methods with respect to the trade off between computational complexity and performance. In order to further enhance the FM based front-end specifically for speaker recognition, optimum frequency band allocation is studied in terms of the number of sub-bands and spacing of centre frequencies, and two new frequency band re-allocations are proposed for FM based speaker recognition. Two group delay features are also proposed: log compressed group delay feature and the sub-band group delay feature, to address problems in group delay caused by the zeros of the z-transform polynomial of a speech signal being close to the unit circle. It has been shown that the combination of group delay and FM, complements Mel Frequency Cepstral Coefficient (MFCC) in speaker recognition tasks. Furthermore, the proposed FM feature is successfully utilised for automatic forensic speaker recognition, which is implemented based on the likelihood ratio framework with two stage modelling and calibration, and shown to behave in a complementary manner to MFCCs. Notably, the FM based system provides better calibration loss than the MFCC based system, suggesting less ambiguity of FM information than MFCC information in an automatic forensic speaker recognition system. In order to demonstrate the effectiveness of FM features in a large scale speaker recognition environment, an FM-based speaker recognition subsystem is developed and submitted to the NIST 2008 speaker recognition evaluation as part of the I4U submission. Post evaluation analysis shows a 19.7% relative improvement over the traditional MFCC based subsystem when it is augmented by the FM based subsystem. Consistent improvements in performance are obtained when MFCC is augmented with FM in all sub-categories of NIST 2008, in three development tasks and for the NIST 2001 database, demonstrating the complementary behaviour of MFCC and FM features.
APA, Harvard, Vancouver, ISO, and other styles
27

Katz, Marcel [Verfasser]. "Discriminative classifiers for speaker Recognition / Marcel Katz." Saarbrücken : Südwestdeutscher Verlag für Hochschulschriften, 2009. http://www.vdm-verlag.de.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Elvira, Jose M. "Neural networks for speech and speaker recognition." Thesis, Staffordshire University, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.262314.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

McAuley, J. "Subband correlation and robust speech/speaker recognition." Thesis, Queen's University Belfast, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.426761.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Chan, Carlos Chun Ming. "Speaker model adaptation in automatic speech recognition." Thesis, Robert Gordon University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.339307.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Irvine, David Alexander. "A comparison of some speaker recognition techniques." Thesis, University of Ulster, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.385661.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Iliadi, Konstantina. "Bio-inspired voice recognition for speaker identification." Thesis, University of Southampton, 2016. https://eprints.soton.ac.uk/413949/.

Full text
Abstract:
Speaker identification (SID) aims to identify the underlying speaker(s) given a speech utterance. In a speaker identification system, the first component is the front-end or feature extractor. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components is strongly determined by its quality. Existing approaches have used several feature extraction methods that have been adopted directly from the speech recognition task. However, the nature of these two tasks is contradictory given that speaker variability is one of the major error sources in speech recognition whereas in speaker recognition, it is the information that we wish to extract. In this thesis, the possible benefits of adapting a biologically-inspired model of human auditory processing as part of the front-end of a SID system are examined. This auditory model named Auditory Image Model (AIM) generates the stabilized auditory image (SAI). Features are extracted by the SAI through breaking it into boxes of different scales. Vector quantization (VQ) is used to create the speaker database with the speakers’ reference templates that will be used for pattern matching with the features of the target speakers that need to be identified. Also, these features are compared to the Mel-frequency cepstral coefficients (MFCCs), which is the most evident example of a feature set that is extensively used in speaker recognition but originally developed for speech recognition purposes. Additionally, another important parameter in SID systems is the dimensionality of the features. This study addresses this issue by specifying the most speaker-specific features and trying to further improve the system configuration for obtaining a representation of the auditory features with lower dimensionality. Furthermore, after evaluating the system performance in quiet conditions, another primary topic of speaker recognition is investigated. SID systems can perform well under matched training and test conditions but their performance degrades significantly because of the mismatch caused by background noise in real-world environments. Achieving robustness to SID systems becomes an important research problem. In the second experimental part of this thesis, the developed version of the system is assessed for speaker data sets of different size. Clean speech is used for the training phase while speech in the presence of babble noise is used for speaker testing. The results suggest that the extracted auditory feature vectors lead to much better performance, i.e. higher SID accuracy, compared to the MFCC-based recognition system especially for low SNRs. Lastly, the system performance is inspected with regard to parameters related to the training and test speech data such as the duration of the spoken material. From these experiments, the system is found to produce satisfying identification scores for relatively short training and test speech segments.
APA, Harvard, Vancouver, ISO, and other styles
33

Fér, Radek. "Speaker Recognition Based on Long Temporal Context." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236121.

Full text
Abstract:
Tato práce se zabývá extrakcí vhodných příznaků pro rozpoznávání řečníka z delších časových úseků. Po představení současných technik pro extrakci takových příznaků navrhujeme a popisujeme novou metodu pracující v časovém rozsahu fonémů a využívající známou techniku i-vektorů. Velké úsilí bylo vynaloženo na nalezení vhodné reprezentace temporálních příznaků, díky kterým by mohly být systémy pro rozpoznávání řečníka robustnější, zejména modelování prosodie. Náš přístup nemodeluje explicitně žádné specifické temporální parametry řeči, namísto toho používá kookurenci řečových rámců jako zdroj temporálních příznaků. Tuto techniku testujeme a analyzujeme na řečové databázi NIST SRE 2008. Z výsledků bohužel vyplývá, že pro rozpoznávání řečníka tato technika nepřináší očekávané zlepšení. Tento fakt diskutujeme a analyzujeme ke konci práce.
APA, Harvard, Vancouver, ISO, and other styles
34

Castellano, Pierre John. "Speaker recognition modelling with artificial neural networks." Thesis, Queensland University of Technology, 1997.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
35

Ho, Ching-Hsiang. "Speaker modelling for voice conversion." Thesis, Brunel University, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.365076.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Fredrickson, Steven Eric. "Neural networks for speaker identification." Thesis, University of Oxford, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.294364.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Nosratighods, Mohaddeseh Electrical Engineering &amp Telecommunications Faculty of Engineering UNSW. "Robust speaker verification system." Publisher:University of New South Wales. Electrical Engineering & Telecommunications, 2008. http://handle.unsw.edu.au/1959.4/42796.

Full text
Abstract:
Identity verification or biometric recognition systems play an important role in our daily lives. Applications include Automatic Teller Machines (ATM), banking and share information retrieval, and personal verification for credit cards. Among the biometric techniques, authentication of speakers by his/her voice is of great importance, since it employs a non-invasive approach and is the only available modality in many applications. However,the performance of Automatic Speaker Verification (ASV) systems degrades significantly under adverse conditions which cause recordings from the same speaker to be different.The objective of this research is to investigate and develop robust techniques for performing automatic speaker recognition over various channel conditions, such as telephony and recorded microphone speech. This research is shown to improve the robustness of ASV systems in three main areas of feature extraction, speaker modelling and score normalization. At the feature level, a new set of dynamic features, termed Delta Cepstral Energy (DCE) is proposed, instead of traditional delta cepstra, which not only greatly reduces thedimensionality of the feature vector compared with delta and delta-delta cepstra, but is also shown to provide the same performance for matched testing and training conditions on TIMIT and a subset of the NIST 2002 dataset. The concept of speaker entropy, which conveys the information contained in a speaker's speech based on the extracted features, facilitates comparative evaluation of the proposed methods. In addition, Frequency Modulation features are combined in a complementary manner with the Mel Frequency CepstralCoefficients (MFCCs) to improve the performance of the ASV system under channel variability of various types. The proposed fused system shows a relative reduction of up to 23% in Equal Error Rate (EER) over the MFCC-based system when evaluated on the NIST 2008 dataset. Currently, the main challenge in speaker modelling is channel variability across different sessions. A recent approach to channel compensation, based on Support Vector Machines (SVM) is Nuisance Attribute Projection (NAP). The proposed multi-component approach to NAP, attempts to compensate for the main sources of inter-session variations through an additional optimization criteria, to allow more accurate estimates of the most dominant channel artefacts and to improve the system performance under mismatched training and test conditions. Another major issue in speaker recognition is that the variability of score distributions due to incompletely modelled regions of the feature space can produce segments of the test speech that are poorly matched to the claimed speaker model. A segment selection technique in score normalization is proposed that relies only on discriminative and reliable segments of the test utterance to verify the speaker. This approach is particularly useful in noisy conditions where using speech activity detection is not reliable at the feature level. Another source of score variability comes from the fact that not all phonemes are equally discriminative. To address this, a new score re-weighting technique is applied to likelihood values based on the discriminative level of each Gaussian component, i.e. each particular region of the feature space. It is found that a limited number of Gaussian mixtures, herein termed discriminative components are responsible for the overall performance, and that inclusion of the other non-discriminative components may only degrade the system performance.
APA, Harvard, Vancouver, ISO, and other styles
38

Wildermoth, Brett Richard, and n/a. "Text-Independent Speaker Recognition Using Source Based Features." Griffith University. School of Microelectronic Engineering, 2001. http://www4.gu.edu.au:8080/adt-root/public/adt-QGU20040831.115646.

Full text
Abstract:
Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.
APA, Harvard, Vancouver, ISO, and other styles
39

Wildermoth, Brett Richard. "Text-Independent Speaker Recognition Using Source Based Features." Thesis, Griffith University, 2001. http://hdl.handle.net/10072/366289.

Full text
Abstract:
Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.
Thesis (Masters)
Master of Philosophy (MPhil)
School of Microelectronic Engineering
Faculty of Engineering and Information Technology
Full Text
APA, Harvard, Vancouver, ISO, and other styles
40

Baker, Brendan J. "Speaker verification incorporating high-level linguistic features." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17665/1/Brendan_Baker_Thesis.pdf.

Full text
Abstract:
Speaker verification is the process of verifying or disputing the claimed identity of a speaker based on a recorded sample of their speech. Automatic speaker verification technology can be applied to a variety of person authentication and identification applications including forensics, surveillance, national security measures for combating terrorism, credit card and transaction verification, automation and indexing of speakers in audio data, voice based signatures, and over-the-phone security access. The ubiquitous nature of modern telephony systems allows for the easy acquisition and delivery of speech signals for processing by an automated speaker recognition system. Traditionally, approaches to automatic speaker verification have involved holistic modelling of low-level acoustic-based features in order to characterise physiological aspects of a speaker such as the length and shape of the vocal tract. Although the use of these low-level features has proved highly successful, there are numerous other sources of speaker specific information in the speech signal that have largely been ignored. In spontaneous and conversational speech, perceptually higher levels of in- formation such as the linguistic content, pronunciation idiosyncrasies, idiolectal word usage, speaking rates and prosody, can also provide useful cues as to identify of a speaker. The main aim of this work is to explore the incorporation of higher levels of information into the verification process. Specifically, linguistic constructs such as words, syllables and phones are examined for their usefulness as features for text-independent speaker verification. Two main approaches to incorporating these linguistic features are explored. Firstly, the direct modelling of linguistic feature sequences is examined. Stochastic language models are used to model word and phonetic sequences obtained from automatically obtained transcripts. Experimentation indicates that significant speaker characterising information is indeed contained in both word and phone-level transcripts. It is shown, however, that model estimation issues arise when limited speech is available for training. This speaker model estimation problem is addressed by employing an adaptive model training strategy that significantly improves the performance and extended the usefulness of both lexical and phonetic techniques to short training length situations. An alternate approach to incorporating linguistic information is also examined. Rather than modelling the high-level features independently of acoustic information, linguistic information is instead used to constrain and aid acoustic- based speaker verification techniques. It is hypothesised that a ext-constrained" approach provides direct benefits by facilitating more detailed modelling, as well as providing useful insight into which articulatory events provide the most useful speaker-characterising information. A novel framework for text-constrained speaker verification is developed. This technique is presented as a generalised framework capable of using di®erent feature sets and modelling paradigms, and is based upon the use of a newly defined pseudo-syllabic segmentation unit. A detailed exploration of the speaker characterising power of both broad phonetic and syllabic events is performed and used to optimise the system configuration. An evaluation of the proposed text- constrained framework using cepstral features demonstrates the benefits of such an approach over holistic approaches, particularly in extended training length scenarios. Finally, a complete evaluation of the developed techniques on the NIST2005 speaker recognition evaluation database is presented. The benefit of including high-level linguistic information is demonstrated when a fusion of both high- and low-level techniques is performed.
APA, Harvard, Vancouver, ISO, and other styles
41

Baker, Brendan J. "Speaker verification incorporating high-level linguistic features." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17665/.

Full text
Abstract:
Speaker verification is the process of verifying or disputing the claimed identity of a speaker based on a recorded sample of their speech. Automatic speaker verification technology can be applied to a variety of person authentication and identification applications including forensics, surveillance, national security measures for combating terrorism, credit card and transaction verification, automation and indexing of speakers in audio data, voice based signatures, and over-the-phone security access. The ubiquitous nature of modern telephony systems allows for the easy acquisition and delivery of speech signals for processing by an automated speaker recognition system. Traditionally, approaches to automatic speaker verification have involved holistic modelling of low-level acoustic-based features in order to characterise physiological aspects of a speaker such as the length and shape of the vocal tract. Although the use of these low-level features has proved highly successful, there are numerous other sources of speaker specific information in the speech signal that have largely been ignored. In spontaneous and conversational speech, perceptually higher levels of in- formation such as the linguistic content, pronunciation idiosyncrasies, idiolectal word usage, speaking rates and prosody, can also provide useful cues as to identify of a speaker. The main aim of this work is to explore the incorporation of higher levels of information into the verification process. Specifically, linguistic constructs such as words, syllables and phones are examined for their usefulness as features for text-independent speaker verification. Two main approaches to incorporating these linguistic features are explored. Firstly, the direct modelling of linguistic feature sequences is examined. Stochastic language models are used to model word and phonetic sequences obtained from automatically obtained transcripts. Experimentation indicates that significant speaker characterising information is indeed contained in both word and phone-level transcripts. It is shown, however, that model estimation issues arise when limited speech is available for training. This speaker model estimation problem is addressed by employing an adaptive model training strategy that significantly improves the performance and extended the usefulness of both lexical and phonetic techniques to short training length situations. An alternate approach to incorporating linguistic information is also examined. Rather than modelling the high-level features independently of acoustic information, linguistic information is instead used to constrain and aid acoustic- based speaker verification techniques. It is hypothesised that a ext-constrained" approach provides direct benefits by facilitating more detailed modelling, as well as providing useful insight into which articulatory events provide the most useful speaker-characterising information. A novel framework for text-constrained speaker verification is developed. This technique is presented as a generalised framework capable of using di®erent feature sets and modelling paradigms, and is based upon the use of a newly defined pseudo-syllabic segmentation unit. A detailed exploration of the speaker characterising power of both broad phonetic and syllabic events is performed and used to optimise the system configuration. An evaluation of the proposed text- constrained framework using cepstral features demonstrates the benefits of such an approach over holistic approaches, particularly in extended training length scenarios. Finally, a complete evaluation of the developed techniques on the NIST2005 speaker recognition evaluation database is presented. The benefit of including high-level linguistic information is demonstrated when a fusion of both high- and low-level techniques is performed.
APA, Harvard, Vancouver, ISO, and other styles
42

Tran, Michael. "An approach to a robust speaker recognition system." Diss., This resource online, 1994. http://scholar.lib.vt.edu/theses/available/etd-06062008-164814/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Farrús, Cabeceran Mireia. "Fusing prosodic and acoustic information for speaker recognition." Doctoral thesis, Universitat Politècnica de Catalunya, 2008. http://hdl.handle.net/10803/31779.

Full text
Abstract:
El reconeixement automàtic del locutor és la utilització d’una màquina per identificar un individu a partir de d’un missatge parlat. Recentment, aquesta tecnologia ha experimentat un increment en l’ús de diverses aplicacions com el control d’accés, l’autenticació de transaccions, la cooperació amb la justícia, l’analítica forense o la personalització de serveis, entre d’altres. Una de les qüestions centrals que es tracten en aquest camp és el fet de saber quina part del senyal de veu conté informació del locutor. Tradicionalment, els sistemes de reconeixement automàtic del locutor s’han basat principalment en característiques relacionades amb l’espectre de la veu. No obstant, els humans utilitzen altres fonts d’informació per reconèixer locutors, de manera que hi ha motius per pensar que aquestes fonts poden tenir un paper important en la tasca de reconeixement automàtic del locutor, aportar coneixement complementari als sistemes de tradicionals basats en l’espectre de la veu i millorar-ne la precisió. L’objectiu principal d’aquesta tesi és incorporar informació prosòdica a un sistema espectral tradicional per tal de millorar-ne el funcionament. Amb aquesta finalitat, diverses característiques relacionades amb la prosòdia – constituïda per elements d’entonació, ritme i accent – es seleccionen i es combinen amb les característiques espectrals existents. A més a més, la tesi també es centra en la utilització de característiques acústiques addicionals – a saber, jitter i shimmer – per millorar el funcionament del sistema de verificació espectral-prosòdic proposat. Totes dues característiques estan relacionades amb la forma i dimensió del tracte vocal, i s’han utilitzat en gran part per detectar patologies de la veu. La majoria d’aplicacions que s’han esmentat abans es poden utilitzar en un entorn multimodal; per aquest motiu, les característiques de veu utilitzades en el sistema de reconeixement del locutor també es combinen amb altres identificadors biomètrics – concretament, la cara – per tal de millorar el funcionament global del sistema. Amb aquest objectiu, s’utilitzen diverses tècniques de normalització i de fusió, i els resultats de la fusió final es milloren aplicant diferents estratègies de fusió basades en seqüències de passos. A més a més, la fusió multimodal també es millora aplicant una equalització d’histogrames com a tècnica de normalització a les distribucions de puntuacions unimodals. Per altra banda, és sabut que els humans poden identificar els altres a partir de la veu fins i tot quan aquestes veus estan alterades d’alguna manera. La qüestió rau en quina mesura els sistemes automàtics de reconeixement del locutor són vulnerables a les diferents alteracions de la veu, com ara la imitació humana o la conversió artificial. L’última part de la tesi consisteix en una anàlisi de la robustesa d’aquests sistemes a les imitacions de veu humanes i a les veus convertides sintèticament, i de la influència dels accents estrangers – com a tipus d’imitació – en el reconeixement auditiu del locutor.
Automatic speaker recognition is the use of a machine to identify an individual from a spoken sentence. Recently, this technology has been undergone an increasing use in applications such as access control, transaction authentication, law enforcement, forensics, and system customisation, among others. One of the central questions addressed by this field is what is it in the speech signal that conveys speaker identity. Traditionally, automatic speaker recognition systems have relied mostly on short-term features related to the spectrum of the voice. However, human speaker recognition relies on other sources of information; therefore, there is reason to believe that these sources can play also an important role in the automatic speaker recognition task, adding complementary knowledge to the traditional spectrum-based recognition systems and thus improving their accuracy. The main objective of this thesis is to add prosodic information to a traditional spectral system in order to improve its performance. To this end, several characteristics related to human speech prosody – which is conveyed through intonation, rhythm and stress – are selected and combined them with the existing spectral features. Furthermore, this thesis also focuses on the use of additional acoustic features – namely jitter and shimmer – to improve the performance of the proposed spectral-prosodic verification system. Both features are related to the shape and dimension of the vocal tract, and they have been largely used to detect voice pathologies. Since almost all the above-mentioned applications can be used in a multimodal environment, this thesis also aims to combine the voice features used in the speaker recognition system together with other biometric identifiers – face – in order to improve the global performance. To this end, several normalisation and fusion techniques are used, and the final fusion results are improved by applying different fusion strategies based on sequences of several steps. Furthermore, multimodal fusion is also improved by applying a histogram equalisation to the unimodal score distributions as a normalisation technique. On the other hand, it is well know that humans are able to identify others from voice even when their voices are disguised. The question arises as to how vulnerable automatic speaker recognition systems are against different voice disguises, such as human imitation or artificial voice conversion, which are potential threats to security systems that rely on automatic speaker recognition. The last part of this thesis finishes with an analysis of the robustness of such systems against human voice imitations and synthetic converted voices, and the influence of foreign accents and dialects – as a sort of imitation – in auditory speaker recognition.
APA, Harvard, Vancouver, ISO, and other styles
44

Khan, Umair. "Self-supervised deep learning approaches to speaker recognition." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/671496.

Full text
Abstract:
In speaker recognition, i-vectors have been the state-of-the-art unsupervised technique over the last few years, whereas x-vectors is becoming the state-of-the-art supervised technique, these days. Recent advances in Deep Learning (DL) approaches to speaker recognition have improved the performance but are constrained to the need of labels for the background data. In practice, labeled background data is not easily accessible, especially when large training data is required. In i-vector based speaker recognition, cosine and Probabilistic Linear Discriminant Analysis (PLDA) are the two basic scoring techniques. Cosine scoring is unsupervised whereas PLDA parameters are typically trained using speaker-labeled background data. This makes a big performance gap between these two scoring techniques. The question is: how to fill this performance gap without using speaker labels for the background data? In this thesis, the above mentioned problem has been addressed using DL approaches without using and/or limiting the use of labeled background data. Three DL based proposals have been made. In the first proposal, a Restricted Boltzmann Machine (RBM) vector representation of speech is proposed for the tasks of speaker clustering and tracking in TV broadcast shows. This representation is referred to as RBM vector. The experiments on AGORA database show that in speaker clustering the RBM vectors gain a relative improvement of 12% in terms of Equal Impurity (EI). For speaker tracking task RBM vectors are used only in the speaker identification part, where the relative improvement in terms of Equal Error Rate (EER) is 11% and 7% using cosine and PLDA scoring, respectively. In the second proposal, DL approaches are proposed in order to increase the discriminative power of i-vectors in speaker verification. We have proposed the use of autoencoder in several ways. Firstly, an autoencoder will be used as a pre-training for a Deep Neural Network (DNN) using a large amount of unlabeled background data. Then, a DNN classifier will be trained using relatively small labeled data. Secondly, an autoencoder will be trained to transform i-vectors into a new representation to increase their discriminative power. The training will be carried out based on the nearest neighbor i-vectors which will be chosen in an unsupervised manner. The evaluation was performed on VoxCeleb-1 database. The results show that using the first system, we gain a relative improvement of 21% in terms of EER, over i-vector/PLDA. Whereas, using the second system, a relative improvement of 42% is gained. If we use the background data in the testing part, a relative improvement of 53% is gained. In the third proposal, we will train a self-supervised end-to-end speaker verification system. The idea is to utilize impostor samples along with the nearest neighbor samples to make client/impostor pairs in an unsupervised manner. The architecture will be based on a Convolutional Neural Network (CNN) encoder, trained as a siamese network with two branch networks. Another network with three branches will also be trained using triplet loss, in order to extract unsupervised speaker embeddings. The experimental results show that both the end-to-end system and the speaker embeddings, despite being unsupervised, show a comparable performance to the supervised baseline. Moreover, their score combination can further improve the performance. The proposed approaches for speaker verification have respective pros and cons. The best result was obtained using the nearest neighbor autoencoder with a disadvantage of relying on background i-vectors in the testing. On the contrary, the autoencoder pre-training for DNN is not bound by this factor but is a semi-supervised approach. The third proposal is free from both these constraints and performs pretty reasonably. It is a self-supervised approach and it does not require the background i-vectors in the testing phase.
Los avances recientes en Deep Learning (DL) para el reconocimiento del hablante están mejorado el rendimiento de los sistemas tradicionales basados en i-vectors. En el reconocimiento de locutor basado en i-vectors, la distancia coseno y el análisis discriminante lineal probabilístico (PLDA) son las dos técnicas más usadas de puntuación. La primera no es supervisada, pero la segunda necesita datos etiquetados por el hablante, que no son siempre fácilmente accesibles en la práctica. Esto crea una gran brecha de rendimiento entre estas dos técnicas de puntuación. La pregunta es: ¿cómo llenar esta brecha de rendimiento sin usar etiquetas del hablante en los datos de background? En esta tesis, el problema anterior se ha abordado utilizando técnicas de DL sin utilizar y/o limitar el uso de datos etiquetados. Se han realizado tres propuestas basadas en DL. En la primera, se propone una representación vectorial de voz basada en la máquina de Boltzmann restringida (RBM) para las tareas de agrupación de hablantes y seguimiento de hablantes en programas de televisión. Los experimentos en la base de datos AGORA, muestran que en agrupación de hablantes los vectores RBM suponen una mejora relativa del 12%. Y, por otro lado, en seguimiento del hablante, los vectores RBM,utilizados solo en la etapa de identificación del hablante, muestran una mejora relativa del 11% (coseno) y 7% (PLDA). En la segunda, se utiliza DL para aumentar el poder discriminativo de los i-vectors en la verificación del hablante. Se ha propuesto el uso del autocodificador de varias formas. En primer lugar, se utiliza un autocodificador como preentrenamiento de una red neuronal profunda (DNN) utilizando una gran cantidad de datos de background sin etiquetar, para posteriormente entrenar un clasificador DNN utilizando un conjunto reducido de datos etiquetados. En segundo lugar, se entrena un autocodificador para transformar i-vectors en una nueva representación para aumentar el poder discriminativo de los i-vectors. El entrenamiento se lleva a cabo en base a los i-vectors vecinos más cercanos, que se eligen de forma no supervisada. La evaluación se ha realizado con la base de datos VoxCeleb-1. Los resultados muestran que usando el primer sistema obtenemos una mejora relativa del 21% sobre i-vectors, mientras que usando el segundo sistema, se obtiene una mejora relativa del 42%. Además, si utilizamos los datos de background en la etapa de prueba, se obtiene una mejora relativa del 53%. En la tercera, entrenamos un sistema auto-supervisado de verificación de locutor de principio a fin. Utilizamos impostores junto con los vecinos más cercanos para formar pares cliente/impostor sin supervisión. La arquitectura se basa en un codificador de red neuronal convolucional (CNN) que se entrena como una red siamesa con dos ramas. Además, se entrena otra red con tres ramas utilizando la función de pérdida triplete para extraer embeddings de locutores. Los resultados muestran que tanto el sistema de principio a fin como los embeddings de locutores, a pesar de no estar supervisados, tienen un rendimiento comparable a una referencia supervisada. Cada uno de los enfoques propuestos tienen sus pros y sus contras. El mejor resultado se obtuvo utilizando el autocodificador con el vecino más cercano, con la desventaja de que necesita los i-vectors de background en el test. El uso del preentrenamiento del autocodificador para DNN no tiene este problema, pero es un enfoque semi-supervisado, es decir, requiere etiquetas de hablantes solo de una parte pequeña de los datos de background. La tercera propuesta no tienes estas dos limitaciones y funciona de manera razonable. Es un en
APA, Harvard, Vancouver, ISO, and other styles
45

Uzuner, Halil. "Robust text-independent speaker recognition over telecommunications systems." Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/843391/.

Full text
Abstract:
Biometric recognition methods, using human features such as voice, face or fingeorprints, are increasingly popular for user authentication. Voice is unique in that it is a non-intrusive biometric which can be transmitted over the existing telecommunication networks, thereby allowing remote authentication. Current spealcer recognition systems can provide high recognition rates on clean speech signals. However, their performance has been shown to degrade in real-life applications such as telephone banking, where speech compression and background noise can affect the speech signal. In this work, three important advancements have been introduced to improve the speaker recognition performance, where it is affected by the coder mismatch, the aliasing distortion caused by the Line Spectral Frequency (LSF) parameter extraction, and the background noise. The first advancement focuses on investigating the speaker recognition system performance in a multi-coder environment using a Speech Coder Detection (SCD) System, which minimises training and testing data mismatch and improves the speaker recognition performance. Having reduced the speaker recognition error rates for multi-coder environment, further investigation on GSM-EFR speech coder is performed to deal with a particular - problem related to LSF parameter extraction method. It has been previously shown that the classic technique for extraction of LSF parameters in speech coders is prone to aliasing distortion. Low-pass filtering on up-sampled LSF vectors has been shown to alleviate this problem, therefore improving speech quality. In this thesis, as a second advancement, the Non-Aliased LSF (NA-LSF) extraction method is introduced in order to reduce the unwanted effects of GSM-EFR coder on speaker recognition performance. Another important factor that effects the performance of speaker recognition systems is the presence of the background noise. Background noise might severely reduce the performance of the targeted application such as quality of the coded speech, or the performance of the speaker recognition systems. The third advancement was achieved by using a noise-canceller to improve the speaker recognition performance in mismatched environments with varying background noise conditions. Speaker recognition system with a Minimum Mean Square Error - Log Spectral Amplitudes (MMSE-LSA) noise- canceller used as a pre-processor is proposed and investigated to determine the efficiency of noise cancellation on the speaker recognition performance using speech corrupted by different background noise conditions. Also the effects of noise cancellation on speaker recognition performance using coded noisy speech have been investigated. Key words; Identification, Verification, Recognition, Gaussian Mixture Models, Speech Coding, Noise Cancellation.
APA, Harvard, Vancouver, ISO, and other styles
46

Eriksson, Erik J. "That voice sounds familiar : factors in speaker recognition." Doctoral thesis, Umeå : Department of Philosophy and Linguistics, Umeå University, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1106.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Falk, Jennie, and Gabriella Hultström. "Support Vector Machines for Optimizing Speaker Recognition Problems." Thesis, KTH, Optimeringslära och systemteori, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-103821.

Full text
Abstract:
Classi cation of data has many applications, amongst others within the eld of speaker recognition. Speaker recognition is the part of speech processing concerned with the task of automatically identifying or verifying speakers using dierent characteristics of their voices. The main focus in speaker recognition is to nd methods that separate data, in order to dierentiate between dierent speakers. In this thesis, such a method is obtained by building a support vector machine, which has proved to be a very good tool for separating all kinds of data. The rst version of the support vector machine is used to separate linearly separable data using linear hyperplanes, and it is then modi ed to separate linearly non-separable data, by allowing some data points to be misclassi ed. Finally, the support vector machine is improved further, through a generalization to higher dimensional data and by the use of dierent kernels and thus higher order hyperplanes. The developed support vector machine is in the end used on a set of speaker recognition data. The separation of two speakers are not very satisfying, most likely due to the very limited set of data. However, the results are very good when the support vector machine is used on other, more complete, sets of data.
Klassi cering av data har manga anvandningsomraden, bland annat inom rostigenkanning. Rostigenkanning ar en del av talmodellering som behandlar problemet med att kunna identi era talare och veri era en talares identitet med hjalp av karakteristiska drag hos dennes rost. Fokus ligger pa att hitta metoder som kan separera data, for att sedan kunna separera talare. I detta kandidatexamensarbete byggs, for detta syfte, en support vector machine som has visats vara ett bra satt att separera olika data. Den forsta versionen anvands pa data som ar linjart separerbart i tva dimensioner, sedan utvecklas den till att kunna separera data som inte ar linjart separerbart, genom att tillata vissa datapunkter att bli felklassi cerade. Slutligen modi eras denna support vector machine till att kunna separera data i hogre dimensioner, samt anvanda olika karnor for att ge separerande hyperplan av hogre ordning. Den fardiga versionen av denna support vector machine anvands till sist pa data for ett rostigenkanningsproblem. Resultatet av att separera tva talare var inte tillfredsstallande, dock skulle mer data fran olika talare ge ett battre resultat. Nar daretmot en annan, mer komplett, mangd av data anvands for att bygga denna support vector machine blir resultatet valdigt bra.
APA, Harvard, Vancouver, ISO, and other styles
48

Farnes, Karen. "Development of a Speaker Recognition Solution in Vidispine." Thesis, Umeå universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-74180.

Full text
Abstract:
A video database contains an enormous amount of information. In order to search through the database, metadata can be attached to each video. One such type of metadata can be labels containing speakers and where they are speaking. With the help of speaker recognition this type of metadata can automatically be assigned to each video. In this thesis a speaker recognition plug-in for Vidispine, an API media asset management platform, is presented. The plug-in was developed with the help of the LIUM SpkDiarization toolkit for speaker diarization and the ALIZE/LIA RAL toolkit for speaker identification. The choice of using the method of GMM-UBM that ALIZE/LIA RAL offers, was made through an in-depth theoretical study of different identification methods. The in-depth study is presented in its own chapter. The goal of the plug-in was to perform an identification rate of 85%. However, the results unfortunately became as low as 63%. Among the issues the plug-in faces, its low performance on female speaker was shown to be crucial.
APA, Harvard, Vancouver, ISO, and other styles
49

Openshaw, J. P. "The effects of additive noise in speaker recognition." Thesis, Swansea University, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.638372.

Full text
Abstract:
This thesis is concerned with text independent speaker recognition and how the performance is affected should additive noise contaminate the speech. Initially, benchmark recognition results are obtained for MFCC, PLP and their Δ derivative features. The drastic effects of noise are clear: recognition errors increase from 3.4% to 60.5% if the test speech has an SNR of 15dB, a level not uncommon as a background office environment. Various attempts at compensating for the adverse effects of noise are investigated, such as explicit modelling, whereby the noise conditions expected to be found in the testing phase are included in the model. The performance is improved across a range of noise levels with this technique, although a-priori knowledge of the noise level is required when creating the models. A different technique, in essence a form of filtering, is used to map features extracted from noisy speech to match those with no noise. Both linear and non-linear transformation functions are investigated, with an artificial neural network, with its ability to model arbitrary functions, achieving the best performance. A reliance on a-priori knowledge of the noise level is still required when generating the transformation function. The technique of noise masking is found to give the features considerable insensitivity to additive noise. This is a simple technique which has little computational overheads, however, the optimum mask level is found to be dependent on the level of the additive noise, again implying a-priori knowledge of the noise level. A new feature is demonstrated in this thesis, the time-relative cepstral series, T-ReCS. The T-ReCS feature uses an estimate of the spectral change of the speech signal, which filters out any stationary spectral component.
APA, Harvard, Vancouver, ISO, and other styles
50

Cox, S. J. "Techniques for rapid speaker adaptation in speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.267271.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography