Dissertations / Theses: 'Vocal feature'

1

Moore, Elliot II. "Evaluating objective feature statistics of speech as indicators of vocal affect and depression." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/5346.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Moore, Elliot. "Evaluating objective feature statistics of speech as indicators of vocal affect and depression." Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04062004-164738/unrestricted/moore%5Felliot%5F200312%5Fphd.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Carvalho, Raphael Torres Santos. "Transformada Wavelet na detecÃÃo de patologias da laringe." Universidade Federal do CearÃ, 2012. http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=8908.

Full text

Abstract:

CoordenaÃÃo de AperfeiÃoamento de Pessoal de NÃvel Superior
A quantidade de mÃtodos nÃo invasivos de diagnÃstico tem aumentado devido Ã necessidade de exames simples, rÃpidos e indolores. Por conta do crescimento da tecnologia que fornece os meios necessÃrios para a extraÃÃo e processamento de sinais, novos mÃtodos de anÃlise tÃm sido desenvolvidos para compreender a complexidade dos sinais de voz. Este trabalho de dissertaÃÃo apresenta uma nova ideia para caracterizar os sinais de voz saudÃvel e patolÃgicos baseado em uma ferramenta matemÃtica amplamente conhecida na literatura, a Transformada Wavelet (WT). O conjunto de dados utilizado neste trabalho consiste de 60 amostras de vozes divididas em quatro classes de amostras, uma de indivÃduos saudÃveis e as outras trÃs de pessoas com nÃdulo vocal, edema de Reinke e disfonia neurolÃgica. Todas as amostras foram gravadas usando a vogal sustentada /a/ do PortuguÃs Brasileiro. Os resultados obtidos por todos os classificadores de padrÃes estudados mostram que a abordagem proposta usando WT Ã uma tÃcnica adequada para discriminaÃÃo entre vozes saudÃvel e patolÃgica, e apresentaram resultados similares ou superiores a da tÃcnica clÃssica quanto Ã taxa de reconhecimento.
The amount of non-invasive methods of diagnosis has increased due to the need for simple, quick and painless tests. Due to the growth of technology that provides the means for extraction and signal processing, new analytical methods have been developed to help the understanding of analysis of the complexity of the voice signals. This dissertation presents a new idea to characterize signals of healthy and pathological voice based on one mathematical tools widely known in the literature, Wavelet Transform (WT). The speech data were used in this work consists of 60 voice samples divided into four classes of samples: one from healthy individuals and three from people with vocal fold nodules, Reinkeâs edema and neurological dysphonia. All the samples were recorded using the vowel /a/ in Brazilian Portuguese. The obtained results by all the pattern classifiers studied indicate that the proposed approach using WT is a suitable technique to discriminate between healthy and pathological voices, since they perform similarly to or even better than classical technique, concerning recognition rates.

APA, Harvard, Vancouver, ISO, and other styles

4

Wildermoth, Brett Richard, and n/a. "Text-Independent Speaker Recognition Using Source Based Features." Griffith University. School of Microelectronic Engineering, 2001. http://www4.gu.edu.au:8080/adt-root/public/adt-QGU20040831.115646.

Full text

Abstract:

Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.

APA, Harvard, Vancouver, ISO, and other styles

5

Wildermoth, Brett Richard. "Text-Independent Speaker Recognition Using Source Based Features." Thesis, Griffith University, 2001. http://hdl.handle.net/10072/366289.

Full text

Abstract:

Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.
Thesis (Masters)
Master of Philosophy (MPhil)
School of Microelectronic Engineering
Faculty of Engineering and Information Technology
Full Text

APA, Harvard, Vancouver, ISO, and other styles

6

Horwitz-Martin, Rachelle (Rachelle Laura). "Vocal modulation features in the prediction of major depressive disorder severity." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/93072.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.
"September 2014." Cataloged from PDF version of thesis.
Includes bibliographical references (pages 113-115).
This thesis develops a model of vocal modulations up to 50 Hz in sustained vowels as a basis for biomarkers of neurological disease, particularly Major Depressive Disorder (MDD). Two model components contribute to amplitude modulation (AM): AM from respiratory muscles and from interaction between formants and frequency modulation in the fundamental frequency harmonics. Based on the modulation model, we test three methods to extract the envelope of the third formant from which features are extracted using sustained vowels from the 2013 AudioNisual Emotion Challenge. Using a Gaussian-Mixture-Model-based predictor, we evaluate performance of each feature in predicting subjects' Beck MDD severity score by the root mean square error (RMSE), mean absolute error (MAE), and Spearman correlation between the actual Beck score and predicted score. Our lowest MAE and RMSE values are 8.46 and 10.32, respectively (Spearman correlation=0.487, p<0.001), relative to the mean MAE of 10.05 and mean RMSE of 11.86.
by Rachelle L. Horwitz.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

7

GarciÌa, MariÌa Susana Avila. "Automatic tracking of 3D vocal tract features during speech production using MRI." Thesis, University of Southampton, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.437111.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Marchetto, Enrico. "Automatic Speaker Recognition and Characterization by means of Robust Vocal Source Features." Doctoral thesis, Università degli studi di Padova, 2011. http://hdl.handle.net/11577/3427390.

Full text

Abstract:

Automatic Speaker Recognition is a wide research field, which encompasses many topics: signal processing, human vocal and auditory physiology, statistical modelling, cognitive sciences, and so on. The study of these techniques started about thirty years ago and, since then, the improvement has been dramatic. Nonetheless the field still poses open issues and many active research centers around the world are working towards more reliable and better performing systems. This thesis documents a Philosophiae Doctor project funded by the private held company RT - Radio Trevisan Elettronica Industriale S.p.A. The title of the fellowship is "Automatic speaker recognition with applications to security and intelligence". Part of the work was carried out during a six-month visit in the Speech, Music and Hearing Department of the KTH Royal Institute of Technology, Stockholm. Speaker Recognition research develops techniques to automatically associate a given human voice to a previously recorded version of it. Speaker Recognition is usually further defined as Speaker Identification or Speaker Verification; in the former the identity of a voice has to be found among a (possibly high) number of speaker voices, while in the latter the system is provided with both a voice and a claimed identity, and the association has to be verified as a true/false statement. The recognition systems also provides a confidence score about the found results. The first Part of the thesis reviews the state of the art of Speaker Recognition research. The main components of a recognition system are described: audio features extraction, statistical modelling, and performance assessment. During the years the research community developed a number of Audio Features, use to describe the information carried by the vocal signal in a compact and deterministic way. In every automatic recognition application, even speech or language, the feature extraction process is the first step, in charge of compressing substantially the size of the input data without loosing any important information. The choice of the best fitted features for a specific application, and their tuning, are crucial to obtain satisfactory recognition results; moreover the definition of innovative features is a lively research direction because it is generally recognized that existing features are still far from the exploitation of the whole information load carried by the vocal signal. There are audio features which during the years have proved to perform better than other; some of them are described in Part I: Mel-Frequency Cepstral Coefficients and Linear Prediction Coefficients. More refined and experimental features are also introduced, and will be explained in Part III. Statistical modelling is introduced, particularly by discussing the Gaussian Mixture Models structure and their training through the EM algorithm; specific modelling techniques for recognition, such as Universal Background Model, are described. Scoring is the last phase of a Speaker Recognition process and involves a number of normalizations; it compensates for different recording conditions or model issues. Part I continues presenting a number of audio databases that are commonly used in the literature as benchmark databases to compare results or recognition systems, in particular TIMIT and NIST Speaker Recognition Evaluation - SRE 2004. A recognition prototype system has been built during the PhD project, and it is detailed in Part II. The first Chapter describes the proposed application, referring to intelligence and security. The application fulfils specific requirements of the Authorities when investigations involve phone wiretapping or environmental interceptions. In these cases Authorities have to listen to a large amount of recordings, most of which are not related to the investigations. The application idea is to automatically detect and label speakers, giving the possibility to search for a specific speaker through the recording collection. This can avoid time wasting, resulting in an economical advantage. Many difficulties arises from the phone lines, which are known to degrade the speech signal and cause a reduction of the recognition performances; main issues are the narrow audio bandwidth, the additive noises and the convolution noise, the last resulting in phase distortion. The second Chapter in Part II describes in detail the developed Speaker Recognition system; a number of design choices are discussed. During the development the research scope of the system has been crucial: a lot of effort has been put to obtain a system with good performances and still easily and deeply modifiable. The assessment of results on different databases posed further challenges, which has been solved with a unified interface to the databases. The fundamental components of a speaker recognition system have been developed, with also some speed-up improvements. Lastly, the whole software can run on a cluster computer without any reconfiguration, a crucial characteristic in order to assess performance on big database in reasonable times. During the three-years project some works have been developed which are related to the Speaker Recognition, although not directly involved with it. These developments are described in Part II as extensions of the prototype. First a Voice Activity Detector suitable for noisy recordings is explained. The first step of feature extraction is to find and select, from a given record, only the segments containing voice; this is not a trivial task when the record is noisy and a simple "energy threshold" approach fails. The developed VAD is based on advanced features, computed from Wavelet Transforms, which are further processed using an adaptive threshold. One second developed application is Speaker Diarization: it permits to automatically segment an audio recording when it contains different speakers. The outputs of the diarization are a segmentation and a speaker label for each segment, resulting in a "who speaks when" answer. The third and last collateral work is a Noise Reduction system for voice applications, developed on a hardware DSP. The noise reduction algorithm adaptively detects the noise and reduces it, keeping only the voice; it works in real time using only a slight portion of the DSP computing power. Lastly, Part III discusses innovative audio features, which are the main novel contribution of this thesis. The features are obtained from the glottal flow, therefore the first Chapter in this Part describes the anatomy of the vocal folds and of the vocal tract. The working principle of the phonation apparatus is described and the importance of the vocal folds physics is pointed out. The glottal flow is an input air flow for the vocal tract, which acts as a filter; an open-source toolkit for the inversion of the vocal tract filter is introduced: it permits to estimate the glottal flow from speech records. A description of some methods used to give a numerical characterization to the glottal flow is given. In the subsequent Chapter, a definition of the novel glottal features is presented. The glottal flow estimates are not always reliable, so a first step detects and deletes unlikely flows. A numerical procedure then groups and sorts the flow estimates, preparing them for a statistical modelling. Performance measures are then discussed, comparing the novel features against the standard ones, applied on the reference databases TIMIT and SRE 2004. A Chapter is dedicated to a different research work, related with glottal flow characterization. A physical model of the vocal folds is presented, with a number of control rules, able to describe the vocal folds dynamic. The rules permit to translate a specific pharyngeal muscular set-up in mechanical parameters of the model, which results in a specific glottal flow (obtained after a computer simulation of the model). The so-called Inverse Problem is defined in this way: given a glottal flow it has to be found the muscular set-up which, used to drive a model simulation, can obtain the same glottal flow as the given one. The inverse problem has a number of difficulties in it, such as the non-univocity of the inversion and the sensitivity to slight variations in the input flow. An optimization control technique has been developed and is explained. The final Chapter summarizes the achievements of the thesis. Along with this discussion, a roadmap for the future improvements to the features is sketched. In the end, a resume of the published and submitted articles for both conferences and journals is presented.
Il Riconoscimento Automatico del Parlatore rappresenta un campo di ricerca esteso, che comprende molti argomenti: elaborazione del segnale, fisiologia vocale e dell'apparato uditivo, strumenti di modellazione statistica, studio del linguaggio, ecc. Lo studio di queste tecniche è iniziato circa trenta anni fa e, da allora, ci sono stati grandi miglioramenti. Nondimeno, il campo di ricerca continua a porre questioni e, in tutto il mondo, gruppi di ricerca continuano a lavorare per ottenere sistemi di riconoscimento più affidabili e con prestazioni migliori. La presente tesi documenta un progetto di Philosophiae Doctor finanziato dall'Azienda privata RT - Radio Trevisan Elettronica Industriale S.p.A. Il titolo della borsa di studio è "Riconoscimento automatico del parlatore con applicazioni alla sicurezza e all'intelligence". Parte del lavoro ha avuto luogo durante una visita, durata sei mesi, presso lo Speech, Music and Hearing Department del KTH - Royal Institute of Technology di Stoccolma. La ricerca inerente il Riconoscimento del Parlatore sviluppa tecnologie per associare automaticamente una data voce umana ad una versione precedentemente registrata della stessa. Il Riconoscimento del Parlatore (Speaker Recognition) viene solitamente meglio definito in termini di Verifica o di Identificazione del Parlatore (in letteratura Speaker Verification o Speaker Identification, rispettivamente). L'Identificazione consiste nel recupero dell'identità di una voce fra un numero (anche alto) di voci modellate dal sistema; nella Verifica invece, date una voce ed una identità, si chiede al sistema di verificare l'associazione tra le due. I sistemi di riconoscimento producono anche un punteggio (Score) che attesta l'attendibilità della risposta fornita. La prima Parte della tesi propone una revisione dello stato dell'arte circa il Riconoscimento del Parlatore. Vengono descritte le componenti principali di un prototipo per il riconoscimento: estrazione di Features audio, modellazione statistica e verifica delle prestazioni. Nel tempo, la comunità di ricerca ha sviluppato una quantità di Features Acustiche: si tratta di tecniche per descrivere numericamente il segnale vocale in modo compatto e deterministico. In ogni applicazione di riconoscimento, anche per le parole o il linguaggio (Speech o Language Recognition), l'estrazione di Features è il primo passo: ha lo scopo di ridurre drasticamente la dimensione dei dati di ingresso, ma senza perdere alcuna informazione significativa. La scelta delle Features più idonee ad una specifica applicazione, e la loro taratura, sono cruciali per ottenere buoni risultati di riconoscimento; inoltre, la definizione di nuove features costituisce un attivo campo di ricerca perché la comunità scientifica ritiene che le features esistenti siano ancora lontane dallo sfruttamento dell'intera informazione portata dal segnale vocale. Alcune Features si sono affermate nel tempo per le loro migliori prestazioni: Coefficienti Cepstrali in scala Mel (Mel-Frequency Cepstral Coefficients) e Coefficienti di Predizione Lineare (Linear Prediction Coefficients); tali Features vengono descritte nella Parte I. Viene introdotta anche la modellazione statistica, spiegando la struttura dei Modelli a Misture di Gaussiane (Gaussian Mixture Models) ed il relativo algoritmo di addestramento (Expectation-Maximization). Tecniche di modellazione specifiche, quali Universal Background Model, completano poi la descrizione degli strumenti statistici usati per il riconoscimento. Lo Scoring rappresenta, infine, la fase di produzione dei risultati da parte del sistema di riconoscimento; comprende diverse procedure di normalizzazione che compensano, ad esempio, i problemi di modellazione o le diverse condizioni acustiche con cui i dati audio sono stati registrati. La Parte I prosegue poi presentando alcuni database audio usati comunemente in letteratura quali riferimento per il confronto delle prestazioni dei sistemi di riconoscimento; in particolare, vengono presentati TIMIT e NIST Speaker Recognition Evaluation (SRE) 2004. Tali database sono adatti alla valutazione delle prestazioni su audio di natura telefonica, di interesse per la presente tesi; tale argomento verrà ulteriormente discusso nella Parte II. Durante il progetto di PhD è stato progettato e realizzato un prototipo di sistema di riconoscimento, discusso nella Parte II. Il primo Capitolo descrive l'applicazione di riconoscimento proposta; la tecnologia per Riconoscimento del Parlatore viene applicate alle linee telefoniche, con riferimento alla sicurezza e all'intelligence. L'applicazione risponde a una specifica necessità delle Autorità quando le investigazioni coinvolgono intercettazioni telefoniche. In questi casi le Autorità devono ascoltare grandi quantità di dati telefonici, la maggior parte dei quali risulta essere inutile ai fini investigativi. L'idea applicativa consiste nell'identificazione e nell'etichettatura automatiche dei parlatori presenti nelle intercettazioni, permettendo così la ricerca di uno specifico parlatore presente nella collezione di registrazioni. Questo potrebbe ridurre gli sprechi di tempo, ottenendo così vantaggi economici. L'audio proveniente da linee telefoniche pone difficoltà al riconoscimento automatico, perché degrada significativamente il segnale e peggiora quindi le prestazioni. Vengono generalmente riconosciute alcune problematiche del segnale audio telefonico: banda ridotta, rumore additivo e rumore convolutivo; quest'ultimo causa distorsione di fase, che altera la forma d'onda del segnale. Il secondo Capitolo della Parte II descrive in dettaglio il sistema di Riconoscimento del Parlatore sviluppato; vengono discusse le diverse scelte di progettazione. Sono state sviluppate le componenti fondamentali di un sistema di riconoscimento, con alcune migliorie per contenere il carico computazionale. Durante lo sviluppo si è ritenuto primario lo scopo di ricerca del software da realizzare: è stato profuso molto impegno per ottenere un sistema con buone prestazioni, che però rimanesse semplice da modificare anche in profondità. La necessità (ed opportunità) di verificare le prestazioni del prototipo ha posto ulteriori requisiti allo sviluppo, che sono stati soddisfatti mediante l'adozione di un'interfaccia comune ai diversi database. Infine, tutti i moduli del software sviluppato possono essere eseguiti su un Cluster di Calcolo (calcolatore ad altre prestazioni per il calcolo parallelo); questa caratteristica del prototipo è stata cruciale per permettere una approfondita valutazione delle prestazioni del software in tempi ragionevoli. Durante il lavoro svolto per il progetto di Dottorato sono stati condotti studi affini al Riconoscimento del Parlatore, ma non direttamente correlati ad esso. Questi sviluppi vengono descritti nella Parte II quali estensioni del prototipo. Viene innanzitutto presentato un Rilevatore di Parlato (Voice Activity Detector) adatto all'impiego in presenza di rumore. Questo componente assume particolare importanza quale primo passo dell'estrazione delle Features: è necessario infatti selezionare e mantenere solo i segmenti audio che contengono effettivamente segnale vocale. In situazioni con rilevante rumore di fondo i semplici approcci a "soglia di energia" falliscono. Il Rilevatore realizzato è basato su Features avanzate, ottenute mediante le Trasformate Wavelet, ulteriormente elaborate mediante una sogliatura adattiva. Una seconda applicazione consiste in un prototipo per la Speaker Diarization, ovvero l'etichettatura automatica di registrazioni audio contenenti diversi parlatori. Il risultato del procedimento consiste nella segmentazione dell'audio ed in una serie di etichette, una per ciascun segmento; il sistema fornisce una risposta del tipo "chi parla quando". Il terzo ed ultimo studio collaterale al Riconoscimento del Parlatore consiste nello sviluppo di un sistema di Riduzione del Rumore (Noise Reduction) su piattaforma hardware DSP dedicata. L'algoritmo di Riduzione individua il rumore in modo adattivo e lo riduce, cercando di mantenere solo il segnale vocale; l'elaborazione avviene in tempo reale, pur usando solo una parte molto limitata delle risorse di calcolo del DSP. La Parte III della tesi introduce, infine, Features audio innovative, che costituiscono il principale contributo innovativo della tesi. Tali Features sono ottenute dal flusso glottale, quindi il primo Capitolo della Parte discute l'anatomia del tratto e delle corde vocali. Viene descritto il principio di funzionamento della fonazione e l'importanza della fisica delle corde vocali. Il flusso glottale costituisce un ingresso per il tratto vocale, che agisce come un filtro. Viene descritto uno strumento software open-source per l'inversione del tratto vocale: esso permette la stima del flusso glottale a partire da semplici registrazioni vocali. Alcuni dei metodi usati per caratterizzare numericamente il flusso glottale vengono infine esposti. Nel Capitolo successivo viene presentata la definizione delle nuove Features glottali. Le stime del flusso glottale non sono sempre affidabili quindi, durante l'estrazione delle nuove Features, il primo passo individua ed esclude i flussi giudicati non attendibili. Una procedure numerica provvede poi a raggruppare ed ordinare le stime dei flussi, preparandoli per la modellazione statistica. Le Features glottali, applicate al Riconoscimento del Parlatore sui database TIMIT e NIST SRE 2004, vengono comparate alle Features standard. Il Capitolo finale della Parte III è dedicato ad un diverso lavoro di ricerca, comunque correlato alla caratterizzazione del flusso glottale. Viene presentato un modello fisico delle corde vocali, controllato da alcune regole numeriche, in grado di descrivere la dinamica delle corde stesse. Le regole permettono di tradurre una specifica impostazione dei muscoli glottali nei parametri meccanici del modello, che portano ad un preciso flusso glottale (ottenuto dopo una simulazione al computer del modello). Il cosiddetto Problema Inverso è definito nel seguente modo: dato un flusso glottale si chiede di trovare una impostazione dei muscoli glottali che, usata per guidare il modello fisico, permetta la risintesi di un segnale glottale il più possibile simile a quello dato. Il problema inverso comporta una serie di difficoltà, quali la non-univocità dell'inversione e la sensitività alle variazioni, anche piccole, del flusso di ingresso. E' stata sviluppata una tecnica di ottimizzazione del controllo, che viene descritta. Il capitolo conclusivo della tesi riassume i risultati ottenuti. A fianco di questa discussione è presentata un piano di lavoro per lo sviluppo delle Features introdotte. Vengono infine presentate le pubblicazioni prodotte.

APA, Harvard, Vancouver, ISO, and other styles

9

Almeida, N?thalee Cavalcanti de. "Sistema inteligente para diagn?stico de patologias na laringe utilizando m?quinas de vetor de suporte." Universidade Federal do Rio Grande do Norte, 2010. http://repositorio.ufrn.br:8080/jspui/handle/123456789/15149.

Full text

Abstract:

Made available in DSpace on 2014-12-17T14:54:56Z (GMT). No. of bitstreams: 1 NathaleeCA_DISSERT.pdf: 1318151 bytes, checksum: d2471205a640d8428567d06ace6c3b31 (MD5) Previous issue date: 2010-07-23
Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior
The human voice is an important communication tool and any disorder of the voice can have profound implications for social and professional life of an individual. Techniques of digital signal processing have been used by acoustic analysis of vocal disorders caused by pathologies in the larynx, due to its simplicity and noninvasive nature. This work deals with the acoustic analysis of voice signals affected by pathologies in the larynx, specifically, edema, and nodules on the vocal folds. The purpose of this work is to develop a classification system of voices to help pre-diagnosis of pathologies in the larynx, as well as monitoring pharmacological treatments and after surgery. Linear Prediction Coefficients (LPC), Mel Frequency cepstral coefficients (MFCC) and the coefficients obtained through the Wavelet Packet Transform (WPT) are applied to extract relevant characteristics of the voice signal. For the classification task is used the Support Vector Machine (SVM), which aims to build optimal hyperplanes that maximize the margin of separation between the classes involved. The hyperplane generated is determined by the support vectors, which are subsets of points in these classes. According to the database used in this work, the results showed a good performance, with a hit rate of 98.46% for classification of normal and pathological voices in general, and 98.75% in the classification of diseases together: edema and nodules
A voz humana ? uma importante ferramenta de comunica??o e qualquer funcionamento inadequado da voz pode ter profundas implica??es na vida social e profissional de um indiv?duo. T?cnicas de processamento digital de sinais t?m sido utilizadas atrav?s da an?lise ac?stica de desordens vocais provocadas por patologias na laringe, devido ? sua simplicidade e natureza n?o-invasiva. Este trabalho trata da an?lise ac?stica de sinais de vozes afetadas por patologias na laringe, especificamente, edemas e n?dulos nas pregas vocais. A proposta deste trabalho ? desenvolver um sistema de classifica??o de vozes para auxiliar no pr?-diagn?stico de patologias na laringe, bem como no acompanhamento de tratamentos farmacol?gicos e p?s-cir?rgicos. Os coeficientes de Predi??o Linear (LPC), Coeficientes Cepstrais de Freq??ncia Mel (MFCC) e os coeficientes obtidos atrav?s da Transformada Wavelet Packet (WPT) s?o aplicados para extra??o de caracter?sticas relevantes do sinal de voz. ? utilizada para a tarefa de classifica??o M?quina de Vetor de Suporte (SVM), a qual tem como objetivo construir hiperplanos ?timos que maximizem a margem de separa??o entre as classes envolvidas. O hiperplano gerado ? determinado pelos vetores de suporte, que s?o subconjuntos de pontos dessas classes. De acordo com o banco de dados utilizado neste trabalho, os resultados apresentaram um bom desempenho, com taxa de acerto de 98,46% para classifica??o de vozes normais e patol?gicas em geral, e 98,75% na classifica??o de patologias entre si: edemas e n?dulos

APA, Harvard, Vancouver, ISO, and other styles

10

Lovett, Victoria Anne. "Voice Features of Sjogren's Syndrome: Examination of Relative Fundamental Frequency (RFF) During Connected Speech." BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/5749.

Full text

Abstract:

The purpose of this study was to examine the effectiveness of relative fundamental frequency (RFF) in quantifying voice disorder severity and possible change with treatment in individuals with Primary Sjögren's Syndrome (SS). Participants completed twice-daily audio recordings during an ABAB within-subjects experimental study investigating the effects of nebulized saline on voice production in this population. Voice samples of the Rainbow Passage from seven of the eight individuals with Primary SS involved in a larger investigation met inclusion criteria for analysis, for a total of 555 tokens. The results indicated that RFF values for this sample were similar to previously reported RFF values for individuals with voice disorders. RFF values improved with nebulized saline treatment but did not fall within the normal range for typical speakers. These findings were similar to other populations of voice disorders who experienced improvement, but not complete normalization, of RFF with treatment. Patient-based factors, such as age and diagnosis as well as measurement and methodological factors, might affect RFF values. The results from this study indicate that RFF is a potentially useful measure in quantifying voice production and disorder severity in individuals with Primary SS.

APA, Harvard, Vancouver, ISO, and other styles

11

Fux, Thibaut. "Vers un système indiquant la distance d'un locuteur par transformation de sa voix." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENT120/document.

Full text

Abstract:

Cette thèse porte sur la transformation de la voix d’un locuteur dans l’objectif d’indiquer la distance de celui-ci : une transformation en voix chuchotée pour indiquer une distance proche et une transformation en voix criée pour une distance plutôt éloignée. Nous effectuons dans un premier temps des analyses approfondies pour déterminer les paramètres les plus pertinentes dans une voix chuchotée et surtout dans une voix criée (beaucoup plus difficile). La contribution principale de cette partie est de montrer la pertinence des paramètres prosodiques dans la perception de l’effort vocal dans une voix criée. Nous proposons ensuite des descripteurs permettant de mieux caractériser les contours prosodiques. Pour la transformation proprement dite, nous proposons plusieurs nouvelles règles de transformation qui contrôlent de manière primordiale la qualité des voix transformées. Les résultats ont montré une très bonne qualité des voix chuchotées transformées ainsi que pour des voix criées pour des structures linguistiques relativement simples (CVC, CVCV, etc.)
This thesis focuses on speaker voice transformation in the aim to indicate the distance of it: a spokento-whispered voice transformation to indicate a close distance and a spoken-to-shouted voicetransformation for a rather far distance. We perform at first, in-depth analysis to determine mostrelevant features in whispered voices and especially in shouted voices (much harder). The maincontribution of this part is to show the relevance of prosodic parameters in the perception of vocaleffort in a shouted voice. Then, we propose some descriptors to better characterize the prosodiccontours. For the actual transformation, we propose several new transformation rules whichimportantly control the quality of transformed voice. The results showed a very good quality oftransformed whispered voices and transformed shouted voices for relatively simple linguisticstructures (CVC, CVCV, etc.)

APA, Harvard, Vancouver, ISO, and other styles

12

Lang, Anja [Verfasser]. "Histomorphometrical analysis of the fibrous components of the porcine vocal folds –Stratigraphical features and their relevance for models in phoniatry / Anja Lang." Hannover : Bibliothek der Tierärztlichen Hochschule Hannover, 2014. http://d-nb.info/1054387656/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Boyer, Stanislas. "Contribution de l'analyse du signal vocal à la détection de l'état de somnolence et du niveau de charge mentale." Thesis, Toulouse 3, 2016. http://www.theses.fr/2016TOU30075/document.

Full text

Abstract:

Les exigences opérationnelles du métier de pilote sont susceptibles d'engendrer de la somnolence et des niveaux de charge mentale inadéquats (i.e., trop faible ou trop élevé) au cours des vols. Les dettes de sommeil et les perturbations circadiennes liées à divers facteurs (e.g., longues périodes de services, horaires de travail irrégulier, etc.) demandent aux pilotes de repousser sans cesse leurs limites biologiques. Par ailleurs, la charge de travail mental des pilotes présente de fortes variations au cours d'un vol : élevée au cours des phases critiques (i.e., décollage et atterrissage), elle devient très réduite pendant les phases de croisière. Lorsque la charge mentale devient trop élevée ou, à l'inverse, trop faible, les performances se dégradent et des erreurs de pilotage peuvent apparaître. La mise en oeuvre de méthodes de détection de l'état de somnolence et du niveau de charge mentale en temps quasi réel est un défi majeur pour le suivi et le contrôle de l'activité de pilotage. L'objectif de la thèse est de déterminer si la voix humaine peut permettre de détecter d'une part, l'état de somnolence et d'autre part, le niveau de charge mentale d'un individu. Dans une première étude, la voix de participants a été enregistrée lors d'une tâche de lecture avant et après une nuit de privation totale de sommeil (PTS). Les variations de l'état de somnolence consécutives à la PTS ont été évaluées au moyen de mesures auto-évaluatives et électrophysiologiques (ÉlectroEncéphaloGraphie [EEG] et Potentiels Évoqués [PEs]). Les résultats ont montré une variation significative après la PTS de plusieurs paramètres acoustiques liés : (a) à l'amplitude des impulsions glottiques (fréquence de modulation d'amplitude), (b) à la forme du signal acoustique (longueur euclidienne du signal et ses caractéristiques associées) et (c) au spectre du signal des voyelles (rapport harmonique sur bruit, fréquence du second formant, coefficient d'asymétrie, centre de gravité spectral, différences d'énergie, pente spectrale et coefficients cepstraux à échelle Mel). La plupart des caractéristiques spectrales ont montré une sensibilité différente à la privation de sommeil en fonction du type de voyelles. Des corrélations significatives ont été mises en évidence entre plusieurs paramètres acoustiques et plusieurs indicateurs objectifs (EEG et PEs) de l'état de somnolence. Dans une seconde étude, le signal vocal a été enregistré durant une tâche de rappel de listes de mots. La difficulté de la tâche était manipulée en faisant varier le nombre de mots dans chaque liste (i.e., entre un et sept, correspondant à sept conditions de charge mentale). Le diamètre pupillaire - qui est un indicateur objectif pertinent du niveau de charge mentale - a été mesuré simultanément avec l'enregistrement de la voix afin d'attester de la variation du niveau de charge mentale durant la tâche expérimentale. Les résultats ont montré que des paramètres acoustiques classiques (fréquence fondamentale et son écart type, shimmer, nombre de périodes et rapport harmonique sur bruit) et originaux (fréquence de modulation d'amplitude et variations à court-terme de la longueur euclidienne du signal) ont été particulièrement sensibles aux variations de la charge mentale. Les variations de ces paramètres acoustiques étaient corrélées à celles du diamètre pupillaire. L'ensemble des résultats suggère que les paramètres acoustiques de la voix humaine identifiés lors des expérimentations pourraient représenter des indicateurs pertinents pour la détection de l'état de somnolence et du niveau de charge mentale d'un individu. Les résultats ouvrent de nombreuses perspectives de recherche et d'applications dans le domaine de la sécurité des transports, notamment dans le secteur aéronautique
Operational requirements of aircraft pilots may cause drowsiness and inadequate mental load levels (i.e., too low or too high) during flights. Sleep debts and circadian disruptions linked to various factors (e.g., long working periods, irregular work schedules, etc.) require pilots to challenge their biological limits. Moreover, pilots' mental workload exhibits strong fluctuations during flights: higher during critical phases (i.e., takeoff and landing), it becomes very low during cruising phases. When the mental load becomes too high or, conversely, too low, performance decreases and flight errors may manifest. Implementation of detection methods of drowsiness and mental load levels in near real time is a major challenge for monitoring and controlling flight activity. The aim of this thesis is therefore to determine if the human voice can serve to detect on one hand the drowsiness and on the other hand the mental load level of an individual. In a first study, the voice of participants was recorded during a reading task before and after a night of total sleep deprivation (TSD). Drowsiness variations linked to TSD were assessed using self-evaluative and electrophysiological measures (ElectroEncephaloGraphy [EEG] and Evoked Potentials [EPs]). Results showed significant variations after the TSD in many acoustic features related to: (a) the amplitude of the glottal pulses (amplitude modulation frequency), (b) the shape of the acoustic wave (Euclidean length of the signal and its associated features) and (3) the spectrum of the vowel signal (harmonic-to-noise ratio, second formant frequency, skewness, spectral center of gravity, energy differences, spectral tilt and Mel-frequency cepstral coefficients). Most spectral features showed different sensitivity to sleep deprivation depending on the vowel type. Significant correlations were found between several acoustic features and several objective indicators (EEG and PEs) of drowsiness. In a second study, voices were recorded during a task featuring word-list recall. The difficulty of the task was manipulated by varying the number of words in each list (i.e., between one and seven, corresponding to seven mental load conditions). Evoked pupillary response - known to be a useful proxy of mental load - was recorded simultaneously with speech to attest variations in mental load level during the experimental task. Results showed that classical features (fundamental frequency and its standard deviation, shimmer, number of periods and harmonic-to-noise ratio) and original features (amplitude modulation frequency and short-term variation in digital amplitude length) were particularly sensitive to variations in mental load. Variations in these acoustic features were correlated to those of the pupil size. Results suggest that the acoustic features of the human voice identified during these experiments could represent relevant indicators for the detection of drowsiness and mental load levels of an individual. Findings open up many research and applications perspectives in the field of transport safety, particularly in the aeronautical sector

APA, Harvard, Vancouver, ISO, and other styles

14

Kahn, Juliette. "Parole de locuteur : performance et confiance en identification biométrique vocale." Phd thesis, Université d'Avignon, 2011. http://tel.archives-ouvertes.fr/tel-00995071.

Full text

Abstract:

Ce travail de thèse explore l'usage biométrique de la parole dont les applications sont très nombreuses (sécurité, environnements intelligents, criminalistique, surveillance du territoire ou authentification de transactions électroniques). La parole est soumise à de nombreuses contraintes fonction des origines du locuteur (géographique, sociale et culturelle) mais également fonction de ses objectifs performatifs. Le locuteur peut être considéré comme un facteur de variation de la parole, parmi d'autres. Dans ce travail, nous présentons des éléments de réponses aux deux questions suivantes :- Tous les extraits de parole d'un même locuteur sont-ils équivalents pour le reconnaître ?- Comment se structurent les différentes sources de variation qui véhiculent directement ou indirectement la spécificité du locuteur ? Nous construisons, dans un premier temps, un protocole pour évaluer la capacité humaine à discriminer un locuteur à partir d'un extrait de parole en utilisant les données de la campagne NIST-HASR 2010. La tâche ainsi posée est difficile pour nos auditeurs, qu'ils soient naïfs ou plus expérimentés.Dans ce cadre, nous montrons que ni la (quasi)unanimité des auditeurs ni l'auto-évaluation de leurs jugements ne sont des gages de confiance dans la véracité de la réponse soumise.Nous quantifions, dans un second temps, l'influence du choix d'un extrait de parole sur la performance des systèmes automatiques. Nous avons utilisé deux bases de données, NIST et BREF ainsi que deux systèmes de RAL, ALIZE/SpkDet (LIA) et Idento (SRI). Les systèmes de RAL, aussi bienfondés sur une approche UBM-GMM que sur une approche i-vector montrent des écarts de performances importants mesurés à l'aide d'un taux de variation autour de l'EER moyen, Vr (pour NIST, VrIdento = 1.41 et VrALIZE/SpkDet = 1.47 et pour BREF, Vr = 3.11) selon le choix du fichier d'apprentissage utilisé pour chaque locuteur. Ces variations de performance, très importantes, montrent la sensibilité des systèmes automatiques au choix des extraits de parole, sensibilité qu'il est important de mesurer et de réduire pour rendre les systèmes de RAL plus fiables.Afin d'expliquer l'importance du choix des extraits de parole, nous cherchons les indices les plus pertinents pour distinguer les locuteurs de nos corpus en mesurant l'effet du facteur Locuteur sur la variance des indices (h2). La F0 est fortement dépendante du facteur Locuteur, et ce indépendamment de la voyelle. Certains phonèmes sont plus discriminants pour le locuteur : les consonnes nasales, les fricatives, les voyelles nasales, voyelles orales mi-fermées à ouvertes.Ce travail constitue un premier pas vers une étude plus précise de ce qu'est le locuteur aussi bien pour la perception humaine que pour les systèmes automatiques. Si nous avons montré qu'il existait bien une différence cepstrale qui conduisait à des modèles plus ou moins performants, il reste encore à comprendre comment lier le locuteur à la production de la parole. Enfin, suite à ces travaux, nous souhaitons explorer plus en détail l'influence de la langue sur la reconnaissance du locuteur. En effet, même si nos résultats indiquent qu'en anglais américain et en français, les mêmes catégories de phonèmes sont les plus porteuses d'information sur le locuteur, il reste à confirmer ce point et à évaluer ce qu'il en est pour d'autres langues

APA, Harvard, Vancouver, ISO, and other styles

15

Sklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition." Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.

Full text

Abstract:

In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success. Communication media, either analog (such as analog landline phones) or digital (VoIP) distort the speaker?s speech signal often in very complex ways: linear distortion occurs in all channels, either in the magnitude or phase spectrum. Non-linear but time-invariant distortion will always appear in all real systems. In digital systems we also have network effects which will produce packet losses and delays and repeated packets. Finally, one cannot really assert what path a signal will take, and so having error or distortion in between is almost a certainty. The channel introduces an acoustical mismatch between the speaker's signal and the trained data in the ASR, which results in poor recognition performance. The approach so far, has been to try to undo the havoc produced by the channels, i.e. compensate for the channel's behavior. In this thesis, we try to characterize the effects of different transmission media and use that as an inexpensive and repeatable way to train ASR systems.

APA, Harvard, Vancouver, ISO, and other styles

16

Moura, Giselle Borges de. "Vocalização de suínos em grupo sob diferentes condições térmicas." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/11/11131/tde-26042013-094034/.

Full text

Abstract:

Quantificar e qualificar o bem-estar de animais de produção, ainda é um desafio. Em qualquer avaliação de bem-estar, deve-se analisar, principalmente, a ausência de sentimentos negativos fortes, como o sofrimento, e a presença de sentimentos positivos, como o prazer. O objetivo principal dessa pesquisa foi quantificar a vocalização de suínos em grupos sob diferentes condições térmicas. Em termos de objetivos específicos foram avaliar a existência de padrões vocálicos de comunicação entre animais alojados em grupo e extrair as características acústicas dos espectros sonoros das vocalizações relacionando com as diferentes condições do micro-clima da instalação. O experimento foi realizado em uma unidade de experimentação com suínos, junto à University of Illinois (EUA), com ambiente controlado. Quatro grupos de seis leitões foram utilizados para a coleta dos dados. Foram instalados dataloggers para registrar as variáveis ambientais (T, °C e UR, %) e posterior cálculo dos índices de conforto (ITU e Entalpia do ar). Foram instalados microfones do tipo cardióide no centro geométrico de cada baia que abrigava os leitões, para registro das vocalizações. Os microfones foram conectados a um amplificador de sinais, e este a uma placa de captura dos sinais de áudio e vídeo, instalados em um computador. Para as edições dos arquivos de áudio contendo as vocalizações dos leitões, o programa Goldwave® foi utilizado na separação, e aplicação de filtros para a retirada de ruídos. Na sequência, os áudios foram analisados com auxílio do programa Sounds Analysis Pro 2011, onde foram extraídos as características acústicas. A amplitude (dB), frequência fundamental (Hz), frequência média (Hz), frequência de pico (Hz) e entropia foram utilizados para caracterização do espectro sonoro das vocalizações do grupo de leitões nas diferentes condições térmicas. O delineamento do experimento foi em blocos casualizados, com dois tratamentos, e três repetições na semana, sendo executado em duas semanas. Os dados foram amostrados para uma análise do comportamento do banco de dados de vocalização em relação aos tratamentos que foram aplicados. Os dados foram submetidos a uma análise de variância utilizando o proc GLM do SAS. Dentre os parâmetros acústicos analisados, a amplitude (dB), frequência fundamental e entropia. Os tratamentos, condição de conforto e condição de calor, apresentaram diferenças significativas, pelo teste de Tukey (p<0,05). A análise de variância mostrou diferenças no formato da onda para cada condição térmica nos diferentes períodos do dia. É possível quantificar a vocalização em grupos de suínos em diferentes condições térmicas, por intermédio da extração das características acústicas das amostras sonoras. O espectro sonoro foi extraído, indicando possíveis variações do comportamento dos leitões nas diferentes condições térmicas dentro dos períodos do dia. No entanto, a etapa de reconhecimento de padrão, ainda necessita de um banco de dados maior e mais consistente para o reconhecimento do espectro em cada condição térmica, seja por análise das imagens ou pela extração das características acústicas. Dentre as características acústicas analisadas, a amplitude (dB), frequência fundamental (Hz) e entropia das vocalizações em grupo de suínos foram significativas para expressar a condição dos animais quando em diferentes condições térmicas.
To quantify and to qualify animal well-being in livestock farms is still a challenge. To assess animal well-being, it must be analyzed, mainly, the absence of strong negative feelings, like pain, and the presence of positive feelings, like pleasure. The main objective was to quantify vocalization in a group of pigs under different thermal conditions. The specific objectives were to assess the existence of vocal pattern of communication between housing groups of pigs, and get the acoustic characteristics of the sound spectrum from the vocalizations related to the different microclimate conditions. The trial was carried out in a controlled environment experimental unit for pigs, at the University of Illinois (USA). Four groups of six pigs were used in the data collection. Dataloggers were installed to record environmental variables (T, °C and RH, %). These environmental variable were used to calculate two thermal comfort index: Enthalpy and THI. Cardioid microphones were installed to record continuous vocalizations in the geometric center of each pen where the pigs were housing. Microphones were connected to an amplifier, and this was connected to a dvr card installed in a computer to record audio and video information. For doing the sound edition in a pig vocalization database, the Goldwave® software was used to separate, and filter the files excluding background noise. In the sequence, the sounds were analyzed using the software Sounds Analysis Pro 2011, and the acoustic characteristics were extracted. Amplitude (dB), pitch (Hz), mean frequency (Hz), peak frequency (Hz) and entropy were used to characterize the sound spectrum of vocalizations of the groups of piglets in the different thermal conditions. A randomized block design was used, composed by two treatments and three repetitions in a week and executed in two weeks. Data were sampled to analyze the behavior of the databank of vocalization as a relation to the applied treatments. Data were submitted to an analysis of variance using the proc GLM of SAS. Among the studied acoustic parameters, the amplitude (dB), pitch and entropy. The treatments (comfort and heat stress conditions) presented significative differences, through Tukey\'s test (p<0,05). The analysis of variance showed differences to the wave format to each thermal condition in the different periods of the day. The quantification of vocalization of swine in groups under different thermal conditions is possible, using the extraction of acoustic characteristics from the sound samples. The sound spectrum was extracted, which indicated possible alterations in the piglets behavior in the different thermal conditions during the periods of the day. However, the stage of pattern\'s recognition still needs a larger and more consistent database to the recognition of the spectrum in each thermal condition, through image analysis or by the extraction of the acoustic characteristics. Among he analyzed acoustic characteristics, the amplitude (dB), pitch (Hz) and entropy of the vocalizations of groups of swine were significative to express the condition of the animals in different thermal conditions.

APA, Harvard, Vancouver, ISO, and other styles

17

Steinholtz, Tim. "Skip connection in a MLP network for Parkinson’s classification." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-303130.

Full text

Abstract:

In this thesis, two different architecture designs of a Multi-Layer Perceptron network have been implemented. One architecture being an ordinary MLP, and in the other adding DenseNet inspired skip connections to an MLP architecture. The models were used and evaluated on the classification task, where the goal was to classify if subjects were diagnosed with Parkinson’s disease or not based on vocal features. The models were trained on an openly available dataset for Parkinson’s classification and evaluated on a hold-out set from this dataset and on two datasets recorded in another sound recording environment than the training data. The thesis searched for the answer to two questions; How insensitive models for Parkinson’s classification are to the sound recording environment and how the proposed skip connections in an MLP model could help improve performance and generalization capacity. The thesis results show that the sound environment affects the accuracy. Nevertheless, it concludes that one would be able to overcome this with more time and allow for good accuracy when models are exposed to data from a new sound environment than the training data. As for the question, if the skip connections improve accuracy and generalization, the thesis cannot draw any broad conclusions due to the data that were used. The models had, in general, the best performance with shallow networks, and it is with deeper networks that the skip connections are argued to help improve these attributes. However, when evaluating on the data from a different sound recording environment than the training data, the skip connections had the best performance in two out of three tests.
I denna avhandling har två olika arkitektur designer för ett artificiellt flerskikts neuralt nätverk implementerats. En arkitektur som följer konventionen för ett vanlig MLP nätverk, samt en ny arkitektur som introducerar DenseNet inspirerade genvägs kopplingar i MLP nätverk. Modellerna användes och utvärderades för klassificering, vars mål var att urskilja försökspersoner som friska eller diagnostiserade med Parkinsons sjukdom baserat på röst attribut. Modellerna tränades på ett öppet tillgänglig dataset för Parkinsons klassificering och utvärderades på en delmängd av denna data som inte hade använts för träningen, samt två dataset som kommer från en annan ljudinspelnings miljö än datan för träningen. Avhandlingen sökte efter svaret på två frågor; Hur okänsliga modeller för Parkinsons klassificering är för ljudinspelnings miljön och hur de föreslagna genvägs kopplingarna i en MLP-modell kan bidra till att förbättra prestanda och generalisering kapacitet. Resultaten av avhandlingen visar att ljudmiljön påverkar noggrannheten, men drar slutsatsen att med mer tid skulle man troligen kunna övervinna detta och möjliggöra god noggrannhet i nya ljudmiljöer. När det kommer till om genvägs kopplingarna förbättrar noggrannhet och generalisering, är avhandlingen inte i stånd att dra några breda slutsatser på grund av den data som användes. Modellerna hade generellt bästa prestanda med grunda nätverk, och det är i djupare nätverk som genvägs kopplingarna argumenteras för att förbättra dessa egenskaper. Med det sagt, om man bara kollade på resultaten på datan som är ifrån en annan ljudinspelnings miljö så hade genvägs arkitekturen bättre resultat i två av de tre testerna som utfördes.

APA, Harvard, Vancouver, ISO, and other styles

18

Regnier, Lise. "Localization, Characterization and Recognition of Singing Voices." Phd thesis, Université Pierre et Marie Curie - Paris VI, 2012. http://tel.archives-ouvertes.fr/tel-00687475.

Full text

Abstract:

This dissertation is concerned with the problem of describing the singing voice within the audio signal of a song. This work is motivated by the fact that the lead vocal is the element that attracts the attention of most listeners. For this reason it is common for music listeners to organize and browse music collections using information related to the singing voice such as the singer name. Our research concentrates on the three major problems of music information retrieval: the localization of the source to be described (i.e. the recognition of the elements corresponding to the singing voice in the signal of a mixture of instruments), the search of pertinent features to describe the singing voice, and finally the development of pattern recognition methods based on these features to identify the singer. For this purpose we propose a set of novel features computed on the temporal variations of the fundamental frequency of the sung melody. These features, which aim to describe the vibrato and the portamento, are obtained with the aid of a dedicated model. In practice, these features are computed on the time-varying frequency of partials obtained using the sinusoidal model. In the first experiment we show that partials corresponding to the singing voice can be accurately differentiated from the partials produced by other instruments using decisions based on the parameters of the vibrato and the portamento. Once the partials emitted by the singer are identified, the segments of the song containing singing can be directly localized. To improve the recognition of the partials emitted by the singer we propose to group partials that are related harmonically. Partials are clustered according to their degree of similarity. This similarity is computed using a set of CASA cues including their temporal frequency variations (i.e. the vibrato and the portamento). The clusters of harmonically related partials corresponding to the singing voice are identified using the vocal vibrato and the portamento parameters. Groups of vocal partials can then be re-synthesized to isolate the voice. The result of the partial grouping can also be used to transcribe the sung melody. We then propose to go further with these features and study if the vibrato and portamento characteristics can be considered as a part of the singers' signature. Previous works on singer identification describe audio signals using features extracted on the short-term amplitude spectrum. The latter features aim to characterize the timbre of the sound, which, in the case of singing, is related to the vocal tract of the singer. The features we develop in this document capture long-term information related to the intonation of the singer, which is relevant to the style and the technique of the singer. We propose a method to combine these two complementary descriptions of the singing voice to increase the recognition rate of singer identification. In addition we evaluate the robustness of each type of feature against a set of variations. We show the singing voice is a highly variable instrument. To obtain a representative model of a singer's voice it is thus necessary to build models using a large set of examples covering the full tessitura of a singer. In addition, we show that features extracted directly from the partials are more robust to the presence of an instrumental accompaniment than features derived from the amplitude spectrum.

APA, Harvard, Vancouver, ISO, and other styles

19

"Use of vocal source features in speaker segmentation." 2006. http://library.cuhk.edu.hk/record=b5892857.

Full text

Abstract:

Chan Wai Nang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.
Includes bibliographical references (leaves 77-82).
Abstracts in English and Chinese.
Chapter Chapter1 --- Introduction --- p.1
Chapter 1.1 --- Speaker recognition --- p.1
Chapter 1.2 --- State of the art of speaker recognition techniques --- p.2
Chapter 1.3 --- Motivations --- p.5
Chapter 1.4 --- Thesis outline --- p.6
Chapter Chapter2 --- Acoustic Features --- p.8
Chapter 2.1 --- Speech production --- p.8
Chapter 2.1.1 --- Physiology of speech production --- p.8
Chapter 2.1.2 --- Source-filter model --- p.11
Chapter 2.2 --- Vocal tract and vocal source related acoustic features --- p.14
Chapter 2.3 --- Linear predictive analysis of speech --- p.15
Chapter 2.4 --- Features for speaker recognition --- p.16
Chapter 2.4.1 --- Vocal tract related features --- p.17
Chapter 2.4.2 --- Vocal source related features --- p.19
Chapter 2.5 --- Wavelet octave coefficients of residues (WOCOR) --- p.20
Chapter Chapter3 --- Statistical approaches to speaker recognition --- p.24
Chapter 3.1 --- Statistical modeling --- p.24
Chapter 3.1.1 --- Classification and modeling --- p.24
Chapter 3.1.2 --- Parametric vs non-parametric --- p.25
Chapter 3.1.3 --- Gaussian mixture model (GMM) --- p.25
Chapter 3.1.4 --- Model estimation --- p.27
Chapter 3.2 --- Classification --- p.28
Chapter 3.2.1 --- Multi-class classification for speaker identification --- p.28
Chapter 3.2.2 --- Two-speaker recognition --- p.29
Chapter 3.2.3 --- Model selection by statistical model --- p.30
Chapter 3.2.4 --- Performance evaluation metric --- p.31
Chapter Chapter4 --- Content dependency study of WOCOR and MFCC --- p.32
Chapter 4.1 --- Database: CU2C --- p.32
Chapter 4.2 --- Methods and procedures --- p.33
Chapter 4.3 --- Experimental results --- p.35
Chapter 4.4 --- Discussion --- p.36
Chapter 4.5 --- Detailed analysis --- p.39
Summary --- p.41
Chapter Chapter5 --- Speaker Segmentation --- p.43
Chapter 5.1 --- Feature extraction --- p.43
Chapter 5.2 --- Statistical methods for segmentation and clustering --- p.44
Chapter 5.2.1 --- Segmentation by spectral difference --- p.44
Chapter 5.2.2 --- Segmentation by Bayesian information criterion (BIC) --- p.47
Chapter 5.2.3 --- Segment clustering by BIC --- p.49
Chapter 5.3 --- Baseline system --- p.50
Chapter 5.3.1 --- Algorithm --- p.50
Chapter 5.3.2 --- Speech database --- p.52
Chapter 5.3.3 --- Performance metric --- p.53
Chapter 5.3.4 --- Results --- p.58
Summary --- p.60
Chapter Chapter6 --- Application of vocal source features in speaker segmentation --- p.61
Chapter 6.1 --- Discrimination power of WOCOR against MFCC --- p.61
Chapter 6.1.1 --- Experimental set-up --- p.62
Chapter 6.1.2 --- Results --- p.63
Chapter 6.2 --- Speaker segmentation using vocal source features --- p.67
Chapter 6.2.1 --- The construction of new proposed system --- p.67
Summary --- p.72
Chapter Chapter7 --- Conclusions --- p.74
Reference --- p.77

APA, Harvard, Vancouver, ISO, and other styles

20

"Robust speaker recognition using both vocal source and vocal tract features estimated from noisy input utterances." 2007. http://library.cuhk.edu.hk/record=b5893317.

Full text

Abstract:

Wang, Ning.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.
Includes bibliographical references (leaves 106-115).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Introduction to Speech and Speaker Recognition --- p.1
Chapter 1.2 --- Difficulties and Challenges of Speaker Authentication --- p.6
Chapter 1.3 --- Objectives and Thesis Outline --- p.7
Chapter 2 --- Speaker Recognition System --- p.10
Chapter 2.1 --- Baseline Speaker Recognition System Overview --- p.10
Chapter 2.1.1 --- Feature Extraction --- p.12
Chapter 2.1.2 --- Pattern Generation and Classification --- p.24
Chapter 2.2 --- Performance Evaluation Metric for Different Speaker Recognition Tasks --- p.30
Chapter 2.3 --- Robustness of Speaker Recognition System --- p.30
Chapter 2.3.1 --- Speech Corpus: CU2C --- p.30
Chapter 2.3.2 --- Noise Database: NOISEX-92 --- p.34
Chapter 2.3.3 --- Mismatched Training and Testing Conditions --- p.35
Chapter 2.4 --- Summary --- p.37
Chapter 3 --- Speaker Recognition System using both Vocal Tract and Vocal Source Features --- p.38
Chapter 3.1 --- Speech Production Mechanism --- p.39
Chapter 3.1.1 --- Speech Production: An Overview --- p.39
Chapter 3.1.2 --- Acoustic Properties of Human Speech --- p.40
Chapter 3.2 --- Source-filter Model and Linear Predictive Analysis --- p.44
Chapter 3.2.1 --- Source-filter Speech Model --- p.44
Chapter 3.2.2 --- Linear Predictive Analysis for Speech Signal --- p.46
Chapter 3.3 --- Vocal Tract Features --- p.51
Chapter 3.4 --- Vocal Source Features --- p.52
Chapter 3.4.1 --- Source Related Features: An Overview --- p.52
Chapter 3.4.2 --- Source Related Features: Technical Viewpoints --- p.54
Chapter 3.5 --- Effects of Noises on Speech Properties --- p.55
Chapter 3.6 --- Summary --- p.61
Chapter 4 --- Estimation of Robust Acoustic Features for Speaker Discrimination --- p.62
Chapter 4.1 --- Robust Speech Techniques --- p.63
Chapter 4.1.1 --- Noise Resilience --- p.64
Chapter 4.1.2 --- Speech Enhancement --- p.64
Chapter 4.2 --- Spectral Subtractive-Type Preprocessing --- p.65
Chapter 4.2.1 --- Noise Estimation --- p.66
Chapter 4.2.2 --- Spectral Subtraction Algorithm --- p.66
Chapter 4.3 --- LP Analysis of Noisy Speech --- p.67
Chapter 4.3.1 --- LP Inverse Filtering: Whitening Process --- p.68
Chapter 4.3.2 --- Magnitude Response of All-pole Filter in Noisy Condition --- p.70
Chapter 4.3.3 --- Noise Spectral Reshaping --- p.72
Chapter 4.4 --- Distinctive Vocal Tract and Vocal Source Feature Extraction . . --- p.73
Chapter 4.4.1 --- Vocal Tract Feature Extraction --- p.73
Chapter 4.4.2 --- Source Feature Generation Procedure --- p.75
Chapter 4.4.3 --- Subband-specific Parameterization Method --- p.79
Chapter 4.5 --- Summary --- p.87
Chapter 5 --- Speaker Recognition Tasks & Performance Evaluation --- p.88
Chapter 5.1 --- Speaker Recognition Experimental Setup --- p.89
Chapter 5.1.1 --- Task Description --- p.89
Chapter 5.1.2 --- Baseline Experiments --- p.90
Chapter 5.1.3 --- Identification and Verification Results --- p.91
Chapter 5.2 --- Speaker Recognition using Source-tract Features --- p.92
Chapter 5.2.1 --- Source Feature Selection --- p.92
Chapter 5.2.2 --- Source-tract Feature Fusion --- p.94
Chapter 5.2.3 --- Identification and Verification Results --- p.95
Chapter 5.3 --- Performance Analysis --- p.98
Chapter 6 --- Conclusion --- p.102
Chapter 6.1 --- Discussion and Conclusion --- p.102
Chapter 6.2 --- Suggestion of Future Work --- p.104

APA, Harvard, Vancouver, ISO, and other styles

21

"Exploitation of phase and vocal excitation modulation features for robust speaker recognition." Thesis, 2011. http://library.cuhk.edu.hk/record=b6075192.

Full text

Abstract:

Mel-frequency cepstral coefficients (MFCCs) are widely adopted in speech recognition as well as speaker recognition applications. They are extracted to primarily characterize the spectral envelope of a quasi-stationary speech segment. It was shown that cepstral features are closely related to the linguistic content of speech. Besides the magnitude-based cepstral features, there are resources in speech, e.g, the phase and excitation source, are believed to contain useful properties for speaker discrimination. Moreover, in real situations, there are large variations exist between the development and application scenarios for a speaker recognition system. These include channel mismatch, recording apparatus mismatch, environmental variation, or even change of emotional/healthy state of speakers. As a consequence, the magnitude-based features are insufficient to provide satisfactory and robust speaker recognition accuracy. Therefore, the exploitation of complementary features with MFCCs may provide one solution to alleviate the deficiency, from a feature-based perspective.
Speaker recognition (SR) refers to the process of automatically determining or verifying the identity of a person based on his or her voice characteristics. In practical applications, a voice can be used as one of the modalities in a multimodal biometric system, or be the sole medium for identity authentication. The general area of speaker recognition encompasses two fundamental tasks: speaker identification and speaker verification.
Wang, Ning.
Adviser: Pak-Chung Ching.
Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: .
Thesis (Ph.D.)--Chinese University of Hong Kong, 2011.
Includes bibliographical references (leaves 177-193).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.

APA, Harvard, Vancouver, ISO, and other styles

22

Wei, Ciou, and 邱薇. "Detecting Emotional Responses of the Customer Service Staff and Exploring General-critical-vocal Features." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/gvu5be.

Full text

Abstract:

碩士
元智大學
工業工程與管理學系
107
As the consumer consciousness rises, the improper dialogue tone of the customer service staff directly affects the customer satisfaction. Therefore, the development of the instant emotion discrimination system for assessing the improper emotional response of the service personnel would help the supervisor provide adequate assistance, and hence improve the service quality. In addition, many studies in the past used different corpora for voice emotion recognition, but important voice feature may vary depending on the corpus or its language. So far, no research has been done to find a set of best voice feature combinations (Anagnostopoulos et al., 2015). Therefore, the research is divided into two stages. The first stage of the study will use the customer service voice data to understand the voice emotion recognition method. The second stage of the study will use the three corpora from three languages to identify and classify their voice emotions, and try to find out the critical features using various scientific logics. Six steps of first stage study method included (1) labelling the dialogue data as good, moderate and bad categories, (2) preprocessing the dialogue data (removing silent parts, diminishing noises, and keeping the voice of the customer service staff), and visualize the original sound file with the t-SNE model, (3) using the OpenSMILE to extract 384 voice features, (4) using methods of principal component analysis, Fisher’s criterion, analysis of variance (ANOVA) and Random Forest to select features, (5) using certain methods comprising OneClassSVM, SVM, Random Forest, ANN and CNN to build recognition models and compare the effectiveness of feature selection methods and modelling performance, (6) and final step was data tag verification. Five steps of second stage study method included (1) screening the corpora of the three languages data and visualizing the original sound files with the t-SNE model, and (2) using the OpenSMILE to extract 384 voice features, (3) using different combinations of voice features, using ANOVA to select the critical voice features (4) and then using different combinations of training and testing data to build SVM model, (5) finally comparing the classification results form each critical voice features combination. First stage study results showed that using Fisher’s Criterion, ANOVA and Random Forest as feature selection methods not only can effectively increase the accuracy of model prediction, but more importantly, the results can help to understand the important relationships between emotion and voice features. The results of this study help to detect inappropriate dialogue tone of customer service staff and hence improve customer service quality. Second stage study results showed that using all corpora for feature selection, considering the emotion and language, the selected features can improve the accuracy of the German corpus (the accuracy is increased from 81.62% to 85.12%), and this study have also confirmed the difficulty of identifying corpora that classify different languages. In the past, most voice emotion recognition researches are devoted to obtaining the best accuracy in the corpus. However, this study believes that identifying multiple corpora at the same time is the biggest challenge in today's voice emotion recognition research, and it is also the trend of voice emotion recognition research in the future. The purpose of most studies is to hope that the research method can be applied to daily life in the future, and the research methods for identifying multiple voice emotion data can be practically applied in today's daily life. Keywords: voice emotion recognition, analysis of variance, feature selection of voice, corpora of different languages

APA, Harvard, Vancouver, ISO, and other styles

23

Zita, Aleš. "Automatic analysis of videokymographic images by means of higher-level features." Master's thesis, 2013. http://www.nusl.cz/ntk/nusl-324574.

Full text

Abstract:

Human voice diagnosis is a complicated problem, even nowadays. The reason is poor access to the body itself and the high frequencies of vocal fold vibrations. One of the clinically available imaging methods to address these problems is Videokymography - a technology for capturing the vocal fold vibrations using a special line CCD camera. Individual lines stacked on top of each other form videokymographic recording. Videokymographic images are suitable for automatic characteristics extraction, therefore helping to reduce the laryngologist workload. For this purpose, the set of such methods is being developed in the Department of Image Processing in the Institute of Information Theory and Automation of the Academy of Science of Czech Republic. The ventricular band position and shape determination is one of the important, but difficult, tasks. The aim of this thesis is to propose new method of automatic detection of ventricular band on videokymographic recording using digital image processing techniques.

APA, Harvard, Vancouver, ISO, and other styles

24

Саковська, Антоніна Андріївна, and Antonina Andriivna Sakovska. "Формування стилю бельканто у майбутнього співака." Master's thesis, 2021. http://repository.sspu.edu.ua/handle/123456789/11785.

Full text

Abstract:

Магісторське дослідження присвячено важливій у вокальній педагогіці темі. Аналіз теоретичних основ формування стилю бельканто розкриває характерні особливості Belcanto, спираючись на вокальні трактати XVIII – першої половини XIX століття; сутність та специфічні риси співу бельканто, фізіологічні і акустичні властивості, регістрову будова співацького голосу. Методичні основи формування навичок стилю бельканто представлено визначенням впливу будови і розвитку мелодії на формування навичок співу бельканто, педагогічними умовами та методичними рекомендаціями щодо формування навичок співу бельканто у майбутнього співака.
Master studies are devoted to important theme in vocal pedagogy. The analysis of theoretical foundations of the formation of the style of Blcanto reveals the characteristic features of Belcanto, based on vocal tracts of the XVIII - the first half of the XIX century; The essence and specific features of the singing Belcanto, physiological and acoustic properties, the register of a singer voice. Methodical foundations for the formation of the style of Belcanto style are represented by the definition of the structure of the structure and development of the melody on the formation of Singa's singing skills, pedagogical conditions and methodological recommendations for the formation of Singing Skill skills in the future singer.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Vocal feature'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles