Dissertations / Theses: 'Acoustic speech features'

1

Leung, Ka Yee. "Combining acoustic features and articulatory features for speech recognition /." View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20LEUNGK.

Full text

Abstract:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2002.
Includes bibliographical references (leaves 92-96). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

2

Juneja, Amit. "Speech recognition based on phonetic features and acoustic landmarks." College Park, Md. : University of Maryland, 2004. http://hdl.handle.net/1903/2148.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2004.
Thesis research directed by: Electrical Engineering. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

3

Tyson, Na'im R. "Exploration of Acoustic Features for Automatic Vowel Discrimination in Spontaneous Speech." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1339695879.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Sun, Rui. "The evaluation of the stability of acoustic features in affective conveyance across multiple emotional databases." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/49041.

Full text

Abstract:

The objective of the research presented in this thesis was to systematically investigate the computational structure for cross-database emotion recognition. The research consisted of evaluating the stability of acoustic features, particularly the glottal and Teager Energy based features, and investigating three normalization methods and two data fusion techniques. One of the challenges of cross-database training and testing is accounting for the potential variation in the types of emotions expressed as well as the recording conditions. In an attempt to alleviate the impact of these types of variations, three normalization methods on the acoustic data were studied. Motivated by the lack of large and diverse enough emotional database to train the classifier, using multiple databases to train posed another challenge: data fusion. This thesis proposed two data fusion techniques, pre-classification SDS and post-classification ROVER to study the issue. Using the glottal, TEO and TECC features, of which the stability of emotion distinguishing ability has been highlighted on multiple databases, the systematic computational structure proposed in this thesis could improve the performance of cross-database binary-emotion recognition by up to 23% for neutral vs. emotional and 10% for positive vs. negative.

APA, Harvard, Vancouver, ISO, and other styles

5

Torres, Juan Félix. "Estimation of glottal source features from the spectral envelope of the acoustic speech signal." Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/34736.

Full text

Abstract:

Speech communication encompasses diverse types of information, including phonetics, affective state, voice quality, and speaker identity. From a speech production standpoint, the acoustic speech signal can be mainly divided into glottal source and vocal tract components, which play distinct roles in rendering the various types of information it contains. Most deployed speech analysis systems, however, do not explicitly represent these two components as distinct entities, as their joint estimation from the acoustic speech signal becomes an ill-defined blind deconvolution problem. Nevertheless, because of the desire to understand glottal behavior and how it relates to perceived voice quality, there has been continued interest in explicitly estimating the glottal component of the speech signal. To this end, several inverse filtering (IF) algorithms have been proposed, but they are unreliable in practice because of the blind formulation of the separation problem. In an effort to develop a method that can bypass the challenging IF process, this thesis proposes a new glottal source information extraction method that relies on supervised machine learning to transform smoothed spectral representations of speech, which are already used in some of the most widely deployed and successful speech analysis applications, into a set of glottal source features. A transformation method based on Gaussian mixture regression (GMR) is presented and compared to current IF methods in terms of feature similarity, reliability, and speaker discrimination capability on a large speech corpus, and potential representations of the spectral envelope of speech are investigated for their ability represent glottal source variation in a predictable manner. The proposed system was found to produce glottal source features that reasonably matched their IF counterparts in many cases, while being less susceptible to spurious errors. The development of the proposed method entailed a study into the aspects of glottal source information that are already contained within the spectral features commonly used in speech analysis, yielding an objective assessment regarding the expected advantages of explicitly using glottal information extracted from the speech signal via currently available IF methods, versus the alternative of relying on the glottal source information that is implicitly contained in spectral envelope representations.

APA, Harvard, Vancouver, ISO, and other styles

6

Ishizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Diekema, Emily D. "Acoustic Measurements of Clear Speech Cue Fade in Adults with Idiopathic Parkinson Disease." Bowling Green State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1460063159.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Tran, Thi-Anh-Xuan. "Acoustic gesture modeling. Application to a Vietnamese speech recognition system." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAT023/document.

Full text

Abstract:

La sélection de caractéristiques acoustiques appropriées est essentielle dans tout système de traitement de la parole. Pendant près de 40 ans, la parole a été généralement considérée comme une séquence de signaux quasi-stables (voyelles) séparés par des transitions (consonnes). Bien qu‟un grand nombre d'études documentent clairement l'importance de la coarticulation, et révèlent que les cibles articulatoires et acoustiques ne sont pas indépendantes du contexte, l‟hypothèse que chaque voyelle présente une cible acoustique qui peut être spécifiée d'une manière indépendante du contexte reste très répandue. Ce point de vue implique des limitations fortes. Il est bien connu que les fréquences de formants sont des caractéristiques acoustiques qui présentent un lien évident avec la production de la parole, et qui peuvent participer à la distinction perceptive entre les voyelles. Par conséquent, les voyelles sont généralement décrites avec des configurations articulatoires statiques représentées par des cibles dans l'espace acoustique, généralement par les fréquences des formants correspondants, représentées dans les plans F1-F2 et F2-F3. Les consonnes occlusives peuvent être décrites en termes de point d'articulation, représentés par locus (ou locus équations) dans le plan acoustique. Mais les trajectoires des fréquences de formants dans la parole fluide présentent rarement un état d'équilibre pour chaque voyelle. Elles varient avec le locuteur, l'environnement consonantique (co-articulation) et le débit de parole (relative à un continuum entre hypo et hyper-articulation). En vue des limites inhérentes aux approches statiques, la démarche adoptée ici consiste à étudier les transitions entre les voyelles et les consonnes (V1V2 et V1CV2) d‟un point de vue dynamique
Speech plays a vital role in human communication. Selection of relevant acoustic speech features is key to in the design of any system using speech processing. For some 40 years, speech was typically considered as a sequence of quasi-stable portions of signal (vowels) separated by transitions (consonants). Despite a wealth of studies that clearly document the importance of coarticulation, and reveal that articulatory and acoustic targets are not context-independent, the view that each vowel has an acoustic target that can be specified in a context-independent manner remains widespread. This point of view entails strong limitations. It is well known that formant frequencies are acoustic characteristics that bear a clear relationship with speech production, and that can distinguish among vowels. Therefore, vowels are generally described with static articulatory configurations represented by targets in the acoustic space, typically by formant frequencies in F1-F2 and F2-F3 planes. Plosive consonants can be described in terms of places of articulation, represented by locus or locus equations in an acoustic plane. But formant frequencies trajectories in fluent speech rarely display a steady state for each vowel. They vary with speaker, consonantal environment (co-articulation) and speaking rate (relating to continuum between hypo- and hyper-articulation). In view of inherent limitations of static approaches, the approach adopted here consists in studying both vowels and consonants from a dynamic point of view.Firstly we studied the effects of the impulse response at the beginning, at the end and during transitions of the signal both in the speech signal and at the perception level. Variations of the phases of the components were then examined. Results show that the effects of these parameters can be observed in spectrograms. Crucially, the amplitudes of the spectral components distinguished under the approach advocated here are sufficient for perceptual discrimination. From this result, for all speech analysis, we only focus on amplitude domain, deliberately leaving aside phase information. Next we extent the work to vowel-consonant-vowel perception from a dynamic point of view. These perceptual results, together with those obtained earlier by Carré (2009a), show that vowel-to-vowel and vowel-consonant-vowel stimuli can be characterized and separated by the direction and rate of the transitions on formant plane, even when absolute frequency values are outside the vowel triangle (i.e. the vowel acoustic space in absolute values).Due to limitations of formant measurements, the dynamic approach needs to develop new tools, based on parameters that can replace formant frequency estimation. Spectral Subband Centroid Frequency (SSCF) features was studied. Comparison with vowel formant frequencies show that SSCFs can replace formant frequencies and act as “pseudo-formant” even during consonant production.On this basis, SSCF is used as a tool to compute dynamic characteristics. We propose a new way to model the dynamic speech features: we called it SSCF Angles. Our analysis work on SSCF Angles were performed on transitions of vowel-to-vowel (V1V2) sequences of both Vietnamese and French. SSCF Angles appear as reliable and robust parameters. For each language, the analysis results show that: (i) SSCF Angles can distinguish V1V2 transitions; (ii) V1V2 and V2V1 have symmetrical properties on the acoustic domain based on SSCF Angles; (iii) SSCF Angles for male and female are fairly similar in the same studied transition of context V1V2; and (iv) they are also more or less invariant for speech rate (normal speech rate and fast one). And finally, these dynamic acoustic speech features are used in Vietnamese automatic speech recognition system with several obtained interesting results

APA, Harvard, Vancouver, ISO, and other styles

9

Wang, Yuxuan. "Supervised Speech Separation Using Deep Neural Networks." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1426366690.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Chen, Jitong. "On Generalization of Supervised Speech Separation." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492038295603502.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

暁芸, 王., and Xiaoyun Wang. "Phoneme set design for second language speech recognition." Thesis, https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13044980/?lang=0, 2017. https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13044980/?lang=0.

Full text

Abstract:

本論文は第二言語話者の発話を高精度で認識するための音素セットの構成方法に関する研究結果を述べている．本論文では，第二言語話者の発話をネイティブ話者の発話とは異なる音響特徴量の頻度分布を持つ情報源とみなし，これを表現する適切な音素セットを構築する手法を提案している．具体的には，対象とする第二言語と母語との調音位置や調音様式などの類似性に加え，同音異義語の発生による単語識別性能の低下を総合した基準に基づき，最適な音素セットを決定する．提案手法を日本人学生の英語発話の音声認識に適用し，種々の条件下で認識精度の向上を検証した．
This dissertation focuses on the problem caused by confused mispronunciation to improve the recognition performance of second language speech. A novel method considering integrated acoustic and linguistic features is proposed to derive a reduced phoneme set for L2 speech recognition. The customized phoneme set is created with a phonetic decision tree (PDT)-based top-down sequential splitting method that utilizes the phonological knowledge between L1 and L2. The dissertation verifies the efficacy of the proposed method for Japanese English and shows that the feasibility of building a speech recognizer with the proposed method is able to alleviate the problem caused by confused mispronunciation by second language speakers.
博士(工学)
Doctor of Philosophy in Engineering
同志社大学
Doshisha University

APA, Harvard, Vancouver, ISO, and other styles

12

Bezůšek, Marek. "Objektivizace Testu 3F - dysartrický profil pomocí akustické analýzy." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2021. http://www.nusl.cz/ntk/nusl-442568.

Full text

Abstract:

Test 3F is used to diagnose the extent of motor speech disorder – dysarthria for czech speakers. The evaluation of dysarthric speech is distorted by subjective assessment. The motivation behind this thesis is that there are not many automatic and objective analysis tools that can be used to evaluate phonation, articulation, prosody and respiration of speech disorder. The aim of this diploma thesis is to identify, implement and test acoustic features of speech that could be used to objectify and automate the evaluation. These features should be easily interpretable by the clinician. It is assumed that the evaluation could be more precise because of the detailed analysis that acoustic features provide. The performance of these features was tested on database of 151 czech speakers that consists of 51 healthy speakers and 100 patients. Statistical analysis and methods of machine learning were used to identify the correlation between features and subjective assesment. 27 of total 30 speech tasks of Test 3F were identified as suitable for automatic evaluation. Within the scope of this thesis only 10 tasks of Test 3F were tested because only a limited part of the database could be preprocessed. The result of statistical analysis is 14 features that were most useful for the test evaluation. The most significant features are: MET (respiration), relF0SD (intonation), relSEOVR (voice intensity – prosody). The lowest prediction error of the machine learning regression models was 7.14 %. The conclusion is that the evaluation of most of the tasks of Test 3F can be automated. The results of analysis of 10 tasks shows that the most significant factor in dysarthria evaluation is limited expiration, monotone voice and low variabilty of speech intensity.

APA, Harvard, Vancouver, ISO, and other styles

13

Tomashenko, Natalia. "Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems." Thesis, Le Mans, 2017. http://www.theses.fr/2017LEMA1040/document.

Full text

Abstract:

Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire
Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them

APA, Harvard, Vancouver, ISO, and other styles

14

Anderson, Jill M. "Lateralization Effects of Brainstem Responses and Middle Latency Responses to a Complex Tone and Speech Syllable." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1313687765.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Zolnay, András. "Acoustic feature combination for speech recognition." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=982202156.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Selmini, Antonio Marcos. "Sistema baseado em regras para o refinamento da segmentação automatica de fala." [s.n.], 2008. http://repositorio.unicamp.br/jspui/handle/REPOSIP/260756.

Full text

Abstract:

Orientador: Fabio Violaro
Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação
Made available in DSpace on 2018-08-11T22:49:44Z (GMT). No. of bitstreams: 1 Selmini_AntonioMarcos_D.pdf: 2404244 bytes, checksum: d7fcd0828f3157c595a0e3426b4a7eb0 (MD5) Previous issue date: 2008
Resumo: A demanda por uma segmentação automática de fala confiável vem crescendo e exigindo pesquisas para suportar o desenvolvimento de sistemas que usam fala para uma interação homem-máquina. Neste contexto, este trabalho relata o desenvolvimento e avaliação de um sistema para segmentação automática de fala usando o algoritmo de Viterbi e refinamento das fronteiras de segmentação baseado nas características fonético-acústicas das classes fonéticas. As subunidades fonéticas (dependentes de contexto) são representadas com Modelos Ocultos de Markov (HMM - Hidden Markov Models). Cada fronteira estimada pelo algoritmo de Viterbi é refinada usando características acústicas dependentes de classes de fones, uma vez que a identidade dos fones do lado direito e esquerdo da fronteira considerada é conhecida. O sistema proposto foi avaliado usando duas bases dependentes de locutor do Português do Brasil (uma masculina e outra feminina) e também uma base independente de locutor (TIMIT). A avaliação foi realizada comparando a segmentação automática com a segmentação manual. Depois do processo de refinamento, um ganho de 29% nas fronteiras com erro de segmentação abaixo de 20 ms foi obtido para a base de fala dependente de locutor masculino do Português Brasileiro.
Abstract: The demand for reliable automatic speech segmentation is increasing and requiring additional research to support the development of systems that use speech for man-machine interface. In this context, this work reports the development and evaluation of a system for automatic speech segmentation using Viterbi's algorithm and a refinement of segmentation boundaries based on acoustic-phonetic features. Phonetic sub-units (context-dependent phones) are modeled with HMM (Hidden Markov Models). Each boundary estimated by Viterbi's algorithm is refined using class-dependent acoustic features, as the identity of the phones on the left and right side of the considered boundary is known. The proposed system was evaluated using two speaker dependent Brazilian Portuguese speech databases (one male and one female speaker), and a speaker independent English database (TIMIT). The evaluation was carried out comparing automatic against manual segmentation. After the refinement process, an improvement of 29% in the percentage of segmentation errors below 20 ms was achieved for the male speaker dependent Brazilian Portuguese speech database.
Doutorado
Telecomunicações e Telemática
Doutor em Engenharia Elétrica

APA, Harvard, Vancouver, ISO, and other styles

17

DiCicco, Thomas M. Jr (Thomas Minotti). "Optimization of acoustic feature extraction from dysarthric speech." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/57781.

Full text

Abstract:

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, February 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 171-180).
Dysarthria is a motor speech disorder characterized by weak or uncoordinated movements of the speech musculature. While unfamiliar listeners struggle to understand speakers with severe dysarthria, familiar listeners are often able to comprehend with high accuracy. This observation implies that although the speech produced by an individual with dysarthria may appear distorted and unintelligible to the untrained listener, there must be a set of consistent acoustic cues that the familiar communication partner is able to interpret. While dysarthric speech has been characterized both acoustically and perceptually, most accounts tend to compare dysarthric productions to those of healthy controls rather than identify the set of reliable and consistently controlled segmental cues. This work aimed to elucidate possible recognition strategies used by familiar listeners by optimizing a model of human speech recognition, Stevens' Lexical Access from Features (LAFF) framework, for ten individual speakers with dysarthria (SWDs). The LAFF model is rooted in distinctive feature theory, with acoustic landmarks indicating changes in the manner of articulation. The acoustic correlates manifested around landmarks provide the identity to articulator-free (manner) and articulator-bound (place) features.
(cont.) SWDs created weaker consonantal landmarks, likely due to an inability to form complete closures in the vocal tract and to fully release consonantal constrictions. Identification of speaker-optimized acoustic correlate sets improved discrimination of each speaker's productions, evidenced by increased sensitivity and specificity. While there was overlap between the types of correlates identified for healthy and dysarthric speakers, using the optimal sets of correlates identified for SWDs adversely impaired discrimination of healthy speech. These results suggest that the combinations of correlates suggested for SWDs were specific to the individual and different from the segmental cues used by healthy individuals. Application of the LAFF model to dysarthric speech has potential clinical utility as a diagnostic tool, highlighting the fine-grain components of speech production that require intervention and quantifying the degree of impairment.
by Thomas M. DiCicco, Jr.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

18

Gajic, Bojana. "Feature Extraction for Automatic Speech Recognition in Noisy Acoustic Environments." Doctoral thesis, Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, 2002. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-441.

Full text

Abstract:

This thesis presents a study of alternative speech feature extraction methods aimed at increasing robustness of automatic speech recognition (ASR) against additive background noise.

Spectral peak positions of speech signals remain practically unchanged in presence of additive background noise. Thus, it was expected that emphasizing spectral peak positions in speech feature extraction would result in improved noise robustness of ASR systems. If frequency subbands are properly chosen, dominant subband frequencies can serve as reasonable estimates of spectral peak positions. Thus, different methods for incorporating dominant subband frequencies into speech feature vectors were investigated in this study.

To begin with, two earlier proposed feature extraction methods that utilize dominant subband frequency information were examined. The first one uses zero-crossing statistics of the subband signals to estimate dominant subband frequencies, while the second one uses subband spectral centroids. The methods were compared with the standard MFCC feature extraction method on two different recognition tasks in various background conditions. The first method was shown to improve ASR performance on both recognition tasks at sufficiently high noise levels. The improvement was, however, smaller on the more complex recognition task. The second method, on the other hand, led to some reduction in ASR performance in all testing conditions.

Next, a new method for incorporating subband spectral centroids into speech feature vectors was proposed, and was shown to be considerably more robust than the standard MFCC method on both ASR tasks. The main difference between the proposed method and the zero-crossing based method is in the way they utilize dominant subband frequency information. It was shown that the performance improvement due to the use of dominant subband frequency information was considerably larger for the proposed method than for the ZCPA method, especially on the more complex recognition task. Finally, the computational complexity of the proposed method is two orders of magnitude lower than that of the zero-crossing based method, and of the same order of magnitude as the standard MFCC method.

APA, Harvard, Vancouver, ISO, and other styles

19

Darch, Jonathan J. A. "Robust acoustic speech feature prediction from Mel frequency cepstral coefficients." Thesis, University of East Anglia, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445206.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Hamlet, Sean Michael. "COMPARING ACOUSTIC GLOTTAL FEATURE EXTRACTION METHODS WITH SIMULTANEOUSLY RECORDED HIGH-SPEED VIDEO FEATURES FOR CLINICALLY OBTAINED DATA." UKnowledge, 2012. http://uknowledge.uky.edu/ece_etds/12.

Full text

Abstract:

Accurate methods for glottal feature extraction include the use of high-speed video imaging (HSVI). There have been previous attempts to extract these features with the acoustic recording. However, none of these methods compare their results with an objective method, such as HSVI. This thesis tests these acoustic methods against a large diverse population of 46 subjects. Two previously studied acoustic methods, as well as one introduced in this thesis, were compared against two video methods, area and displacement for open quotient (OQ) estimation. The area comparison proved to be somewhat ambiguous and challenging due to thresholding eﬀects. The displacement comparison, which is based on glottal edge tracking, proved to be a more robust comparison method than the area. The ﬁrst acoustic methods OQ estimate had a relatively small average error of 8.90% and the second method had a relatively large average error of -59.05% compared to the displacement OQ. The newly proposed method had a relatively small error of -13.75% when compared to the displacements OQ. There was some success even though there was relatively high error with the acoustic methods, however, they may be utilized to augment the features collected by HSVI for a more accurate glottal feature estimation.

APA, Harvard, Vancouver, ISO, and other styles

21

Pandit, Medha. "Voice and lip based speaker verification." Thesis, University of Surrey, 2000. http://epubs.surrey.ac.uk/915/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

TAKEDA, Kazuya, Norihide KITAOKA, and Makoto SAKAI. "Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria." Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14970.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

TAKEDA, Kazuya, Norihide KITAOKA, and Makoto SAKAI. "Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition." Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14969.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

TAKEDA, Kazuya, Seiichi NAKAGAWA, Yuya HATTORI, Norihide KITAOKA, and Makoto SAKAI. "Evaluation of Combinational Use of Discriminant Analysis-Based Acoustic Feature Transformation and Discriminative Training." Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14968.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Bagchi, Deblin. "Transfer learning approaches for feature denoising and low-resource speech recognition." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1577641434371497.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Temko, Andriy. "Acoustic event detection and classification." Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.

Full text

Abstract:

L'activitat humana que té lloc en sales de reunions o aules d'ensenyament es veu reflectida en una rica varietat d'events acústics, ja siguin produïts pel cos humà o per objectes que les persones manegen. Per això, la determinació de la identitat dels sons i de la seva posició temporal pot ajudar a detectar i a descriure l'activitat humana que té lloc en la sala. A més a més, la detecció de sons diferents de la veu pot ajudar a millorar la robustes de tecnologies de la parla com el reconeixement automàtica a condicions de treball adverses. L'objectiu d'aquesta tesi és la detecció i classificació automàtica d'events acústics. Es tracta de processar els senyals acústics recollits per micròfons distants en sales de reunions o aules per tal de convertir-los en descripcions simbòliques que es corresponguin amb la percepció que un oient tindria dels diversos events sonors continguts en els senyals i de les seves fonts. En primer lloc, s'encara la tasca de classificació automàtica d'events acústics amb classificadors de màquines de vectors suport (Support Vector Machines (SVM)), elecció motivada per l'escassetat de dades d'entrenament. Per al problema de reconeixement multiclasse es desenvolupa un esquema d'agrupament automàtic amb conjunt de característiques variable i basat en matrius de confusió. Realitzant proves amb la base de dades recollida, aquest classificador obté uns millors resultats que la tècnica basada en models de barreges de Gaussianes (Gaussian Mixture Models (GMM)), i aconsegueix una reducció relativa de l'error mitjà elevada en comparació amb el millor resultat obtingut amb l'esquema convencional basat en arbre binari. Continuant amb el problema de classificació, es comparen unes quantes maneres alternatives d'estendre els SVM al processament de seqüències, en un intent d'evitar l'inconvenient de treballar amb vectors de longitud fixa que presenten els SVM quan han de tractar dades d'àudio. En aquestes proves s'observa que els nuclis de deformació temporal dinàmica funcionen bé amb sons que presenten una estructura temporal. A més a més, s'usen conceptes i eines manllevats de la teoria de lògica difusa per investigar, d'una banda, la importància de cada una de les característiques i el grau d'interacció entre elles, i d'altra banda, tot cercant l'augment de la taxa de classificació, s'investiga la fusió de les
sortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.

APA, Harvard, Vancouver, ISO, and other styles

27

Kleynhans, Neil Taylor. "Automatic speech recognition for resource-scarce environments / N.T. Kleynhans." Thesis, North-West University, 2013. http://hdl.handle.net/10394/9668.

Full text

Abstract:

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. In this thesis we present research into developing techniques and tools to (1) harvest audio data, (2) rapidly adapt ASR systems and (3) select “useful” training samples in order to assist with resource-scarce ASR system development. We demonstrate an automatic audio harvesting approach which efficiently creates a speech recognition corpus by harvesting an easily available audio resource. We show that by starting with bootstrapped acoustic models, trained with language data obtain from a dialect, and then running through a few iterations of an alignment-filter-retrain phase it is possible to create an accurate speech recognition corpus. As a demonstration we create a South African English speech recognition corpus by using our approach and harvesting an internet website which provides audio and approximate transcriptions. The acoustic models developed from harvested data are evaluated on independent corpora and show that the proposed harvesting approach provides a robust means to create ASR resources. As there are many acoustic model adaptation techniques which can be implemented by an ASR system developer it becomes a costly endeavour to select the best adaptation technique. We investigate the dependence of the adaptation data amount and various adaptation techniques by systematically varying the adaptation data amount and comparing the performance of various adaptation techniques. We establish a guideline which can be used by an ASR developer to chose the best adaptation technique given a size constraint on the adaptation data, for the scenario where adaptation between narrow- and wide-band corpora must be performed. In addition, we investigate the effectiveness of a novel channel normalisation technique and compare the performance with standard normalisation and adaptation techniques. Lastly, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.
Thesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.

APA, Harvard, Vancouver, ISO, and other styles

28

Udaya, Kumar Magesh Kumar. "Classification of Parkinson’s Disease using MultiPass Lvq,Logistic Model Tree,K-Star for Audio Data set : Classification of Parkinson Disease using Audio Dataset." Thesis, Högskolan Dalarna, Datateknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:du-5596.

Full text

Abstract:

Parkinson's disease (PD) is a degenerative illness whose cardinal symptoms include rigidity, tremor, and slowness of movement. In addition to its widely recognized effects PD can have a profound effect on speech and voice.The speech symptoms most commonly demonstrated by patients with PD are reduced vocal loudness, monopitch, disruptions of voice quality, and abnormally fast rate of speech. This cluster of speech symptoms is often termed Hypokinetic Dysarthria.The disease can be difficult to diagnose accurately, especially in its early stages, due to this reason, automatic techniques based on Artificial Intelligence should increase the diagnosing accuracy and to help the doctors make better decisions. The aim of the thesis work is to predict the PD based on the audio files collected from various patients.Audio files are preprocessed in order to attain the features.The preprocessed data contains 23 attributes and 195 instances. On an average there are six voice recordings per person, By using data compression technique such as Discrete Cosine Transform (DCT) number of instances can be minimized, after data compression, attribute selection is done using several WEKA build in methods such as ChiSquared, GainRatio, Infogain after identifying the important attributes, we evaluate attributes one by one by using stepwise regression.Based on the selected attributes we process in WEKA by using cost sensitive classifier with various algorithms like MultiPass LVQ, Logistic Model Tree(LMT), K-Star.The classified results shows on an average 80%.By using this features 95% approximate classification of PD is acheived.This shows that using the audio dataset, PD could be predicted with a higher level of accuracy.

APA, Harvard, Vancouver, ISO, and other styles

29

Musti, Utpala. "Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales." Thesis, Université de Lorraine, 2013. http://www.theses.fr/2013LORR0003.

Full text

Abstract:

Ce travail porte sur la synthèse de la parole audio-visuelle. Dans la littérature disponible dans ce domaine, la plupart des approches traite le problème en le divisant en deux problèmes de synthèse. Le premier est la synthèse de la parole acoustique et l'autre étant la génération d'animation faciale correspondante. Mais, cela ne garantit pas une parfaite synchronisation et cohérence de la parole audio-visuelle. Pour pallier implicitement l'inconvénient ci-dessus, nous avons proposé une approche de synthèse de la parole acoustique-visuelle par la sélection naturelle des unités synchrones bimodales. La synthèse est basée sur le modèle de sélection d'unité classique. L'idée principale derrière cette technique de synthèse est de garder l'association naturelle entre la modalité acoustique et visuelle intacte. Nous décrivons la technique d'acquisition de corpus audio-visuelle et la préparation de la base de données pour notre système. Nous présentons une vue d'ensemble de notre système et nous détaillons les différents aspects de la sélection d'unités bimodales qui ont besoin d'être optimisées pour une bonne synthèse. L'objectif principal de ce travail est de synthétiser la dynamique de la parole plutôt qu'une tête parlante complète. Nous décrivons les caractéristiques visuelles cibles que nous avons conçues. Nous avons ensuite présenté un algorithme de pondération de la fonction cible. Cet algorithme que nous avons développé effectue une pondération de la fonction cible et l'élimination de fonctionnalités redondantes de manière itérative. Elle est basée sur la comparaison des classements de coûts cible et en se basant sur une distance calculée à partir des signaux de parole acoustiques et visuels dans le corpus. Enfin, nous présentons l'évaluation perceptive et subjective du système de synthèse final. Les résultats montrent que nous avons atteint l'objectif de synthétiser la dynamique de la parole raisonnablement bien
This work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well

APA, Harvard, Vancouver, ISO, and other styles

30

Spa, Carvajal Carlos. "Time-domain numerical methods in room acoustics simulations." Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7565.

Full text

Abstract:

L'acústica de sales s'encarrega de l'estudi del comportament de les ones sonores en espais tancats.La informació acústica de qualsevol entorn, coneguda com la resposta impulsional, pot ser expressada en termes del camp acústic com una funció de l'espai i el temps. En general, és impossible obtenir solucions analítiques de funcions resposta en habitacions reals. Per tant, en aquests últims anys, l'ús d'ordinadors per resoldre aquest tipus de problemes ha emergit com una solució adecuada per calcular respostes impulsionals.
En aquesta Tesi hem centrat el nostre anàlisis en els mètodes basats en el comportament ondulatori dins del domini temporal. Més concretament, estudiem en detall les formulacions més importants del mètode de Diferències Finites, el qual s'utilitza en moltes aplicacions d'acústica de sales, i el recentment proposat mètode PseudoEspectral de Fourier. Ambdós mètodes es basen en la formulació discreta de les equacions analítiques que descriuen els fenòmens acústics en espais tancats.
Aquesta obra contribueix en els aspectes més importants en el càlcul numèric de respostes impulsionals: la propagació del so, la generació de fonts i les condicions de contorn de reactància local.
Room acoustics is the science concerned to study the behavior of sound waves in enclosed rooms. The acoustic information of any room, the so called impulse response, is expressed in terms of the acoustic field as a function of space and time. In general terms, it is nearly impossible to find analytical impulse responses of real rooms. Therefore, in the recent years, the use of computers for solving this type of problems has emerged as a proper alternative to calculate the impulse responses.
In this Thesis we focus on the analysis of the wavebased methods in the timedomain. More concretely, we study in detail the main formulations of FiniteDifference methods, which have been used in many room acoustics applications, and the recently proposed Fourier PseudoSpectral methods. Both methods are based on the discrete formulations of the analytical equations that describe the sound phenomena in enclosed rooms.
This work contributes to the main aspects in the computation of impulse responses: the wave propagation, the source generation and the locallyreacting boundary conditions.

APA, Harvard, Vancouver, ISO, and other styles

31

Hacine-Gharbi, Abdenour. "Sélection de paramètres acoustiques pertinents pour la reconnaissance de la parole." Phd thesis, Université d'Orléans, 2012. http://tel.archives-ouvertes.fr/tel-00843652.

Full text

Abstract:

L'objectif de cette thèse est de proposer des solutions et améliorations de performance à certains problèmes de sélection des paramètres acoustiques pertinents dans le cadre de la reconnaissance de la parole. Ainsi, notre première contribution consiste à proposer une nouvelle méthode de sélection de paramètres pertinents fondée sur un développement exact de la redondance entre une caractéristique et les caractéristiques précédemment sélectionnées par un algorithme de recherche séquentielle ascendante. Le problème de l'estimation des densités de probabilités d'ordre supérieur est résolu par la troncature du développement théorique de cette redondance à des ordres acceptables. En outre, nous avons proposé un critère d'arrêt qui permet de fixer le nombre de caractéristiques sélectionnées en fonction de l'information mutuelle approximée à l'itération j de l'algorithme de recherche. Cependant l'estimation de l'information mutuelle est difficile puisque sa définition dépend des densités de probabilités des variables (paramètres) dans lesquelles le type de ces distributions est inconnu et leurs estimations sont effectuées sur un ensemble d'échantillons finis. Une approche pour l'estimation de ces distributions est basée sur la méthode de l'histogramme. Cette méthode exige un bon choix du nombre de bins (cellules de l'histogramme). Ainsi, on a proposé également une nouvelle formule de calcul du nombre de bins permettant de minimiser le biais de l'estimateur de l'entropie et de l'information mutuelle. Ce nouvel estimateur a été validé sur des données simulées et des données de parole. Plus particulièrement cet estimateur a été appliqué dans la sélection des paramètres MFCC statiques et dynamiques les plus pertinents pour une tâche de reconnaissance des mots connectés de la base Aurora2.

APA, Harvard, Vancouver, ISO, and other styles

32

Heinrich, Lisa Marie. "Acoustic-phonetic features in the speech of deaf women." 1995. http://catalog.hathitrust.org/api/volumes/oclc/34556039.html.

Full text

Abstract:

Thesis (M.S.)--University of Wisconsin--Madison, 1995.
Typescript. eContent provider-neutral record in process. Description based on print version record. Includes bibliographical references (leaves 66-69).

APA, Harvard, Vancouver, ISO, and other styles

33

Chien, To-Chang, and 錢鐸樟. "Integration of Acoustic and Linguistic Features for Maximum Entropy Speech Recognition." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/24325293971312481529.

Full text

Abstract:

碩士
國立成功大學
資訊工程學系碩博士班
93
In traditional speech recognition system, we assume that acoustic and linguistic information sources are independent. Parameters of acoustic hidden Markov model (HMM) and linguistic n-gram model are estimated individually and then combined together to build a plug-in maximum a posteriori (MAP) classification rule. However, the acoustic model and language model are correlated in essence. We should relax the independence assumption so as to improve speech recognition performance. In this study, we propose an integrated approach based on maximum entropy (ME) principle where acoustic and linguistic features are optimally combined in an unified framework. Using this approach, the associations between acoustic and linguistic features are explored and merged in the integrated models. On the issue of discriminative training, we also establish the relationship between ME and discriminative maximum mutual information (MMI) models. In addition, this ME integrated model is general so that the semantic topics and long distance association patterns can be further combined. In the experiments, we carry out the proposed ME model for broadcast news transcription using MATBN database. In preliminary experimental results, we obtain improvement compared to conventional speech recognition system based on plug-in MAP classification rule.

APA, Harvard, Vancouver, ISO, and other styles

34

Tu, Tsung-Wei, and 涂宗瑋. "Speech Information Retrieval Using Support Vector Machines with Context and Acoustic Features." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/39598366759652069568.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Chen, Jia-Yu, and 陳佳妤. "Minimum Phone Error Training of Acoustic Models and Features for Large Vocabulary Mandarin Speech Recognition." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/63448829355525193378.

Full text

Abstract:

碩士
國立臺灣大學
電機工程學研究所
94
Traditional speech recognition uses maximum likelihood estimation to train parameters of HMM. Such method can make correct transcript have largest posterior probability; however it can’t separate confused models effectively. Discriminative training can take correct transcript and recognized result into consideration at the same time, trying to separate confused models in high dimensional space. Based on minimum phone error (MPE) and feature-space minimum phone error (fMPE), the thesis will introduce discriminative training’s background knowledge, basic theory and experimental results. The thesis has four parts: The first part is the basic theory, including risk estimation and auxiliary function. Risk estimation starts from minimum Bayesian risk, introducing widely explored model training methods, including maximum likelihood estimation, maximum mutual information estimation, overall risk criterion estimation, and minimum phone error. The objective functions can be regarded as extension of Bayesian risk. In addition, the thesis will review strong-sense and weak-sense auxiliary functions and smoothing function. Strong-sense and weak-sense auxiliary functions can be used to find the optimal solution. When using weak-sense auxiliary function to find solutions, adding smoothing function can improve convergence speed. The second part is the experimental architecture, including NTNU broadcast news corpus, lexicon and language model. The recognizer uses left-to-right, frame-synchronous tree copy search to implement LVCSR. The thesis uses maximum likelihood training results of mel frequency cepstrum coefficients and features processed by heteroscedastic linear discriminant analysis as baseline. The third part is minimum phone error. The method uses minimum phone error as direct objective function. From the update equation we can see that the newly trained model parameters are closer to correctly-recognized features (belong to numerator lattices) and move far away from wrongly-recognized features (belong to denominator lattices). The I-smoothing technique introduces model’s prior to optimize estimation. Besides, the thesis will introduce the approximation of phone error-how to use lattice to approximate all recognized results and how to use forward-backward algorithms to calculate average accuracy. The experimental results show that this method can reduce 3% character error rate in the corpus. The fourth part is the feature-space minimum phone error. The method projects features into high-dimension space and generate an offset vector added to original feature and leads to discrimination. The transform matrix is trained by minimum phone error followed by gradient descent to do update. There are direct differential and indirect differential. Indirect differential can reflect the model change on features so that feature training and model training can be done iteratively. Offset feature-space minimum phone error is different in the high dimension feature. The method can save 1/4 computation and achieve similar improvement. My thesis proposed dimension-weighted offset feature-space minimum phone error which treats different dimensions with different weights. Experimental results show that theses methods have 3% character error rate reduction. Dimension-weighted offset feature-space minimum phone error has larger improvements and more robust in training.

APA, Harvard, Vancouver, ISO, and other styles

36

Chung, Cheng-Tao, and 鍾承道. "Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/p9p96r.

Full text

Abstract:

博士
國立臺灣大學
電機工程學研究所
105
In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics.

APA, Harvard, Vancouver, ISO, and other styles

37

WANG, SHANG-YU, and 王上瑜. "A Study of Applying Noise-Robust Features in Reduced Frame-Rate Acoustic Models for Speech Recognition." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/63485710426800421992.

Full text

Abstract:

碩士
國立暨南國際大學
電機工程學系
104
Speech recognition in mobile devices has been increasingly popular in our life, while it has to deal with the requirements of high recognition accuracy and low transmission load. One of the most challenging tasks for improving the recognition accuracy for real-world applications is to alleviate the noise effect, and one prominent way to reducing the transmission load is to make the speech features as compact as possible. In this study, we evaluate and explore the effectiveness of integrating the noise-robust speech feature representation with the reduced frame-rate acoustic model architecture. The used noise-robustness algorithms for improving features include cepstral mean subtraction (CMS), ceptral mean and variance normalization (MVN), histogram equalization (HEQ), cepstral gain normalization (CGN), MVN plus auto-regressive moving average filtering (MVA) and modulation spectrum power-law expansion (MSPLE). On the other hand, the adapted hidden Markov model (HMM) structure for reduced frame-rate (RFR) speech features, developed by Professor Lee-min Lee, is exploited in our evaluation task. The experiments conducted on the Aurora-2 digit database shows that: in the clean noise-free situation, the adapted HMM with the RFR features can provide comparable recognition accuracy relative to the non-adapted HMM with full frame-rate (FFR) features, while in the noisy situations, the noise-robustness algorithms work well in the RFR HMM scenarios and are capable of improving the recognition performance even when the RFR down-sampling ratio is as low as 1/4.

APA, Harvard, Vancouver, ISO, and other styles

38

Kim, Yunjung. "Patterns of speech abnormality in a large dysarthria database : interactions between severity, acoustic features, and dysarthria type /." 2007. http://www.library.wisc.edu/databases/connect/dissertations.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Albalkhi, Rahaf. "Articulation modelling of vowels in dysarthric and non-dysarthric speech." Thesis, 2020. http://hdl.handle.net/1828/11771.

Full text

Abstract:

People with motor function disorders that cause dysarthric speech find difficulty using state-of- the-art automatic speech recognition (ASR) systems. These systems are developed based on non- dysarthric speech models, which explains the poor performance when used by individuals with dysarthria. Thus, a solution is needed to compensate for the poor performance of these systems. This thesis examines the possibility of quantifying vowels of dysarthric and non-dysarthric speech into codewords regardless of inter-speaker variability and possible to be implemented on limited- processing-capability machines. I show that it is possible to model all possible vowels and vowel- like sounds that a North American speaker can produce if the frequencies of the first and second formants are used to encode these sounds. The proposed solution is aligned with the use of neural networks and hidden Markov models to build an acoustic model in conventional ASR systems. A secondary finding of this study includes the feasibility of reducing the set of ten most common vowels in North American English to eight vowels only.
Graduate
2021-05-11

APA, Harvard, Vancouver, ISO, and other styles

40

Lee, Yi-Hsuan, and 李依萱. "Relationship of Aspiration/ Unaspiration Features for Stops, Affricates and Speech Intelligibility In Esophageal and Pneumatic Device Speakers: An Acoustic and Perceptual Study." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/83321695086212692263.

Full text

Abstract:

碩士
國立臺北護理健康大學
聽語障礙科學研究所
103
The purpose of this study was to investigate and compare the acoustic and auditory perception parameters of esophageal speakers, pneumatic device speakers and normal laryngeal speakers. Acoustic parameters including voice onset time (VOT) and noise duration, auditory perception parameters including stops intelligibility and affricates intelligibility have been studied. Speech samples were recorded from 16 esophageal speakers, 18 pneumatic device speakers and 19 normal laryngeal speakers. Kruskal-Wallis test was used to analyze the differences of all acoustic and auditory perception parameters between 3 groups of participants, Spearman rank correlation coefficient was used to analyze the relationship between acoustic parameters and auditory perception parameters. Results of acoustic measurements revealed that acoustic parameters had significant differences between 3 groups. The VOT for unaspirated stops of both esophageal and pneumatic device speakers was significant higher than normal laryngeal speakers. The VOT for aspirated stops of normal speakers was significant higher than pneumatic device speakers. The noise duration for unaspirated affricates of esophageal speakers was significant higher than pneumatic device speakers. The noise duration for aspirated affricates of normal speakers was significant higher than pneumatic device speakers. No significant difference was found between the alaryngeal groups in relationship between acoustic and auditory perception parameters. The finding could provide references for clinical speech-language pathologists to execute speech assessment and rehabilitation for esophageal and pneumatic device speakers.

APA, Harvard, Vancouver, ISO, and other styles

41

"Multi-resolution analysis based acoustic features for speech recognition =: 基於多尺度分析的聲學特徵在語音識別中的應用." 1999. http://library.cuhk.edu.hk/record=b5890004.

Full text

Abstract:

Chan Chun Ping.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.
Includes bibliographical references (leaves 134-137).
Text in English; abstracts in English and Chinese.
Chan Chun Ping.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Speech Recognition --- p.1
Chapter 1.2 --- Review of Speech Recognition Techniques --- p.2
Chapter 1.3 --- Review of Signal Representation --- p.4
Chapter 1.4 --- Review of Wavelet Transform --- p.7
Chapter 1.5 --- Objective of Thesis --- p.11
Chapter 1.6 --- Thesis Outline --- p.11
References --- p.13
Chapter 2 --- Baseline Speech Recognition System --- p.17
Chapter 2.1 --- Intoduction --- p.17
Chapter 2.2 --- Feature Extraction --- p.18
Chapter 2.3 --- Hidden Markov Model for Speech Recognition --- p.24
Chapter 2.3.1 --- The Principle of Using HMM in Speech Recognition --- p.24
Chapter 2.3.2 --- Elements of an HMM --- p.27
Chapter 2.3.3 --- Parameters Estimation and Recognition Algorithm --- p.30
Chapter 2.3.4 --- Summary of HMM based Speech Recognition --- p.31
Chapter 2.4 --- TIMIT Continuous Speech Corpus --- p.32
Chapter 2.5 --- Baseline Speech Recognition Experiments --- p.36
Chapter 2.6 --- Summary --- p.39
References --- p.40
Chapter 3 --- Multi-Resolution Based Acoustic Features --- p.42
Chapter 3.1 --- Introduction --- p.42
Chapter 3.2 --- Discrete Wavelet Transform --- p.43
Chapter 3.3 --- Periodic Discrete Wavelet Transform --- p.47
Chapter 3.4 --- Multi-Resolution Analysis on STFT Spectrum --- p.49
Chapter 3.5 --- Principal Component Analysis --- p.52
Chapter 3.5.1 --- Related Work --- p.52
Chapter 3.5.2 --- Theoretical Background of PCA --- p.53
Chapter 3.5.3 --- Examples of Basis Vectors Found by PCA --- p.57
Chapter 3.6 --- Experiments for Multi-Resolution Based Feature --- p.60
Chapter 3.6.1 --- Experiments with Clean Speech --- p.60
Chapter 3.6.2 --- Experiments with Noisy Speech --- p.64
Chapter 3.7 --- Summary --- p.69
References --- p.70
Chapter 4 --- Wavelet Packet Based Acoustic Features --- p.72
Chapter 4.1 --- Introduction --- p.72
Chapter 4.2 --- Wavelet Packet Filter-Bank --- p.74
Chapter 4.3 --- Dimensionality Reduction --- p.76
Chapter 4.4 --- Filter-Bank Parameters --- p.77
Chapter 4.4.1 --- Mel-Scale Wavelet Packet Filter-Bank --- p.77
Chapter 4.4.2 --- Effect of Down-Sampling --- p.78
Chapter 4.4.3 --- Mel-Scale Wavelet Packet Tree --- p.81
Chapter 4.4.4 --- Wavelet Filters --- p.84
Chapter 4.5 --- Experiments Using Wavelet Packet Based Acoustic Features --- p.86
Chapter 4.6 --- Broad Phonetic Class Analysis --- p.89
Chapter 4.7 --- Discussion --- p.92
Chapter 4.8 --- Summary --- p.99
References --- p.100
Chapter 5 --- De-Noising by Wavelet Transform --- p.101
Chapter 5.1 --- Introduction --- p.101
Chapter 5.2 --- De-Noising Capability of Wavelet Transform --- p.103
Chapter 5.3 --- Wavelet Transform Based Wiener Filtering --- p.105
Chapter 5.3.1 --- Sub-Band Position for Wiener Filtering --- p.107
Chapter 5.3.2 --- Estimation of Short-Time Speech and Noise Power --- p.109
Chapter 5.4 --- De-Noising Embedded in Wavelet Packet Filter-Bank --- p.115
Chapter 5.5 --- Experiments Using Wavelet Build-in De-Noising Properties --- p.118
Chapter 5.6 --- Discussion --- p.120
Chapter 5.6.1 --- Broad Phonetic Class Analysis --- p.122
Chapter 5.6.2 --- Distortion Measure --- p.124
Chapter 5.7 --- Summary --- p.132
References --- p.134
Chapter 6 --- Conclusions and Future Work --- p.138
Chapter 6.1 --- Conclusions --- p.138
Chapter 6.2 --- Future Work --- p.140
References --- p.142
Appendix 1 Jacobi's Method --- p.143
Appendix 2 Broad Phonetic Class --- p.148

APA, Harvard, Vancouver, ISO, and other styles

42

Chen, Chia-Ping, and 陳佳蘋. "Improved Speech Information Retrieval by Acoustic Feature Similarity." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/89018651792401837962.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

SAKAI, Makoto, and 誠. 坂井. "Acoustic Feature Transformation Based on Generalized Criteria for Speech Recognition." Thesis, 2010. http://hdl.handle.net/2237/14293.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Zolnay, András [Verfasser]. "Acoustic feature combination for speech recognition / vorgelegt von András Zolnay." 2006. http://d-nb.info/982202156/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Chu, Chung Ling, and 朱忠玲. "Acoustic Modeling and Feature Normalization for Large Vocabulary Continuous Mandarin Speech Recognition." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/74973355073730968412.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Khan, W., Ping Jiang, and David R. W. Holton. "Word spotting in continuous speech using wavelet transform." 2014. http://hdl.handle.net/10454/10713.

Full text

Abstract:

No
Word spotting in continuous speech is considered a challenging issue due to dynamic nature of speech. Literature contains a variety of novel techniques for the isolated word recognition and spotting. Most of these techniques are based on pattern recognition and similarity measures. This paper amalgamates the use of different techniques that includes wavelet transform, feature extraction and Euclidean distance. Based on the acoustic features, the proposed system is capable of identifying and localizing a target (test) word in a continuous speech of any length. Wavelet transform is used for the time-frequency representation and filtration of speech signal. Only high intensity frequency components are passed to feature extraction and matching process resulting robust performance in terms of matching as well as computational cost.

APA, Harvard, Vancouver, ISO, and other styles

47

Molau, Sirko [Verfasser]. "Normalization in the acoustic feature space for improved speech recognition / vorgelegt von Sirko Molau." 2003. http://d-nb.info/96913603X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Tsai, Cheng-Yu, and 蔡政昱. "Mutual Reinforcement for Acoustic Tokens and Multi-level Acoustic Tokenizing Deep Neural Network for Unsupervised Speech Feature Extraction and Spoken Term Discovery." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/88386789472006613910.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Acoustic speech features'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles