Dissertationen zum Thema „Acoustic speech features“
Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an
Machen Sie sich mit Top-48 Dissertationen für die Forschung zum Thema "Acoustic speech features" bekannt.
Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.
Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.
Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.
Leung, Ka Yee. „Combining acoustic features and articulatory features for speech recognition /“. View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20LEUNGK.
Der volle Inhalt der QuelleIncludes bibliographical references (leaves 92-96). Also available in electronic version. Access restricted to campus users.
Juneja, Amit. „Speech recognition based on phonetic features and acoustic landmarks“. College Park, Md. : University of Maryland, 2004. http://hdl.handle.net/1903/2148.
Der volle Inhalt der QuelleThesis research directed by: Electrical Engineering. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
Tyson, Na'im R. „Exploration of Acoustic Features for Automatic Vowel Discrimination in Spontaneous Speech“. The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1339695879.
Der volle Inhalt der QuelleSun, Rui. „The evaluation of the stability of acoustic features in affective conveyance across multiple emotional databases“. Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/49041.
Der volle Inhalt der QuelleTorres, Juan Félix. „Estimation of glottal source features from the spectral envelope of the acoustic speech signal“. Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/34736.
Der volle Inhalt der QuelleIshizuka, Kentaro. „Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments“. 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.
Der volle Inhalt der QuelleDiekema, Emily D. „Acoustic Measurements of Clear Speech Cue Fade in Adults with Idiopathic Parkinson Disease“. Bowling Green State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1460063159.
Der volle Inhalt der QuelleTran, Thi-Anh-Xuan. „Acoustic gesture modeling. Application to a Vietnamese speech recognition system“. Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAT023/document.
Der volle Inhalt der QuelleSpeech plays a vital role in human communication. Selection of relevant acoustic speech features is key to in the design of any system using speech processing. For some 40 years, speech was typically considered as a sequence of quasi-stable portions of signal (vowels) separated by transitions (consonants). Despite a wealth of studies that clearly document the importance of coarticulation, and reveal that articulatory and acoustic targets are not context-independent, the view that each vowel has an acoustic target that can be specified in a context-independent manner remains widespread. This point of view entails strong limitations. It is well known that formant frequencies are acoustic characteristics that bear a clear relationship with speech production, and that can distinguish among vowels. Therefore, vowels are generally described with static articulatory configurations represented by targets in the acoustic space, typically by formant frequencies in F1-F2 and F2-F3 planes. Plosive consonants can be described in terms of places of articulation, represented by locus or locus equations in an acoustic plane. But formant frequencies trajectories in fluent speech rarely display a steady state for each vowel. They vary with speaker, consonantal environment (co-articulation) and speaking rate (relating to continuum between hypo- and hyper-articulation). In view of inherent limitations of static approaches, the approach adopted here consists in studying both vowels and consonants from a dynamic point of view.Firstly we studied the effects of the impulse response at the beginning, at the end and during transitions of the signal both in the speech signal and at the perception level. Variations of the phases of the components were then examined. Results show that the effects of these parameters can be observed in spectrograms. Crucially, the amplitudes of the spectral components distinguished under the approach advocated here are sufficient for perceptual discrimination. From this result, for all speech analysis, we only focus on amplitude domain, deliberately leaving aside phase information. Next we extent the work to vowel-consonant-vowel perception from a dynamic point of view. These perceptual results, together with those obtained earlier by Carré (2009a), show that vowel-to-vowel and vowel-consonant-vowel stimuli can be characterized and separated by the direction and rate of the transitions on formant plane, even when absolute frequency values are outside the vowel triangle (i.e. the vowel acoustic space in absolute values).Due to limitations of formant measurements, the dynamic approach needs to develop new tools, based on parameters that can replace formant frequency estimation. Spectral Subband Centroid Frequency (SSCF) features was studied. Comparison with vowel formant frequencies show that SSCFs can replace formant frequencies and act as “pseudo-formant” even during consonant production.On this basis, SSCF is used as a tool to compute dynamic characteristics. We propose a new way to model the dynamic speech features: we called it SSCF Angles. Our analysis work on SSCF Angles were performed on transitions of vowel-to-vowel (V1V2) sequences of both Vietnamese and French. SSCF Angles appear as reliable and robust parameters. For each language, the analysis results show that: (i) SSCF Angles can distinguish V1V2 transitions; (ii) V1V2 and V2V1 have symmetrical properties on the acoustic domain based on SSCF Angles; (iii) SSCF Angles for male and female are fairly similar in the same studied transition of context V1V2; and (iv) they are also more or less invariant for speech rate (normal speech rate and fast one). And finally, these dynamic acoustic speech features are used in Vietnamese automatic speech recognition system with several obtained interesting results
Wang, Yuxuan. „Supervised Speech Separation Using Deep Neural Networks“. The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1426366690.
Der volle Inhalt der QuelleChen, Jitong. „On Generalization of Supervised Speech Separation“. The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492038295603502.
Der volle Inhalt der Quelle暁芸, 王., und Xiaoyun Wang. „Phoneme set design for second language speech recognition“. Thesis, https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13044980/?lang=0, 2017. https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB13044980/?lang=0.
Der volle Inhalt der QuelleThis dissertation focuses on the problem caused by confused mispronunciation to improve the recognition performance of second language speech. A novel method considering integrated acoustic and linguistic features is proposed to derive a reduced phoneme set for L2 speech recognition. The customized phoneme set is created with a phonetic decision tree (PDT)-based top-down sequential splitting method that utilizes the phonological knowledge between L1 and L2. The dissertation verifies the efficacy of the proposed method for Japanese English and shows that the feasibility of building a speech recognizer with the proposed method is able to alleviate the problem caused by confused mispronunciation by second language speakers.
博士(工学)
Doctor of Philosophy in Engineering
同志社大学
Doshisha University
Bezůšek, Marek. „Objektivizace Testu 3F - dysartrický profil pomocí akustické analýzy“. Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2021. http://www.nusl.cz/ntk/nusl-442568.
Der volle Inhalt der QuelleTomashenko, Natalia. „Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems“. Thesis, Le Mans, 2017. http://www.theses.fr/2017LEMA1040/document.
Der volle Inhalt der QuelleDifferences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them
Anderson, Jill M. „Lateralization Effects of Brainstem Responses and Middle Latency Responses to a Complex Tone and Speech Syllable“. University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1313687765.
Der volle Inhalt der QuelleZolnay, András. „Acoustic feature combination for speech recognition“. [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=982202156.
Der volle Inhalt der QuelleSelmini, Antonio Marcos. „Sistema baseado em regras para o refinamento da segmentação automatica de fala“. [s.n.], 2008. http://repositorio.unicamp.br/jspui/handle/REPOSIP/260756.
Der volle Inhalt der QuelleTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação
Made available in DSpace on 2018-08-11T22:49:44Z (GMT). No. of bitstreams: 1 Selmini_AntonioMarcos_D.pdf: 2404244 bytes, checksum: d7fcd0828f3157c595a0e3426b4a7eb0 (MD5) Previous issue date: 2008
Resumo: A demanda por uma segmentação automática de fala confiável vem crescendo e exigindo pesquisas para suportar o desenvolvimento de sistemas que usam fala para uma interação homem-máquina. Neste contexto, este trabalho relata o desenvolvimento e avaliação de um sistema para segmentação automática de fala usando o algoritmo de Viterbi e refinamento das fronteiras de segmentação baseado nas características fonético-acústicas das classes fonéticas. As subunidades fonéticas (dependentes de contexto) são representadas com Modelos Ocultos de Markov (HMM - Hidden Markov Models). Cada fronteira estimada pelo algoritmo de Viterbi é refinada usando características acústicas dependentes de classes de fones, uma vez que a identidade dos fones do lado direito e esquerdo da fronteira considerada é conhecida. O sistema proposto foi avaliado usando duas bases dependentes de locutor do Português do Brasil (uma masculina e outra feminina) e também uma base independente de locutor (TIMIT). A avaliação foi realizada comparando a segmentação automática com a segmentação manual. Depois do processo de refinamento, um ganho de 29% nas fronteiras com erro de segmentação abaixo de 20 ms foi obtido para a base de fala dependente de locutor masculino do Português Brasileiro.
Abstract: The demand for reliable automatic speech segmentation is increasing and requiring additional research to support the development of systems that use speech for man-machine interface. In this context, this work reports the development and evaluation of a system for automatic speech segmentation using Viterbi's algorithm and a refinement of segmentation boundaries based on acoustic-phonetic features. Phonetic sub-units (context-dependent phones) are modeled with HMM (Hidden Markov Models). Each boundary estimated by Viterbi's algorithm is refined using class-dependent acoustic features, as the identity of the phones on the left and right side of the considered boundary is known. The proposed system was evaluated using two speaker dependent Brazilian Portuguese speech databases (one male and one female speaker), and a speaker independent English database (TIMIT). The evaluation was carried out comparing automatic against manual segmentation. After the refinement process, an improvement of 29% in the percentage of segmentation errors below 20 ms was achieved for the male speaker dependent Brazilian Portuguese speech database.
Doutorado
Telecomunicações e Telemática
Doutor em Engenharia Elétrica
DiCicco, Thomas M. Jr (Thomas Minotti). „Optimization of acoustic feature extraction from dysarthric speech“. Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/57781.
Der volle Inhalt der QuelleCataloged from PDF version of thesis.
Includes bibliographical references (p. 171-180).
Dysarthria is a motor speech disorder characterized by weak or uncoordinated movements of the speech musculature. While unfamiliar listeners struggle to understand speakers with severe dysarthria, familiar listeners are often able to comprehend with high accuracy. This observation implies that although the speech produced by an individual with dysarthria may appear distorted and unintelligible to the untrained listener, there must be a set of consistent acoustic cues that the familiar communication partner is able to interpret. While dysarthric speech has been characterized both acoustically and perceptually, most accounts tend to compare dysarthric productions to those of healthy controls rather than identify the set of reliable and consistently controlled segmental cues. This work aimed to elucidate possible recognition strategies used by familiar listeners by optimizing a model of human speech recognition, Stevens' Lexical Access from Features (LAFF) framework, for ten individual speakers with dysarthria (SWDs). The LAFF model is rooted in distinctive feature theory, with acoustic landmarks indicating changes in the manner of articulation. The acoustic correlates manifested around landmarks provide the identity to articulator-free (manner) and articulator-bound (place) features.
(cont.) SWDs created weaker consonantal landmarks, likely due to an inability to form complete closures in the vocal tract and to fully release consonantal constrictions. Identification of speaker-optimized acoustic correlate sets improved discrimination of each speaker's productions, evidenced by increased sensitivity and specificity. While there was overlap between the types of correlates identified for healthy and dysarthric speakers, using the optimal sets of correlates identified for SWDs adversely impaired discrimination of healthy speech. These results suggest that the combinations of correlates suggested for SWDs were specific to the individual and different from the segmental cues used by healthy individuals. Application of the LAFF model to dysarthric speech has potential clinical utility as a diagnostic tool, highlighting the fine-grain components of speech production that require intervention and quantifying the degree of impairment.
by Thomas M. DiCicco, Jr.
Ph.D.
Gajic, Bojana. „Feature Extraction for Automatic Speech Recognition in Noisy Acoustic Environments“. Doctoral thesis, Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, 2002. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-441.
Der volle Inhalt der QuelleThis thesis presents a study of alternative speech feature extraction methods aimed at increasing robustness of automatic speech recognition (ASR) against additive background noise.
Spectral peak positions of speech signals remain practically unchanged in presence of additive background noise. Thus, it was expected that emphasizing spectral peak positions in speech feature extraction would result in improved noise robustness of ASR systems. If frequency subbands are properly chosen, dominant subband frequencies can serve as reasonable estimates of spectral peak positions. Thus, different methods for incorporating dominant subband frequencies into speech feature vectors were investigated in this study.
To begin with, two earlier proposed feature extraction methods that utilize dominant subband frequency information were examined. The first one uses zero-crossing statistics of the subband signals to estimate dominant subband frequencies, while the second one uses subband spectral centroids. The methods were compared with the standard MFCC feature extraction method on two different recognition tasks in various background conditions. The first method was shown to improve ASR performance on both recognition tasks at sufficiently high noise levels. The improvement was, however, smaller on the more complex recognition task. The second method, on the other hand, led to some reduction in ASR performance in all testing conditions.
Next, a new method for incorporating subband spectral centroids into speech feature vectors was proposed, and was shown to be considerably more robust than the standard MFCC method on both ASR tasks. The main difference between the proposed method and the zero-crossing based method is in the way they utilize dominant subband frequency information. It was shown that the performance improvement due to the use of dominant subband frequency information was considerably larger for the proposed method than for the ZCPA method, especially on the more complex recognition task. Finally, the computational complexity of the proposed method is two orders of magnitude lower than that of the zero-crossing based method, and of the same order of magnitude as the standard MFCC method.
Darch, Jonathan J. A. „Robust acoustic speech feature prediction from Mel frequency cepstral coefficients“. Thesis, University of East Anglia, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445206.
Der volle Inhalt der QuelleHamlet, Sean Michael. „COMPARING ACOUSTIC GLOTTAL FEATURE EXTRACTION METHODS WITH SIMULTANEOUSLY RECORDED HIGH-SPEED VIDEO FEATURES FOR CLINICALLY OBTAINED DATA“. UKnowledge, 2012. http://uknowledge.uky.edu/ece_etds/12.
Der volle Inhalt der QuellePandit, Medha. „Voice and lip based speaker verification“. Thesis, University of Surrey, 2000. http://epubs.surrey.ac.uk/915/.
Der volle Inhalt der QuelleTAKEDA, Kazuya, Norihide KITAOKA und Makoto SAKAI. „Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria“. Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14970.
Der volle Inhalt der QuelleTAKEDA, Kazuya, Norihide KITAOKA und Makoto SAKAI. „Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition“. Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14969.
Der volle Inhalt der QuelleTAKEDA, Kazuya, Seiichi NAKAGAWA, Yuya HATTORI, Norihide KITAOKA und Makoto SAKAI. „Evaluation of Combinational Use of Discriminant Analysis-Based Acoustic Feature Transformation and Discriminative Training“. Institute of Electronics, Information and Communication Engineers, 2010. http://hdl.handle.net/2237/14968.
Der volle Inhalt der QuelleBagchi, Deblin. „Transfer learning approaches for feature denoising and low-resource speech recognition“. The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1577641434371497.
Der volle Inhalt der QuelleTemko, Andriy. „Acoustic event detection and classification“. Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.
Der volle Inhalt der Quellesortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.
Kleynhans, Neil Taylor. „Automatic speech recognition for resource-scarce environments / N.T. Kleynhans“. Thesis, North-West University, 2013. http://hdl.handle.net/10394/9668.
Der volle Inhalt der QuelleThesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.
Udaya, Kumar Magesh Kumar. „Classification of Parkinson’s Disease using MultiPass Lvq,Logistic Model Tree,K-Star for Audio Data set : Classification of Parkinson Disease using Audio Dataset“. Thesis, Högskolan Dalarna, Datateknik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:du-5596.
Der volle Inhalt der QuelleMusti, Utpala. „Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales“. Thesis, Université de Lorraine, 2013. http://www.theses.fr/2013LORR0003.
Der volle Inhalt der QuelleThis work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well
Spa, Carvajal Carlos. „Time-domain numerical methods in room acoustics simulations“. Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7565.
Der volle Inhalt der QuelleEn aquesta Tesi hem centrat el nostre anàlisis en els mètodes basats en el comportament ondulatori dins del domini temporal. Més concretament, estudiem en detall les formulacions més importants del mètode de Diferències Finites, el qual s'utilitza en moltes aplicacions d'acústica de sales, i el recentment proposat mètode PseudoEspectral de Fourier. Ambdós mètodes es basen en la formulació discreta de les equacions analítiques que descriuen els fenòmens acústics en espais tancats.
Aquesta obra contribueix en els aspectes més importants en el càlcul numèric de respostes impulsionals: la propagació del so, la generació de fonts i les condicions de contorn de reactància local.
Room acoustics is the science concerned to study the behavior of sound waves in enclosed rooms. The acoustic information of any room, the so called impulse response, is expressed in terms of the acoustic field as a function of space and time. In general terms, it is nearly impossible to find analytical impulse responses of real rooms. Therefore, in the recent years, the use of computers for solving this type of problems has emerged as a proper alternative to calculate the impulse responses.
In this Thesis we focus on the analysis of the wavebased methods in the timedomain. More concretely, we study in detail the main formulations of FiniteDifference methods, which have been used in many room acoustics applications, and the recently proposed Fourier PseudoSpectral methods. Both methods are based on the discrete formulations of the analytical equations that describe the sound phenomena in enclosed rooms.
This work contributes to the main aspects in the computation of impulse responses: the wave propagation, the source generation and the locallyreacting boundary conditions.
Hacine-Gharbi, Abdenour. „Sélection de paramètres acoustiques pertinents pour la reconnaissance de la parole“. Phd thesis, Université d'Orléans, 2012. http://tel.archives-ouvertes.fr/tel-00843652.
Der volle Inhalt der QuelleHeinrich, Lisa Marie. „Acoustic-phonetic features in the speech of deaf women“. 1995. http://catalog.hathitrust.org/api/volumes/oclc/34556039.html.
Der volle Inhalt der QuelleTypescript. eContent provider-neutral record in process. Description based on print version record. Includes bibliographical references (leaves 66-69).
Chien, To-Chang, und 錢鐸樟. „Integration of Acoustic and Linguistic Features for Maximum Entropy Speech Recognition“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/24325293971312481529.
Der volle Inhalt der Quelle國立成功大學
資訊工程學系碩博士班
93
In traditional speech recognition system, we assume that acoustic and linguistic information sources are independent. Parameters of acoustic hidden Markov model (HMM) and linguistic n-gram model are estimated individually and then combined together to build a plug-in maximum a posteriori (MAP) classification rule. However, the acoustic model and language model are correlated in essence. We should relax the independence assumption so as to improve speech recognition performance. In this study, we propose an integrated approach based on maximum entropy (ME) principle where acoustic and linguistic features are optimally combined in an unified framework. Using this approach, the associations between acoustic and linguistic features are explored and merged in the integrated models. On the issue of discriminative training, we also establish the relationship between ME and discriminative maximum mutual information (MMI) models. In addition, this ME integrated model is general so that the semantic topics and long distance association patterns can be further combined. In the experiments, we carry out the proposed ME model for broadcast news transcription using MATBN database. In preliminary experimental results, we obtain improvement compared to conventional speech recognition system based on plug-in MAP classification rule.
Tu, Tsung-Wei, und 涂宗瑋. „Speech Information Retrieval Using Support Vector Machines with Context and Acoustic Features“. Thesis, 2012. http://ndltd.ncl.edu.tw/handle/39598366759652069568.
Der volle Inhalt der QuelleChen, Jia-Yu, und 陳佳妤. „Minimum Phone Error Training of Acoustic Models and Features for Large Vocabulary Mandarin Speech Recognition“. Thesis, 2006. http://ndltd.ncl.edu.tw/handle/63448829355525193378.
Der volle Inhalt der Quelle國立臺灣大學
電機工程學研究所
94
Traditional speech recognition uses maximum likelihood estimation to train parameters of HMM. Such method can make correct transcript have largest posterior probability; however it can’t separate confused models effectively. Discriminative training can take correct transcript and recognized result into consideration at the same time, trying to separate confused models in high dimensional space. Based on minimum phone error (MPE) and feature-space minimum phone error (fMPE), the thesis will introduce discriminative training’s background knowledge, basic theory and experimental results. The thesis has four parts: The first part is the basic theory, including risk estimation and auxiliary function. Risk estimation starts from minimum Bayesian risk, introducing widely explored model training methods, including maximum likelihood estimation, maximum mutual information estimation, overall risk criterion estimation, and minimum phone error. The objective functions can be regarded as extension of Bayesian risk. In addition, the thesis will review strong-sense and weak-sense auxiliary functions and smoothing function. Strong-sense and weak-sense auxiliary functions can be used to find the optimal solution. When using weak-sense auxiliary function to find solutions, adding smoothing function can improve convergence speed. The second part is the experimental architecture, including NTNU broadcast news corpus, lexicon and language model. The recognizer uses left-to-right, frame-synchronous tree copy search to implement LVCSR. The thesis uses maximum likelihood training results of mel frequency cepstrum coefficients and features processed by heteroscedastic linear discriminant analysis as baseline. The third part is minimum phone error. The method uses minimum phone error as direct objective function. From the update equation we can see that the newly trained model parameters are closer to correctly-recognized features (belong to numerator lattices) and move far away from wrongly-recognized features (belong to denominator lattices). The I-smoothing technique introduces model’s prior to optimize estimation. Besides, the thesis will introduce the approximation of phone error-how to use lattice to approximate all recognized results and how to use forward-backward algorithms to calculate average accuracy. The experimental results show that this method can reduce 3% character error rate in the corpus. The fourth part is the feature-space minimum phone error. The method projects features into high-dimension space and generate an offset vector added to original feature and leads to discrimination. The transform matrix is trained by minimum phone error followed by gradient descent to do update. There are direct differential and indirect differential. Indirect differential can reflect the model change on features so that feature training and model training can be done iteratively. Offset feature-space minimum phone error is different in the high dimension feature. The method can save 1/4 computation and achieve similar improvement. My thesis proposed dimension-weighted offset feature-space minimum phone error which treats different dimensions with different weights. Experimental results show that theses methods have 3% character error rate reduction. Dimension-weighted offset feature-space minimum phone error has larger improvements and more robust in training.
Chung, Cheng-Tao, und 鍾承道. „Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/p9p96r.
Der volle Inhalt der Quelle國立臺灣大學
電機工程學研究所
105
In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics.
WANG, SHANG-YU, und 王上瑜. „A Study of Applying Noise-Robust Features in Reduced Frame-Rate Acoustic Models for Speech Recognition“. Thesis, 2016. http://ndltd.ncl.edu.tw/handle/63485710426800421992.
Der volle Inhalt der Quelle國立暨南國際大學
電機工程學系
104
Speech recognition in mobile devices has been increasingly popular in our life, while it has to deal with the requirements of high recognition accuracy and low transmission load. One of the most challenging tasks for improving the recognition accuracy for real-world applications is to alleviate the noise effect, and one prominent way to reducing the transmission load is to make the speech features as compact as possible. In this study, we evaluate and explore the effectiveness of integrating the noise-robust speech feature representation with the reduced frame-rate acoustic model architecture. The used noise-robustness algorithms for improving features include cepstral mean subtraction (CMS), ceptral mean and variance normalization (MVN), histogram equalization (HEQ), cepstral gain normalization (CGN), MVN plus auto-regressive moving average filtering (MVA) and modulation spectrum power-law expansion (MSPLE). On the other hand, the adapted hidden Markov model (HMM) structure for reduced frame-rate (RFR) speech features, developed by Professor Lee-min Lee, is exploited in our evaluation task. The experiments conducted on the Aurora-2 digit database shows that: in the clean noise-free situation, the adapted HMM with the RFR features can provide comparable recognition accuracy relative to the non-adapted HMM with full frame-rate (FFR) features, while in the noisy situations, the noise-robustness algorithms work well in the RFR HMM scenarios and are capable of improving the recognition performance even when the RFR down-sampling ratio is as low as 1/4.
Kim, Yunjung. „Patterns of speech abnormality in a large dysarthria database : interactions between severity, acoustic features, and dysarthria type /“. 2007. http://www.library.wisc.edu/databases/connect/dissertations.html.
Der volle Inhalt der QuelleAlbalkhi, Rahaf. „Articulation modelling of vowels in dysarthric and non-dysarthric speech“. Thesis, 2020. http://hdl.handle.net/1828/11771.
Der volle Inhalt der QuelleGraduate
2021-05-11
Lee, Yi-Hsuan, und 李依萱. „Relationship of Aspiration/ Unaspiration Features for Stops, Affricates and Speech Intelligibility In Esophageal and Pneumatic Device Speakers: An Acoustic and Perceptual Study“. Thesis, 2014. http://ndltd.ncl.edu.tw/handle/83321695086212692263.
Der volle Inhalt der Quelle國立臺北護理健康大學
聽語障礙科學研究所
103
The purpose of this study was to investigate and compare the acoustic and auditory perception parameters of esophageal speakers, pneumatic device speakers and normal laryngeal speakers. Acoustic parameters including voice onset time (VOT) and noise duration, auditory perception parameters including stops intelligibility and affricates intelligibility have been studied. Speech samples were recorded from 16 esophageal speakers, 18 pneumatic device speakers and 19 normal laryngeal speakers. Kruskal-Wallis test was used to analyze the differences of all acoustic and auditory perception parameters between 3 groups of participants, Spearman rank correlation coefficient was used to analyze the relationship between acoustic parameters and auditory perception parameters. Results of acoustic measurements revealed that acoustic parameters had significant differences between 3 groups. The VOT for unaspirated stops of both esophageal and pneumatic device speakers was significant higher than normal laryngeal speakers. The VOT for aspirated stops of normal speakers was significant higher than pneumatic device speakers. The noise duration for unaspirated affricates of esophageal speakers was significant higher than pneumatic device speakers. The noise duration for aspirated affricates of normal speakers was significant higher than pneumatic device speakers. No significant difference was found between the alaryngeal groups in relationship between acoustic and auditory perception parameters. The finding could provide references for clinical speech-language pathologists to execute speech assessment and rehabilitation for esophageal and pneumatic device speakers.
„Multi-resolution analysis based acoustic features for speech recognition =: 基於多尺度分析的聲學特徵在語音識別中的應用“. 1999. http://library.cuhk.edu.hk/record=b5890004.
Der volle Inhalt der QuelleThesis (M.Phil.)--Chinese University of Hong Kong, 1999.
Includes bibliographical references (leaves 134-137).
Text in English; abstracts in English and Chinese.
Chan Chun Ping.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Speech Recognition --- p.1
Chapter 1.2 --- Review of Speech Recognition Techniques --- p.2
Chapter 1.3 --- Review of Signal Representation --- p.4
Chapter 1.4 --- Review of Wavelet Transform --- p.7
Chapter 1.5 --- Objective of Thesis --- p.11
Chapter 1.6 --- Thesis Outline --- p.11
References --- p.13
Chapter 2 --- Baseline Speech Recognition System --- p.17
Chapter 2.1 --- Intoduction --- p.17
Chapter 2.2 --- Feature Extraction --- p.18
Chapter 2.3 --- Hidden Markov Model for Speech Recognition --- p.24
Chapter 2.3.1 --- The Principle of Using HMM in Speech Recognition --- p.24
Chapter 2.3.2 --- Elements of an HMM --- p.27
Chapter 2.3.3 --- Parameters Estimation and Recognition Algorithm --- p.30
Chapter 2.3.4 --- Summary of HMM based Speech Recognition --- p.31
Chapter 2.4 --- TIMIT Continuous Speech Corpus --- p.32
Chapter 2.5 --- Baseline Speech Recognition Experiments --- p.36
Chapter 2.6 --- Summary --- p.39
References --- p.40
Chapter 3 --- Multi-Resolution Based Acoustic Features --- p.42
Chapter 3.1 --- Introduction --- p.42
Chapter 3.2 --- Discrete Wavelet Transform --- p.43
Chapter 3.3 --- Periodic Discrete Wavelet Transform --- p.47
Chapter 3.4 --- Multi-Resolution Analysis on STFT Spectrum --- p.49
Chapter 3.5 --- Principal Component Analysis --- p.52
Chapter 3.5.1 --- Related Work --- p.52
Chapter 3.5.2 --- Theoretical Background of PCA --- p.53
Chapter 3.5.3 --- Examples of Basis Vectors Found by PCA --- p.57
Chapter 3.6 --- Experiments for Multi-Resolution Based Feature --- p.60
Chapter 3.6.1 --- Experiments with Clean Speech --- p.60
Chapter 3.6.2 --- Experiments with Noisy Speech --- p.64
Chapter 3.7 --- Summary --- p.69
References --- p.70
Chapter 4 --- Wavelet Packet Based Acoustic Features --- p.72
Chapter 4.1 --- Introduction --- p.72
Chapter 4.2 --- Wavelet Packet Filter-Bank --- p.74
Chapter 4.3 --- Dimensionality Reduction --- p.76
Chapter 4.4 --- Filter-Bank Parameters --- p.77
Chapter 4.4.1 --- Mel-Scale Wavelet Packet Filter-Bank --- p.77
Chapter 4.4.2 --- Effect of Down-Sampling --- p.78
Chapter 4.4.3 --- Mel-Scale Wavelet Packet Tree --- p.81
Chapter 4.4.4 --- Wavelet Filters --- p.84
Chapter 4.5 --- Experiments Using Wavelet Packet Based Acoustic Features --- p.86
Chapter 4.6 --- Broad Phonetic Class Analysis --- p.89
Chapter 4.7 --- Discussion --- p.92
Chapter 4.8 --- Summary --- p.99
References --- p.100
Chapter 5 --- De-Noising by Wavelet Transform --- p.101
Chapter 5.1 --- Introduction --- p.101
Chapter 5.2 --- De-Noising Capability of Wavelet Transform --- p.103
Chapter 5.3 --- Wavelet Transform Based Wiener Filtering --- p.105
Chapter 5.3.1 --- Sub-Band Position for Wiener Filtering --- p.107
Chapter 5.3.2 --- Estimation of Short-Time Speech and Noise Power --- p.109
Chapter 5.4 --- De-Noising Embedded in Wavelet Packet Filter-Bank --- p.115
Chapter 5.5 --- Experiments Using Wavelet Build-in De-Noising Properties --- p.118
Chapter 5.6 --- Discussion --- p.120
Chapter 5.6.1 --- Broad Phonetic Class Analysis --- p.122
Chapter 5.6.2 --- Distortion Measure --- p.124
Chapter 5.7 --- Summary --- p.132
References --- p.134
Chapter 6 --- Conclusions and Future Work --- p.138
Chapter 6.1 --- Conclusions --- p.138
Chapter 6.2 --- Future Work --- p.140
References --- p.142
Appendix 1 Jacobi's Method --- p.143
Appendix 2 Broad Phonetic Class --- p.148
Chen, Chia-Ping, und 陳佳蘋. „Improved Speech Information Retrieval by Acoustic Feature Similarity“. Thesis, 2011. http://ndltd.ncl.edu.tw/handle/89018651792401837962.
Der volle Inhalt der QuelleSAKAI, Makoto, und 誠. 坂井. „Acoustic Feature Transformation Based on Generalized Criteria for Speech Recognition“. Thesis, 2010. http://hdl.handle.net/2237/14293.
Der volle Inhalt der QuelleZolnay, András [Verfasser]. „Acoustic feature combination for speech recognition / vorgelegt von András Zolnay“. 2006. http://d-nb.info/982202156/34.
Der volle Inhalt der QuelleChu, Chung Ling, und 朱忠玲. „Acoustic Modeling and Feature Normalization for Large Vocabulary Continuous Mandarin Speech Recognition“. Thesis, 2007. http://ndltd.ncl.edu.tw/handle/74973355073730968412.
Der volle Inhalt der QuelleKhan, W., Ping Jiang und David R. W. Holton. „Word spotting in continuous speech using wavelet transform“. 2014. http://hdl.handle.net/10454/10713.
Der volle Inhalt der QuelleWord spotting in continuous speech is considered a challenging issue due to dynamic nature of speech. Literature contains a variety of novel techniques for the isolated word recognition and spotting. Most of these techniques are based on pattern recognition and similarity measures. This paper amalgamates the use of different techniques that includes wavelet transform, feature extraction and Euclidean distance. Based on the acoustic features, the proposed system is capable of identifying and localizing a target (test) word in a continuous speech of any length. Wavelet transform is used for the time-frequency representation and filtration of speech signal. Only high intensity frequency components are passed to feature extraction and matching process resulting robust performance in terms of matching as well as computational cost.
Molau, Sirko [Verfasser]. „Normalization in the acoustic feature space for improved speech recognition / vorgelegt von Sirko Molau“. 2003. http://d-nb.info/96913603X/34.
Der volle Inhalt der QuelleTsai, Cheng-Yu, und 蔡政昱. „Mutual Reinforcement for Acoustic Tokens and Multi-level Acoustic Tokenizing Deep Neural Network for Unsupervised Speech Feature Extraction and Spoken Term Discovery“. Thesis, 2015. http://ndltd.ncl.edu.tw/handle/88386789472006613910.
Der volle Inhalt der Quelle