Tesis sobre el tema "Perceptual features for speech recognition"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Perceptual features for speech recognition".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Haque, Serajul. "Perceptual features for speech recognition". University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0187.
Texto completoGu, Y. "Perceptually-based features in automatic speech recognition". Thesis, Swansea University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.637182.
Texto completoChu, Kam Keung. "Feature extraction based on perceptual non-uniform spectral compression for noisy speech recognition /". access full-text access abstract and table of contents, 2005. http://libweb.cityu.edu.hk/cgi-bin/ezdb/thesis.pl?mphil-ee-b19887516a.pdf.
Texto completo"Submitted to Department of Electronic Engineering in partial fulfillment of the requirements for the degree of Master of Philosophy" Includes bibliographical references (leaves 143-147)
Koniaris, Christos. "Perceptually motivated speech recognition and mispronunciation detection". Doctoral thesis, KTH, Tal-kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-102321.
Texto completoQC 20120914
European Union FP6-034362 research project ACORNS
Computer-Animated language Teachers (CALATea)
Koniaris, Christos. "A study on selecting and optimizing perceptually relevant features for automatic speech recognition". Licentiate thesis, Stockholm : Kungliga Tekniska högskolan, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-11470.
Texto completoSklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition". Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.
Texto completoAtassi, Hicham. "Rozpoznání emočního stavu z hrané a spontánní řeči". Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2014. http://www.nusl.cz/ntk/nusl-233665.
Texto completoTemko, Andriy. "Acoustic event detection and classification". Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.
Texto completosortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.
Lileikytė, Rasa. "Quality estimation of speech recognition features". Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2012. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2012~D_20120302_090132-92071.
Texto completoŠnekos signalų atpažinimo sistemų tikslumas priklauso nuo šnekos signalus aprašančių požymių ir šiuos požymius naudojančių klasifikatorių savybių. Vertinant tradiciškai atpažinimo sistemų tikslumą kiekvienai pasirinktai požymių sistemai ir kiekvienam klasifikatoriaus tipui tenka atlikti atpažinimo tikslumo skaičiavimus. Tokių darbų apimtis galima sumažinti įvertinus pasirenkamų požymių kokybę. Todėl buvo atlikti šnekos signalų požymių kokybės vertinimo tyrimai. Ištirtas metodas šnekos signalų atpažinimo požymių kokybei vertinti, grindžiamas trijų metrikų panaudojimu. Parodyta, kad tokiu būdu atrinkti šnekos signalų požymiai Euklido erdvėje aprašo atpažinimo sistemų kokybę ir leidžia sumažinti atpažinimo sistemų kokybės vertinimo darbų apimtis. Parodyta, kad šnekos signalų požymių kokybės vertinimo metodo algoritmo sudėtingumas yra O(2Rlog2R), o atpažinimo sistemos, kuriame naudojamas dinaminio laiko skalės kraipymo klasifikatorius, atpažinimo kokybės vertinimo algoritmo sudėtingumas yra O(R^2), R – šnekos signalų etalonų vektorių skaičius. Eksperimentinių tyrimų rezultatai patvirtino pateikto šnekos signalų atpažinimo požymių kokybės vertinimo metodo teisingumą.
Matthews, Iain. "Features for audio-visual speech recognition". Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.
Texto completoDroppo, J. G. "Time-frequency features for speech recognition /". Thesis, Connect to this title online; UW restricted, 2000. http://hdl.handle.net/1773/5965.
Texto completoOre, Brian M. "Multilingual Articulatory Features for Speech Recognition". Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1176169264.
Texto completoLeung, Ka Yee. "Combining acoustic features and articulatory features for speech recognition /". View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20LEUNGK.
Texto completoIncludes bibliographical references (leaves 92-96). Also available in electronic version. Access restricted to campus users.
Iliev, Alexander Iliev. "Emotion Recognition Using Glottal and Prosodic Features". Scholarly Repository, 2009. http://scholarlyrepository.miami.edu/oa_dissertations/515.
Texto completoMossmyr, Simon. "Noisy recognition of perceptual mid-level features in music". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-294229.
Texto completoSjälvträning med störningar är en delvis övervakad självträningsmetod som uppnådde en avsevärd pricksäkerhet på ImageNets bildigenkänningsprov. Den använder sig av dataförstärkning och störningar i modellen när den ska anpassas till en stor mängd artificiellt annoterad träningsdata tillsammans med vanlig träningsdata. I den här uppsatsen så använder vi självträning med störningar för att träna ett VGG-liknande faltningsnätverk med en datamängd av musikstycken annoterade med perceptuella mellanliggande särdrag. För att uppnå detta så börjar vi med att experimentera med dataförstärkning och finner att förändring av tonhöjd, tidsuttöjning och tidsförflyttning (applicerat direkt på musikstyckenas spektrogram) kan öka modellens tolerans för förändringar i datan. Vi experimenterar även med stokastiskt djup — en metod som inaktiverar hela lager av ett neuronnätverk under träning—och finner att detta också kan öka modellens tolerans. Detta är en nyanvändning av stokastiskt djup eftersom metoden såvitt vi känner till inte har använts i annat än varianter av ResNet. Slutligen så använder vi självträning med störningar med de tidigare nämnda metoderna och finner en avsevärd minskning i modellens fel, även om dess övergripande prestanda kan ifrågasättas.
Saenko, Ekaterina 1976. "Articulatory features for robust visual speech recognition". Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28736.
Texto completoIncludes bibliographical references (p. 99-105).
This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.
by Ekaterina Saenko.
S.M.
Väyrynen, E. (Eero). "Emotion recognition from speech using prosodic features". Doctoral thesis, Oulun yliopisto, 2014. http://urn.fi/urn:isbn:9789526204048.
Texto completoTiivistelmä Emootiontunnistus on affektiivisen laskennan keskeinen osa-alue. Siinä pyritään ihmisen kommunikaatioon sisältyvien emotionaalisten viestien selvittämiseen, esim. visuaalisten, auditiivisten ja/tai fysiologisten vihjeiden avulla. Puhe on ihmisten tärkein tapa kommunikoida ja on siten ensiarvoisen tärkeässä roolissa viestinnän oikean semanttisen ja emotionaalisen tulkinnan kannalta. Emotionaalinen tieto välittyy puheessa paljolti jatkuvana paralingvistisenä viestintänä, jonka tärkein komponentti on prosodia. Tämän affektiivisen ja emotionaalisen tulkinnan vajaavaisuus ihminen-kone – interaktioissa rajoittaa kuitenkin vielä nykyisellään teknologisten laitteiden toimintaa ja niiden käyttökokemusta. Tässä väitöstyössä on käytetty puheen prosodisia ja akustisia piirteitä puhutun suomen emotionaalisen sisällön tunnistamiseksi. Työssä on kehitetty pitkien puhenäytteiden prosodisiin piirteisiin perustuvia emootiontunnistusmenetelmiä. Lyhyiden puheenpätkien emotionaalisen sisällön tunnistamiseksi on taas kehitetty informaatiofuusioon perustuva menetelmä käyttäen prosodian sekä äänilähteen laadullisten piirteiden yhdistelmää. Lisäksi on kehitetty teknologinen viitekehys emotionaalisen puheen visualisoimiseksi prosodisten piirteiden avulla. Tutkimuksessa saavutettiin ihmisten tunnistuskykyyn verrattava automaattisen emootiontunnistuksen taso käytettäessä suppeaa perusemootioiden joukkoa (neutraali, surullinen, iloinen ja vihainen). Emootiontunnistuksen suorituskyky puhutulle suomelle havaittiin olevan verrannollinen länsieurooppalaisten kielten kanssa. Lyhyiden puheenpätkien emotionaalisen sisällön tunnistamisessa saavutettiin taas parempi suorituskyky käytettäessä fuusiomenetelmää. Emotionaalisen puheen visualisoimiseksi kehitetyllä opetettavalla epälineaarisella manifoldimallinnustekniikalla pystyttiin tuottamaan aineistolle emootion dimensionaalisen mallin kaltainen visuaalinen rakenne. Mataladimensionaalisen kuvauksen voitiin edelleen osoittaa säilyttävän sekä tutkimusaineiston emotionaalisten luokkien että emotionaalisen intensiteetin topologisia rakenteita. Tässä väitöksessä kehitettiin hahmontunnistusmenetelmiin perustuvaa teknologiaa emotionaalisen puheen tunnistamiseksi käytettäessä sekä pitkiä että lyhyitä puhenäytteitä. Emotionaalisen aineiston visualisointiin ja luokitteluun kehitettyä teknologista kehysmenetelmää käyttäen voidaan myös esittää puheaineistoa muidenkin semanttisten rakenteiden mukaisesti
Rankin, D. "Extraction of features from speech spectra". Thesis, Queen's University Belfast, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.373541.
Texto completoDomont, Xavier. "Hierarchical spectro-temporal features for robust speech recognition". Münster Verl.-Haus Monsenstein und Vannerdat, 2009. http://d-nb.info/1001282655/04.
Texto completoLal, Partha. "Cross-lingual automatic speech recognition using tandem features". Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5773.
Texto completoHarte, Naomi Antonia. "Segmental phonetic features and models for speech recognition". Thesis, Queen's University Belfast, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.287466.
Texto completoSchuy, Lars. "Speech features and their significance in speaker recognition". Thesis, University of Sussex, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.288845.
Texto completoNecioğlu, Burhan F. "Objectively measured descriptors for perceptual characterization of speakers". Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/15035.
Texto completoSavvides, Vasos E. "Perceptual models in speech quality assessment and coding". Thesis, Loughborough University, 1988. https://dspace.lboro.ac.uk/2134/36273.
Texto completoJuneja, Amit. "Speech recognition based on phonetic features and acoustic landmarks". College Park, Md. : University of Maryland, 2004. http://hdl.handle.net/1903/2148.
Texto completoThesis research directed by: Electrical Engineering. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
ALENCAR, VLADIMIR FABREGAS SURIGUE DE. "EFFICIENT FEATURES AND INTERPOLATION DOMAINS IN DISTRIBUTED SPEECH RECOGNITION". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2005. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=6201@1.
Texto completoCom o crescimento gigantesco da Internet e dos sistemas de comunicações móveis celulares, as aplicações de processamento de voz nessas redes têm despertado grande interesse . Um problema particularmente importante nessa área consiste no reconhecimento de voz em um sistema servidor, baseado nos parâmetros acústicos calculados e quantizados no terminal do usuário (Reconhecimento de Voz Distribuído). Como em geral estes parâmetros não são os mais indicados como atributos de voz para o sistema de reconhecimento remoto, é importante que sejam examinadas diferentes transformações dos parâmetros, que permitam um melhor desempenho do reconhecedor. Esta dissertação trata da extração de atributos de reconhecimento eficientes a partir dos parâmetros dos codificadores utilizados em redes móveis celulares e em redes IP. Além disso, como a taxa dos parâmetros fornecidos ao reconhecedor de voz é normalmente superior àquela com a qual os codificadores geram os parâmetros, é importante analisar o efeito da interpolação dos parâmetros sobre o desempenho do sistema de reconhecimento, bem como o melhor domínio sobre o qual esta interpolação deve ser realizada. Estes são outros tópicos apresentados nesta dissertação.
The huge growth of the Internet and cellular mobile communication systems has stimulated a great interest in the applications of speech processing in these networks. An important problem in this field consists in speech recognition in a server system, based on the acoustic parameters calculated and quantized in the user terminal (Distributed Speech Recognition). Since these parameters are not the most indicated ones for the remote recognition system, it is important to examine different transformations of these parameters, in order to allow a better performance of the recogniser. This dissertation is concerned with the extraction of efficient recognition features from the coder parameters used in cellular mobile networks and IP networks. In addition, as the rate that parameters supplied for the speech recogniser must be usually higher than that generated by the codec, it is important to analyze the effect of the interpolation of the parameters over the performance of the recognition system. Moreover, it is paramount to establish the best domain over which this interpolation must be carried out. These are other topics presented in this dissertation.
Meng, Helen M. "The use of distinctive features for automatic speech recognition". Thesis, Massachusetts Institute of Technology, 1991. http://hdl.handle.net/1721.1/13279.
Texto completoSchutte, Kenneth Thomas 1979. "Parts-based models and local features for automatic speech recognition". Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53301.
Texto completoCataloged from PDF version of thesis.
Includes bibliographical references (p. 101-108).
While automatic speech recognition (ASR) systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. This thesis revisits the basic acoustic modeling assumptions common to most ASR systems and argues that improvements to the underlying model of speech are required to address these shortcomings. A number of problems with the standard method of hidden Markov models (HMMs) and features derived from fixed, frame-based spectra (e.g. MFCCs) are discussed. Based on these problems, a set of desirable properties of an improved acoustic model are proposed, and we present a "parts-based" framework as an alternative. The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles. We discuss the proposed model's relationship to HMMs and segment-based recognizers, and describe how they can be viewed as special cases of the PBM. Two variations of PBMs are described in detail. The first represents each phonetic unit with a set of time-frequency (T-F) "patches" which act as filters over a spectrogram. The model structure encodes the patches' relative T-F positions. The second variation, referred to as a "speech schematic" model, more directly encodes the information in a spectrogram by using simple edge detectors and focusing more on modeling the constraints between parts.
(cont.) We demonstrate the proposed models on various isolated recognition tasks and show the benefits over baseline systems, particularly in noisy conditions and when only limited training data is available. We discuss efficient implementation of the models and describe how they can be combined to build larger recognition systems. It is argued that the flexible templates used in parts-based modeling may provide a better generative model of speech than typical HMMs.
by Kenneth Thomas Schutte.
Ph.D.
Tang, Min Ph D. Massachusetts Institute of Technology. "Large vocabulary continuous speech recognition using linguistic features and constraints". Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33203.
Texto completoThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (leaves 111-123).
Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as lexical access". This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation. The first pass classifies speech signals into a sequence of linguistic features, and only words that match this sequence - the cohort - are activated. The second pass performs a detailed acoustic phonetic analysis within the cohort to decide the identity of the word. This model differs from the lexical access model nowadays commonly employed in speech recognizers where detailed acoustic phonetic analysis is performed directly and lexical items are retrieved in one pass. The thesis first studies the implementation issues of the Huttenlocher-Zue model. A number of extensions to the original proposal are made to take advantage of the existing facilities of a probabilistic, graph-based recognition framework and, more importantly, to model the broad linguistic features in a data-driven approach. First, we analyze speech signals along the two diagonal dimensions of manner and place of articulation, rather than the manner dimension alone. Secondly, we adopt a set of feature-based landmarks optimized for data-driven modeling as the basic recognition units, and Gaussian mixture models are trained for these units. We explore information fusion techniques to integrate constraints from both the manner and place dimensions, as well as examining how to integrate constraints from the feature-based first pass with the second pass of detailed acoustic phonetic analysis. Our experiments on a large-vocabulary isolated word recognition task show that, while constraints from each individual feature dimension provide only limited help in this lexical access model, the utilization of both dimensions and information fusion techniques leads to significant performance gain over a one-pass phonetic system. The thesis then proposes to generalize the original Huttenlocher-Zue model, which limits itself to only isolated word tasks, to handle continuous speech. With continuous speech, the search space for both stages is infinite if all possible word sequences are allowed. We generalize the original cohort idea from the Huttenlocher-Zue proposal and use the bag of words of the N-best list of the first pass as cohorts for continuous speech. This approach transfers the constraints of broad linguistic features into a much reduced search space for the second stage. The thesis also studies how to recover from errors made by the first pass, which is not discussed in the original Huttenlocher- Zue proposal. In continuous speech recognition, a way of recovering from errors made in the first pass is vital to the performance of the over-all system. We find empirical evidence that such errors tend to occur around function words, possibly due to the lack of prominence, in meaning and henceforth in linguistic features, of such words. This thesis proposes an error-recovery mechanism based on empirical analysis on a development set for the two-pass lexical access model. Our experiments on a medium- sized, telephone-quality continuous speech recognition task achieve higher accuracy than a state-of-the-art one-pass baseline system. The thesis applies the generalized two-pass lexical access model to the challenge of recognizing continuous speech from an open vocabulary. Telephony information query systems often need to deal with a large list of words that are not observed in the training data, for example the city names in a weather information query system. The large portion of vocabulary unseen in the training data - the open vocabulary - poses a serious data-sparseness problem to both acoustic and language modeling. A two-pass lexical access model provides a solution by activating a small cohort within the open vocabulary in the first pass, thus significantly reducing the data- sparseness problem. Also, the broad linguistic constraints in the first pass generalize better to unseen data compared to finer, context-dependent acoustic phonetic models. This thesis also studies a data-driven analysis of acoustic similarities among open vocabulary items. The results are used for recovering possible errors in the first pass. This approach demonstrates an advantage over a two-pass approach based on specific semantic constraints. In summary, this thesis implements the original Huttenlocher-Zue two-pass lexical access model in a modern probabilistic speech recognition framework. This thesis also extends the original model to recognize continuous speech from an open vocabulary, with our two-stage model achieving a better performance than the baseline system. In the future, sub-lexical linguistic hierarchy constraints, such as syllables, can be introduced into this two-pass model to further improve the lexical access performance.
by Min Tang.
Ph.D.
Civile, Ciro. "The face inversion effect and perceptual learning : features and configurations". Thesis, University of Exeter, 2013. http://hdl.handle.net/10871/13564.
Texto completoWeatherholtz, Kodi. "Perceptual learning of systemic cross-category vowel variation". The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1429782580.
Texto completoJeon, Woojay. "Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System". Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/14061.
Texto completoBruijn, Christina Geertruida de. "Voice quality after dictation to speech recognition software : a perceptual and acoustic study". Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.440907.
Texto completoJavadi, Ailar. "Bio-inspired noise robust auditory features". Thesis, Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/44801.
Texto completoBerg, Brian LaRoy. "Investigating Speaker Features From Very Short Speech Records". Diss., Virginia Tech, 2001. http://hdl.handle.net/10919/28691.
Texto completoPh. D.
Clark, Tracy M. "A Study of Features and Processes Towards Real-time Speech Word Recognition". Thesis, University of Canterbury. Electrical and Electronic Engineering, 1993. http://hdl.handle.net/10092/7561.
Texto completoPeso, Pablo. "Spatial features of reverberant speech : estimation and application to recognition and diarization". Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/45664.
Texto completoSidorova, Julia. "Optimization techniques for speech emotion recognition". Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7575.
Texto completoThe first contribution of this thesis is a speech emotion recognition system called the ESEDA capable of recognizing emotions in di®erent languages. The second contribution is the classifier TGI+. First objects are modeled by means of a syntactic method and then, with a statistical method the mappings of samples are classified, not their feature vectors. The TGI+ outperforms the state of the art top performer on a benchmark data set of acted emotions. The third contribution is high-level features, which are distances from a feature vector to the tree automata accepting class i, for all i in the set of class labels. The set of low-level features and the set of high-level features are concatenated and the resulting set is submitted to the feature selection procedure. Then the classification step is done in the usual way. Testing on a benchmark dataset of authentic emotions showed that this classification strategy outperforms the state of the art top performer.
Lareau, Jonathan. "Application of shifted delta cepstral features for GMM language identification /". Electronic version of thesis, 2006. https://ritdml.rit.edu/dspace/handle/1850/2686.
Texto completoGangireddy, Siva Reddy. "Recurrent neural network language models for automatic speech recognition". Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28990.
Texto completoWong, Jimmy Pui Fung. "The use of prosodic features in Chinese speech recognition and spoken language processing /". View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202003%20WONG.
Texto completoIncludes bibliographical references (leaves 97-101). Also available in electronic version. Access restricted to campus users.
Sun, Rui. "The evaluation of the stability of acoustic features in affective conveyance across multiple emotional databases". Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/49041.
Texto completoLUDWICZAK, LEIGH ANN. "CHILDRENS' FIRST FIVE WORDS: AN ANALYSIS OF PERCEPTUAL FEATURES, GRAMMATICAL CATEGORIES, AND COMMUNICATIVE INTENTIONS". University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin990647609.
Texto completoSIQUEIRA, JAN KRUEGER. "CONTINUOUS SPEECH RECOGNITION WITH MFCC, SSCH AND PNCC FEATURES, WAVELET DENOISING AND NEURAL NETWORKS". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2011. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=19143@1.
Texto completoUm dos maiores desafios na área de reconhecimento de voz contínua é desenvolver sistemas robustos ao ruído aditivo. Para isso, este trabalho analisa e testa três técnicas. A primeira delas é a extração de atributos do sinal de voz usando os métodos MFCC, SSCH e PNCC. A segunda é a remoção de ruído do sinal de voz via wavelet denoising. A terceira e última é uma proposta original batizada de feature denoising, que busca melhorar os atributos extraídos usando um conjunto de redes neurais. Embora algumas dessas técnicas já sejam conhecidas na literatura, a combinação entre elas trouxe vários resultados interessantes e inéditos. Inclusive, nota-se que o melhor desempenho vem da união de PNCC com feature denoising.
One of the biggest challenges on the continuous speech recognition field is to develop systems that are robust to additive noise. To do so, this work analyses and tests three techniques. The first one extracts features from the voice signal using the MFCC, SSCH and PNCC methods. The second one removes noise from the voice signal through wavelet denoising. The third one is an original one, called feature denoising, that seeks to improve the extracted features using a set of neural networks. Although some of these techniques are already known in the literature, the combination of them brings many interesting and new results. In fact, it is noticed that the best performance comes from the union of PNCC and feature denoising.
Ishizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments". 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.
Texto completoChan, Oscar. "Prosodic features for a maximum entropy language model". University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0244.
Texto completoJuzwin, Kathryn Rossetto. "The effects of perceptual interference and noninterference on facial recognition based on outer and inner facial features". Virtual Press, 1986. http://liblink.bsu.edu/uhtbin/catkey/447843.
Texto completoChristensen, Carl V. "Fluency Features and Elicited Imitation as Oral Proficiency Measurement". BYU ScholarsArchive, 2012. https://scholarsarchive.byu.edu/etd/3114.
Texto completoWang, Yihan. "Automatic Speech Recognition Model for Swedish using Kaldi". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285538.
Texto completoDet existerar flera lösningar för automatisk transkribering på marknaden, menen stor del av dem stödjer inte svenska på grund utav det relativt få antalettalare. I det här projektet så skapades automatisk transkribering för svenskamed Hidden Markov models och Gaussian mixture models genom att användaKaldi. Detta för att kunna möjliggöra för ICABanken att klassificera samtal tillsin kundtjänst. En mängd av modellvariationer med olika fonemkombinationsmetoder,egenvärdesberäkning och databearbetningsmetoder har utforskats.Word error rate och real time factor är valda som utvärderingskriterier föratt jämföra precisionen och hastigheten mellan modellerna. När det kommertill kontinuerlig transkribering för ett stort ordförråd så resulterar triphonei mycket bättre prestanda än monophone. Med hjälp utav transformationerså förbättras både precisionen och hastigheten. Kombinationen av lineardiscriminatn analysis, maximum likelihood linear transformering och speakeradaptive träning resulterar i den bästa prestandan i denna implementation.För olika egenskapsextraktioner så bidrar mel-frequency cepstral koefficiententill en bättre precision medan perceptual linear predictive tenderar att ökahastigheten.
Pietrzyk, Mariusz W. "Spatial frequency analysis of the perceptual features involved in pulmonary nodule detection and recognition from posterior-anterior chest radiographs". Thesis, Lancaster University, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556697.
Texto completo