Dissertations / Theses on the topic 'Perceptual features for speech recognition'

To see the other types of publications on this topic, follow the link: Perceptual features for speech recognition.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Perceptual features for speech recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Haque, Serajul. "Perceptual features for speech recognition." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0187.

Full text
Abstract:
Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.
APA, Harvard, Vancouver, ISO, and other styles
2

Gu, Y. "Perceptually-based features in automatic speech recognition." Thesis, Swansea University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.637182.

Full text
Abstract:
Interspeaker variability of speech features is one of most important problems in automatic speech recognition (ASR), and makes speaker-independent systems much more difficult to achieve than speaker-dependent ones. The work described in the Thesis examines two ideas to overcome this problem. The first attempts to extract more reliable speech features by perceptually-based modelling; the second investigates the speaker variability in this speech feature and reduces its effects by a speaker normalisation scheme. The application of human speech perception in automatic speech recognition is discussed in the Thesis. Several perceptually-based feature analysis techniques are compared in terms of recognition performance, and the effects of individual perceptual parameter encompassed in the feature analysis are investigated. The work demonstrates the benefits of perceptual feature analysis (particularly perceptually-based linear predictive approach) compared with the conventional linear predictive analysis technique. The proposal for speaker normalisation is based on a regional-continuous linear matrix transform function on the perceptual feature space, with an automatic feature classification. This approach is applied in an ASR adaptation system. It is shown that the recognition error rate reduces rapidly when using a few words or a single sentence for adaptation. The adaptation performance demonstrates that such an approach could be very promising for a large vocabulary speaker-independent system.
APA, Harvard, Vancouver, ISO, and other styles
3

Chu, Kam Keung. "Feature extraction based on perceptual non-uniform spectral compression for noisy speech recognition /." access full-text access abstract and table of contents, 2005. http://libweb.cityu.edu.hk/cgi-bin/ezdb/thesis.pl?mphil-ee-b19887516a.pdf.

Full text
Abstract:
Thesis (M.Phil.)--City University of Hong Kong, 2005.
"Submitted to Department of Electronic Engineering in partial fulfillment of the requirements for the degree of Master of Philosophy" Includes bibliographical references (leaves 143-147)
APA, Harvard, Vancouver, ISO, and other styles
4

Koniaris, Christos. "Perceptually motivated speech recognition and mispronunciation detection." Doctoral thesis, KTH, Tal-kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-102321.

Full text
Abstract:
This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes. The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions. The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language’s own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods.

QC 20120914


European Union FP6-034362 research project ACORNS
Computer-Animated language Teachers (CALATea)
APA, Harvard, Vancouver, ISO, and other styles
5

Koniaris, Christos. "A study on selecting and optimizing perceptually relevant features for automatic speech recognition." Licentiate thesis, Stockholm : Kungliga Tekniska högskolan, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-11470.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Sklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition." Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.

Full text
Abstract:
In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success. Communication media, either analog (such as analog landline phones) or digital (VoIP) distort the speaker?s speech signal often in very complex ways: linear distortion occurs in all channels, either in the magnitude or phase spectrum. Non-linear but time-invariant distortion will always appear in all real systems. In digital systems we also have network effects which will produce packet losses and delays and repeated packets. Finally, one cannot really assert what path a signal will take, and so having error or distortion in between is almost a certainty. The channel introduces an acoustical mismatch between the speaker's signal and the trained data in the ASR, which results in poor recognition performance. The approach so far, has been to try to undo the havoc produced by the channels, i.e. compensate for the channel's behavior. In this thesis, we try to characterize the effects of different transmission media and use that as an inexpensive and repeatable way to train ASR systems.
APA, Harvard, Vancouver, ISO, and other styles
7

Atassi, Hicham. "Rozpoznání emočního stavu z hrané a spontánní řeči." Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2014. http://www.nusl.cz/ntk/nusl-233665.

Full text
Abstract:
Dizertační práce se zabývá rozpoznáním emočního stavu mluvčích z řečového signálu. Práce je rozdělena do dvou hlavních častí, první část popisuju navržené metody pro rozpoznání emočního stavu z hraných databází. V rámci této části jsou představeny výsledky rozpoznání použitím dvou různých databází s různými jazyky. Hlavními přínosy této části je detailní analýza rozsáhlé škály různých příznaků získaných z řečového signálu, návrh nových klasifikačních architektur jako je například „emoční párování“ a návrh nové metody pro mapování diskrétních emočních stavů do dvou dimenzionálního prostoru. Druhá část se zabývá rozpoznáním emočních stavů z databáze spontánní řeči, která byla získána ze záznamů hovorů z reálných call center. Poznatky z analýzy a návrhu metod rozpoznání z hrané řeči byly využity pro návrh nového systému pro rozpoznání sedmi spontánních emočních stavů. Jádrem navrženého přístupu je komplexní klasifikační architektura založena na fúzi různých systémů. Práce se dále zabývá vlivem emočního stavu mluvčího na úspěšnosti rozpoznání pohlaví a návrhem systému pro automatickou detekci úspěšných hovorů v call centrech na základě analýzy parametrů dialogu mezi účastníky telefonních hovorů.
APA, Harvard, Vancouver, ISO, and other styles
8

Temko, Andriy. "Acoustic event detection and classification." Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.

Full text
Abstract:
L'activitat humana que té lloc en sales de reunions o aules d'ensenyament es veu reflectida en una rica varietat d'events acústics, ja siguin produïts pel cos humà o per objectes que les persones manegen. Per això, la determinació de la identitat dels sons i de la seva posició temporal pot ajudar a detectar i a descriure l'activitat humana que té lloc en la sala. A més a més, la detecció de sons diferents de la veu pot ajudar a millorar la robustes de tecnologies de la parla com el reconeixement automàtica a condicions de treball adverses. L'objectiu d'aquesta tesi és la detecció i classificació automàtica d'events acústics. Es tracta de processar els senyals acústics recollits per micròfons distants en sales de reunions o aules per tal de convertir-los en descripcions simbòliques que es corresponguin amb la percepció que un oient tindria dels diversos events sonors continguts en els senyals i de les seves fonts. En primer lloc, s'encara la tasca de classificació automàtica d'events acústics amb classificadors de màquines de vectors suport (Support Vector Machines (SVM)), elecció motivada per l'escassetat de dades d'entrenament. Per al problema de reconeixement multiclasse es desenvolupa un esquema d'agrupament automàtic amb conjunt de característiques variable i basat en matrius de confusió. Realitzant proves amb la base de dades recollida, aquest classificador obté uns millors resultats que la tècnica basada en models de barreges de Gaussianes (Gaussian Mixture Models (GMM)), i aconsegueix una reducció relativa de l'error mitjà elevada en comparació amb el millor resultat obtingut amb l'esquema convencional basat en arbre binari. Continuant amb el problema de classificació, es comparen unes quantes maneres alternatives d'estendre els SVM al processament de seqüències, en un intent d'evitar l'inconvenient de treballar amb vectors de longitud fixa que presenten els SVM quan han de tractar dades d'àudio. En aquestes proves s'observa que els nuclis de deformació temporal dinàmica funcionen bé amb sons que presenten una estructura temporal. A més a més, s'usen conceptes i eines manllevats de la teoria de lògica difusa per investigar, d'una banda, la importància de cada una de les característiques i el grau d'interacció entre elles, i d'altra banda, tot cercant l'augment de la taxa de classificació, s'investiga la fusió de les
sortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.
APA, Harvard, Vancouver, ISO, and other styles
9

Lileikytė, Rasa. "Quality estimation of speech recognition features." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2012. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2012~D_20120302_090132-92071.

Full text
Abstract:
The accuracy of speech recognition system depends on characteristics of employed speech recognition features and classifier. Evaluating the accuracy of speech recognition system in ordinary way, the error of speech recognition system has to be calculated for each type of explored feature system and each type of classifier. The amount of such calculations can be reduced if the quality of explored feature system is estimated. Accordingly, the researches were made for quality estimation of speech recognition features. The proposed method for quality estimation of speech recognition features is based on three metrics usage. It was demonstrated, that the proposed method describes the quality of speech recognition features in Euclidean space and reduces the calculations of quality estimation of speech recognition systems. Demonstrated, that algorithm complexity of method for quality estimation of speech recognition features is O(2Rlog2R), while algorithm complexity of dynamic time warping recognition system is O(R^2), where R is vectors number of speech pattern references. The results of experimental researches confirmed the correctness of the proposed method for quality estimation of speech recognition features.
Šnekos signalų atpažinimo sistemų tikslumas priklauso nuo šnekos signalus aprašančių požymių ir šiuos požymius naudojančių klasifikatorių savybių. Vertinant tradiciškai atpažinimo sistemų tikslumą kiekvienai pasirinktai požymių sistemai ir kiekvienam klasifikatoriaus tipui tenka atlikti atpažinimo tikslumo skaičiavimus. Tokių darbų apimtis galima sumažinti įvertinus pasirenkamų požymių kokybę. Todėl buvo atlikti šnekos signalų požymių kokybės vertinimo tyrimai. Ištirtas metodas šnekos signalų atpažinimo požymių kokybei vertinti, grindžiamas trijų metrikų panaudojimu. Parodyta, kad tokiu būdu atrinkti šnekos signalų požymiai Euklido erdvėje aprašo atpažinimo sistemų kokybę ir leidžia sumažinti atpažinimo sistemų kokybės vertinimo darbų apimtis. Parodyta, kad šnekos signalų požymių kokybės vertinimo metodo algoritmo sudėtingumas yra O(2Rlog2R), o atpažinimo sistemos, kuriame naudojamas dinaminio laiko skalės kraipymo klasifikatorius, atpažinimo kokybės vertinimo algoritmo sudėtingumas yra O(R^2), R – šnekos signalų etalonų vektorių skaičius. Eksperimentinių tyrimų rezultatai patvirtino pateikto šnekos signalų atpažinimo požymių kokybės vertinimo metodo teisingumą.
APA, Harvard, Vancouver, ISO, and other styles
10

Matthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Droppo, J. G. "Time-frequency features for speech recognition /." Thesis, Connect to this title online; UW restricted, 2000. http://hdl.handle.net/1773/5965.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Ore, Brian M. "Multilingual Articulatory Features for Speech Recognition." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1176169264.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Leung, Ka Yee. "Combining acoustic features and articulatory features for speech recognition /." View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20LEUNGK.

Full text
Abstract:
Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2002.
Includes bibliographical references (leaves 92-96). Also available in electronic version. Access restricted to campus users.
APA, Harvard, Vancouver, ISO, and other styles
14

Iliev, Alexander Iliev. "Emotion Recognition Using Glottal and Prosodic Features." Scholarly Repository, 2009. http://scholarlyrepository.miami.edu/oa_dissertations/515.

Full text
Abstract:
Emotion conveys the psychological state of a person. It is expressed by a variety of physiological changes, such as changes in blood pressure, heart beat rate, degree of sweating, and can be manifested in shaking, changes in skin coloration, facial expression, and the acoustics of speech. This research focuses on the recognition of emotion conveyed in speech. There were three main objectives of this study. One was to examine the role played by the glottal source signal in the expression of emotional speech. The second was to investigate whether it can provide improved robustness in real-world situations and in noisy environments. This was achieved through testing in clear and various noisy conditions. Finally, the performance of glottal features was compared to diverse existing and newly introduced emotional feature domains. A novel glottal symmetry feature is proposed and automatically extracted from speech. The effectiveness of several inverse filtering methods in extracting the glottal signal from speech has been examined. Other than the glottal symmetry, two additional feature classes were tested for emotion recognition domains. They are the: Tonal and Break Indices (ToBI) of American English intonation, and Mel Frequency Cepstral Coefficients (MFCC) of the glottal signal. Three corpora were specifically designed for the task. The first two investigated the four emotions: Happy, Angry, Sad, and Neutral, and the third added Fear and Surprise in a six emotions recognition task. This work shows that the glottal signal carries valuable emotional information and using it for emotion recognition has many advantages over other conventional methods. For clean speech, in a four emotion recognition task using classical prosodic features achieved 89.67% recognition, ToBI combined with classical features, reached 84.75% recognition, while using glottal symmetry alone achieved 98.74%. For a six emotions task these three methods achieved 79.62%, 90.39% and 85.37% recognition rates, respectively. Using the glottal signal also provided greater classifier robustness under noisy conditions and distortion caused by low pass filtering. Specifically, for additive white Gaussian noise at SNR = 10 dB in the six emotion task the classical features and the classical with ToBI both failed to provide successful results; speech MFCC's achieved a recognition rate of 41.43% and glottal symmetry reached 59.29%. This work has shown that the glottal signal, and the glottal symmetry in particular, provides high class separation for both the four and six emotion cases. It is confidently surpassing the performance of all other features included in this investigation in noisy speech conditions and in most clean signal conditions.
APA, Harvard, Vancouver, ISO, and other styles
15

Mossmyr, Simon. "Noisy recognition of perceptual mid-level features in music." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-294229.

Full text
Abstract:
Self-training with noisy student is a consistency-based semi-supervised self- training method that achieved state-of-the-art accuracy on ImageNet image classification upon its release. It makes use of data noise and model noise when fitting a model to both labelled data and a large amount of artificially labelled data. In this work, we use self- training with noisy student to fit a VGG- style deep CNN model to a dataset of music piece excerpts labelled with perceptual mid-level features and compare its performance with the benchmark. To achieve this, we experiment with some common data warping augmentations and find that pitch shifting, time stretching, and time translation applied on the excerpt spectrograms can improve the model's invariance. We also apply stochastic depth to the VGG-style model — a method which randomly drops entire layers of a model during training—and find that it too can increase model invariance. This is a novel application since stochastic depth has not been used outside the ResNet architecture to our knowledge. Finally, we apply self-training with noisy student with the aforementioned methods as sources of noise and find that it reduces the mean squared error of the testing subset by an impressive amount, although the overall performance of the model can still be questioned.
Självträning med störningar är en delvis övervakad självträningsmetod som uppnådde en avsevärd pricksäkerhet på ImageNets bildigenkänningsprov. Den använder sig av dataförstärkning och störningar i modellen när den ska anpassas till en stor mängd artificiellt annoterad träningsdata tillsammans med vanlig träningsdata. I den här uppsatsen så använder vi självträning med störningar för att träna ett VGG-liknande faltningsnätverk med en datamängd av musikstycken annoterade med perceptuella mellanliggande särdrag. För att uppnå detta så börjar vi med att experimentera med dataförstärkning och finner att förändring av tonhöjd, tidsuttöjning och tidsförflyttning (applicerat direkt på musikstyckenas spektrogram) kan öka modellens tolerans för förändringar i datan. Vi experimenterar även med stokastiskt djup — en metod som inaktiverar hela lager av ett neuronnätverk under träning—och finner att detta också kan öka modellens tolerans. Detta är en nyanvändning av stokastiskt djup eftersom metoden såvitt vi känner till inte har använts i annat än varianter av ResNet. Slutligen så använder vi självträning med störningar med de tidigare nämnda metoderna och finner en avsevärd minskning i modellens fel, även om dess övergripande prestanda kan ifrågasättas.
APA, Harvard, Vancouver, ISO, and other styles
16

Saenko, Ekaterina 1976. "Articulatory features for robust visual speech recognition." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28736.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 99-105).
This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.
by Ekaterina Saenko.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
17

Väyrynen, E. (Eero). "Emotion recognition from speech using prosodic features." Doctoral thesis, Oulun yliopisto, 2014. http://urn.fi/urn:isbn:9789526204048.

Full text
Abstract:
Abstract Emotion recognition, a key step of affective computing, is the process of decoding an embedded emotional message from human communication signals, e.g. visual, audio, and/or other physiological cues. It is well-known that speech is the main channel for human communication and thus vital in the signalling of emotion and semantic cues for the correct interpretation of contexts. In the verbal channel, the emotional content is largely conveyed as constant paralinguistic information signals, from which prosody is the most important component. The lack of evaluation of affect and emotional states in human machine interaction is, however, currently limiting the potential behaviour and user experience of technological devices. In this thesis, speech prosody and related acoustic features of speech are used for the recognition of emotion from spoken Finnish. More specifically, methods for emotion recognition from speech relying on long-term global prosodic parameters are developed. An information fusion method is developed for short segment emotion recognition using local prosodic features and vocal source features. A framework for emotional speech data visualisation is presented for prosodic features. Emotion recognition in Finnish comparable to the human reference is demonstrated using a small set of basic emotional categories (neutral, sad, happy, and angry). A recognition rate for Finnish was found comparable with those reported in the western language groups. Increased emotion recognition is shown for short segment emotion recognition using fusion techniques. Visualisation of emotional data congruent with the dimensional models of emotion is demonstrated utilising supervised nonlinear manifold modelling techniques. The low dimensional visualisation of emotion is shown to retain the topological structure of the emotional categories, as well as the emotional intensity of speech samples. The thesis provides pattern recognition methods and technology for the recognition of emotion from speech using long speech samples, as well as short stressed words. The framework for the visualisation and classification of emotional speech data developed here can also be used to represent speech data from other semantic viewpoints by using alternative semantic labellings if available
Tiivistelmä Emootiontunnistus on affektiivisen laskennan keskeinen osa-alue. Siinä pyritään ihmisen kommunikaatioon sisältyvien emotionaalisten viestien selvittämiseen, esim. visuaalisten, auditiivisten ja/tai fysiologisten vihjeiden avulla. Puhe on ihmisten tärkein tapa kommunikoida ja on siten ensiarvoisen tärkeässä roolissa viestinnän oikean semanttisen ja emotionaalisen tulkinnan kannalta. Emotionaalinen tieto välittyy puheessa paljolti jatkuvana paralingvistisenä viestintänä, jonka tärkein komponentti on prosodia. Tämän affektiivisen ja emotionaalisen tulkinnan vajaavaisuus ihminen-kone – interaktioissa rajoittaa kuitenkin vielä nykyisellään teknologisten laitteiden toimintaa ja niiden käyttökokemusta. Tässä väitöstyössä on käytetty puheen prosodisia ja akustisia piirteitä puhutun suomen emotionaalisen sisällön tunnistamiseksi. Työssä on kehitetty pitkien puhenäytteiden prosodisiin piirteisiin perustuvia emootiontunnistusmenetelmiä. Lyhyiden puheenpätkien emotionaalisen sisällön tunnistamiseksi on taas kehitetty informaatiofuusioon perustuva menetelmä käyttäen prosodian sekä äänilähteen laadullisten piirteiden yhdistelmää. Lisäksi on kehitetty teknologinen viitekehys emotionaalisen puheen visualisoimiseksi prosodisten piirteiden avulla. Tutkimuksessa saavutettiin ihmisten tunnistuskykyyn verrattava automaattisen emootiontunnistuksen taso käytettäessä suppeaa perusemootioiden joukkoa (neutraali, surullinen, iloinen ja vihainen). Emootiontunnistuksen suorituskyky puhutulle suomelle havaittiin olevan verrannollinen länsieurooppalaisten kielten kanssa. Lyhyiden puheenpätkien emotionaalisen sisällön tunnistamisessa saavutettiin taas parempi suorituskyky käytettäessä fuusiomenetelmää. Emotionaalisen puheen visualisoimiseksi kehitetyllä opetettavalla epälineaarisella manifoldimallinnustekniikalla pystyttiin tuottamaan aineistolle emootion dimensionaalisen mallin kaltainen visuaalinen rakenne. Mataladimensionaalisen kuvauksen voitiin edelleen osoittaa säilyttävän sekä tutkimusaineiston emotionaalisten luokkien että emotionaalisen intensiteetin topologisia rakenteita. Tässä väitöksessä kehitettiin hahmontunnistusmenetelmiin perustuvaa teknologiaa emotionaalisen puheen tunnistamiseksi käytettäessä sekä pitkiä että lyhyitä puhenäytteitä. Emotionaalisen aineiston visualisointiin ja luokitteluun kehitettyä teknologista kehysmenetelmää käyttäen voidaan myös esittää puheaineistoa muidenkin semanttisten rakenteiden mukaisesti
APA, Harvard, Vancouver, ISO, and other styles
18

Rankin, D. "Extraction of features from speech spectra." Thesis, Queen's University Belfast, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.373541.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Domont, Xavier. "Hierarchical spectro-temporal features for robust speech recognition." Münster Verl.-Haus Monsenstein und Vannerdat, 2009. http://d-nb.info/1001282655/04.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Lal, Partha. "Cross-lingual automatic speech recognition using tandem features." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5773.

Full text
Abstract:
Automatic speech recognition requires many hours of transcribed speech recordings in order for an acoustic model to be effectively trained. However, recording speech corpora is time-consuming and expensive, so such quantities of data exist only for a handful of languages — there are many languages for which little or no data exist. Given that there are acoustic similarities between different languages, it may be fruitful to use data from a well-supported source language for the task of training a recogniser in a target language with little training data. Since most languages do not share a common phonetic inventory, we propose an indirect way of transferring information from a source language model to a target language model. Tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features, are used to do that. They have the advantage that the language used to train the classifier, typically a Multilayer Perceptron (MLP) need not be the same as the target language being recognised. Consistent with prior work, positive results are achieved for monolingual systems in a number of different languages. Furthermore, improvements are also shown for the cross-lingual case, in which the tandem features were generated using a classifier not trained for the target language. We examine factors which may predict the relative improvements brought about by tandem features for a given source and target pair. We examine some cross-corpus normalization issues that naturally arise in multilingual speech recognition and validate our solution in terms of recognition accuracy and a mutual information measure. The tandem classifier in work up to this point in the thesis has been a phoneme classifier. Articulatory features (AFs), represented here as a multi-stream, discrete, multivalued labelling of speech, can be used as an alternative task. The motivation for this is that since AFs are a set of physically grounded categories that are not language-specific they may be more suitable for cross-lingual transfer. Then, using either phoneme or AF classification as our MLP task, we look at training the MLP using data from more than one language — again we hypothesise that AF tandem will resulting greater improvements in accuracy. We also examine performance where only limited amounts of target language data are available, and see how our various tandem systems perform under those conditions.
APA, Harvard, Vancouver, ISO, and other styles
21

Harte, Naomi Antonia. "Segmental phonetic features and models for speech recognition." Thesis, Queen's University Belfast, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.287466.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Schuy, Lars. "Speech features and their significance in speaker recognition." Thesis, University of Sussex, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.288845.

Full text
Abstract:
This thesis addresses the significance of speech features within the task of speaker recognition. Motivated by the perception of simple attributes like `loud', `smooth', `fast', more than 70 new speech features are developed. A set of basic speech features like pitch, loudness and speech speed are combined together with these new features in a feature set, one set per utterance. A neural network classifier is used to evaluate the significance of these features by creating a speaker recognition system and analysing the behaviour of successfully trained single-speaker networks. An in-depth analysis of network weights allows a rating of significance and feature contribution. A subjective listening experiment validates and confirms the results of the neural network analysis. The work starts with an extended sentence analysis; ten sentences are uttered by 630 speakers. The extraction of 100 speech features is outlined and a 100-element feature vector for each utterance is derived. Some features themselves and the methods of analysing them have been used elsewhere, for example pitch, sound pressure level, spectral envelope, loudness, speech speed and glottal-to-noise excitation. However, more than 70 of the 100 features are derivatives of these basic features and have not yet been described and used before in the speakerr ecognition research,e speciallyyn ot within a rating of feature significance. These derivatives include histogram, 3`d and 4 moments, function approximation, as well as other statistical analyses applied to the basic features. The first approach assessing the significance of features and their possible use in a recognition system is based on a probability analysis. The analysis is established on the assumption that within the speaker's ten utterances' single feature values have a small deviation and cluster around the mean value of one speaker. The presented features indeed cluster into groups and show significant differences between speakers, thus enabling a clear separation of voices when applied to a small database of < 20 speakers. The recognition and assessment of individual feature contribution jecomes impossible, when the database is extended to 200 speakers. To ensure continous vplidation of feature contribution it is necessary to consider a different type of classifier. These limitations are overcome with the introduction of neural network classifiers. A separate network is assigned to each speaker, resulting in the creation of 630 networks. All networks are of standard feed-forward backpropagation type and have a 100-input, 20- hidden-nodes, one-output architecture. The 6300 available feature vectors are split into a training, validation and test set in the ratio of 5-3-2. The networks are initially trained with the same 100-feature input database. Successful training was achieved within 30 to 100 epochs per network. The speaker related to the network with the highest output is declared as the speaker represented by the input. The achieved recognition rate for 630 speakers is -49%. A subsequent preclusion of features with minor significance raises the recognition rate to 57%. The analysis of the network weight behaviour reveals two major pointsA definite ranking order of significance exists between the 100 features. Many of the newly introduced derivatives of pitch, brightness, spectral voice patterns and speech speed contribute intensely to recognition, whereas feature groups related to glottal-to-noiseexcitation ratio and sound pressure level play a less important role. The significance of features is rated by the training, testing and validation behaviour of the networks under data sets with reduced information content, the post-trained weight distribution and the standard deviation of weight distribution within networks. The findings match with results of a subjective listening experiment. As a second major result the analysis shows that there are large differences between speakers and the significance of features, i. e. not all speakers use the same feature set to the same extent. The speaker-related networks exhibit key features, where they are uniquely identifiable and these key features vary from speaker to speaker. Some features like pitch are used by all networks; other features like sound pressure level and glottal-to-noise excitation ratio are used by only a few distinct classifiers. Again, the findings correspond with results of a subjective listening experiment. This thesis presents more than 70 new features which never have been used before in speaker recognition. A quantitative ranking order of 100 speech features is introduced. Such a ranking order has not been documented elsewhere and is comparatively new to the area of speaker recognition. This ranking order is further extended and describes the amount to which a classifier uses or omits single features, solely depending on the characteristics of the voice sample. Such a separation has not yet been documented and is a novel contribution. The close correspondence of the subjective listening experiment and the findings of the network classifiers show that it is plausible to model the behaviour of human speech recognition with an artificial neural network. Again such a validation is original in the area of speaker recognition
APA, Harvard, Vancouver, ISO, and other styles
23

Necioğlu, Burhan F. "Objectively measured descriptors for perceptual characterization of speakers." Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/15035.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Savvides, Vasos E. "Perceptual models in speech quality assessment and coding." Thesis, Loughborough University, 1988. https://dspace.lboro.ac.uk/2134/36273.

Full text
Abstract:
The ever-increasing demand for good communications/toll quality speech has created a renewed interest into the perceptual impact of rate compression. Two general areas are investigated in this work, namely speech quality assessment and speech coding. In the field of speech quality assessment, a model is developed which simulates the processing stages of the peripheral auditory system. At the output of the model a "running" auditory spectrum is obtained. This represents the auditory (spectral) equivalent of any acoustic sound such as speech. Auditory spectra from coded speech segments serve as inputs to a second model. This model simulates the information centre in the brain which performs the speech quality assessment.
APA, Harvard, Vancouver, ISO, and other styles
25

Juneja, Amit. "Speech recognition based on phonetic features and acoustic landmarks." College Park, Md. : University of Maryland, 2004. http://hdl.handle.net/1903/2148.

Full text
Abstract:
Thesis (Ph. D.) -- University of Maryland, College Park, 2004.
Thesis research directed by: Electrical Engineering. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
APA, Harvard, Vancouver, ISO, and other styles
26

ALENCAR, VLADIMIR FABREGAS SURIGUE DE. "EFFICIENT FEATURES AND INTERPOLATION DOMAINS IN DISTRIBUTED SPEECH RECOGNITION." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2005. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=6201@1.

Full text
Abstract:
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
Com o crescimento gigantesco da Internet e dos sistemas de comunicações móveis celulares, as aplicações de processamento de voz nessas redes têm despertado grande interesse . Um problema particularmente importante nessa área consiste no reconhecimento de voz em um sistema servidor, baseado nos parâmetros acústicos calculados e quantizados no terminal do usuário (Reconhecimento de Voz Distribuído). Como em geral estes parâmetros não são os mais indicados como atributos de voz para o sistema de reconhecimento remoto, é importante que sejam examinadas diferentes transformações dos parâmetros, que permitam um melhor desempenho do reconhecedor. Esta dissertação trata da extração de atributos de reconhecimento eficientes a partir dos parâmetros dos codificadores utilizados em redes móveis celulares e em redes IP. Além disso, como a taxa dos parâmetros fornecidos ao reconhecedor de voz é normalmente superior àquela com a qual os codificadores geram os parâmetros, é importante analisar o efeito da interpolação dos parâmetros sobre o desempenho do sistema de reconhecimento, bem como o melhor domínio sobre o qual esta interpolação deve ser realizada. Estes são outros tópicos apresentados nesta dissertação.
The huge growth of the Internet and cellular mobile communication systems has stimulated a great interest in the applications of speech processing in these networks. An important problem in this field consists in speech recognition in a server system, based on the acoustic parameters calculated and quantized in the user terminal (Distributed Speech Recognition). Since these parameters are not the most indicated ones for the remote recognition system, it is important to examine different transformations of these parameters, in order to allow a better performance of the recogniser. This dissertation is concerned with the extraction of efficient recognition features from the coder parameters used in cellular mobile networks and IP networks. In addition, as the rate that parameters supplied for the speech recogniser must be usually higher than that generated by the codec, it is important to analyze the effect of the interpolation of the parameters over the performance of the recognition system. Moreover, it is paramount to establish the best domain over which this interpolation must be carried out. These are other topics presented in this dissertation.
APA, Harvard, Vancouver, ISO, and other styles
27

Meng, Helen M. "The use of distinctive features for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1991. http://hdl.handle.net/1721.1/13279.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Schutte, Kenneth Thomas 1979. "Parts-based models and local features for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53301.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 101-108).
While automatic speech recognition (ASR) systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. This thesis revisits the basic acoustic modeling assumptions common to most ASR systems and argues that improvements to the underlying model of speech are required to address these shortcomings. A number of problems with the standard method of hidden Markov models (HMMs) and features derived from fixed, frame-based spectra (e.g. MFCCs) are discussed. Based on these problems, a set of desirable properties of an improved acoustic model are proposed, and we present a "parts-based" framework as an alternative. The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles. We discuss the proposed model's relationship to HMMs and segment-based recognizers, and describe how they can be viewed as special cases of the PBM. Two variations of PBMs are described in detail. The first represents each phonetic unit with a set of time-frequency (T-F) "patches" which act as filters over a spectrogram. The model structure encodes the patches' relative T-F positions. The second variation, referred to as a "speech schematic" model, more directly encodes the information in a spectrogram by using simple edge detectors and focusing more on modeling the constraints between parts.
(cont.) We demonstrate the proposed models on various isolated recognition tasks and show the benefits over baseline systems, particularly in noisy conditions and when only limited training data is available. We discuss efficient implementation of the models and describe how they can be combined to build larger recognition systems. It is argued that the flexible templates used in parts-based modeling may provide a better generative model of speech than typical HMMs.
by Kenneth Thomas Schutte.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
29

Tang, Min Ph D. Massachusetts Institute of Technology. "Large vocabulary continuous speech recognition using linguistic features and constraints." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33203.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (leaves 111-123).
Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as lexical access". This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation. The first pass classifies speech signals into a sequence of linguistic features, and only words that match this sequence - the cohort - are activated. The second pass performs a detailed acoustic phonetic analysis within the cohort to decide the identity of the word. This model differs from the lexical access model nowadays commonly employed in speech recognizers where detailed acoustic phonetic analysis is performed directly and lexical items are retrieved in one pass. The thesis first studies the implementation issues of the Huttenlocher-Zue model. A number of extensions to the original proposal are made to take advantage of the existing facilities of a probabilistic, graph-based recognition framework and, more importantly, to model the broad linguistic features in a data-driven approach. First, we analyze speech signals along the two diagonal dimensions of manner and place of articulation, rather than the manner dimension alone. Secondly, we adopt a set of feature-based landmarks optimized for data-driven modeling as the basic recognition units, and Gaussian mixture models are trained for these units. We explore information fusion techniques to integrate constraints from both the manner and place dimensions, as well as examining how to integrate constraints from the feature-based first pass with the second pass of detailed acoustic phonetic analysis. Our experiments on a large-vocabulary isolated word recognition task show that, while constraints from each individual feature dimension provide only limited help in this lexical access model, the utilization of both dimensions and information fusion techniques leads to significant performance gain over a one-pass phonetic system. The thesis then proposes to generalize the original Huttenlocher-Zue model, which limits itself to only isolated word tasks, to handle continuous speech. With continuous speech, the search space for both stages is infinite if all possible word sequences are allowed. We generalize the original cohort idea from the Huttenlocher-Zue proposal and use the bag of words of the N-best list of the first pass as cohorts for continuous speech. This approach transfers the constraints of broad linguistic features into a much reduced search space for the second stage. The thesis also studies how to recover from errors made by the first pass, which is not discussed in the original Huttenlocher- Zue proposal. In continuous speech recognition, a way of recovering from errors made in the first pass is vital to the performance of the over-all system. We find empirical evidence that such errors tend to occur around function words, possibly due to the lack of prominence, in meaning and henceforth in linguistic features, of such words. This thesis proposes an error-recovery mechanism based on empirical analysis on a development set for the two-pass lexical access model. Our experiments on a medium- sized, telephone-quality continuous speech recognition task achieve higher accuracy than a state-of-the-art one-pass baseline system. The thesis applies the generalized two-pass lexical access model to the challenge of recognizing continuous speech from an open vocabulary. Telephony information query systems often need to deal with a large list of words that are not observed in the training data, for example the city names in a weather information query system. The large portion of vocabulary unseen in the training data - the open vocabulary - poses a serious data-sparseness problem to both acoustic and language modeling. A two-pass lexical access model provides a solution by activating a small cohort within the open vocabulary in the first pass, thus significantly reducing the data- sparseness problem. Also, the broad linguistic constraints in the first pass generalize better to unseen data compared to finer, context-dependent acoustic phonetic models. This thesis also studies a data-driven analysis of acoustic similarities among open vocabulary items. The results are used for recovering possible errors in the first pass. This approach demonstrates an advantage over a two-pass approach based on specific semantic constraints. In summary, this thesis implements the original Huttenlocher-Zue two-pass lexical access model in a modern probabilistic speech recognition framework. This thesis also extends the original model to recognize continuous speech from an open vocabulary, with our two-stage model achieving a better performance than the baseline system. In the future, sub-lexical linguistic hierarchy constraints, such as syllables, can be introduced into this two-pass model to further improve the lexical access performance.
by Min Tang.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
30

Civile, Ciro. "The face inversion effect and perceptual learning : features and configurations." Thesis, University of Exeter, 2013. http://hdl.handle.net/10871/13564.

Full text
Abstract:
This thesis explores the causes of the face inversion effect, which is a substantial decrement in performance in recognising facial stimuli when they are presented upside down (Yin,1969). I will provide results from both behavioural and electrophysiological (EEG) experiments to aid in the analysis of this effect. Over the course of six chapters I summarise my work during the four years of my PhD, and propose an explanation of the face inversion effect that is based on the general mechanisms for learning that we also share with other animals. In Chapter 1 I describe and discuss some of the main theories of face inversion. Chapter 2 used behavioural and EEG techniques to test one of the most popular explanations of the face inversion effect proposed by Diamond and Carey (1986). They proposed that it is the disruption of the expertise needed to exploit configural information that leads to the inversion effect. The experiments reported in Chapter 2 were published as in the Proceedings of the 34th annual conference of the Cognitive Science Society. In Chapter 3 I explore other potential causes of the inversion effect confirming that not only configural information is involved, but also single feature orientation information plays an important part in the inversion effect. All the experiments included in Chapter 3 are part of a paper accepted for publication in the Quarterly Journal of Experimental Psychology. Chapter 4 of this thesis went on to attempt to answer the question of whether configural information is really necessary to obtain an inversion effect. All the experiments presented in Chapter 4 are part of a manuscript in preparation for submission to the Quarterly Journal of Experimental Psychology. Chapter 5 includes some of the most innovative experiments from my PhD work. In particular it offers some behavioural and electrophysiological evidence that shows that it is possible to apply an associative approach to face inversion. Chapter 5 is a key component of this thesis because on the one hand it explains the face inversion effect using general mechanisms of perceptual learning (MKM model). On the other hand it also shows that there seems to be something extra needed to explain face recognition entirely. All the experiments included in Chapter 5 were reported in a paper submitted to the Journal of Experimental Psychology; Animal Behaviour Processes. Finally in Chapter 6 I summarise the implications that this work will have for explanations of the face inversion effect and some of the general processes involved in face perception.
APA, Harvard, Vancouver, ISO, and other styles
31

Weatherholtz, Kodi. "Perceptual learning of systemic cross-category vowel variation." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1429782580.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Jeon, Woojay. "Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/14061.

Full text
Abstract:
It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection.
APA, Harvard, Vancouver, ISO, and other styles
33

Bruijn, Christina Geertruida de. "Voice quality after dictation to speech recognition software : a perceptual and acoustic study." Thesis, University of Sheffield, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.440907.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Javadi, Ailar. "Bio-inspired noise robust auditory features." Thesis, Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/44801.

Full text
Abstract:
The purpose of this work is to investigate a series of biologically inspired modifications to state-of-the-art Mel- frequency cepstral coefficients (MFCCs) that may improve automatic speech recognition results. We have provided recommendations to improve speech recognition results de- pending on signal-to-noise ratio levels of input signals. This work has been motivated by noise-robust auditory features (NRAF). In the feature extraction technique, after a signal is filtered using bandpass filters, a spatial derivative step is used to sharpen the results, followed by an envelope detector (recti- fication and smoothing) and down-sampling for each filter bank before being compressed. DCT is then applied to the results of all filter banks to produce features. The Hidden- Markov Model Toolkit (HTK) is used as the recognition back-end to perform speech recognition given the features we have extracted. In this work, we investigate the role of filter types, window size, spatial derivative, rectification types, smoothing, down- sampling and compression and compared the final results to state-of-the-art Mel-frequency cepstral coefficients (MFCC). A series of conclusions and insights are provided for each step of the process. The goal of this work has not been to outperform MFCCs; however, we have shown that by changing the compression type from log compression to 0.07 root compression we are able to outperform MFCCs for all noisy conditions.
APA, Harvard, Vancouver, ISO, and other styles
35

Berg, Brian LaRoy. "Investigating Speaker Features From Very Short Speech Records." Diss., Virginia Tech, 2001. http://hdl.handle.net/10919/28691.

Full text
Abstract:
A procedure is presented that is capable of extracting various speaker features, and is of particular value for analyzing records containing single words and shorter segments of speech. By taking advantage of the fast convergence properties of adaptive filtering, the approach is capable of modeling the nonstationarities due to both the vocal tract and vocal cord dynamics. Specifically, the procedure extracts the vocal tract estimate from within the closed glottis interval and uses it to obtain a time-domain glottal signal. This procedure is quite simple, requires minimal manual intervention (in cases of inadequate pitch detection), and is particularly unique because it derives both the vocal tract and glottal signal estimates directly from the time-varying filter coefficients rather than from the prediction error signal. Using this procedure, several glottal signals are derived from human and synthesized speech and are analyzed to demonstrate the glottal waveform modeling performance and kind of glottal characteristics obtained therewith. Finally, the procedure is evaluated using automatic speaker identity verification.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
36

Clark, Tracy M. "A Study of Features and Processes Towards Real-time Speech Word Recognition." Thesis, University of Canterbury. Electrical and Electronic Engineering, 1993. http://hdl.handle.net/10092/7561.

Full text
Abstract:
Word recognition techniques are reviewed. An exhaustive comparative study of many of the factors that affect recognition accuracy is presented. Experiments centred on four major areas of word recognition are described: pre-processing techniques, recognition features, recognition algorithms and distance measures. Recognition accuracy, in the context of each of these four areas, is investigated using the digit vocabulary spoken by 10 New Zealand (6 male and 4 female) and 38 American (20 male and 18 female) speakers. Pre-processing techniques examined are the type of window, the length of the data name, data frame overlap, and pre-emphasis. Acoustic features tested include temporal features such as energy and zero-crossing rate, as well as frequency based acoustic representations such as linear prediction coefficients, cepstral coefficients, dynamic (transitional) cepstral coefficients, and perceptual linear prediction coefficients. Three types of distance measures are also reported on the Euclidean, the weighted Euclidean, and the projection. Two methods of training, random template selection and clustering, are investigated. Accuracy improvement by combining different features is also examined. Implementation of a real-time word recognition system designed on the basis of the comparative study and experiments, is described. The system is based on a TMS320C30 and takes around 0.03 seconds per recognition. The real-time system achieves speaker-dependent accuracies greater than 95% and speaker-independent accuracies greater than 70% for the digit vocabulary. An examination is also made of two methods of continuous recognition using sub-word representations. Both these methods take advantage of isolated word recognition techniques such as dynamic programming. A segmentation method and anon-segmentation method were investigated. Accuracy of the segmentation recognition method is found to depend linearly on the accuracy of the segmenter. With a segmentation error of 22%, an average recognition accuracy of 90.7% was obtained for 10 vowels and 2 consonants. For the non-segmentation recognition method, an average accuracy of 75% was obtained. Although the segmentation method produced higher accuracies than the non-segmentation method, it is argued that the removal of the segmentation is an advantage that greatly simplifies the recognition strategy.
APA, Harvard, Vancouver, ISO, and other styles
37

Peso, Pablo. "Spatial features of reverberant speech : estimation and application to recognition and diarization." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/45664.

Full text
Abstract:
Distant talking scenarios, such as hands-free calling or teleconference meetings, are essential for natural and comfortable human-machine interaction and they are being increasingly used in multiple contexts. The acquired speech signal in such scenarios is reverberant and affected by additive noise. This signal distortion degrades the performance of speech recognition and diarization systems creating troublesome human-machine interactions. This thesis proposes a method to non-intrusively estimate room acoustic parameters, paying special attention to a room acoustic parameter highly correlated with speech recognition degradation: clarity index. In addition, a method to provide information regarding the estimation accuracy is proposed. An analysis of the phoneme recognition performance for multiple reverberant environments is presented, from which a confusability metric for each phoneme is derived. This confusability metric is then employed to improve reverberant speech recognition performance. Additionally, room acoustic parameters can as well be used in speech recognition to provide robustness against reverberation. A method to exploit clarity index estimates in order to perform reverberant speech recognition is introduced. Finally, room acoustic parameters can also be used to diarize reverberant speech. A room acoustic parameter is proposed to be used as an additional source of information for single-channel diarization purposes in reverberant environments. In multi-channel environments, the time delay of arrival is a feature commonly used to diarize the input speech, however the computation of this feature is affected by reverberation. A method is presented to model the time delay of arrival in a robust manner so that speaker diarization is more accurately performed.
APA, Harvard, Vancouver, ISO, and other styles
38

Sidorova, Julia. "Optimization techniques for speech emotion recognition." Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7575.

Full text
Abstract:
Hay tres aspectos innovadores. Primero, un algoritmo novedoso para calcular el contenido emocional de un enunciado, con un diseño mixto que emplea aprendizaje estadístico e información sintáctica. Segundo, una extensión para selección de rasgos que permite adaptar los pesos y así aumentar la flexibilidad del sistema. Tercero, una propuesta para incorporar rasgos de alto nivel al sistema. Dichos rasgos, combinados con los rasgos de bajo nivel, permiten mejorar el rendimiento del sistema.
The first contribution of this thesis is a speech emotion recognition system called the ESEDA capable of recognizing emotions in di®erent languages. The second contribution is the classifier TGI+. First objects are modeled by means of a syntactic method and then, with a statistical method the mappings of samples are classified, not their feature vectors. The TGI+ outperforms the state of the art top performer on a benchmark data set of acted emotions. The third contribution is high-level features, which are distances from a feature vector to the tree automata accepting class i, for all i in the set of class labels. The set of low-level features and the set of high-level features are concatenated and the resulting set is submitted to the feature selection procedure. Then the classification step is done in the usual way. Testing on a benchmark dataset of authentic emotions showed that this classification strategy outperforms the state of the art top performer.
APA, Harvard, Vancouver, ISO, and other styles
39

Lareau, Jonathan. "Application of shifted delta cepstral features for GMM language identification /." Electronic version of thesis, 2006. https://ritdml.rit.edu/dspace/handle/1850/2686.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Gangireddy, Siva Reddy. "Recurrent neural network language models for automatic speech recognition." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28990.

Full text
Abstract:
The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).
APA, Harvard, Vancouver, ISO, and other styles
41

Wong, Jimmy Pui Fung. "The use of prosodic features in Chinese speech recognition and spoken language processing /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202003%20WONG.

Full text
Abstract:
Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 97-101). Also available in electronic version. Access restricted to campus users.
APA, Harvard, Vancouver, ISO, and other styles
42

Sun, Rui. "The evaluation of the stability of acoustic features in affective conveyance across multiple emotional databases." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/49041.

Full text
Abstract:
The objective of the research presented in this thesis was to systematically investigate the computational structure for cross-database emotion recognition. The research consisted of evaluating the stability of acoustic features, particularly the glottal and Teager Energy based features, and investigating three normalization methods and two data fusion techniques. One of the challenges of cross-database training and testing is accounting for the potential variation in the types of emotions expressed as well as the recording conditions. In an attempt to alleviate the impact of these types of variations, three normalization methods on the acoustic data were studied. Motivated by the lack of large and diverse enough emotional database to train the classifier, using multiple databases to train posed another challenge: data fusion. This thesis proposed two data fusion techniques, pre-classification SDS and post-classification ROVER to study the issue. Using the glottal, TEO and TECC features, of which the stability of emotion distinguishing ability has been highlighted on multiple databases, the systematic computational structure proposed in this thesis could improve the performance of cross-database binary-emotion recognition by up to 23% for neutral vs. emotional and 10% for positive vs. negative.
APA, Harvard, Vancouver, ISO, and other styles
43

LUDWICZAK, LEIGH ANN. "CHILDRENS' FIRST FIVE WORDS: AN ANALYSIS OF PERCEPTUAL FEATURES, GRAMMATICAL CATEGORIES, AND COMMUNICATIVE INTENTIONS." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin990647609.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

SIQUEIRA, JAN KRUEGER. "CONTINUOUS SPEECH RECOGNITION WITH MFCC, SSCH AND PNCC FEATURES, WAVELET DENOISING AND NEURAL NETWORKS." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2011. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=19143@1.

Full text
Abstract:
CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO
Um dos maiores desafios na área de reconhecimento de voz contínua é desenvolver sistemas robustos ao ruído aditivo. Para isso, este trabalho analisa e testa três técnicas. A primeira delas é a extração de atributos do sinal de voz usando os métodos MFCC, SSCH e PNCC. A segunda é a remoção de ruído do sinal de voz via wavelet denoising. A terceira e última é uma proposta original batizada de feature denoising, que busca melhorar os atributos extraídos usando um conjunto de redes neurais. Embora algumas dessas técnicas já sejam conhecidas na literatura, a combinação entre elas trouxe vários resultados interessantes e inéditos. Inclusive, nota-se que o melhor desempenho vem da união de PNCC com feature denoising.
One of the biggest challenges on the continuous speech recognition field is to develop systems that are robust to additive noise. To do so, this work analyses and tests three techniques. The first one extracts features from the voice signal using the MFCC, SSCH and PNCC methods. The second one removes noise from the voice signal through wavelet denoising. The third one is an original one, called feature denoising, that seeks to improve the extracted features using a set of neural networks. Although some of these techniques are already known in the literature, the combination of them brings many interesting and new results. In fact, it is noticed that the best performance comes from the union of PNCC and feature denoising.
APA, Harvard, Vancouver, ISO, and other styles
45

Ishizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Chan, Oscar. "Prosodic features for a maximum entropy language model." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0244.

Full text
Abstract:
A statistical language model attempts to characterise the patterns present in a natural language as a probability distribution defined over word sequences. Typically, they are trained using word co-occurrence statistics from a large sample of text. In some language modelling applications, such as automatic speech recognition (ASR), the availability of acoustic data provides an additional source of knowledge. This contains, amongst other things, the melodic and rhythmic aspects of speech referred to as prosody. Although prosody has been found to be an important factor in human speech recognition, its use in ASR has been limited. The goal of this research is to investigate how prosodic information can be employed to improve the language modelling component of a continuous speech recognition system. Because prosodic features are largely suprasegmental, operating over units larger than the phonetic segment, the language model is an appropriate place to incorporate such information. The prosodic features and standard language model features are combined under the maximum entropy framework, which provides an elegant solution to modelling information obtained from multiple, differing knowledge sources. We derive features for the model based on perceptually transcribed Tones and Break Indices (ToBI) labels, and analyse their contribution to the word recognition task. While ToBI has a solid foundation in linguistic theory, the need for human transcribers conflicts with the statistical model's requirement for a large quantity of training data. We therefore also examine the applicability of features which can be automatically extracted from the speech signal. We develop representations of an utterance's prosodic context using fundamental frequency, energy and duration features, which can be directly incorporated into the model without the need for manual labelling. Dimensionality reduction techniques are also explored with the aim of reducing the computational costs associated with training a maximum entropy model. Experiments on a prosodically transcribed corpus show that small but statistically significant reductions to perplexity and word error rates can be obtained by using both manually transcribed and automatically extracted features.
APA, Harvard, Vancouver, ISO, and other styles
47

Juzwin, Kathryn Rossetto. "The effects of perceptual interference and noninterference on facial recognition based on outer and inner facial features." Virtual Press, 1986. http://liblink.bsu.edu/uhtbin/catkey/447843.

Full text
Abstract:
This study investigated the effects of interference from a center stimulus on the recognition of faces presented in each visual half-field using the tachistoscoptic presentation. Based on prior studies, it was hypothesized that faces would be recognized nnre accurately based on outline features when presented to the Left visual field - Right hemisphere and on inner features for the Right visual field - Left hemisphere. It was also hypothesized that digits presented at center fixation would interfere most with the recognition of the inner details of faces presented to the right hemisphere, since recognizing both faces and digits requires high-frequency spectral analysis (Sergent, 1982b).Each stimulus was cinposed of either a number or a blank at center fixation and a face placed either to the left or right of fixation. The results indicated no performance differences due to the visual field of presentation. Recognition was most accurate when no center stimulus was present, and recognition of outer details was more accurate than recognition of inner details. Subjects tended to use top-to--bottan processing for faces in both visual fields.
APA, Harvard, Vancouver, ISO, and other styles
48

Christensen, Carl V. "Fluency Features and Elicited Imitation as Oral Proficiency Measurement." BYU ScholarsArchive, 2012. https://scholarsarchive.byu.edu/etd/3114.

Full text
Abstract:
The objective and automatic grading of oral language tests has been the subject of significant research in recent years. Several obstacles lie in the way of achieving this goal. Recent work has suggested a testing technique called elicited imitation (EI) can be used to accurately approximate global oral proficiency. This testing methodology, however, does not incorporate some fundamental aspects of language such as fluency. Other work has suggested another testing technique, simulated speech (SS), as a supplement to EI that can provide automated fluency metrics. In this work, I investigate a combination of fluency features extracted for SS testing and EI test scores to more accurately predict oral language proficiency. I also investigate the role of EI as an oral language test, and the optimal method of extracting fluency features from SS sound files. Results demonstrate the ability of EI and SS to more effectively predict hand-scored SS test item scores. I finally discuss implications of this work for future automated oral testing scenarios.
APA, Harvard, Vancouver, ISO, and other styles
49

Wang, Yihan. "Automatic Speech Recognition Model for Swedish using Kaldi." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285538.

Full text
Abstract:
With the development of intelligent era, speech recognition has been a hottopic. Although many automatic speech recognition(ASR) tools have beenput into the market, a considerable number of them do not support Swedishbecause of its small number. In this project, a Swedish ASR model basedon Hidden Markov Model and Gaussian Mixture Models is established usingKaldi which aims to help ICA Banken complete the classification of aftersalesvoice calls. A variety of model patterns have been explored, whichhave different phoneme combination methods and eigenvalue extraction andprocessing methods. Word Error Rate and Real Time Factor are selectedas evaluation criteria to compare the recognition accuracy and speed ofthe models. As far as large vocabulary continuous speech recognition isconcerned, triphone is much better than monophone. Adding feature transformationwill further improve the speed of accuracy. The combination oflinear discriminant analysis, maximum likelihood linear transformand speakeradaptive training obtains the best performance in this implementation. Fordifferent feature extraction methods, mel-frequency cepstral coefficient ismore conducive to obtain higher accuracy, while perceptual linear predictivetends to improve the overall speed.
Det existerar flera lösningar för automatisk transkribering på marknaden, menen stor del av dem stödjer inte svenska på grund utav det relativt få antalettalare. I det här projektet så skapades automatisk transkribering för svenskamed Hidden Markov models och Gaussian mixture models genom att användaKaldi. Detta för att kunna möjliggöra för ICABanken att klassificera samtal tillsin kundtjänst. En mängd av modellvariationer med olika fonemkombinationsmetoder,egenvärdesberäkning och databearbetningsmetoder har utforskats.Word error rate och real time factor är valda som utvärderingskriterier föratt jämföra precisionen och hastigheten mellan modellerna. När det kommertill kontinuerlig transkribering för ett stort ordförråd så resulterar triphonei mycket bättre prestanda än monophone. Med hjälp utav transformationerså förbättras både precisionen och hastigheten. Kombinationen av lineardiscriminatn analysis, maximum likelihood linear transformering och speakeradaptive träning resulterar i den bästa prestandan i denna implementation.För olika egenskapsextraktioner så bidrar mel-frequency cepstral koefficiententill en bättre precision medan perceptual linear predictive tenderar att ökahastigheten.
APA, Harvard, Vancouver, ISO, and other styles
50

Pietrzyk, Mariusz W. "Spatial frequency analysis of the perceptual features involved in pulmonary nodule detection and recognition from posterior-anterior chest radiographs." Thesis, Lancaster University, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556697.

Full text
Abstract:
RATIONALE AND OBJECTIVES: Radiological error due to the incorrect interpretation of medical images still occurs in current practice, and continues to be reported both in laboratory and clinical experimental conditions. In general radiological practice error rates range from 3 - 5%. However, that scale reaches up to 30% for detection of some early pulmonary cancers. Computer-Aided Detection (CAD) algorithms have been proposed to support human observers in verifying their choices. Although CAD systems might help in certain situations, its general implementation in clinical practice is still controversial. Perceptual studies involving psychophysical approaches to the error problem may give some insight into the gap between advances in image processing and the nature of radiological expertise. Moreover, some neuroscientific evidence underlines the importance of processing spatial frequency properties of visual stimuli that is carried out by the Human Visual System (HVS). This has provided the inspiration for Spatial Frequency Analysis of certain Regions of Interest (ROI) selected by human observers in medical image interpretation in a number of studies. Such studies have been conducted in mammography focusing on the relationship between the physical properties, the type of radiological outcomes and the dwelling time. The spatial frequency features in mammograms have very specific features however, and this leads to the question of whether the results for mammography could be generalised to other medical images. RESEARCH AIMS: This study aims to investigate the perceptual criteria used in decision-making processes in pulmonary lung nodule detection from Posterior-Anterior (PA) Chest Radiographs (CxR). Moreover, the development in radiological expertise has been taken into account by comparing the results obtained from subjects with different levels of experience in the field. MATERIALS AND METHODOLOGY: Ten participant observers were selected from each of the following groups:  radiologists and reporting radiographers (experts),  radiography students (two levels of novices) and those  without any relevant experience (naive). 11 ~ •• ------------~=== Subjects participated in the eye tracking experiment during lung nodule detection from a set of PA radiographs with a 50% prevalence of pathology. Twenty radiological cases were included in the data bank., where ten contained one to five nodules. The assessment of performance for each individual was calculated based on Jack-knife Alternative Free-response Receiver Operating Characteristic Figure of Merit (JAFROC FOM). Eye tracking data was used to divide images into areas of foveal visual attention distribution from the most dwelled to totally ignored Regions of Interest (ROI). These selected sites were analysed in terms of spatial frequency properties using 2D Stationary Wavelet Packets Transform (SWPT) frames by Dubieties functions up to three levels of decomposition. The logarithm of energy carried by each wavelet coefficient represents the amount of visual information coded by the spatial frequency range ω1 = f(ωx/ωy) in a particular orientation θi = g(tag(ωx/ωy)) and is called Spatial Frequency Band (SFBj). A reduction procedure was applied to eliminate redundancy in information coding by a set of SFB. Thus, 84 bands obtained from the third level of decomposition were reduced to twenty nine bands. The degree of dissimilarity in spatial frequency domains between selected regions was explored by statistical analysis on wavelet representations at the sites of subjects' responses. The locality of selected sites was limited by the foveal Field of View (FOV). The dissimilarities between wavelet representations were measured according to the number of non-redundant SFBj within which significant differences (p<0.05) were found according to an analysis of variance (ANOV A) with post hoc test. The statistical analysis embraced subject-related factors (expertise level, JAFROC FOM, dwell time) and image-based features (nodule detectability, conspicuity, localization and spatial frequency description). These factors were considered as independent variables in visual attention distribution and decision-sites studies. RESULTS: The correctness of the second or higher order responses were highly correlated with the category of first decision-outcomes made on the case. That correlation shows the probability of accurate end-point decisions related to the first decision. Experts are more accurate in dedicating visual attention to the more relevant areas containing pulmonary nodules. Significant differences were found in the spatial frequency domain between nodule-containing regions which have been fixated and those which were left without focal attention. The JAFROC FOM calculation based on overall performance characterizes the more experienced subjects as being more accurate in decision-making and less variable in FOM value within a group. Moreover, the high accuracy of subject performance was correlated with the allocation of visual attention in normal regions which are more similar to the nodule-containing sites,.in terms of the spatial frequency features. The experts' ability in distinguishing the most attractive True- Negative (TN) from True-Positive (TP) avoiding False-Positive (FP) was proven with differences at the spatial frequency level. High correlation between the correctness of a first overt decision made on certain cases and the .quality performance was found significant (r=D. 75), The category of the first response effects the perceptual criteria applied to form the final decision outcome. CONCLUSIONS: The main contribution to knowledge of this work is that for the first time the SF A was conducted on a radiological task other than mammography. The work lends significant weight to the argument that spatial frequency channels coded through a wavelet paradigm are a characterising feature of visual perception and that this is phenomenon is generalisable to areas of radiology other than breast imaging, where mammographs are quite unique in terms of image-based features compare to pictures obtained from other medical irnaging modalities. Also, this work contributes an extension to previous studies on non-expert groups through investigation into trends in development of radiological experience. There is some agreement with the conclusions presented by others who suggest that experts may use specific neural connections - a set of spatial frequency channels tuned to specific object detection - during visual searching in a radiological task.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography