Log in

Relevant bibliographies by topics / Automatic speech recognition – Statistical methods / Dissertations / Theses

Dissertations / Theses on the topic 'Automatic speech recognition – Statistical methods'

To see the other types of publications on this topic, follow the link: Automatic speech recognition – Statistical methods.

Author: Grafiati

Published: 4 June 2021

Last updated: 30 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 36 dissertations / theses for your research on the topic 'Automatic speech recognition – Statistical methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Wu, Jian, and 武健. "Discriminative speaker adaptation and environmental robustness in automatic speech recognition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31246138.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

黃伯光 and Pak-kwong Wong. "Statistical language models for Chinese recognition: speech and character." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1998. http://hub.hku.hk/bib/B31239456.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Chan, Oscar. "Prosodic features for a maximum entropy language model." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0244.

Full text

Abstract:

A statistical language model attempts to characterise the patterns present in a natural language as a probability distribution defined over word sequences. Typically, they are trained using word co-occurrence statistics from a large sample of text. In some language modelling applications, such as automatic speech recognition (ASR), the availability of acoustic data provides an additional source of knowledge. This contains, amongst other things, the melodic and rhythmic aspects of speech referred to as prosody. Although prosody has been found to be an important factor in human speech recognition, its use in ASR has been limited. The goal of this research is to investigate how prosodic information can be employed to improve the language modelling component of a continuous speech recognition system. Because prosodic features are largely suprasegmental, operating over units larger than the phonetic segment, the language model is an appropriate place to incorporate such information. The prosodic features and standard language model features are combined under the maximum entropy framework, which provides an elegant solution to modelling information obtained from multiple, differing knowledge sources. We derive features for the model based on perceptually transcribed Tones and Break Indices (ToBI) labels, and analyse their contribution to the word recognition task. While ToBI has a solid foundation in linguistic theory, the need for human transcribers conflicts with the statistical model's requirement for a large quantity of training data. We therefore also examine the applicability of features which can be automatically extracted from the speech signal. We develop representations of an utterance's prosodic context using fundamental frequency, energy and duration features, which can be directly incorporated into the model without the need for manual labelling. Dimensionality reduction techniques are also explored with the aim of reducing the computational costs associated with training a maximum entropy model. Experiments on a prosodically transcribed corpus show that small but statistically significant reductions to perplexity and word error rates can be obtained by using both manually transcribed and automatically extracted features.

APA, Harvard, Vancouver, ISO, and other styles

4

Fu, Qiang. "A generalization of the minimum classification error (MCE) training method for speech recognition and detection." Diss., Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/22705.

Full text

Abstract:

The model training algorithm is a critical component in the statistical pattern recognition approaches which are based on the Bayes decision theory. Conventional applications of the Bayes decision theory usually assume uniform error cost and result in a ubiquitous use of the maximum a posteriori (MAP) decision policy and the paradigm of distribution estimation as practice in the design of a statistical pattern recognition system. The minimum classification error (MCE) training method is proposed to overcome some substantial limitations for the conventional distribution estimation methods. In this thesis, three aspects of the MCE method are generalized. First, an optimal classifier/recognizer design framework is constructed, aiming at minimizing non-uniform error cost.A generalized training criterion named weighted MCE is proposed for pattern and speech recognition tasks with non-uniform error cost. Second, the MCE method for speech recognition tasks requires appropriate management of multiple recognition hypotheses for each data segment. A modified version of the MCE method with a new approach to selecting and organizing recognition hypotheses is proposed for continuous phoneme recognition. Third, the minimum verification error (MVE) method for detection-based automatic speech recognition (ASR) is studied. The MVE method can be viewed as a special version of the MCE method which aims at minimizing detection/verification errors. We present many experiments on pattern recognition and speech recognition tasks to justify the effectiveness of our generalizations.

APA, Harvard, Vancouver, ISO, and other styles

5

Seward, Alexander. "Efficient Methods for Automatic Speech Recognition." Doctoral thesis, KTH, Tal, musik och hörsel, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3675.

Full text

Abstract:

This thesis presents work in the area of automatic speech recognition (ASR). The thesis focuses on methods for increasing the efficiency of speech recognition systems and on techniques for efficient representation of different types of knowledge in the decoding process. In this work, several decoding algorithms and recognition systems have been developed, aimed at various recognition tasks. The thesis presents the KTH large vocabulary speech recognition system. The system was developed for online (live) recognition with large vocabularies and complex language models. The system utilizes weighted transducer theory for efficient representation of different knowledge sources, with the purpose of optimizing the recognition process. A search algorithm for efficient processing of hidden Markov models (HMMs) is presented. The algorithm is an alternative to the classical Viterbi algorithm for fast computation of shortest paths in HMMs. It is part of a larger decoding strategy aimed at reducing the overall computational complexity in ASR. In this approach, all HMM computations are completely decoupled from the rest of the decoding process. This enables the use of larger vocabularies and more complex language models without an increase of HMM-related computations. Ace is another speech recognition system developed within this work. It is a platform aimed at facilitating the development of speech recognizers and new decoding methods. A real-time system for low-latency online speech transcription is also presented. The system was developed within a project with the goal of improving the possibilities for hard-of-hearing people to use conventional telephony by providing speech-synchronized multimodal feedback. This work addresses several additional requirements implied by this special recognition task.
QC 20100811

APA, Harvard, Vancouver, ISO, and other styles

6

Clarkson, P. R. "Adaptation of statistical language models for automatic speech recognition." Thesis, University of Cambridge, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.597745.

Full text

Abstract:

Statistical language models encode linguistic information in such a way as to be useful to systems which process human language. Such systems include those for optical character recognition and machine translation. Currently, however, the most common application of language modelling is in automatic speech recognition, and it is this that forms the focus of this thesis. Most current speech recognition systems are dedicated to one specific task (for example, the recognition of broadcast news), and thus use a language model which has been trained on text which is appropriate to that task. If, however, one wants to perform recognition on more general language, then creating an appropriate language model is far from straightforward. A task-specific language model will often perform very badly on language from a different domain, whereas a model trained on text from many diverse styles of language might perform better in general, but will not be especially well suited to any particular domain. Thus the idea of an adaptive language model whose parameters automatically adjust to the current style of language is an appealing one. In this thesis, two adaptive language models are investigated. The first is a mixture-based model. The training text is partitioned according to the style of text, and a separate language model is constructed for each component. Each component is assigned a weighting according to its performance at modelling the observed text, and a final language model is constructed as the weighted sum of each of the mixture components. The second approach is based on a cache of recent words. Previous work has shown that words that have occurred recently have a higher probability of occurring in the immediate future than would be predicted by a standard triagram language model. This thesis investigates the hypothesis that more recent words should be considered more significant within the cache by implementing a cache in which a word's recurrence probability decays exponentially over time. The problem of how to predict the effect of a particular language model on speech recognition accuracy is also addressed in this thesis. The results presented here, as well as those of other recent research, suggest that perplexity, the most commonly used method of evaluating language models, is not as well correlated with word error rate as was once thought. This thesis investigates the connection between a language model's perplexity and its effect on speech recognition performance, and will describe the development of alternative measures of a language models' quality which are better correlated with word error rate. Finally, it is shown how the recognition performance which is achieved using mixture-based language models can be improved by optimising the mixture weights with respect to these new measures.

APA, Harvard, Vancouver, ISO, and other styles

7

Wei, Yi. "Statistical methods on automatic aircraft recognition in aerial images." Thesis, University of Strathclyde, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.248947.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Wong, Pak-kwong. "Statistical language models for Chinese recognition : speech and character /." Hong Kong : University of Hong Kong, 1998. http://sunzi.lib.hku.hk/hkuto/record.jsp?B20158725.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

McGreevy, Michael. "Statistical language modelling for large vocabulary speech recognition." Thesis, Queensland University of Technology, 2006. https://eprints.qut.edu.au/16444/1/Michael_McGreevy_Thesis.pdf.

Full text

Abstract:

The move towards larger vocabulary Automatic Speech Recognition (ASR) systems places greater demands on language models. In a large vocabulary system, acoustic confusion is greater, thus there is more reliance placed on the language model for disambiguation. In addition to this, ASR systems are increasingly being deployed in situations where the speaker is not conscious of their interaction with the system, such as in recorded meetings and surveillance scenarios. This results in more natural speech, which contains many false starts and disfluencies. In this thesis we investigate a novel approach to the modelling of speech corrections. We propose a syntactic model of speech corrections, and seek to determine if this model can improve on the performance of standard language modelling approaches when applied to conversational speech. We investigate a number of related variations to our basic approach and compare these approaches against the class-based N-gram. We also investigate the modelling of styles of speech. Specifically, we investigate whether the incorporation of prior knowledge about sentence types can improve the performance of language models. We propose a sentence mixture model based on word-class N-grams, in which the sentence mixture models and the word-class membership probabilities are jointly trained. We compare this approach with word-based sentence mixture models.

APA, Harvard, Vancouver, ISO, and other styles

10

McGreevy, Michael. "Statistical language modelling for large vocabulary speech recognition." Queensland University of Technology, 2006. http://eprints.qut.edu.au/16444/.

Full text

Abstract:

The move towards larger vocabulary Automatic Speech Recognition (ASR) systems places greater demands on language models. In a large vocabulary system, acoustic confusion is greater, thus there is more reliance placed on the language model for disambiguation. In addition to this, ASR systems are increasingly being deployed in situations where the speaker is not conscious of their interaction with the system, such as in recorded meetings and surveillance scenarios. This results in more natural speech, which contains many false starts and disfluencies. In this thesis we investigate a novel approach to the modelling of speech corrections. We propose a syntactic model of speech corrections, and seek to determine if this model can improve on the performance of standard language modelling approaches when applied to conversational speech. We investigate a number of related variations to our basic approach and compare these approaches against the class-based N-gram. We also investigate the modelling of styles of speech. Specifically, we investigate whether the incorporation of prior knowledge about sentence types can improve the performance of language models. We propose a sentence mixture model based on word-class N-grams, in which the sentence mixture models and the word-class membership probabilities are jointly trained. We compare this approach with word-based sentence mixture models.

APA, Harvard, Vancouver, ISO, and other styles

11

Doulaty, Bashkand Mortaza. "Methods for addressing data diversity in automatic speech recognition." Thesis, University of Sheffield, 2017. http://etheses.whiterose.ac.uk/17096/.

Full text

Abstract:

The performance of speech recognition systems is known to degrade in mismatched conditions, where the acoustic environment and the speaker population significantly differ between the training and target test data. Performance degradation due to the mismatch is widely reported in the literature, particularly for diverse datasets. This thesis approaches the mismatch problem in diverse datasets with various strategies including data refinement, variability modelling and speech recognition model adaptation. These strategies are realised in six novel contributions. The first contribution is a data subset selection technique using likelihood ratio derived from a target test set quantifying mismatch. The second contribution is a multi-style training method using data augmentation. The existing training data is augmented using a distribution of variabilities learnt from a target dataset, resulting in a matched set. The third contribution is a new approach for genre identification in diverse media data with the aim of reducing the mismatch in an adaptation framework. The fourth contribution is a novel method which performs an unsupervised domain discovery using latent Dirichlet allocation. Since the latent domains have a high correlation with some subjective meta-data tags, such as genre labels of media data, features derived from the latent domains are successfully applied to the genre and broadcast show identification tasks. The fifth contribution extends the latent modelling technique for acoustic model adaptation, where latent-domain specific models are adapted from a base model. As the sixth contribution, an alternative adaptation approach is proposed where subspace adaptation of deep neural network acoustic models is performed using the proposed latent-domain aware training procedure. All of the proposed techniques for mismatch reduction are verified using diverse datasets. Using data selection, data augmentation and latent-domain model adaptation methods the mismatch between training and testing conditions of diverse ASR systems are reduced, resulting in more robust speech recognition systems.

APA, Harvard, Vancouver, ISO, and other styles

12

Gayvert, Robert T. "A statistical approach to formant tracking /." Online version of thesis, 1988. http://hdl.handle.net/1850/10499.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Salvi, Giampiero. "Mining Speech Sounds : Machine Learning Methods for Automatic Speech Recognition and Analysis." Doctoral thesis, Stockholm : KTH School of Computer Science and Comunication, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4111.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Whittaker, Edward William Daniel. "Statistical language modelling for automatic speech recognition of Russian and English." Thesis, University of Cambridge, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.621936.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Singh-Miller, Natasha 1981. "Neighborhood analysis methods in acoustic modeling for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/62450.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 121-134).
This thesis investigates the problem of using nearest-neighbor based non-parametric methods for performing multi-class class-conditional probability estimation. The methods developed are applied to the problem of acoustic modeling for speech recognition. Neighborhood components analysis (NCA) (Goldberger et al. [2005]) serves as the departure point for this study. NCA is a non-parametric method that can be seen as providing two things: (1) low-dimensional linear projections of the feature space that allow nearest-neighbor algorithms to perform well, and (2) nearest-neighbor based class-conditional probability estimates. First, NCA is used to perform dimensionality reduction on acoustic vectors, a commonly addressed problem in speech recognition. NCA is shown to perform competitively with another commonly employed dimensionality reduction technique in speech known as heteroscedastic linear discriminant analysis (HLDA) (Kumar [1997]). Second, a nearest neighbor-based model related to NCA is created to provide a class-conditional estimate that is sensitive to the possible underlying relationship between the acoustic-phonetic labels. An embedding of the labels is learned that can be used to estimate the similarity or confusability between labels. This embedding is related to the concept of error-correcting output codes (ECOC) and therefore the proposed model is referred to as NCA-ECOC. The estimates provided by this method along with nearest neighbor information is shown to provide improvements in speech recognition performance (2.5% relative reduction in word error rate). Third, a model for calculating class-conditional probability estimates is proposed that generalizes GMM, NCA, and kernel density approaches. This model, called locally-adaptive neighborhood components analysis, LA-NCA, learns different low-dimensional projections for different parts of the space. The models exploits the fact that in different parts of the space different directions may be important for discrimination between the classes. This model is computationally intensive and prone to over-fitting, so methods for sub-selecting neighbors used for providing the classconditional estimates are explored. The estimates provided by LA-NCA are shown to give significant gains in speech recognition performance (7-8% relative reduction in word error rate) as well as phonetic classification.
by Natasha Singh-Miller.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

16

Delmege, James W. "CLASS : a study of methods for coarse phonetic classification /." Online version of thesis, 1988. http://hdl.handle.net/1850/10449.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Dambreville, Samuel. "Statistical and geometric methods for shape-driven segmentation and tracking." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/22707.

Full text

Abstract:

Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2008.
Committee Chair: Allen Tannenbaum; Committee Member: Anthony Yezzi; Committee Member: Marc Niethammer; Committee Member: Patricio Vela; Committee Member: Yucel Altunbasak.

APA, Harvard, Vancouver, ISO, and other styles

18

Ravindran, Sourabh. "Physiologically Motivated Methods For Audio Pattern Classification." Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/14066.

Full text

Abstract:

Human-like performance by machines in tasks of speech and audio processing has remained an elusive goal. In an attempt to bridge the gap in performance between humans and machines there has been an increased effort to study and model physiological processes. However, the widespread use of biologically inspired features proposed in the past has been hampered mainly by either the lack of robustness across a range of signal-to-noise ratios or the formidable computational costs. In physiological systems, sensor processing occurs in several stages. It is likely the case that signal features and biological processing techniques evolved together and are complementary or well matched. It is precisely for this reason that modeling the feature extraction processes should go hand in hand with modeling of the processes that use these features. This research presents a front-end feature extraction method for audio signals inspired by the human peripheral auditory system. New developments in the field of machine learning are leveraged to build classifiers to maximize the performance gains afforded by these features. The structure of the classification system is similar to what might be expected in physiological processing. Further, the feature extraction and classification algorithms can be efficiently implemented using the low-power cooperative analog-digital signal processing platform. The usefulness of the features is demonstrated for tasks of audio classification, speech versus non-speech discrimination, and speech recognition. The low-power nature of the classification system makes it ideal for use in applications such as hearing aids, hand-held devices, and surveillance through acoustic scene monitoring

APA, Harvard, Vancouver, ISO, and other styles

19

Melin, Håkan. "Automatic speaker verification on site and by telephone: methods, applications and assessment." Doctoral thesis, KTH, Tal, musik och hörsel, TMH, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4242.

Full text

Abstract:

Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant's voice. The present thesis deals with various topics related to automatic speaker verification (ASV) in the context of its commercial applications, characterized by co-operative users, user-friendly interfaces, and requirements for small amounts of enrollment and test data. A text-dependent system based on hidden Markov models (HMM) was developed and used to conduct experiments, including a comparison between visual and aural strategies for prompting claimants for randomized digit strings. It was found that aural prompts lead to more errors in spoken responses and that visually prompted utterances performed marginally better in ASV, given that enrollment data were visually prompted. High-resolution flooring techniques were proposed for variance estimation in the HMMs, but results showed no improvement over the standard method of using target-independent variances copied from a background model. These experiments were performed on Gandalf, a Swedish speaker verification telephone corpus with 86 client speakers. A complete on-site application (PER), a physical access control system securing a gate in a reverberant stairway, was implemented based on a combination of the HMM and a Gaussian mixture model based system. Users were authenticated by saying their proper name and a visually prompted, random sequence of digits after having enrolled by speaking ten utterances of the same type. An evaluation was conducted with 54 out of 56 clients who succeeded to enroll. Semi-dedicated impostor attempts were also collected. An equal error rate (EER) of 2.4% was found for this system based on a single attempt per session and after retraining the system on PER-specific development data. On parallel telephone data collected using a telephone version of PER, 3.5% EER was found with landline and around 5% with mobile telephones. Impostor attempts in this case were same-handset attempts. Results also indicate that the distribution of false reject and false accept rates over target speakers are well described by beta distributions. A state-of-the-art commercial system was also tested on PER data with similar performance as the baseline research system.
QC 20100910

APA, Harvard, Vancouver, ISO, and other styles

20

Yaman, Sibel. "A multi-objective programming perspective to statistical learning problems." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26470.

Full text

Abstract:

Thesis (Ph.D)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Chin-Hui Lee; Committee Member: Anthony Yezzi; Committee Member: Evans Harrell; Committee Member: Fred Juang; Committee Member: James H. McClellan. Part of the SMARTech Electronic Thesis and Dissertation Collection.

APA, Harvard, Vancouver, ISO, and other styles

21

Berry, Jeffrey James. "Machine Learning Methods for Articulatory Data." Diss., The University of Arizona, 2012. http://hdl.handle.net/10150/223348.

Full text

Abstract:

Humans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data useful to speech scientists, and for speech technology applications. This dissertation presents feature extraction methods for ultrasound images of the tongue and for data collected with an Electro-Magnetic Articulograph (EMA). The usefulness of these features is tested in several phoneme classification tasks. Feature extraction methods for ultrasound tongue images presented here consist of automatically tracing the tongue surface contour using a modified Deep Belief Network (DBN) (Hinton et al. 2006), and methods inspired by research in face recognition which use the entire image. The tongue tracing method consists of training a DBN as an autoencoder on concatenated images and traces, and then retraining the first two layers to accept only the image at runtime. This 'translational' DBN (tDBN) method is shown to produce traces comparable to those made by human experts. An iterative bootstrapping procedure is presented for using the tDBN to assist a human expert in labeling a new data set. Tongue contour traces are compared with the Eigentongues method of (Hueber et al. 2007), and a Gabor Jet representation in a 6-class phoneme classification task using Support Vector Classifiers (SVC), with Gabor Jets performing the best. These SVC methods are compared to a tDBN classifier, which extracts features from raw images and classifies them with accuracy only slightly lower than the Gabor Jet SVC method.For EMA data, supervised binary SVC feature detectors are trained for each feature in three versions of Distinctive Feature Theory (DFT): Preliminaries (Jakobson et al. 1954), The Sound Pattern of English (Chomsky and Halle 1968), and Unified Feature Theory (Clements and Hume 1995). Each of these feature sets, together with a fourth unsupervised feature set learned using Independent Components Analysis (ICA), are compared on their usefulness in a 46-class phoneme recognition task. Phoneme recognition is performed using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001), which takes advantage of the temporal nature of speech, by looking at observations adjacent in time. Results of the phoneme recognition task show that Unified Feature Theory performs slightly better than the other versions of DFT. Surprisingly, ICA actually performs worse than running the CRF on raw EMA data.

APA, Harvard, Vancouver, ISO, and other styles

22

Khodai-Joopari, Mehrdad Information Technology &amp Electrical Engineering Australian Defence Force Academy UNSW. "Forensic speaker analysis and identification by computer : a Bayesian approach anchored in the cepstral domain." Awarded by:University of New South Wales - Australian Defence Force Academy. School of Information Technology and Electrical Engineering, 2007. http://handle.unsw.edu.au/1959.4/38715.

Full text

Abstract:

This thesis advances understanding of the forensic value of the automatic speech parameters by addressing the following question: what is the potentiality of the speech cepstrum as a forensic-acoustic parameter? Despite many advances in automatic speech and speaker recognition, robust and unconstrained progress in technical forensic speaker identification has been partly impeded by our incomplete understanding of the interaction and relation between forensic phonetics and the techniques employed in state-of-the-art automatic speech and speaker recognition. The posed question underlies the recurrent and longstanding issue of acoustic parameterisation in the area of forensic phonetics, where 1) speaker identification often must be carried out under less than optimal conditions, and 2) views differ on the usefulness and trustworthiness of the formant frequency measurements. To this end, a new formulation for the forensic evaluation of speech data was derived which is effectively a spectral likelihood ratio with enhanced sensitivity to the local peaks of the formant structure of the speech spectrum of vowel sounds, while retaining the characteristics of the Bayesian framework. This new hybrid formula was used together with a novel approach, which is founded on a statistically-based matched-pairs technique to account for various levels of variation inherent in speech recordings, thereby providing a spectrally meaningful measure of variations between two speech spectra and hence the true worth of speech samples as forensic evidence. The experimental results are obtained based on a forensically-realistic database of a relatively large population of 297 native speakers of Japanese. In sum, the research conducted in this thesis is a major step forward in advancing the forensic-phonetic field which broadens the objective basis of the forensic speaker identification. Beyond advancing knowledge in the field, the semi data-independent nature of the new formula ultimately has great implications in technical forensic speaker identification. It also provides us with a valuable biometric tool with both academic and commercial potential in crime investigation in a field which is already suffering from the lack of adequate data.

APA, Harvard, Vancouver, ISO, and other styles

23

Yoshino, Koichiro. "Spoken Dialogue System for Information Navigation based on Statistical Learning of Semantic and Dialogue Structure." 京都大学 (Kyoto University), 2014. http://hdl.handle.net/2433/192214.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Zamora, Martínez Francisco Julián. "Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática." Doctoral thesis, Universitat Politècnica de València, 2012. http://hdl.handle.net/10251/18066.

Full text

Abstract:

El procesamiento del lenguaje natural es un área de aplicación de la inteligencia artificial, en particular, del reconocimiento de formas que estudia, entre otras cosas, incorporar información sintáctica (modelo de lenguaje) sobre cómo deben juntarse las palabras de una determinada lengua, para así permitir a los sistemas de reconocimiento/traducción decidir cual es la mejor hipótesis �con sentido común�. Es un área muy amplia, y este trabajo se centra únicamente en la parte relacionada con el modelado de lenguaje y su aplicación a diversas tareas: reconocimiento de secuencias mediante modelos ocultos de Markov y traducción automática estadística. Concretamente, esta tesis tiene su foco central en los denominados modelos conexionistas de lenguaje, esto es, modelos de lenguaje basados en redes neuronales. Los buenos resultados de estos modelos en diversas áreas del procesamiento del lenguaje natural han motivado el desarrollo de este estudio. Debido a determinados problemas computacionales que adolecen los modelos conexionistas de lenguaje, los sistemas que aparecen en la literatura se construyen en dos etapas totalmente desacopladas. En la primera fase se encuentra, a través de un modelo de lenguaje estándar, un conjunto de hipótesis factibles, asumiendo que dicho conjunto es representativo del espacio de búsqueda en el cual se encuentra la mejor hipótesis. En segundo lugar, sobre dicho conjunto, se aplica el modelo conexionista de lenguaje y se extrae la hipótesis con mejor puntuación. A este procedimiento se le denomina �rescoring�. Este escenario motiva los objetivos principales de esta tesis: � Proponer alguna técnica que pueda reducir drásticamente dicho coste computacional degradando lo mínimo posible la calidad de la solución encontrada. � Estudiar el efecto que tiene la integración de los modelos conexionistas de lenguaje en el proceso de búsqueda de las tareas propuestas. � Proponer algunas modificaciones del modelo original que permitan mejorar su calidad
Zamora Martínez, FJ. (2012). Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18066
Palancia

APA, Harvard, Vancouver, ISO, and other styles

25

Le, Hai Son. "Continuous space models with neural networks in natural language processing." Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00776704.

Full text

Abstract:

The purpose of language models is in general to capture and to model regularities of language, thereby capturing morphological, syntactical and distributional properties of word sequences in a given language. They play an important role in many successful applications of Natural Language Processing, such as Automatic Speech Recognition, Machine Translation and Information Extraction. The most successful approaches to date are based on n-gram assumption and the adjustment of statistics from the training data by applying smoothing and back-off techniques, notably Kneser-Ney technique, introduced twenty years ago. In this way, language models predict a word based on its n-1 previous words. In spite of their prevalence, conventional n-gram based language models still suffer from several limitations that could be intuitively overcome by consulting human expert knowledge. One critical limitation is that, ignoring all linguistic properties, they treat each word as one discrete symbol with no relation with the others. Another point is that, even with a huge amount of data, the data sparsity issue always has an important impact, so the optimal value of n in the n-gram assumption is often 4 or 5 which is insufficient in practice. This kind of model is constructed based on the count of n-grams in training data. Therefore, the pertinence of these models is conditioned only on the characteristics of the training text (its quantity, its representation of the content in terms of theme, date). Recently, one of the most successful attempts that tries to directly learn word similarities is to use distributed word representations in language modeling, where distributionally words, which have semantic and syntactic similarities, are expected to be represented as neighbors in a continuous space. These representations and the associated objective function (the likelihood of the training data) are jointly learned using a multi-layer neural network architecture. In this way, word similarities are learned automatically. This approach has shown significant and consistent improvements when applied to automatic speech recognition and statistical machine translation tasks. A major difficulty with the continuous space neural network based approach remains the computational burden, which does not scale well to the massive corpora that are nowadays available. For this reason, the first contribution of this dissertation is the definition of a neural architecture based on a tree representation of the output vocabulary, namely Structured OUtput Layer (SOUL), which makes them well suited for large scale frameworks. The SOUL model combines the neural network approach with the class-based approach. It achieves significant improvements on both state-of-the-art large scale automatic speech recognition and statistical machine translations tasks. The second contribution is to provide several insightful analyses on their performances, their pros and cons, their induced word space representation. Finally, the third contribution is the successful adoption of the continuous space neural network into a machine translation framework. New translation models are proposed and reported to achieve significant improvements over state-of-the-art baseline systems.

APA, Harvard, Vancouver, ISO, and other styles

26

Bezůšek, Marek. "Objektivizace Testu 3F - dysartrický profil pomocí akustické analýzy." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2021. http://www.nusl.cz/ntk/nusl-442568.

Full text

Abstract:

Test 3F is used to diagnose the extent of motor speech disorder – dysarthria for czech speakers. The evaluation of dysarthric speech is distorted by subjective assessment. The motivation behind this thesis is that there are not many automatic and objective analysis tools that can be used to evaluate phonation, articulation, prosody and respiration of speech disorder. The aim of this diploma thesis is to identify, implement and test acoustic features of speech that could be used to objectify and automate the evaluation. These features should be easily interpretable by the clinician. It is assumed that the evaluation could be more precise because of the detailed analysis that acoustic features provide. The performance of these features was tested on database of 151 czech speakers that consists of 51 healthy speakers and 100 patients. Statistical analysis and methods of machine learning were used to identify the correlation between features and subjective assesment. 27 of total 30 speech tasks of Test 3F were identified as suitable for automatic evaluation. Within the scope of this thesis only 10 tasks of Test 3F were tested because only a limited part of the database could be preprocessed. The result of statistical analysis is 14 features that were most useful for the test evaluation. The most significant features are: MET (respiration), relF0SD (intonation), relSEOVR (voice intensity – prosody). The lowest prediction error of the machine learning regression models was 7.14 %. The conclusion is that the evaluation of most of the tasks of Test 3F can be automated. The results of analysis of 10 tasks shows that the most significant factor in dysarthria evaluation is limited expiration, monotone voice and low variabilty of speech intensity.

APA, Harvard, Vancouver, ISO, and other styles

27

Sodré, Bruno Ribeiro. "Reconhecimento de padrões aplicados à identificação de patologias de laringe." Universidade Tecnológica Federal do Paraná, 2016. http://repositorio.utfpr.edu.br/jspui/handle/1/2013.

Full text

Abstract:

As patologias que afetam a laringe estão aumentando consideravelmente nos últimos anos devido à condição da sociedade atual onde há hábitos não saudáveis como fumo, álcool e tabaco e um abuso vocal cada vez maior, talvez por conta do aumento da poluição sonora, principalmente nos grandes centros urbanos. Atualmente o exame utilizado pela endoscopia per-oral, direcionado a identiﬁcar patologias de laringe, são a videolaringoscopia e videoestroboscopia, ambos invasivos e por muitas vezes desconfortável ao paciente. Buscando melhorar o bem estar e minimizar o desconforto dos pacientes que necessitam submeter-se a estes procedimentos, este estudo tem como objetivo reconhecer padrões que possam ser aplicados à identiﬁcação de patologias de laringe de modo a auxiliar na criação de um novo método não invasivo em substituição ao método atual. Este trabalho utilizará várias conﬁgurações diferentes de redes neurais. A primeira rede neural foi gerada a partir de 524.287 resultados obtidos através das conﬁgurações k-k das 19 medidas acústicas disponíveis neste trabalho. Esta conﬁguração atingiu uma acurácia de 99,5% (média de 96,99±2,08%) ao utilizar uma conﬁguração com 11 e com 12 medidas acústicas dentre as 19 disponíveis. Utilizando-se 3 medidas rotacionadas (obtidas através do método de componentes principais), foi obtido uma acurácia de 93,98±0,24%. Com 6 medidas rotacionadas, o resultado obtido foi de acurácia foi de 94,07±0,29%. Para 6 medidas rotacionadas com entrada normalizada, a acurácia encontrada foi de 97,88±1,53%. A rede neural que fez 23 diferentes classiﬁcações, voz normal mais 22 patologias, mostrou que as melhores classiﬁcações, de acordo com a acurácia, são a da patologia hiperfunção com 58,23±18,98% e a voz normal com 52,15±18,31%. Já para a pior patologia a ser classiﬁcada, encontrou-se a fadiga vocal com 0,57±1,99%. Excluindo-se a voz normal, ou seja, utilizando uma rede neural composta somente por vozes patológicas, a hiperfunção continua sendo a mais facilmente identiﬁcável com uma acurácia de 57,3±19,55%, a segunda patologia mais facilmente identiﬁcável é a constrição ântero-posterior com 18,14±11,45%. Nesta conﬁguração, a patologia mais difícil de se classiﬁcar continua sendo a fadiga vocal com 0,7±2,14%. A rede com re-amostragem obteve uma acurácia de 25,88±10,15% enquanto que a rede com re-amostragem e alteração de neurônios na camada intermediária obteve uma acurácia de 21,47±7,58% para 30 neurônios e uma acurácia de 18,44±6,57% para 40 neurônios. Por ﬁm foi feita uma máquina de vetores suporte que encontrou um resultado de 67±6,2%. Assim, mostrou-se que as medidas acústicas precisam ser aprimoradas para a obtenção de melhores resultados de classiﬁcação dentre as patologias de laringe estudadas. Ainda assim, veriﬁcou-se que é possível discriminar locutores normais daqueles pacientes disfônicos.
Diseases that affect the larynx have been considerably increased in recent years due to the condition of nowadays society where there have been unhealthy habits like smoking, alcohol and tobacco and an increased vocal abuse, perhaps due to the increase in noise pollution, especially in large urban cities. Currently the exam performed by per-oral endoscopy (aimed to identify laryngeal pathologies) have been videolaryngoscopy and videostroboscopy, both invasive and often uncomfortable to the patient. Seeking to improve the comfort of the patients who need to undergo through these procedures, this study aims to identify acoustic patterns that can be applied to the identification of laryngeal pathologies in order to creating a new non-invasive larynx assessment method. Here two different configurations of neural networks were used. The first one was generated from 524.287 combinations of 19 acoustic measurements to classify voices into normal or from a diseased larynx, and achieved an max accuracy of 99.5% (96.99±2.08%). Using 3 and 6 rotated measurements (obtained from the principal components analysis method), the accuracy was 93.98±0.24% and 94.07±0.29%, respectively. With 6 rotated measurements from a previouly standardization of the 19 acoustic measurements, the accuracy was 97.88±1.53%. The second one, to classify 23 different voice types (including normal voices), showed better accuracy in identifying hiperfunctioned larynxes and normal voices, with 58.23±18.98% and 52.15±18.31%, respectively. The worst accuracy was obtained from vocal fatigues, with 0.57±1.99%. Excluding normal voices of the analysis, hyperfunctioned voices remained the most easily identifiable (with an accuracy of 57.3±19.55%) followed by anterior-posterior constriction (with 18.14±11.45%), and the most difficult condition to be identified remained vocal fatigue (with 0.7±2.14%). Re-sampling the neural networks input vectors, it was obtained accuracies of 25.88±10.15%, 21.47±7.58%, and 18.44±6.57% from such networks with 20, 30, and 40 hidden layer neurons, respectively. For comparison, classification using support vector machine produced an accuracy of 67±6.2%. Thus, it was shown that the acoustic measurements need to be improved to achieve better results of classification among the studied laryngeal pathologies. Even so, it was found that is possible to discriminate normal from dysphonic speakers.

APA, Harvard, Vancouver, ISO, and other styles

28

Goecke, Roland. "A stereo vision lip tracking algorithm and subsequent statistical analyses of the audio-video correlation in Australian English." Phd thesis, 2004. http://hdl.handle.net/1885/149999.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Ghane, Parisa. "Silent speech recognition in EEG-based brain computer interface." Thesis, 2015. http://hdl.handle.net/1805/9886.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
A Brain Computer Interface (BCI) is a hardware and software system that establishes direct communication between human brain and the environment. In a BCI system, brain messages pass through wires and external computers instead of the normal pathway of nerves and muscles. General work ow in all BCIs is to measure brain activities, process and then convert them into an output readable for a computer. The measurement of electrical activities in different parts of the brain is called electroencephalography (EEG). There are lots of sensor technologies with different number of electrodes to record brain activities along the scalp. Each of these electrodes captures a weighted sum of activities of all neurons in the area around that electrode. In order to establish a BCI system, it is needed to set a bunch of electrodes on scalp, and a tool to send the signals to a computer for training a system that can find the important information, extract them from the raw signal, and use them to recognize the user's intention. After all, a control signal should be generated based on the application. This thesis describes the step by step training and testing a BCI system that can be used for a person who has lost speaking skills through an accident or surgery, but still has healthy brain tissues. The goal is to establish an algorithm, which recognizes different vowels from EEG signals. It considers a bandpass filter to remove signals' noise and artifacts, periodogram for feature extraction, and Support Vector Machine (SVM) for classification.

APA, Harvard, Vancouver, ISO, and other styles

30

May, Avner. "Kernel Approximation Methods for Speech Recognition." Thesis, 2018. https://doi.org/10.7916/D8D80P9T.

Full text

Abstract:

Over the past five years or so, deep learning methods have dramatically improved the state of the art performance in a variety of domains, including speech recognition, computer vision, and natural language processing. Importantly, however, they suffer from a number of drawbacks: 1. Training these models is a non-convex optimization problem, and thus it is difficult to guarantee that a trained model minimizes the desired loss function. 2. These models are difficult to interpret. In particular, it is difficult to explain, for a given model, why the computations it performs make accurate predictions. In contrast, kernel methods are straightforward to interpret, and training them is a convex optimization problem. Unfortunately, solving these optimization problems exactly is typically prohibitively expensive, though one can use approximation methods to circumvent this problem. In this thesis, we explore to what extent kernel approximation methods can compete with deep learning, in the context of large-scale prediction tasks. Our contributions are as follows: 1. We perform the most extensive set of experiments to date using kernel approximation methods in the context of large-scale speech recognition tasks, and compare performance with deep neural networks. 2. We propose a feature selection algorithm which significantly improves the performance of the kernel models, making their performance competitive with fully-connected feedforward neural networks. 3. We perform an in-depth comparison between two leading kernel approximation strategies — random Fourier features [Rahimi and Recht, 2007] and the Nyström method [Williams and Seeger, 2001] — showing that although the Nyström method is better at approximating the kernel, it performs worse than random Fourier features when used for learning. We believe this work opens the door for future research to continue to push the boundary of what is possible with kernel methods. This research direction will also shed light on the question of when, if ever, deep models are needed for attaining strong performance.

APA, Harvard, Vancouver, ISO, and other styles

31

Chandrasekaran, Aravind. "Efficient methods for rapid UBM training (RUT) for robust speaker verification /." 2008. http://proquest.umi.com/pqdweb?did=1650508671&sid=2&Fmt=2&clientId=10361&RQT=309&VName=PQD.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Wang, Qi. "Nonlinear noise compensation in feature domain for speech recognition with numerical methods /." 2004. http://wwwlib.umi.com/cr/yorku/fullcit?pMQ99403.

Full text

Abstract:

Thesis (M.Sc.)--York University, 2004. Graduate Programme in Computer Science.
Typescript. Includes bibliographical references (leaves 60-65). Also available on the Internet. MODE OF ACCESS via web browser by entering the following URL: http://wwwlib.umi.com/cr/yorku/fullcit?pMQ99403

APA, Harvard, Vancouver, ISO, and other styles

33

"The statistical evaluation of minutiae-based automatic fingerprint verification systems." Thesis, 2006. http://library.cuhk.edu.hk/record=b6074180.

Full text

Abstract:

Basic technologies for fingerprint feature extraction and matching have been improved to such a stage that they can be embedded into commercial Automatic Fingerprint Verification Systems (AFVSs). However, the reliability of AFVSs has kept attracting concerns from the society since AFVSs do fail occasionally due to difficulties like problematic fingers, changing environments, and malicious attacks. Furthermore, the absence of a solid theoretical foundation for evaluating AFVSs prevents these failures from been predicted and evaluated. Under the traditional empirical AFVS evaluation framework, repeated verification experiments, which can be very time consuming, have to be performed to test whether an update to an AFVS can really lead to an upgrade in its performance. Also, empirical verification results are often unable to provide deeper understanding of AFVSs. To solve these problems, we propose a novel statistical evaluation model for minutiae-based AFVSs based on the understanding of fingerprint minutiae patterns. This model can predict the verification performance metrics as well as their confidence intervals. The analytical power of our evaluation model, which makes it superior to empirical evaluation methods, can assist system developers to upgrade their AFVSs purposefully. Also, our model can facilitate the theoretical analysis of the advantages and disadvantages of various fingerprint verification techniques. We verify our claims through different and extensive experiments.
Chen, Jiansheng.
"November 2006."
Adviser: Yiu-Sang Moon.
Source: Dissertation Abstracts International, Volume: 68-08, Section: B, page: 5343.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2006.
Includes bibliographical references (p. 110-122).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstracts in English and Chinese.
School code: 1307.

APA, Harvard, Vancouver, ISO, and other styles

34

"Robust methods for Chinese spoken document retrieval." 2003. http://library.cuhk.edu.hk/record=b5896122.

Full text

Abstract:

Hui Pui Yu.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 158-169).
Abstracts in English and Chinese.
Abstract --- p.2
Acknowledgements --- p.6
Chapter 1 --- Introduction --- p.23
Chapter 1.1 --- Spoken Document Retrieval --- p.24
Chapter 1.2 --- The Chinese Language and Chinese Spoken Documents --- p.28
Chapter 1.3 --- Motivation --- p.33
Chapter 1.3.1 --- Assisting the User in Query Formation --- p.34
Chapter 1.4 --- Goals --- p.34
Chapter 1.5 --- Thesis Organization --- p.35
Chapter 2 --- Multimedia Repository --- p.37
Chapter 2.1 --- The Cantonese Corpus --- p.37
Chapter 2.1.1 --- The RealMedia´ёØCollection --- p.39
Chapter 2.1.2 --- The MPEG-1 Collection --- p.40
Chapter 2.2 --- The Multimedia Markup Language --- p.42
Chapter 2.3 --- Chapter Summary --- p.44
Chapter 3 --- Monolingual Retrieval Task --- p.45
Chapter 3.1 --- Properties of Cantonese Video Archive --- p.45
Chapter 3.2 --- Automatic Speech Transcription --- p.46
Chapter 3.2.1 --- Transcription of Cantonese Spoken Documents --- p.47
Chapter 3.2.2 --- Indexing Units --- p.48
Chapter 3.3 --- Known-Item Retrieval Task --- p.49
Chapter 3.3.1 --- Evaluation ´ؤ Average Inverse Rank --- p.50
Chapter 3.4 --- Retrieval Model --- p.51
Chapter 3.5 --- Experimental Results --- p.52
Chapter 3.6 --- Chapter Summary --- p.53
Chapter 4 --- The Use of Audio and Video Information for Monolingual Spoken Document Retrieval --- p.55
Chapter 4.1 --- Video-based Segmentation --- p.56
Chapter 4.1.1 --- Metric Computation --- p.57
Chapter 4.1.2 --- Shot Boundary Detection --- p.58
Chapter 4.1.3 --- Shot Transition Detection --- p.67
Chapter 4.2 --- Audio-based Segmentation --- p.69
Chapter 4.2.1 --- Gaussian Mixture Models --- p.69
Chapter 4.2.2 --- Transition Detection --- p.70
Chapter 4.3 --- Performance Evaluation --- p.72
Chapter 4.3.1 --- Automatic Story Segmentation --- p.72
Chapter 4.3.2 --- Video-based Segmentation Algorithm --- p.73
Chapter 4.3.3 --- Audio-based Segmentation Algorithm --- p.74
Chapter 4.4 --- Fusion of Video- and Audio-based Segmentation --- p.75
Chapter 4.5 --- Retrieval Performance --- p.76
Chapter 4.6 --- Chapter Summary --- p.78
Chapter 5 --- Document Expansion for Monolingual Spoken Document Retrieval --- p.79
Chapter 5.1 --- Document Expansion using Selected Field Speech Segments --- p.81
Chapter 5.1.1 --- Annotations from MmML --- p.81
Chapter 5.1.2 --- Selection of Cantonese Field Speech --- p.83
Chapter 5.1.3 --- Re-weighting Different Retrieval Units --- p.84
Chapter 5.1.4 --- Retrieval Performance with Document Expansion using Selected Field Speech --- p.84
Chapter 5.2 --- Document Expansion using N-best Recognition Hypotheses --- p.87
Chapter 5.2.1 --- Re-weighting Different Retrieval Units --- p.90
Chapter 5.2.2 --- Retrieval Performance with Document Expansion using TV-best Recognition Hypotheses --- p.90
Chapter 5.3 --- Document Expansion using Selected Field Speech and N-best Recognition Hypotheses --- p.92
Chapter 5.3.1 --- Re-weighting Different Retrieval Units --- p.92
Chapter 5.3.2 --- Retrieval Performance with Different Indexed Units --- p.93
Chapter 5.4 --- Chapter Summary --- p.94
Chapter 6 --- Query Expansion for Cross-language Spoken Document Retrieval --- p.97
Chapter 6.1 --- The TDT-2 Corpus --- p.99
Chapter 6.1.1 --- English Textual Queries --- p.100
Chapter 6.1.2 --- Mandarin Spoken Documents --- p.101
Chapter 6.2 --- Query Processing --- p.101
Chapter 6.2.1 --- Query Weighting --- p.101
Chapter 6.2.2 --- Bigram Formation --- p.102
Chapter 6.3 --- Cross-language Retrieval Task --- p.103
Chapter 6.3.1 --- Indexing Units --- p.104
Chapter 6.3.2 --- Retrieval Model --- p.104
Chapter 6.3.3 --- Performance Measure --- p.105
Chapter 6.4 --- Relevance Feedback --- p.106
Chapter 6.4.1 --- Pseudo-Relevance Feedback --- p.107
Chapter 6.5 --- Retrieval Performance --- p.107
Chapter 6.6 --- Chapter Summary --- p.109
Chapter 7 --- Conclusions and Future Work --- p.111
Chapter 7.1 --- Future Work --- p.114
Chapter A --- XML Schema for Multimedia Markup Language --- p.117
Chapter B --- Example of Multimedia Markup Language --- p.128
Chapter C --- Significance Tests --- p.135
Chapter C.1 --- Selection of Cantonese Field Speech Segments --- p.135
Chapter C.2 --- Fusion of Video- and Audio-based Segmentation --- p.137
Chapter C.3 --- Document Expansion with Reporter Speech --- p.137
Chapter C.4 --- Document Expansion with N-best Recognition Hypotheses --- p.140
Chapter C.5 --- Document Expansion with Reporter Speech and N-best Recognition Hypotheses --- p.140
Chapter C.6 --- Query Expansion with Pseudo Relevance Feedback --- p.142
Chapter D --- Topic Descriptions of TDT-2 Corpus --- p.145
Chapter E --- Speech Recognition Output from Dragon in CLSDR Task --- p.148
Chapter F --- Parameters Estimation --- p.152
Chapter F.1 --- "Estimating the Number of Relevant Documents, Nr" --- p.152
Chapter F.2 --- "Estimating the Number of Terms Added from Relevant Docu- ments, Nrt , to Original Query" --- p.153
Chapter F.3 --- "Estimating the Number of Non-relevant Documents, Nn , from the Bottom-scoring Retrieval List" --- p.153
Chapter F.4 --- "Estimating the Number of Terms, Selected from Non-relevant Documents (Nnt), to be Removed from Original Query" --- p.154
Chapter G --- Abbreviations --- p.155
Bibliography --- p.158

APA, Harvard, Vancouver, ISO, and other styles

35

Τσιλφίδης, Αλέξανδρος. "Signal processing methods for enhancing speech and music signals in reverberant environments." Thesis, 2011. http://nemertes.lis.upatras.gr/jspui/handle/10889/4710.

Full text

Abstract:

This thesis presents novel signal processing algorithms for speech and music dereverberation. The proposed algorithms focus on blind single-channel suppression of late reverberation; however binaural and semi-blind methods have also been introduced. Late reverberation is a particularly harmful distortion, since it significantly decreases the perceived quality of the reverberant signals but also degrades the performance of Automatic Speech Recognition (ASR) systems and other speech and music processing algorithms. Hence, the proposed deverberation methods can be either used as standalone enhancing techniques or implemented as preprocessing schemes prior to ASR or other applied systems. The main dereverberation method proposed here is a blind dereverberation technique based on perceptual reverberation modeling has been developed. This technique employs a computational auditory masking model and locates the signal regions where late reverberation is audible, i.e. where it is unmasked from the clean signal components. Following a selective signal processing approach, only such signal regions are further processed through sub-band gain filtering. The above technique has been evaluated for both speech and music signals and for a wide range of reverberation conditions. In all cases it was found to minimize the processing artifacts and to produce perceptually superior clean signal estimations than any other tested technique. Moreover, extensive ASR tests have shown that it significantly improves the recognition performance, especially in highly reverberant environments.
Η διατριβή αποτελείται από εννιά κεφάλαια, δύο παραρτήματα καθώς και την σχετική βιβλιογραφία. Είναι γραμμένη στα αγγλικά ενώ περιλαμβάνει και ελληνική περίληψη. Στην παρούσα διατριβή, αναπτύσσονται μεθόδοι ψηφιακής επεξεργασίας σήματος για την αφαίρεση αντήχησης από σήματα ομιλίας και μουσικής. Οι προτεινόμενοι αλγόριθμοι καλύπτουν ένα μεγάλο εύρος εφαρμογών αρχικά εστιάζοντας στην τυφλή (“blind”) αφαίρεση για μονοκαναλικά σήματα. Στοχεύοντας σε πιο ειδικά σενάρια χρήσης προτείνονται επίσης αμφιωτικοί αλγόριθμοι αλλά και τεχνικές που προϋποθέτουν την πραγματοποίηση κάποιας ακουστικής μέτρησης. Οι αλγόριθμοι επικεντρώνουν στην αφαίρεση της καθυστερημένης αντήχησης που είναι ιδιαίτερα επιβλαβής για την ποιότητα σημάτων ομιλίας και μουσικής και μειώνει την καταληπτότητα της ομιλίας. Επίσης, επειδή αλλοιώνει σημαντικά τα στατιστικά των σημάτων, μειώνει σημαντικά την απόδοση συστημάτων αυτόματης αναγνώρισης ομιλίας καθώς και άλλων αλγορίθμων ψηφιακής επεξεργασίας ομιλίας και μουσικής. Έτσι οι προτεινόμενοι αλγόριθμοι μπορούν είτε να χρησιμοποιηθούν σαν αυτόνομες τεχνικές βελτίωσης της ποιότητας των ακουστικών σημάτων είτε να ενσωματωθούν σαν στάδια προ-επεξεργασίας σε άλλες εφαρμογές. Η κύρια μέθοδος αφαίρεσης αντήχησης που προτείνεται στην διατριβή, είναι βασισμένη στην αντιληπτική μοντελοποίηση και χρησιμοποιεί ένα σύγχρονο ψυχοακουστικό μοντέλο. Με βάση αυτό το μοντέλο γίνεται μία εκτίμηση των σημείων του σήματος που η αντήχηση είναι ακουστή δηλαδή που δεν επικαλύπτεται από το ισχυρότερο σε ένταση καθαρό από αντήχηση σήμα. Η συγκεκριμένη εκτίμηση οδηγεί σε μία επιλεκτική επεξεργασία σήματος όπου η αφαίρεση πραγματοποιείται σε αυτά και μόνο τα σημεία, μέσω πρωτότυπων υβριδικών συναρτήσεων κέρδους που βασίζονται σε δείκτες αντικειμενικής και υποκειμενικής αλλοίωσης. Εκτεταμένα αντικειμενικά και υποκειμενικά πειράματα δείχνουν ότι η προτεινόμενη τεχνική δίνει βέλτιστες ποιοτικά ανηχωικές εκτιμήσεις ανεξάρτητα από το μέγεθος του χώρου.

APA, Harvard, Vancouver, ISO, and other styles

36

Medeiros, Henrique Rodrigues Barbosa de. "Automatic detection of disfluencies in a corpus of university lectures." Master's thesis, 2014. http://hdl.handle.net/10071/8683.

Full text

Abstract:

This dissertation focuses on the identification of disfluent sequences and their distinct structural regions. Reported experiments are based on audio segmentation and prosodic features, calculated from a corpus of university lectures in European Portuguese, containing about 32 hours of speech and about 7.7% of disfluencies. The set of features automatically extracted from the forced alignment corpus proved to be discriminant of the regions contained in the production of a disfluency. The best results concern the detection of the interregnum, followed by the detection of the interruption point. Several machine learning methods have been applied, but experiments show that Classification and Regression Trees usually outperform the other methods. The set of most informative features for cross-region identification encompasses word duration ratios, word confidence score, silent ratios, and pitch and energy slopes. Features such as the number of phones and syllables per word proved to be more useful for the identification of the interregnum, whereas energy slopes were most suited for identifying the interruption point. We have also conducted initial experiments on automatic detecting filled pauses, the most frequent disfluency type. For now, only force aligned transcripts were used, since the ASR system is not well adapted to this domain. This study is a step towards automatic detection of filled pauses for European Portuguese using prosodic features. Future work will extend this study for fully automatic transcripts, and will also tackle other domains, also exploring extended sets of linguistic features.
Esta tese aborda a identificação de sequências disfluentes e respetivas regiões estruturais. As experiências aqui descritas baseiam-se em segmentação e informação relativa a prosódia, calculadas a partir de um corpus de aulas universitárias em Português Europeu, contendo cerca de 32 horas de fala e de cerca de 7,7% de disfluências. O conjunto de características utilizadas provou ser discriminatório na identificação das regiões contidas na produção de disfluências. Os melhores resultados dizem respeito à deteção do interregnum, seguida da deteção do ponto de interrupção. Foram testados vários métodos de aprendizagem automática, sendo as Árvores de Decisão e Regressão as que geralmente obtiveram os melhores resultados. O conjunto de características mais informativas para a identificação e distinção de regiões disfluentes abrange rácios de duração de palavras, nível de confiança da palavra atual, rácios envolvendo silêncios e declives de pitch e de energia. Características tais como o número de fones e sílabas por palavra provaram ser mais úteis para a identificação do interregnum, enquanto pitch e energia foram os mais adequados para identificar o ponto de interrupção. Foram também realizadas experiências focando a deteção de pausas preenchidas. Por enquanto, para estas experiências foi utilizado apenas material proveniente de alinhamento forçado, já que o sistema de reconhecimento automático não está bem adaptado a este domínio. Este estudo representa um novo passo no sentido da deteção automática de pausas preenchidas para Português Europeu, utilizando recursos prosódicos. Em trabalho futuro pretende-se estender esse estudo para transcrições automáticas e também abordar outros domínios, explorando conjuntos mais extensos de características linguísticas.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!