Tesis:

1

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition". Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Texto completo

Resumen

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-63).
The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.
by Felix Sun.
M. Eng.

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Alcaraz, Meseguer Noelia. "Speech Analysis for Automatic Speech Recognition". Thesis, Norwegian University of Science and Technology, Department of Electronics and Telecommunications, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9092.

Texto completo

Resumen

The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Kleinschmidt, Tristan Friedrich. "Robust speech recognition using speech enhancement". Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/31895/1/Tristan_Kleinschmidt_Thesis.pdf.

Texto completo

Resumen

Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Blank, Sarah Catrin. "Speech comprehension, speech production and recovery of propositional speech following aphasic stroke". Thesis, Imperial College London, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.407772.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

Price, Moneca C. "Interactions between speech coders and disordered speech". Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp01/MQ28640.pdf.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Chong, Fong Loong. "Objective speech quality measurement for Chinese speech". Thesis, University of Canterbury. Computer Science and Software Engineering, 2005. http://hdl.handle.net/10092/9607.

Texto completo

Resumen

In the search for the optimisation of transmission speed and storage, speech information is often coded, or transmitted with a reduced bandwidth. As a result, quality and/or intelligibility are sometimes degraded. Speech quality is normally defined as the degree of goodness in the perception of speech while speech intelligibility is how well or clearly one can understand what is being said. In order to assess the level of acceptability of degraded speeches, various subjective methods have been developed to test codecs or sound processing systems. Although good results have been demonstrated with these, they are time consuming and expensive due to the necessary involvement of teams of professional or naive subjects1[56]. To reduce cost, computerised objective systems were created with the hope of replacing human subjects [90][43]. While reasonable standards have been reported by several of these systems, they have not reached the accuracy of well constructed subjective tests yet [92][84]. Therefore, their evaluations and improvements are constantly been researched for further breakthroughs. To date, objective speech quality measurement systems (OSQMs) have been developed mostly in Europe or the United States, and effectiveness is only tested for English, several European and Asian languages but not Chinese (Mandarin) [38][70][32].

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Stedmon, Alexander Winstan. "Putting speech in, taking speech out : human factors in the use of speech interfaces". Thesis, University of Nottingham, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.420342.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda y Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition". INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Tang, Lihong. "Nonsensical speech : speech acts in postsocialist Chinese culture /". Thesis, Connect to this title online; UW restricted, 2008. http://hdl.handle.net/1773/6662.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Itakura, Fumitada, Tetsuya Shinde, Kiyoshi Tatara, Taisuke Ito, Ikuya Yokoo, Shigeki Matsubara, Kazuya Takeda y Nobuo Kawaguchi. "CIAIR speech corpus for real world speech recognition". The oriental chapter of COCOSDA (The International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques), 2002. http://hdl.handle.net/2237/15462.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation". The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Limbu, Sireesh Haang. "Direct Speech to Speech Translation Using Machine Learning". Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-439141.

Texto completo

Resumen

Nowadays, most of the speech to speech translation applications and services use a three step process. First step is the speech to text translation using speech recognition. This is followed by text to text language translation and finally the text is synthesized into speech. As the availability of data and computing power improved,each of these individual steps advanced over time. Although, the progress was significant, there was always error associated with the first translation step in terms of various factors such as tone recognition of the speech, accent etc. The error further propagated and quite often deteriorated as it went down the translation steps. This gave birth to an ongoing budding research in direct speech to speech translation without relying on text translations. This project is inspired from Google’s 'Translatotron : An End-to-End Speech-to-Speech translation model'. In line with the 'Translatotron' model this thesis makes use of a simpler Sequence-to-Sequence (STS)encoder-decoder LSTM network using spectrograms as input to examine the possibility of direct language translations in audio form. Although the final results have inconsistencies and are not as efficient as the traditional speech to speech translation techniques which heavily rely on text translations, they serve as a promising platform for further research

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Hu, Ke. "Speech Segregation in Background Noise and Competing Speech". The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1339018952.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Al-Otaibi, Abdulhadi S. "Arabic speech processing : syllabic segmentation and speech recognition". Thesis, Aston University, 1988. http://publications.aston.ac.uk/8064/.

Texto completo

Resumen

A detailed description of the Arabic Phonetic System is given. The syllabic behaviour of the Arabic language is highlighted. Basic statistical properties Of the Arabic language (phoneme and syllabic frequency of repetition) are included. A thorough review of the speech processing techniques, used in speech analysis, synthesis and recognition applications are presented. The development of a PC-based speech processing system is described. The system has proven to be a useful tool in Arabic speech analysis and recognition applications. A sample speotrographic study of two pairs of Arabic similar sounds was performed. it is shown that no clear acoustical property exist in distinguishing between the phonemes /O/ and /f/ except the gradual rise of F1 during formant movements (transitions). The development of an automatic Arabic syllabic segmentation algorithm is described. The performance of the algorithm is tested with monosyllabic and multisyllabic words. An overall accuracy of 92% was achieved. The main parameters affecting the accuracy of the segmentation algorithm are discussed. The syllabic units generated from applying the Arabic syllabic segmentation algorithm are utilized in the implementation of three major speech applications, namely, automatic Arabic vowel recognition system, isolated word recognition system and an acoustic-phonetic model for Arabic. Each application is fully described and its performance results are indicated.

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Smith, Peter Wilfred Hesling. "Speech act theory, discourse structure and indirect speech". Thesis, University of Leeds, 1991. http://etheses.whiterose.ac.uk/734/.

Texto completo

Resumen

Speech Act Theory is concerned with the ways in which language can be used. It originated with Austin, but was developed by Searle. The theories of Austin and Searle are described and several problem areas are identified. If it is to be a viable theory of language usage, speech act theory must be able to integrate with a theory of discourse structure, because if speech acts are identifiable as units of language, then it must be possible include them in a model of discourse. The second chapter examines discourse structure, examining two rival theories: the discourse analysis approach and the conversational analysis approach. Discourse analysis is broadly sympathetic to speech act theory, whereas, conversational analysis is not. The claims of conversational analysis are examined and are found to be wanting in several respects. Speech Act Theory is then discussed with a particular emphasis on the problem of relating speech acts to each other within a larger unit of discourse. It is noted that Austin, by including the expositive class of speech acts, allows for the possibility of relations between speech acts, whereas Searle's description of speech acts effectively rules out any relations between speech acts. The third chapter develops speech acts in terms of a schematic model consisting of cognitive states, a presumed effect of the speech act and an action. The cognitive states are represented using modal and deontic operators on the proposition within epistemic logic. This idea of the description of a speech act in terms of cognitive states is developed in Chapter Four. In Chapter Four, speech acts are related using a communicated cognitive state to pair two speech acts together into a primary and secondary speech act. It is noted that the idea of a primary and secondary speech act is present within the discourse analysis model of discourse (in the form of the initiation-response cycle of exchanges) and also in the conversational analysis approach to discourse (in the form of the adjacency pair). The conclusion from this is that the two approaches are perhaps not so incompatible as might first appear. Chapter Five deals with grammatical sentence types and their possible use in communicating cognitive states. It also examines modal auxiliary verbs and their possible relationship to the modal and deontic operators used in the cognitive state model. In Chapter Six, theories of indirect speech acts are described. An explanation of indirect speech acts is developed using pragmatic maxims and cognitive states to explain why certain indirect forms are chosen. This leads to a theory of linguistic politeness and a use model of speech acts.

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Tran, Viet Anh. "Silent communication : whispered speech-to-clear speech conversion". Grenoble INPG, 2010. http://www.theses.fr/2010INPG0006.

Texto completo

Resumen

La parole silencieuse ou murmurée est définie comme la production articulée de sons, avec très peu de vibration des cordes vocales dans le cas du chuchotement, et aucune vibration dans le cas du murmure, produite par les mouvements et les interactions des organes de la parole tels que la langue, le voile du palais, les lèvres, etc. , dans le but d'éviter d'être entendue par plusieurs personnes. La parole silencieuse ou murmurée est utilisée généralement pour la communication privée et confidentielle ou peut être employée par les personnes présentant un handicap laryngé et qui ne peuvent pas parler normalement. Cependant, il est difficile d'employer directement la parole silencieuse (murmurée) pour la communication face à face ou avec un téléphone portable parce que le contenu linguistique et l'information paralinguistique dans le message prononcé sont dégradés fortement quand le locuteur murmure ou chuchote. Une piste récente de recherche est donc celle de la conversion de la parole silencieuse (ou murmurée) en voix claire afin d'avoir une voix plus intelligible et plus naturelle. Avec une telle conversion, des applications potentielles telles que la téléphonie silencieuse " ou des systèmes d'aides robustes pour les handicaps laryngés deviendraient envisageables. Notre travail dans cette thèse se concentre donc sur cette piste
In recent years, advances in wireless communication technology have led to the widespread use of cellular phones. Because of noisy environmental conditions and competing surrounding conversations, users tend to speak loudly. As a consequence, private policies and public legislation tend to restrain the use of cellular phone in public places. Silent speech which can only be heard by a limited set of listeners close to the speaker is an attractive solution to this problem if it can effectively be used for quiet and private communication. The motivation of this research thesis was to investigate ways of improving the naturalness and the intelligibility of synthetic speech obtained from the conversion of silent or whispered speech. A Non-audible murmur (NAM) condenser microphone, together with signal-based Gaussian Mixture Model (GMM) mapping, were chosen because promising results were already obtained with this sensor and this approach, and because the size of the NAM sensor is well adapted to mobile communication technology. Several improvements to the speech conversion obtained with this sensor were considered. A first set of improvement concerns characteristics of the voiced source. One of the features missing in whispered or silent speech with respect to loud or modal speech is F0, which is crucial in conveying linguistic (question vs. Statement, syntactic grouping, etc. ) as well as paralinguistic (attitudes, emotions) information. The proposed estimation of voicing and F0 for converted speech by separate predictors improves both predictions. The naturalness of the converted speech was then further improved by extending the context window of the input feature from phoneme size to syllable size and using a Linear Discriminant Analysis (LDA) instead of a Principal Component Analysis (PCA) for the dimension reduction of input feature vector. The objective positive influence of this new approach of the quality of the output converted speech was confirmed by perceptual tests. Another approach investigated in this thesis consisted in integrating visual information as a complement to the acoustic information in both input and output data. Lip movements which significantly contribute to the intelligibility of visual speech in face-to-face human interaction were explored by using an accurate lip motion capture system from 3D positions of coloured beads glued on the speaker's face. The visual parameters are represented by 5 components related to the rotation of the jaw, to lip rounding, upper and lower lip vertical movements and movements of the throat which is associated with the underlying movements of the larynx and hyoid bone. Including these visual features in the input data significantly improved the quality of the output converted speech, in terms of F0 and spectral features. In addition, the audio output was replaced by an audio-visual output. Subjective perceptual tests confirmed that the investigation of the visual modality in either the input or output data or both, improves the intelligibility of the whispered speech conversion. Both of these improvements are confirmed by subjective tests. Finally, we investigated the technique using a phonetic pivot by combining Hidden Markov Model (HMM)-based speech recognition and HMM-based speech synthesis techniques to convert whispered speech data to audible one in order to compare the performance of the two state-of-the-art approaches. Audiovisual features were used in the input data and audiovisual speech was produced as an output. The objective performance of the HMM-based system was inferior to the direct signal-to-signal system based on a GMM. A few interpretations of this result were proposed together with future lines of research

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Chuchilina, L. M. y I. E. Yeskov. "Speech recognition". Thesis, Видавництво СумДУ, 2008. http://essuir.sumdu.edu.ua/handle/123456789/15995.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Windchy, Eli. "Keynote Speech". Digital Commons @ East Tennessee State University, 2018. https://dc.etsu.edu/dcseug/2018/schedule/9.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Chua, W. W. "Speech recognition predictability of a Cantonese speech intelligibility index". Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B30509737.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Overton, Katherine. "Perceptual Differences in Natural Speech and Personalized Synthetic Speech". Scholar Commons, 2017. http://scholarcommons.usf.edu/etd/6921.

Texto completo

Resumen

The purpose of this study was to determine what perceptual differences existed between a natural recorded human voice and a synthetic voice that was created to sound like the same voice. This process was meant to mimic the differences between a voice that would be used for Message Banking and a voice that would be created by the ModelTalker system. Forty speech pathology graduate students (mean age = 23 years) rated voices on clarity, naturalness, pleasantness, and overall similarity. Analysis of data showed that the natural human voice was consistently rated as more natural, clear, and pleasant. In addition, participants generally rated the two voices as very different. This demonstrates that, at least in terms of perception, using the method of Message Banking results in a voice that is overall perceived more positively than the voice created using ModelTalker.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Mailend, Marja-Liisa y Marja-Liisa Mailend. "Speech Motor Planning in Apraxia of Speech and Aphasia". Diss., The University of Arizona, 2017. http://hdl.handle.net/10150/625882.

Texto completo

Resumen

Apraxia of speech (AOS) is a motor speech disorder that poses significant obstacles to a person's ability to communicate and take part in everyday life. Agreement exists between current theories of AOS that the impairment affects the speech motor planning stage, where linguistic representations are transformed into speech movements, but they disagree on the specific nature of the breakdown at this processing level. A more detailed understanding of this impairment is essential for developing targeted, effective treatment approaches and for identifying the appropriate candidates for these treatments. The study of AOS is complicated by the fact that this disorder rarely occurs in isolation but is commonly accompanied by various degrees of aphasia (a language impairment) and/or dysarthria (a neuromuscular impairment of speech motor control). In addition, the behavioral similarities of AOS and its closest clinical neighbor, aphasia with phonemic paraphasias, undermine the usefulness of traditional methods, such as perceptual error analysis, in the study of both disorders. The purpose of this dissertation was to test three competing hypotheses about the specific nature of the speech motor planning impairment in AOS in a systematic sequence of three reaction time experiments. This research was formulated in the context of a well-established theoretical framework of speech production and it combines psycholinguistic reaction time paradigms with a cognitive neuropsychological approach. The results of the three experiments provide evidence that one component of the speech motor planning impairment in AOS involves difficulty with selecting the intended motor program for articulation. Furthermore, this difficulty appears to be intensified by simultaneously activated alternative speech motor programs that compete with the target program for selection. These findings may prove useful as a theoretically-motivated basis for improving diagnostic tools and treatment protocols for people with AOS and aphasia, thus enhancing clinical decision-making. Such translational and clinical research aimed at developing sensitive and specific diagnostic tools and improving treatment approaches is the ultimate long-term objective of this research program.

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Mak, Cheuk-yan Charin. "Effects of speech and noise on Cantonese speech intelligibility". Click to view the E-thesis via HKUTO, 2006. http://sunzi.lib.hku.hk/hkuto/record/B37989790.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Evans, N. W. D. "Spectral subtraction for speech enhancement and automatic speech recognition". Thesis, Swansea University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.636935.

Texto completo

Resumen

The contributions made in this thesis relate to an extensive investigation of spectral subtraction in the context of speech enhancement and noise robust automatic speech recognition (ASR) and the morphological processing of speech spectrograms. Three sources of error in a spectral subtraction approach are identified and assessed with ASR. The effects of phase, cross-term component and spectral magnitude errors are assessed in a common spectral subtraction framework. ASR results confirm that, except for extreme noise conditions, phase and cross-term component errors are relatively negligible compared to noise estimate errors. A topology classifying approaches to spectral subtraction into power and magnitude, linear and non-linear spectral subtraction is proposed. Each class is assessed and compared under otherwise identical experimental conditions. These experiments are thought to be the first to assess the four combinations under such controlled conditions. ASR results illustrate a lesser sensitivity to noise over-estimation for non-linear approaches. With a view to practical systems, different approaches to noise estimation are investigated. In particular approaches that do not require explicit voice activity detection are assessed and shown to compare favourably to the conventional approach, the latter requiring explicit voice activity detection. Following on from this finding a new computationally efficient approach to noise estimation that does not require explicit voice activity detection is proposed. Investigations into the fundamentals of spectral subtraction highlight the limitation of noise estimates: statistical estimates obtained from a number of analysis frames lead to relatively poor representations of the instantaneous values. To ameliorate this situation, estimates from neighbouring, lateral frequencies are used to complement within bin (from the same frequency) statistical approaches. Improvements are found to be negligible. However, the principle of these lateral estimates lead naturally to the final stage of the work presented in this thesis, that of morphologically filtering speech spectrograms. This form of processing is examined for both synthesised and speech signals and promising ASR performance is reported. In 2000 the Aurora 2 database was introduced by the organisers of a special session at Eurospeech 2001 entitled ‘Noise Robust Recognition’, aimed at providing a standard database and experimental protocols for the assessment of noise robust ASR. This facility, when it became available, was used for the work described in this thesis.

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

Chua, W. W. y 蔡蕙慧. "Speech recognition predictability of a Cantonese speech intelligibility index". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30509737.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Mak, Cheuk-yan Charin y 麥芍欣. "Effects of speech and noise on Cantonese speech intelligibility". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B37989790.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Le, Cornu Thomas. "Reconstruction of intelligible audio speech from visual speech information". Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/67012/.

Texto completo

Resumen

The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility.

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Jett, Brandi. "The role of coarticulation in speech-on-speech recognition". Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1554498179209764.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Bi, Ning. "Speech conversion and its application to alaryngeal speech enhancement". Diss., The University of Arizona, 1995. http://hdl.handle.net/10150/187290.

Texto completo

Resumen

In this investigation, a vector quantization (VQ)-based speech conversion algorithm and a linear multivariate regression (LMR)-based speech conversion algorithm were modified, and the modified algorithms were applied to the enhancement of alaryngeal speech. The modifications were aimed at reducing the spectral distortion (bandwidth increase) in the VQ-based system and the spectral discontinuity in the LMR-based system. The spectral distortion in the VQ-based algorithm was compensated by formant enhancement using chirp z-transform and cepstral weighting. The spectral discontinuity in the LMR-based system was minimized by the use of overlapped subsets during the constructing of conversion mapping function. These modified algorithms were evaluated using simulated data and speech samples. Results of the evaluations indicated that the modified algorithms reduced conversion distortions. These modified algorithms were also used for the enhancement of alaryngeal speech. Results of perceptual evaluation indicated that listeners generally preferred to listen to the enhanced speech samples.

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Gordon, Jane S. "Use of synthetic speech in tests of speech discrimination". PDXScholar, 1985. https://pdxscholar.library.pdx.edu/open_access_etds/3443.

Texto completo

Resumen

The purpose of this study was to develop two tape-recorded synthetic speech discrimination test tapes and assess their intelligibility in order to determine whether or not synthetic speech was intelligible and if it would prove useful in speech discrimination testing. Four scramblings of the second MU-6 monosyllable word list were generated by the ECHO l C speech synthesizer using two methods of generating synthetic speech called TEXTALKER and SPEAKEASY. These stimuli were presented in one ear to forty normal-hearing adult subjects, 36 females and 4 males, at 60 dB HL under headphone&. Each subject listened to two different scramblings of the 50 monosyllable word list, one scrambling generated by TEXTALKER and the other scrambling generated by SPEAKEASY. The order in which the TEXTALKER and SPEAKEASY mode of presentation occurred as well as which ear to test per subject was randomly determined.

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

MUKHERJEE, SANKAR. "Sensorimotor processes in speech listening and speech-based interaction". Doctoral thesis, Università degli studi di Genova, 2019. http://hdl.handle.net/11567/941827.

Texto completo

Resumen

The thesis deals with two extreme end of speech perception in cognitive neuroscience. On its one end it deals with single isolated person brain responses to acoustic stimulus and missing articulatory cues, and on the other end it explores the neural mechanisms emerging while speech is embedded in a true conversational interaction. Studying these two extremities requires the use of relatively different methodological approaches. In fact, the first approach has seen the consolidation of a wide variety of experimental designs and analytical methods. Otherwise, the investigation of speech brain processes during a conversation is still in its early infancy and several technical and methodological challenges still needs to be solved. In the present thesis, I will present one EEG study using a classical attentive speech listening task, analyzed by using recent methodological advancement explicitly looking at the neural entrainment to speech oscillatory properties. Then, I will report on the work I did to design a robust speech-based interactive task, to extract acoustic and articulatory indexes of interaction, as well as the neural EEG correlates of its word-level dynamics. All in all, this work suggests that motor processes play a critical role both in attentive speech listening and in guiding mutual speech accomodation. In fact, the motor system, on one hand reconstruct information that are missing in the sensory domain and on the other hand drives our implicit tendency to adapt our speech production to the conversational partner and the interactive dynamics.

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Kong, Jessica Lynn. "The Effect Of Mean Fundamental Frequency Normalization Of Masker Speech For A Speech-In-Speech Recognition Task". Case Western Reserve University School of Graduate Studies / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=case1588949121900459.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Schramm, Hauke. "Modeling spontaneous speech variability for large vocabulary continuous speech recognition". [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=97968479X.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

Lidstone, Jane Stephanie May. "Private speech and inner speech in typical and atypical development". Thesis, Durham University, 2010. http://etheses.dur.ac.uk/526/.

Texto completo

Resumen

Children often talk themselves through their activities: They produce private speech to regulate their thought and behaviour, which is internalised to form inner speech, or silent verbal thought. Private speech and inner speech can together be referred to as self-directed speech (SDS). SDS is thought to be an important aspect of human cognition. The first chapter of the present thesis explores the theoretical background of research on SDS, and brings the reader up-to-date with current debates in this research area. Chapter 2 consists of empirical work that used the observation of private speech in combination with the dual task paradigm to assess the extent to which the executive function of planning is reliant on SDS in typically developing 7- to 11-year-olds. Chapters 3 and 4 describe studies investigating the SDS of two groups of atypically developing children who show risk factors for SDS impairment—those with autism and those with specific language impairment. The research reported in Chapter 5 tests an important tenet of neoVygotskian theory—that the development of SDS development is domain-general—by looking at cross-task correlations between measures of private speech production in typically developing children. Other psychometric properties of private speech production (longitudinal stability and cross-context consistency) were also investigated. Chapter 6, the General Discussion, first summarises the main body of the thesis, and then goes on to discuss next steps for this research area, in terms of the methods used to study SDS, the issue of domain-general development, and the investigation of SDS in developmental disorders.

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Howard, John Graham. "Temporal aspects of auditory-visual speech and non-speech perception". Thesis, University of Reading, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.553127.

Texto completo

Resumen

This thesis concentrates on the temporal aspects of the auditory-visual integratory perceptual experience described above. It is organized in two parts, a literature review, followed by an experimentation section. After a brief introduction (Chapter One), Chapter Two begins by considering the evolution of the earliest biological structures to exploit information in the acoustic and optic environments. The second part of the chapter proposes that the auditory-visual integratory experience might be a by-product of the earliest emergence of spoken language. Chapter Three focuses on human auditory and visual neural structures. It traces the auditory and visual systems of the modem human brain through the complex neuroanatomical forms that construct their pathways, through to where they finally integrate into the high-level multi-sensory association areas. Chapter Four identifies two distinct investigative schools that have each reported on the auditory-visual integratory experience. We consider their different experimental methodologies and a number of architectural and information processing models that have sought to emulate human sensory, cognitive and perceptual processing, and ask how far they can accommodate a bi-sensory integratory processing. Chapter Five draws upon empirical data to support the importance of the temporal dimension of sensory forms in information processing, especially bimodal processing. It considers the implications of different modalities processing differently discontinuous afferent information within different time-frames. It concludes with a discussion of a number of models of biological clocks that have been proposed as essential temporal regulators of human sensory experience. In Part Two, the experiments are presented. Chapter Six provides the general methodology, and in the following Chapters a series of four experiments is reported upon. The experiments follow a logical sequence, each being built upon information either revealed or confirmed in results previously reported. Experiments One, Three, and Four required a radical reinterpretation of the 'fast-detection' paradigm developed for use in signal detection theory. This enables the work of two discrete investigative schools in auditory-visual processing to be brought together. The use of this modified paradigm within an appropriately designed methodology produces experimental results that speak directly to both the 'speech versus non-speech' debate and also to gender studies.

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Simm, William Alexander. "Dysarthric speech measures for use in evidence-based speech therapy". Thesis, Lancaster University, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.531724.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Lebart, Katia. "Speech dereverberation applied to automatic speech recognition and hearing aids". Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.285064.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

Alghamdi, Najwa. "Visual speech enhancement and its application in speech perception training". Thesis, University of Sheffield, 2017. http://etheses.whiterose.ac.uk/19667/.

Texto completo

Resumen

This thesis investigates methods for visual speech enhancement to support auditory and audiovisual speech perception. Normal-hearing non-native listeners receiving cochlear implant (CI) simulated speech are used as ‘proxy’ listeners for CI users, a proposed user group who could benefit from such enhancement methods in speech perception training. Both CI users and non-native listeners share similarities with regards to audiovisual speech perception, including increased sensitivity to visual speech cues. Two enhancement methods are proposed: (i) an appearance based method, which modifies the appearance of a talker’s lips using colour and luminance blending to apply a ‘lipstick effect’ to increase the saliency of mouth shapes; and (ii) a kinematics based method, which amplifies the kinematics of the talker’s mouth to create the effect of more pronounced speech (an ‘exaggeration effect’). The application that is used to test the enhancements is speech perception training, or audiovisual training, which can be used to improve listening skills. An audiovisual training framework is presented which structures the evaluation of the effectiveness of these methods. It is used in two studies. The first study, which evaluates the effectiveness of the lipstick effect, found a significant improvement in audiovisual and auditory perception. The second study, which evaluates the effectiveness of the exaggeration effect, found improvement in the audiovisual perception of a number of phoneme classes; no evidence was found of improvements in the subsequent auditory perception, as audiovisual recalibration to visually exaggerated speech may have impeded learning when used in the audiovisual training. The thesis also investigates an example of kinematics based enhancement which is observed in Lombard speech, by studying the behaviour of visual Lombard phonemes in different contexts. Due to the lack of suitable datasets for this analysis, the thesis presents a novel audiovisual Lombard speech dataset recorded under high SNR, which offers two, fixed head-pose, synchronised views of each talker in the dataset.

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

Mwanyoha, Sadiki Pili 1974. "A speech recognition module for speech-to-text language translation". Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/9862.

Texto completo

Resumen

Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (leaves 47-48).
by Sadiki Pili Mwanyoha.
S.B.and M.Eng.

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

Moers-Prinz, Donata [Verfasser]. "Fast Speech in Unit Selection Speech Synthesis / Donata Moers-Prinz". Bielefeld : Universitätsbibliothek Bielefeld, 2020. http://d-nb.info/1219215201/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

LEBART, KATIA. "Speech dereverberation applied to automatic speech recognition and hearing aids". Rennes 1, 1999. http://www.theses.fr/1999REN10033.

Texto completo

Resumen

Cette these concerne la dereverberation de la parole dans les contextes specifiques de l'application aux appareils pour malentendants ou a la reconnaissance automatique de la parole. Les methodes considerees doivent etre fonctionnelles dans des conditions ou les canaux acoustiques pris en compte sont inconnus et variables. Nous proposons donc de discriminer la reverberation du signal direct a l'aide de proprietes de la reverberation qui sont independantes du canal acoustique. La correlation spatiale des signaux, leurs directions de provenance et leurs supports temporels menent a differentes methodes qui sont examinees successivement. Apres un etat de l'art sur les methodes fondees sur la decorrelation spatiale de la reverberation tardive et leurs limites, nous suggerons des ameliorations pour l'un des algorithmes les plus utilises. Nous presentons ensuite un nouvel algorithme spatialement selectif, qui attenue les contributions de la reverberation en fonction de leur direction. Cet algorithme est complementaire du precedent. Tous deux utilisent deux capteurs. Finalement, nous proposons une methode originale qui attenue efficacement l'effet de masquage par recouvrement de la reverberation. Les methodes sont evaluees a l'aide de diverses mesures objectives (facteur de reduction de bruit, gain en rsb, distance cepstrale et scores de reconnaissance automatique de la parole). Des essais de combinaisons des differentes methodes demontrent le benefice potentiel de telles associations.

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Söderberg, Hampus. "Engaging Speech UI's - How to address a speech recognition interface". Thesis, Malmö högskola, Fakulteten för teknik och samhälle (TS), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20591.

Texto completo

Resumen

Speech recognition has existed for a long time in various shapes, often used for recognizing commands, performing text-to-speech transcription or a mix of the two. This thesis investigates how the input affordances for such speech based interactions should be designed to enable intuitive engagement in a multimodal user interface. At the time of writing, current efforts in user interface design typically revolves around the established desktop metaphor where vision is the primary sense. Since speech recognition is based on the sense of hearing, previous work related to GUI design cannot be applied directly to a speech interface. Similar to how traditional GUI’s have evolved to embrace the desktop metaphor and matured into supporting modern touch based experiences, speech interaction needs to undergo a similar evolutionary process before designers can begin to understand its inherent characteristics and make informed assumptions about appropriate interaction mechanics. In order to investigate interface addressability and affordance accessibility, a prototype speech interface for a Windows 8 tablet PC was created. The prototype extended Windows 8’s modern touch optimized interface with speech interaction. The thesis’ outcome is based on a user centered evaluation of the aforementioned prototype. The outcome consists of additional knowledge surrounding foundational interaction mechanics regarding the matter of addressing and engaging a speech interface. These mechanics are important key aspects to consider when developing full featured speech recognition interfaces. This thesis aims to provide a first stepping stone towards understanding how speech interfaces should be designed. Additionally, the thesis’ has also investigated related interaction aspects such as required feedback and considerations when designing a multimodal user interface that includes touch and speech input methods. It has also been identified that a speech transcription or dictating interface needs more interaction mechanics than its inherent start and stop to become usable and useful.

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

Shuster, Linda Irene. "Speech perception and speech production : between and within modal adaptation /". The Ohio State University, 1986. http://rave.ohiolink.edu/etdc/view?acc_num=osu148726754698296.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

Kim, Hyo-Jong. "Stephen's speech missiological implications of Stephen's speech in Luke-Acts /". Online full text .pdf document, available to Fuller patrons only, 1999. http://www.tren.com.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Vescovi, Federico <1993&gt. "Understanding Speech Acts: Towards the Automated Detection of Speech Acts". Master's Degree Thesis, Università Ca' Foscari Venezia, 2019. http://hdl.handle.net/10579/15644.

Texto completo

Resumen

Il presente lavoro è un tentativo di analizzare il linguaggio in termini delle azioni che svolgiamo attraverso il parlare. Il nostro lavoro ruota intorno alla teoria degli atti linguistici (Austin, 1962, Searle, 1969), che costituisce lo sfondo teorico del nostro studio. La teoria degli atti linguistici è una teoria dell'uso del linguaggio che indaga le azioni, o atti, che eseguiamo producendo enunciati in conversazione; alcuni esempi atti linguistici sono i seguenti: richiedere, interrogare, promettere, minacciare e scusarsi. Supponendo che ogni espressione implichi l'esecuzione di (almeno) un atto linguistico (Searle & Vanderveken, 1985), il nostro obiettivo è determinare in quali (e quanti) tipi di atti linguistici possiamo classificare in modo efficiente le espressioni linguistiche, dove ogni tipo o classe di atti linguistici include tutti gli atti linguistici che hanno lo stesso scopo in conversazione (Searle, 1976). Eseguiremo un'analisi sia della forma linguistica degli enunciati sia del contesto in cui vengono utilizzati. La nostra analisi conduce alle due seguenti osservazioni chiave: 1) gli elementi del linguaggio naturale possono essere usati come indicatori dei tipi di atti linguistici; e 2) l'uso di tali elementi per la classificazione delle frasi è tanto allettante quanto fuorviante in quanto vi sono molti modi per eseguire un atto linguistico senza utilizzare un corrispondente indicatore del linguaggio naturale.

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

Eriksson, Mattias. "Speech recognition availability". Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2651.

Texto completo

Resumen

This project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions.

I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory.

I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

Øygarden, Jon. "Norwegian Speech Audiometry". Doctoral thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for språk- og kommunikasjonsstudier, 2009. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-5409.

Texto completo

Resumen

A new set of speech audiometry for Norwegian - called "HiST taleaudiometri" - has been developed by the author of this thesis ("HiST" being short for the Norwegian name of Sør-Trøndelag University College and "taleaudiometri" being Norwegian for speech audiometry). The speech audiometry set consists of five-word sentences, three-word utterances, monosyllabic words, monosyllabic words for testing children and numrals. The process of developing the speech audiometry set is presented in this thesis. The five-word sentences are of the form Name-verb-numeral-adjetive-noun. Hagerman developed this sentence type for Swedish speech audiometry in the 1980s, but for Norwegian the sentences were developed using a new diphone-splitting method. For each word category ten alternatives exist, makings it possible to generate a number of lists with the same phonemic content but with different sentences. A noise was developed from the speech material. This is intended for use together with the speech for the purpose of speech recognition threshold in noise measurements. The material is very suitible for performing repeated measurements on the same person, which is often a requisite for hearing aid evaluation or psychoacoustical testing. The three-word utterances are of the form numeral-adjective-noun. The words are identical with the last three words used in the five-word sentences. The three-word utterances are intended for speech recognition threshold measurement. The noise developed for five-word sentences can be used together with the three-word utterances for speech recogniton threshold in noise measurements. Monosyllabic word lists were developed mainly for the purpose of measuring maximum speech recogniton score or the performance-intensity function. The recorded lists earmarked for testing children were developed by Rikshospitalet University Hospital in Oslo. The numrals used in the "HiST taleaudiometri" set are the numerals that were recorded by Sverre Quist-Hanssen for his speech audiometry. The numerals are organized in groups of three ( digit triplets).

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Nilsson, Mattias. "Entropy and Speech". Doctoral thesis, Stockholm : Sound and Image Processing Laboratory, School of Electrical Engineering, Royal Institute of Technology, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3990.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Janardhanan, Deepa. "Wideband speech enhancement". Aachen Shaker, 2008. http://d-nb.info/989298310/04.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

Donovan, R. E. "Trainable speech synthesis". Thesis, University of Cambridge, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.598598.

Texto completo

Resumen

This thesis is concerned with the synthesis of speech using trainable systems. The research it describes was conducted with two principal aims: to build a hidden Markov model (HMM) based speech synthesis system which could synthesise very high quality speech; and to ensure that all the parameters used by the system were obtained through training. The motivation behind the first of these aims was to determine if the HMM techniques which have been applied so successfully in recent years to the problem of automatic speech recognition could achieve a similar level of success in the field of speech synthesis. The motivation behind the second aim was to construct a system that would be very flexible with respect to changing voices, or even languages. A synthesis system was developed which used the clustered states of a set of decision-tree state-clustered HMMs as its synthesis units. The synthesis parameters for each clustered state were obtained completely automatically through training on a one hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronunciation, was generated as a sequence of these clustered states. Initially, each clustered state was associated with a single linear prediction (LP) vector, and LP synthesis used to generate the sequence of vectors corresponding to the state sequence required. Numerous shortcomings were identified in this system, and these were addressed through improvements to its transcription, clustering, and segmentation capabilities. The LP synthesis scheme was replaced by a TD-PSOLA scheme which synthesised speech by concatenating waveform segments selected to represent each clustered state.

Los estilos APA, Harvard, Vancouver, ISO, etc.

50

Oliver, Richard George. "Malocclusion and speech". Thesis, Cardiff University, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.390247.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

Tesis sobre el tema "Speech"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros