Rozprawy doktorskie: „Speech synthesis”

1

Donovan, R. E. "Trainable speech synthesis". Thesis, University of Cambridge, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.598598.

Pełny tekst źródła

Streszczenie:

This thesis is concerned with the synthesis of speech using trainable systems. The research it describes was conducted with two principal aims: to build a hidden Markov model (HMM) based speech synthesis system which could synthesise very high quality speech; and to ensure that all the parameters used by the system were obtained through training. The motivation behind the first of these aims was to determine if the HMM techniques which have been applied so successfully in recent years to the problem of automatic speech recognition could achieve a similar level of success in the field of speech synthesis. The motivation behind the second aim was to construct a system that would be very flexible with respect to changing voices, or even languages. A synthesis system was developed which used the clustered states of a set of decision-tree state-clustered HMMs as its synthesis units. The synthesis parameters for each clustered state were obtained completely automatically through training on a one hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronunciation, was generated as a sequence of these clustered states. Initially, each clustered state was associated with a single linear prediction (LP) vector, and LP synthesis used to generate the sequence of vectors corresponding to the state sequence required. Numerous shortcomings were identified in this system, and these were addressed through improvements to its transcription, clustering, and segmentation capabilities. The LP synthesis scheme was replaced by a TD-PSOLA scheme which synthesised speech by concatenating waveform segments selected to represent each clustered state.

Style APA, Harvard, Vancouver, ISO itp.

2

Greenwood, Andrew Richard. "Articulatory speech synthesis". Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386773.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

3

Tsukanova, Anastasiia. "Articulatory speech synthesis". Electronic Thesis or Diss., Université de Lorraine, 2019. http://www.theses.fr/2019LORR0166.

Pełny tekst źródła

Streszczenie:

Cette thèse se situe dans le domaine de la synthèse articulatoire de la parole et est organisé en trois grandes parties : les deux premières sont consacrées au développement de deux synthétiseurs articulatoires de la parole ; la troisième traite des liens que l'on peut établir entre les deux approches utilisées. Le premier synthétiseur est issu d'une approche à base de règles. Celle-ci visait à obtenir le contrôle complet sur les articulateurs (mâchoire, langue, lèvres, vélum, larynx et épiglotte). Elle s'appuyait sur des données statiques du plan sagittal médian obtenues par IRM (Imagerie par Résonance Magnétique) correspondant à des articulations bloquées de voyelles du français, ainsi que des syllabes de type consonne-voyelle, et était composée de plusieurs étapes : l'encodage de l'ensemble des données grâce à un modèle du conduit vocal basé sur l'ACP (analyse en composantes principales) ; l'utilisation des configurations articulatoires obtenues comme sources de positions à atteindre et destinées à piloter le synthétiseur à base de règles qui est la contribution principale de cette première partie ; l'ajustement des conduits vocaux obtenus selon une perspective phonétique ; la simulation acoustique permettant d'obtenir un signal acoustique. Les résultats de cette synthèse ont été évalués de manière visuelle, acoustique et perceptuelle, et les problèmes rencontrés ont été identifiés et classés selon leurs origines, qui pouvaient être : les données, leur modélisation, l'algorithme contrôlant la forme du conduit vocal, la traduction de cette forme en fonctions d'aire, ou encore la simulation acoustique. Ces analyses nous permettent de conclure que, parmi les tests effectués, les stratégies articulatoires des voyelles et des occlusives sont les plus correctes, suivies par celles des nasales et des fricatives. La seconde approche a été développée en s'appuyant sur un synthétiseur de référence constitué d'un réseau de neurones feed-forward entraîné à l'aide de la méthode standard du système Merlin sur des données audio composées de parole en langue française enregistrée par IRM en temps réel. Ces données ont été segmentées phonétiquement et linguistiquement. Ces données audio, malgré un débruitage, étaient fortement parasitées par le son de la machine à IRM. Nous avons complété le synthétiseur de référence en ajoutant huit paramètres représentant de l'information articulatoire : l'ouverture des lèvres et leur protrusion, la distance entre la langue et le vélum, entre le vélum et la paroi pharyngale, et enfin entre la langue et la paroi pharyngale. Ces paramètres ont été extraits automatiquement à partir des images et alignés au signal et aux spécifications linguistiques. Les séquences articulatoires et les séquences de parole, générées conjointement, ont été évaluées à l'aide de différentes mesures : distance de déformation temporelle dynamique, la distortion mel-cepstrum moyenne, l'erreur de prédiction de l'apériodicité, et trois mesures pour F0 : RMSE (root mean square error), CORR (coéfficient de corrélation) and V/UV (frame-level voiced/unvoiced error). Une analyse de la pertinence des paramètres articulatoires par rapport aux labels phonétiques a également été réalisée. Elle permet de conclure que les paramètres articulatoires générés s'approchent de manière acceptable des paramètres originaux, et que l'ajout des paramètres articulatoires n'a pas dégradé le modèle acoustique original. Les deux approches présentées ci-dessus ont en commun l'utilisation de deux types de données IRM. Ce point commun a motivé la recherche, dans les données temps réel, des images clés, c'est-à-dire les configurations statiques IRM, utilisées pour modéliser la coarticulation. Afin de comparer les images IRM statiques avec les images dynamiques en temps réel, nous avons utilisé plusieurs mesures : [...]
The thesis is set in the domain of articulatory speech synthesis and consists of three major parts: the first two are dedicated to the development of two articulatory speech synthesizers and the third addresses how we can relate them to each other. The first approach results from a rule-based approach to articulatory speech synthesis that aimed to have a comprehensive control over the articulators (the jaw, the tongue, the lips, the velum, the larynx and the epiglottis). This approach used a dataset of static mid-sagittal magnetic resonance imaging (MRI) captures showing blocked articulation of French vowels and a set of consonant-vowel syllables; that dataset was encoded with a PCA-based vocal tract model. Then the system comprised several components: using the recorded articulatory configurations to drive a rule-based articulatory speech synthesizer as a source of target positions to attain (which is the main contribution of this first part); adjusting the obtained vocal tract shapes from the phonetic perspective; running an acoustic simulation unit to obtain the sound. The results of this synthesis were evaluated visually, acoustically and perceptually, and the problems encountered were broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation. We concluded that, among our test examples, the articulatory strategies for vowels and stops are most correct, followed by those of nasals and fricatives. The second explored approach started off a baseline deep feed-forward neural network-based speech synthesizer trained with the standard recipe of Merlin on the audio recorded during real-time MRI (RT-MRI) acquisitions: denoised (and yet containing a considerable amount of noise of the MRI machine) speech in French and force-aligned state labels encoding phonetic and linguistic information. This synthesizer was augmented with eight parameters representing articulatory information---the lips opening and protrusion, the distance between the tongue and the velum, the velum and the pharyngeal wall and the tongue and the pharyngeal wall---that were automatically extracted from the captures and aligned with the audio signal and the linguistic specification. The jointly synthesized speech and articulatory sequences were evaluated objectively with dynamic time warping (DTW) distance, mean mel-cepstrum distortion (MCD), BAP (band aperiodicity prediction error), and three measures for F0: RMSE (root mean square error), CORR (correlation coefficient) and V/UV (frame-level voiced/unvoiced error). The consistency of articulatory parameters with the phonetic label was analyzed as well. I concluded that the generated articulatory parameter sequences matched the original ones acceptably closely, despite struggling more at attaining a contact between the articulators, and that the addition of articulatory parameters did not hinder the original acoustic model. The two approaches above are linked through the use of two different kinds of MRI speech data. This motivated a search for such coarticulation-aware targets as those that we had in the static case to be present or absent in the real-time data. To compare static and real-time MRI captures, the measures of structural similarity, Earth mover's distance, and SIFT were utilized; having analyzed these measures for validity and consistency, I qualitatively and quantitatively studied their temporal behavior, interpreted it and analyzed the identified similarities. I concluded that SIFT and structural similarity did capture some articulatory information and that their behavior, overall, validated the static MRI dataset. [...]

Style APA, Harvard, Vancouver, ISO itp.

4

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition". Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Pełny tekst źródła

Streszczenie:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-63).
The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.
by Felix Sun.
M. Eng.

Style APA, Harvard, Vancouver, ISO itp.

5

Morton, K. "Speech production and synthesis". Thesis, University of Essex, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.377930.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

6

Jin, Yi-Xuan. "A HIGH SPEED DIGITAL IMPLEMENTATION OF LPC SPEECH SYNTHESIZER USING THE TMS320". Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/275309.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

7

Wong, Chun-ho Eddy. "Reliability of rating synthesized hypernasal speech signals in connected speech and vowels". Click to view the E-thesis via HKU Scholars Hub, 2007. http://lookup.lib.hku.hk/lookup/bib/B4200617X.

Pełny tekst źródła

Streszczenie:

Thesis (B.Sc)--University of Hong Kong, 2007.
"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, June 30, 2007." Includes bibliographical references (p. 28-30). Also available in print.

Style APA, Harvard, Vancouver, ISO itp.

8

Peng, Antai. "Speech expression modeling and synthesis". Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13560.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

9

Brierton, Richard A. "Variable frame-rate speech synthesis". Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.357363.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

10

Klompje, Gideon. "A parametric monophone speech synthesis system". Thesis, Link to online version, 2006. http://hdl.handle.net/10019/561.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

11

Moers-Prinz, Donata [Verfasser]. "Fast Speech in Unit Selection Speech Synthesis / Donata Moers-Prinz". Bielefeld : Universitätsbibliothek Bielefeld, 2020. http://d-nb.info/1219215201/34.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

12

Andersson, Johan Sebastian. "Synthesis and evaluation of conversational characteristics in speech synthesis". Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8891.

Pełny tekst źródła

Streszczenie:

Conventional synthetic voices can synthesise neutral read aloud speech well. But, to make synthetic speech more suitable for a wider range of applications, the voices need to express more than just the word identity. We need to develop voices that can partake in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural and believable manner. In speech synthesis there are currently two dominating frameworks: unit selection and HMM-based speech synthesis. Both frameworks utilise recordings of human speech to build synthetic voices. Despite the fact that the content of the recordings determines the segmental and prosodic phenomena that can be synthesised, surprisingly little research has been made on utilising the corpus to extend the limited behaviour of conventional synthetic voices. In this thesis we will show how natural sounding conversational characteristics can be added to both unit selection and HMM-based synthetic voices, by adding speech from a spontaneous conversation to the voices. We recorded a spontaneous conversation, and by manually transcribing and selecting utterances we obtained approximately two thousand utterances from it. These conversational utterances were rich in conversational speech phenomena, but they lacked the general coverage that allows unit selection and HMM-based synthesis techniques to synthesise high quality speech. Therefore we investigated a number of blending approaches in the synthetic voices, where the conversational utterances were augmented with conventional read aloud speech. The synthetic voices that contained conversational speech were contrasted with conventional voices without conversational speech. The perceptual evaluations showed that the conversational voices were generally perceived by listeners as having a more conversational style than the conventional voices. This conversational style was largely due to the conversational voices’ ability to synthesise utterances that contained conversational speech phenomena in a more natural manner than the conventional voices. Additionally, we conducted an experiment that showed that natural sounding conversational characteristics in synthetic speech can convey pragmatic information, in our case an impression of certainty or uncertainty, about a topic to a listener. The conclusion drawn is that the limited behaviour of conventional synthetic voices can be enriched by utilising conversational speech in both unit selection and HMM-based speech synthesis.

Style APA, Harvard, Vancouver, ISO itp.

13

Liu, Zhu Lin. "Speech synthesis via adaptive Fourier decomposition". Thesis, University of Macau, 2011. http://umaclib3.umac.mo/record=b2493215.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

14

Jauk, Igor. "Unsupervised learning for expressive speech synthesis". Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/460814.

Pełny tekst źródła

Streszczenie:

Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker- and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations. The goal of this thesis is to study automatic training methods for expressive speech synthesis avoiding labeling and to develop applications from these proposals. The focus lies on the acoustic and the semantic domains. For the part of the acoustic domain, the goal is to find suitable acoustic features to represent expressive speech, especially for the multi-speaker domain, as getting closer to real-life uncontrolled data. For this, the perspective will slide away from traditional, mainly prosody-based, features towards features gained with factor analysis, trying to identify the principal components of the expressiveness, namely using i-vectors. Results show that a combination of traditional and i-vector based features performs better in unsupervised clustering of expressive speech than traditional features and even better than large state-of-the-art sets in the multi-speaker domain. Once the feature set is defined, it is used for unsupervised clustering of an audiobook, where from each cluster a voice is trained. Then, the method is evaluated in an audiobook-editing application, where users can use the synthetic voices to create their own dialogues. The obtained results validate the proposal. In this editing application users choose synthetic voices and assign them to the sentences considering the speaking characters and the expressiveness. Involving the semantic domain, this assignment can be achieved automatically, at least partly. Words and sentences are represented numerically in trainable semantic vector spaces, called embeddings, and these can be used to predict the expressiveness to some extent. This method not only permits fully automatic reading of larger text passages, considering the local context, but can also be used as a semantic search engine for training data. Both applications are evaluated in a perceptual test showing the potential of the proposed method. Finally, accounting for the new tendencies in the speech synthesis world, deep neural network based expressive speech synthesis is designed and tested. Emotionally motivated semantic representations of text, sentiment embeddings, trained on the positiveness and the negativeness of movie reviews, are used as an additional input to the system. The neural network now learns not only from segmental and contextual information, but also from the sentiment embeddings, affecting especially prosody. The system is evaluated in two perceptual experiments which show preferences for the inclusion of sentiment embeddings as an additional input.
Hoy en día, especialmente con el auge de las redes neuronales, la síntesis de habla se basa casi totalmente en datos. El objetivo de esta tesis es proveer métodos de entrenamiento automático y no supervisado a partir de datos para la síntesis de habla expresiva. En comparación con sistemas de síntesis "neutrales", resulta más difícil encontrar datos de entrenamiento fiables para la síntesis expresiva, a pesar de la gran disponibilidad de recursos como internet. La dificultad principal se origina en la naturaleza del habla expresiva, altamente dependiente del hablante y la situación, resultando en muchas variaciones acústicas. Las consecuencias son, primero, que es muy difícil definir etiquetas que identifiquen fiablemente todos los detalles del habla expresiva. La definición típica de 6 emociones básicas es una simplificación que tendrá consecuencias inexcusables cuando se trata con datos fuera del laboratorio. Segundo, incluso si se llegara a definir un conjunto de etiquetas, aparte del enorme esfuerzo manual que supondría, sería muy difícil conseguir suficientes datos de entrenamiento para cada variante respetando todos sus matices. El objetivo de esta tesis es estudiar métodos de entrenamiento automático para la síntesis de habla expresiva evitando etiquetas y desarrollar aplicaciones a base de estas propuestas. El enfoque abarca los dominios acústico y semántico. Con respecto al dominio acústico, el objetivo es encontrar características acústicas aptas para representar habla expresiva, especialmente en el dominio multi-locutor, acercándose a datos reales e incontrolados. Para esto, la perspectiva se apartará de las características tradicionales, principalmente basadas en la prosodia, hacia características ganadas a partir del análisis de factores, intentando identificar los componentes principales de la expresividad, concretamente los i-vectors. Los resultados demuestran que una combinación de características tradicionales y de las basadas en los i-vectors rinde mejor en la tarea del "clustering" no supervisado del habla expresiva que solo las características tradicionales e incluso mejor que amplios conjuntos de características del estado del arte en el dominio multi-locutor. Una vez definido, el conjunto de características se utiliza para el "clustering" no supervisado de un audiolibro, entrenando de cada "cluster" una voz. El método se ha evaluado en una aplicación de edición de audiolibro, donde los usuarios utilizaban las voces sintéticas para crear sus propios diálogos. Los resultados obtenidos validan la propuesta. En la aplicación de edición, los usuarios eligen voces sintéticas y las asignan a frases considerando los personajes y la expresividad. Implicando el dominio semántico, esta asignación podría realizarse automáticamente. En esta parte de la tesis, palabras y frases se representan numéricamente en espacios vectoriales entrenables, llamados embeddings, y pueden utilizarse para predecir la expresividad. Este método no solo permite una lectura automática de pasajes de texto, tomando en cuenta el contexto local, sino que también puede utilizarse como una herramienta de búsqueda semántica para datos de entrenamiento. Ambas aplicaciones se han evaluado en un experimento perceptual demostrando el potencial de la metodología propuesta. Finalmente, siguiendo las nuevas tendencias en el mundo de la síntesis de habla basada en redes neuronales, se ha desarrollado y evaluado un sistema de síntesis de voz expresiva utilizando esta tecnología. Representaciones semánticas de texto, motivadas emocionalmente, llamadas "sentiment embeddings", entrenadas con reseñas de cine, se utilizan como input adicional en el sistema. La red neuronal ahora aprende no solamente de la información segmental y contextual, sino también de esta representación del sentimiento, afectando especialmente la prosodia. El sistema se ha evaluado en dos experimentos perceptuales, demostrando la preferencia del sistema que incluye esta nueva represent

Style APA, Harvard, Vancouver, ISO itp.

15

Macon, Michael W. "Speech synthesis based on sinusoidal modeling". Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13904.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

16

Wang, Min 1961. "Format-based synthesis of Chinese speech". Thesis, McGill University, 1986. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=66001.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

17

Rahim, Mazin. "Neural networks in articulatory speech synthesis". Thesis, University of Liverpool, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.317191.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

18

Fekkai, Souhila. "Fractal based speech recognition and synthesis". Thesis, De Montfort University, 2002. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.269246.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

19

Wright, Richard Douglas. "An investigation of speech synthesis parameters". Thesis, University of Southampton, 1988. https://eprints.soton.ac.uk/52279/.

Pełny tekst źródła

Streszczenie:

The model of speech production generally used in speech synthesis is that of a source modified by a digital filter. The major difference between a number of models is the form of the digital filter. The purpose of this research is to compare the properties of these filters when used for speech synthesis. Six models were investigated: (1) series resonance; (2) direct form; (3) reflection coefficients; (4) area function; (5) parallel resonance; and (6) a simple articulatory model. Types (2,3,4) are three varieties of linear predictive coding (LPC) parameters. There are five parts to the investigation: (1) an historical survey of models for speech synthesis and their problems; (2) a formal description of the models and their analytical relationships; (3) an objective assessment of the behaviour of the models during interpolation; (4) measurement of intelligibility (using a FAAF test); and (5) measurement of naturalness. Principal results are: synthesizer types (1) to (4) are all-pole models, formally equivalent in the steady state. But when the parameters of any of the models are interpolated, consequences for motion of vocal tract resonances (formants) differ. These differences exceed the discrimination limen for formant frequency, and make a small but statistically significant difference to intelligibility, but not to naturalness. Simple linear interpolation was found to be as good as cosine or piecewise-linear interpolation. Complete lack of interpolation reduced intelligibility by 30%. Finally, the synthesis studied achieved as few place-of-articulation errors as did LPC speech, indicating that intelligibility was limited not by parameter and transition type, but by other factors such as the excitation signal, phoneme target values, and durations.

Style APA, Harvard, Vancouver, ISO itp.

20

Campbell, Wilhelm. "Multi-level speech timing control". Thesis, University of Sussex, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.283832.

Pełny tekst źródła

Streszczenie:

This thesis describes a model of speech timing, predicting at the syllable level, with sensitivity to rhythmic factors at the foot level, that predicts segmental durations by a process of accommodation into the higher-level timing framework. The model is based on analyses of two large databases of British English speech; one illustrating the range of prosodic variation in the language, the other illustrating segmental duration characteristics in various phonetic environments. Designed for a speech synthesis application, the model also has relevance to linguistic and phonetic theory, and shows that phonological specification of prosodic variation is independent of the phonetic realisation of segmental duration. It also shows, using normalisation of phone-specific timing characteristics, that lengthening of segments within the syllable is of three kinds: prominence-related, applying more to onset segments; boundary-related, applying more to coda segments; and rhythm/rate-related, being more uniform across all component segments. In this model, durations are first predicted at the level of the syllable from consideration of the number of component segments, the nature of the rhyme, and the three types of lengthening. The segmental durations are then constrained to sum to this value by determining an appropriate uniform quantile of their individual distributions. Segmental distributions define the range of likely durations each might show under a given set of conditions; their parameters are predicted from broad-class features of place and manner of articulation, factored for position in the syllable, clustering, stress, and finality. Two parameters determine the segmental duration . pdfs, assuming a Gamma distribution, and one parameter determines the quantile within that pdf to predict the duration of any segment in a given prosodic context. In experimental tests, each level produced durations that closely fitted the data of four speakers of British English, and showed performance rates higher than a comparable model predicting exclusively at the level of the segment.

Style APA, Harvard, Vancouver, ISO itp.

21

Moakes, Paul Alan. "On-line adaptive nonlinear modelling of speech". Thesis, University of Sheffield, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364189.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

22

Shannon, Sean Matthew. "Probabilistic acoustic modelling for parametric speech synthesis". Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708415.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

23

Cummings, Kathleen E. "Analysis, synthesis, and recognition of stressed speech". Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/15673.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

24

Varga, A. P. "Multipulse excited linear predictive analysis in speech coding and constructive speech synthesis". Thesis, University of Cambridge, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.372909.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

25

Mazel, David S. "Sinusoidal modeling of speech". Thesis, Georgia Institute of Technology, 1986. http://hdl.handle.net/1853/13873.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

26

Merritt, Thomas. "Overcoming the limitations of statistical parametric speech synthesis". Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/22071.

Pełny tekst źródła

Streszczenie:

At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis.

Style APA, Harvard, Vancouver, ISO itp.

27

Engwall, Olov. "Tongue Talking : Studies in Intraoral Speech Synthesis". Doctoral thesis, KTH, Tal, musik och hörsel, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3380.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

28

Sakai, Shinsuke. "A Probabilistic Approach to Concatenative Speech Synthesis". 京都大学 (Kyoto University), 2012. http://hdl.handle.net/2433/152508.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

29

Hassanain, Elham. "Novel cepstral techniques applied to speech synthesis". Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/842745/.

Pełny tekst źródła

Streszczenie:

The aim of this research was to develop an improved analysis and synthesis model for utilization in speech synthesis. Conventionally, linear prediction has been used in speech synthesis but is restricted by the requirement of an all-pole, minimum phase model. Here, cepstral homomorphic deconvolution techniques were used to approach the problem, since there are fewer constraints on the model and some evidence in the literature that shows that cepstral homomorphic deconvolution can give improved performance. Specifically the spectral root cepstrum was developed in an attempt to separate the magnitude and phase spectra. Analysis and synthesis filters were developed on these two data streams independently in an attempt to improve the process. It is shown that independent analysis of the magnitude and phase spectra is preferable to a combined analysis, and so the concept of a phase cepstrum is introduced, and a number of different phase cepstra are defined. Although extremely difficult for many types of signals, phase analysis via a root cepstrum and the Hartley phase cepstrum give encouraging results for a wide range of both minimum and maximum phase signals. Overall, this research has shown that improved synthesis can be achieved with these techniques.

Style APA, Harvard, Vancouver, ISO itp.

30

Vepa, Jithendra. "Join cost for unit selection speech synthesis". Thesis, University of Edinburgh, 2004. http://hdl.handle.net/1842/1452.

Pełny tekst źródła

Streszczenie:

Undoubtedly, state-of-the-art unit selection-based concatenative speech systems produce very high quality synthetic speech. this is due to a large speech database containing many instances of each speech unit, with a varied and natural distribution of prosodic and spectral characteristics. the join cost, which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from this large speech database. The ideal join cost is one that measures percieved discontinuity based on easily measurable spectral properties of the units being joined, inorder to ensure smooth and natural sounding synthetic speech. During first part of my research, I have investigated various spectrally based distance measures for use in computation of the join cost by designing a perceptual listening experiment. A variation to the usual perceptual test paradigm is proposed in this thesis by deliberately including a wide range of qualities of join in polysyllabic words. The test stimuli are obtained using a state-of-the-art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Three spectral features Mel-frequency cepstral coefficients (MFCC), line spectral frequencies (LSF) and multiple centroid analysis (MCA) parameters and various statistical distances - Euclidean, Kullback-Leibler, Mahalanobis - are used to obtain distance measures. Based on the correlations between perceptual scores and these spectral distances. I proposed new spectral distance measures, which have good correlation with human perception to concatenation discontinuities. The second part of my research concentrates on combining join cost computation and the smoothing operation, which is required to disguise joins, by learning an underlying representation from the acoustic signal. In order to accomplish this task, I have chosen linear dynamic models (LDM), sometimes known as Kalman filters. Three different initialisation schemes are used prior to Expectation-Maximisation (KM) in LDM training. Once the models are trained, the join cost is computed based on the error between model predictions and actual observations. Analytical measures are derived based on the shape of this error plot. These measures and initialisation schemes are compared by computing correlations using the perceptual data. The LDMs are also able to smooth the observations which are then used to synthesise speech. To evaluate the LDM smoothing operation, another listening test is performed where it is compared with the standard methods (simple linear interpolation). I have compared the best three join cost functions, chosen from the first and second parts of my research, subjectively using a listening test in the third part of my research. in this test, I also evaluated different smoothing methods: no smoothing, linear smoothing and smoothing achieved using LDMs.

Style APA, Harvard, Vancouver, ISO itp.

31

Watts, Oliver Samuel. "Unsupervised learning for text-to-speech synthesis". Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7982.

Pełny tekst źródła

Streszczenie:

This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.

Style APA, Harvard, Vancouver, ISO itp.

32

Vine, Daniel Samuel Gordon. "Time-domain concatenative text-to-speech synthesis". Thesis, Bournemouth University, 1998. http://eprints.bournemouth.ac.uk/351/.

Pełny tekst źródła

Streszczenie:

A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six.

Style APA, Harvard, Vancouver, ISO itp.

33

Edge, James D. "Techniques for the synthesis of visual speech". Thesis, University of Sheffield, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.419276.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

34

SOLEWICZ, JOSE ALBERTO. "TEXT-TO-SPEECH SYNTHESIS FOR BRAZILIAN PORTUGUESE". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 1993. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=8690@1.

Pełny tekst źródła

Streszczenie:

Este trabalho apresenta um sistema de síntese de voz a partir de texto irrestrito para a língua portuguesa falada no Brasil. O sistema é baseado na técnica de concatenação, por regras, de unidades de voz previamente codificadas. Propõe-se um inventário de unidades de síntese extremamente reduzido (149 unidades) composto, basicamente, por transições consoante-vogal (CV), que representam segmentos acústicos cruciais no processo de produção da fala. Mostrou-se ser possível produzir voz altamente inteligível através da concatenação destas unidades. É proposto, também, o uso de um modelo CELP como estrutura de compressão e síntese do inventário de unidades, incluindo as adaptações necessárias para as alterações prosódicas do sinal no momento de sua codificação. Resultados de testes auditivos mostraram que a síntese através do modelo CELP proposto é superior àquela obtida através do Vocoder-LPC (excitação mono- pulso/ruído) usualmente empregado nos sistemas de síntese de voz a partir de texto.
This work presents na unrestricted text-to-speech synthesis system for brazilian portuguese. The system is based on the concatenation by rules of previously coded speech units. An extremely reduced set of synthesis units (149) is proposed. This set is mostly comprised of consonant-vowel (CV) transitions, which represent crucial acoustic segments in the speech production process. Production of highly intelligible speech is show to be possible through concatenation of these units. A CELP model is also proposed as a compression and synthesis structure, which includes necessary adaptations in order to modify the speech prosody during its decoding phase. Subjective tests showed that speech synthesized through the proposed CELP model is judged superior to that obtained through an LPC Vocoder (mono-pulse/noise excited), which is traditionally used in text-to-speech synthesis systems.

Style APA, Harvard, Vancouver, ISO itp.

35

Hardwick, John C. (John Clark). "A high quality speech analysis/synthesis system". Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/14901.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

36

Halabi, Nawar. "Modern standard Arabic phonetics for speech synthesis". Thesis, University of Southampton, 2016. https://eprints.soton.ac.uk/409695/.

Pełny tekst źródła

Streszczenie:

Arabic phonetics and phonology have not been adequately studied for the purposes of speech synthesis and speech synthesis corpus design. The only sources of knowledge available are either archaic or targeted towards other disciplines such as education. This research conducted a three-stage study. First, Arabic phonology research was reviewed in general, and the results of this review were triangulated with expert opinions – gathered throughout the project – to create a novel formalisation of Arabic phonology for speech synthesis. Secondly, this formalisation was used to create a speech corpus in Modern Standard Arabic and this corpus was used to produce a speech synthesiser. This corpus was the first to be constructed and published for this dialect of Arabic using scientifically-supported phonological formalisms. The corpus was semi-automatically annotated with phoneme boundaries and stress marks; it is word-aligned with the orthographical transcript. The accuracy of these alignments was compared with previous published work, which showed that even slightly less accurate alignments are sufficient for producing high quality synthesis. Finally, objective and subjective evaluations were conducted to assess the quality of this corpus. The objective evaluation showed that the corpus based on the proposed phonological formalism had sufficient phonetic coverage compared with previous work. The subjective evaluation showed that this corpus can be used to produce high quality parametric and unit selection speech synthesisers. In addition, it showed that the use of orthographically extracted stress marks can improve the quality of the generated speech for general purpose synthesis. These stress marks are the first to be tested for Modern Standard Arabic, which thus opens this subject for future research.

Style APA, Harvard, Vancouver, ISO itp.

37

Beněk, Tomáš. "Implementing and Improving a Speech Synthesis System". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236079.

Pełny tekst źródła

Streszczenie:

Tato práce se zabývá syntézou řeči z textu. V práci je podán základní teoretický úvod do syntézy řeči z textu. Práce je postavena na MARY TTS systému, který umožňuje využít existujících modulů k vytvoření vlastního systému pro syntézu řeči z textu, a syntéze řeči pomocí skrytých Markovových modelů natrénovaných na vytvořené řečové databázi. Bylo vytvořeno několik jednoduchých programů ulehčujících vytvoření databáze a přidání nového jazyka a hlasu pro MARY TTS systém bylo demonstrováno. Byl vytvořen a publikován modul a hlas pro Český jazyk. Byl popsán a implementován algoritmus pro přepis grafémů na fonémy.

Style APA, Harvard, Vancouver, ISO itp.

38

Näslund, Per. "Artificial Neural Networks in Swedish Speech Synthesis". Thesis, KTH, Tal-kommunikation, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239350.

Pełny tekst źródła

Streszczenie:

Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods. In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.
Talsynteser, också kallat TTS (text-to-speech) används i stor utsträckning inom smarta assistenter och många andra applikationer. Samtida forskning applicerar maskininlärning och artificiella neurala nätverk (ANN) för att utföra talsyntes. Det har visats i studier att dessa system presterar bättre än de äldre konkatenativa och parametriska metoderna. I den här rapporten utforskas ANN-baserade TTS-metoder och en av metoderna implementeras för det svenska språket. Den använda metoden kallas “Tacotron” och är ett första steg mot end-to-end TTS baserat på neurala nätverk. Metoden binder samman flertalet olika ANN-tekniker. Det resulterande systemet jämförs med en parametriskt TTS genom ett graderat preferens-test som innefattar 20 svensktalande försökspersoner. En statistiskt säkerställd preferens för det ANN- baserade TTS-systemet fastställs. Försökspersonerna indikerar att det ANN-baserade TTS-systemet presterar bättre än det parametriska när det kommer till ljudkvalitet och naturlighet men visar brister inom tydlighet.

Style APA, Harvard, Vancouver, ISO itp.

39

Hagrot, Joel. "A Data-Driven Approach For Automatic Visual Speech In Swedish Speech Synthesis Applications". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-246393.

Pełny tekst źródła

Streszczenie:

This project investigates the use of artificial neural networks for visual speech synthesis. The objective was to produce a framework for animated chat bots in Swedish. A survey of the literature on the topic revealed that the state-of-the-art approach was using ANNs with either audio or phoneme sequences as input. Three subjective surveys were conducted, both in the context of the final product, and in a more neutral context with less post-processing. They compared the ground truth, captured using the deep-sensing camera of the iPhone X, against both the ANN model and a baseline model. The statistical analysis used mixed effects models to find any statistically significant differences. Also, the temporal dynamics and the error were analyzed. The results show that a relatively simple ANN was capable of learning a mapping from phoneme sequences to blend shape weight sequences with satisfactory results, except for the fact that certain consonant requirements were unfulfilled. The issues with certain consonants were also observed in the ground truth, to some extent. Post-processing with consonant-specific overlays made the ANN’s animations indistinguishable from the ground truth and the subjects perceived them as more realistic than the baseline model’s animations. The ANN model proved useful in learning the temporal dynamics and coarticulation effects for vowels, but may have needed more data to properly satisfy the requirements of certain consonants. For the purposes of the intended product, these requirements can be satisfied using consonant-specific overlays.
Detta projekt utreder hur artificiella neuronnät kan användas för visuell talsyntes. Ändamålet var att ta fram ett ramverk för animerade chatbotar på svenska. En översikt över litteraturen kom fram till att state-of-the-art-metoden var att använda artificiella neuronnät med antingen ljud eller fonemsekvenser som indata. Tre enkäter genomfördes, både i den slutgiltiga produktens kontext, samt i en mer neutral kontext med mindre bearbetning. De jämförde sanningsdatat, inspelat med iPhone X:s djupsensorkamera, med både neuronnätsmodellen och en grundläggande så kallad baselinemodell. Den statistiska analysen använde mixed effects-modeller för att hitta statistiskt signifikanta skillnader i resultaten. Även den temporala dynamiken analyserades. Resultaten visar att ett relativt enkelt neuronnät kunde lära sig att generera blendshapesekvenser utifrån fonemsekvenser med tillfredsställande resultat, förutom att krav såsom läppslutning för vissa konsonanter inte alltid uppfylldes. Problemen med konsonanter kunde också i viss mån ses i sanningsdatat. Detta kunde lösas med hjälp av konsonantspecifik bearbetning, vilket gjorde att neuronnätets animationer var oskiljbara från sanningsdatat och att de samtidigt upplevdes vara bättre än baselinemodellens animationer. Sammanfattningsvis så lärde sig neuronnätet vokaler väl, men hade antagligen behövt mer data för att på ett tillfredsställande sätt uppfylla kraven för vissa konsonanter. För den slutgiltiga produktens skull kan dessa krav ändå uppnås med hjälp av konsonantspecifik bearbetning.

Style APA, Harvard, Vancouver, ISO itp.

40

Gordon, Jane S. "Use of synthetic speech in tests of speech discrimination". PDXScholar, 1985. https://pdxscholar.library.pdx.edu/open_access_etds/3443.

Pełny tekst źródła

Streszczenie:

The purpose of this study was to develop two tape-recorded synthetic speech discrimination test tapes and assess their intelligibility in order to determine whether or not synthetic speech was intelligible and if it would prove useful in speech discrimination testing. Four scramblings of the second MU-6 monosyllable word list were generated by the ECHO l C speech synthesizer using two methods of generating synthetic speech called TEXTALKER and SPEAKEASY. These stimuli were presented in one ear to forty normal-hearing adult subjects, 36 females and 4 males, at 60 dB HL under headphone&. Each subject listened to two different scramblings of the 50 monosyllable word list, one scrambling generated by TEXTALKER and the other scrambling generated by SPEAKEASY. The order in which the TEXTALKER and SPEAKEASY mode of presentation occurred as well as which ear to test per subject was randomly determined.

Style APA, Harvard, Vancouver, ISO itp.

41

Chung, Jae H. "A new homomorphic vocoder framework using analysis-by-synthesis excitation analysis". Diss., Georgia Institute of Technology, 1991. http://hdl.handle.net/1853/15471.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

42

Alissali, Mamoun. "Architecture logicielle pour la synthèse multilingue de la parole". Grenoble INPG, 1993. http://www.theses.fr/1993INPG0037.

Pełny tekst źródła

Streszczenie:

Cette these presente une etude des specifications logicielles pour la synthese multilingue de la parole. L'objectif est la conception et la realisation d'une architecture logicielle definissant une collection d'outils appropries au developpement et a l'utilisation de systemes de synthese multilingue de la parole. L'environnement resultant, appele compost, permet la construction de systemes reconfigurables a partir de collections de modules ecrits dans deux langages de programmation: un langage de reecriture specialise, appele compost egalement, et un langage de programmation traditionnel (c). Une interface normalisee permet la coprogrammation dans ces deux langages. Des exemples de developpement et d'exploitation de systemes de synthese de la parole sous compost sont ensuite presentes. Ses capacites multilingues et son architecture repartie sont ensuite presentees. En conclusion les grandes lignes de l'evolution future de cet environnement sont tracees

Style APA, Harvard, Vancouver, ISO itp.

43

Kain, Alexander Blouke. "High resolution voice transformation /". Full text open access at:, 2001. http://content.ohsu.edu/u?/etd,189.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

44

Peters, Richard Alan II. "A LINEAR PREDICTION CODING MODEL OF SPEECH (SYNTHESIS, LPC, COMPUTER, ELECTRONIC)". Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/291240.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

45

Strömbergsson, Sofia. "The /k/s, the /t/s, and the inbetweens : Novel approaches to examining the perceptual consequences of misarticulated speech". Doctoral thesis, KTH, Tal-kommunikation, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-143102.

Pełny tekst źródła

Streszczenie:

This thesis comprises investigations of the perceptual consequences of children’s misarticulated speech – as perceived by clinicians, by everyday listeners, and by the children themselves. By inviting methods from other areas to the study of speech disorders, this work demonstrates some successful cases of cross-fertilization. The population in focus is children with a phonological disorder (PD), who misarticulate /t/ and /k/. A theoretical assumption underlying this work is that errors in speech production are often paralleled in perception, e.g. that children base their decision on whether a speech sound is a /t/ or a /k/ on other acoustic-phonetic criteria than those employed by proficient language users. This assumption, together with an aim at stimulating self-monitoring in these children, motivated two of the included studies. Through these studies, new insights into children’s perception of their own speech were achieved – insights entailing both clinical and psycholinguistic implications. For example, the finding that children with PD generally recognize themselves as the speaker in recordings of their own utterances lends support to the use of recordings in therapy, to attract children’s attention to their own speech production. Furthermore, through the introduction of a novel method for automatic correction of children’s speech errors, these findings were extended with the observation that children with PD tend to evaluate misarticulated utterances as correct when just having produced them, and to perceive inaccuracies better when time has passed. Another theme in this thesis is the gradual nature of speech perception related to phonological categories, and a concern that perceptual sensitivity is obscured in descriptions based solely on discrete categorical labels. This concern is substantiated by the finding that listeners rate “substitutions” of [t] for /k/ as less /t/-like than correct productions of [t] for intended /t/. Finally, a novel method of registering listener reactions during the continuous playback of misarticulated speech is introduced, demonstrating a viable approach to exploring how different speech errors influence intelligibility and/or acceptability. By integrating such information in the prioritizing of therapeutic targets, intervention may be better directed at those patterns that cause the most problems for the child in his or her everyday life.

QC 20140317

Style APA, Harvard, Vancouver, ISO itp.

46

Low, Phuay Hui. "Statistical analysis, modelling and synthesis of voice for text to speech synthesis". Thesis, Brunel University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.401342.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

47

Bulyko, Ivan. "Flexible speech synthesis using weighted finite-state transducers /". Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6081.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

48

Crosmer, Joel R. "Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients". Diss., Georgia Institute of Technology, 1985. http://hdl.handle.net/1853/15739.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

49

Qader, Raheel. "Pronunciation and disfluency modeling for expressive speech synthesis". Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S076/document.

Pełny tekst źródła

Streszczenie:

Dans la première partie de cette thèse, nous présentons une nouvelle méthode de production de variantes de prononciations qui adapte des prononciations standards, c'est-à-dire issues d'un dictionnaire, à un style spontané. Cette méthode utilise une vaste gamme d'informations linguistiques, articulatoires et acoustiques, ainsi qu'un cadre probabiliste d'apprentissage automatique, à savoir les champs aléatoires conditionnels (CAC) et les modèles de langage. Nos expériences poussées sur le corpus Buckeye démontrent l'efficacité de l'approche à travers des évaluations objectives et perceptives. Des tests d'écoutes sur de la parole synthétisée montrent que les prononciations adaptées sont jugées plus spontanées que les prononciations standards, et même que celle réalisées par les locuteurs du corpus étudié. Par ailleurs, nous montrons que notre méthode peut être étendue à d'autres tâches d'adaptation, par exemple pour résoudre des problèmes d'incohérences entre les différentes séquences de phonèmes manipulées par un système de synthèse. La seconde partie de la thèse explore une nouvelle approche de production automatique de disfluences dans les énoncés en entrée d'un système de synthèse de la parole. L'approche proposée offre l'avantage de considérer plusieurs types de disfluences, à savoir des pauses, des répétitions et des révisions. Pour cela, nous présentons une formalisation novatrice du processus de production de disfluences à travers un mécanisme de composition de ces disfluences. Nous présentons une première implémentation de notre processus, elle aussi fondée sur des CAC et des modèles de langage, puis conduisons des évaluations objectives et perceptives. Celles-ci nous permettent de conclure à la bonne fonctionnalité de notre proposition et d'en discuter les pistes principales d'amélioration
In numerous domains, the usage of synthetic speech is conditioned upon the ability of speech synthesis systems to generate natural and expressive speech. In this frame, we address the problem of expressivity in TTS by incorporating two phenomena with a high impact on speech: pronunciation variants and speech disfluencies. In the first part of this thesis, we present a new pronunciation variant generation method which works by adapting standard i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and acoustic features and to use a probabilistic machine learning framework, namely conditional random fields (CRFs) and language models. Extensive experiments on the Buckeye corpus demonstrate the effectiveness of this approach through objective and subjective evaluations. Listening tests on synthetic speech show that adapted pronunciations are judged as more spontaneous than standard ones, as well as those realized by real speakers. Furthermore, we show that the method can be extended to other adaptation tasks, for instance, to solve the problem of inconsistency between phoneme sequences handled in TTS systems. The second part of this thesis explores a novel approach to automatic generation of speech disfluencies for TTS. Speech disfluencies are one of the most pervasive phenomena in spontaneous speech, therefore being able to automatically generate them is crucial to have more expressive synthetic speech. The proposed approach provides the advantage of generating several types of disfluencies: pauses, repetitions and revisions. To achieve this task, we formalize the problem as a theoretical process, where transformation functions are iteratively composed. We present a first implementation of the proposed process using CRFs and language models, before conducting objective and perceptual evaluations. These experiments lead to the conclusion that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements

Style APA, Harvard, Vancouver, ISO itp.

50

Micallef, Paul. "A text to speech synthesis system for Maltese". Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/842702/.

Pełny tekst źródła

Streszczenie:

The subject of this thesis covers a considerably varied multidisciplinary area which needs to be addressed to be able to achieve a text-to-speech synthesis system of high quality, in any language. This is the first time that such a system has been built for Maltese, and therefore, there was the additional problem of no computerised sources or corpora. However many problems and much of the system designs are common to all languages. This thesis focuses on two general problems. The first is that of automatic labelling of phonemic data, since this is crucial for the setting up of Maltese speech corpora, which in turn can be used to improve the system. A novel way of achieving such automatic segmentation was investigated. This uses a mixed parameter model with maximum likelihood training of the first derivative of the features across a set of phonetic class boundaries. It was found that this gives good results even for continuous speech provided that a phonemic labelling of the text is available. A second general problem is that of segment concatenation, since the end and beginning of subsequent diphones can have mismatches in amplitude, frequency, phase and spectral envelope. The use of-intermediate frames, build up from the last and first frames of two concatenated diphones, to achieve a smoother continuity was analysed. The analysis was done both in time and in frequency. The use of wavelet theory for the separation of the spectral envelope from the excitation was also investigated. The linguistic system modules have been built for this thesis. In particular a rule based grapheme to phoneme conversion system that is serial and not hierarchical was developed. The morphological analysis required the design of a system which allowed two dissimilar lexical structures, (semitic and romance) to be integrated into one overall morphological analyser. Appendices at the back are included with detailed rules of the linguistic modules developed. The present system, while giving satisfactory intelligibility, with capability of modifying duration, does not include as yet a prosodic module.

Style APA, Harvard, Vancouver, ISO itp.

Rozprawy doktorskie na temat „Speech synthesis”

Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych