Tesis sobre el tema "Speech synthesis"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Speech synthesis".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Donovan, R. E. "Trainable speech synthesis". Thesis, University of Cambridge, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.598598.
Texto completoGreenwood, Andrew Richard. "Articulatory speech synthesis". Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386773.
Texto completoTsukanova, Anastasiia. "Articulatory speech synthesis". Electronic Thesis or Diss., Université de Lorraine, 2019. http://www.theses.fr/2019LORR0166.
Texto completoThe thesis is set in the domain of articulatory speech synthesis and consists of three major parts: the first two are dedicated to the development of two articulatory speech synthesizers and the third addresses how we can relate them to each other. The first approach results from a rule-based approach to articulatory speech synthesis that aimed to have a comprehensive control over the articulators (the jaw, the tongue, the lips, the velum, the larynx and the epiglottis). This approach used a dataset of static mid-sagittal magnetic resonance imaging (MRI) captures showing blocked articulation of French vowels and a set of consonant-vowel syllables; that dataset was encoded with a PCA-based vocal tract model. Then the system comprised several components: using the recorded articulatory configurations to drive a rule-based articulatory speech synthesizer as a source of target positions to attain (which is the main contribution of this first part); adjusting the obtained vocal tract shapes from the phonetic perspective; running an acoustic simulation unit to obtain the sound. The results of this synthesis were evaluated visually, acoustically and perceptually, and the problems encountered were broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation. We concluded that, among our test examples, the articulatory strategies for vowels and stops are most correct, followed by those of nasals and fricatives. The second explored approach started off a baseline deep feed-forward neural network-based speech synthesizer trained with the standard recipe of Merlin on the audio recorded during real-time MRI (RT-MRI) acquisitions: denoised (and yet containing a considerable amount of noise of the MRI machine) speech in French and force-aligned state labels encoding phonetic and linguistic information. This synthesizer was augmented with eight parameters representing articulatory information---the lips opening and protrusion, the distance between the tongue and the velum, the velum and the pharyngeal wall and the tongue and the pharyngeal wall---that were automatically extracted from the captures and aligned with the audio signal and the linguistic specification. The jointly synthesized speech and articulatory sequences were evaluated objectively with dynamic time warping (DTW) distance, mean mel-cepstrum distortion (MCD), BAP (band aperiodicity prediction error), and three measures for F0: RMSE (root mean square error), CORR (correlation coefficient) and V/UV (frame-level voiced/unvoiced error). The consistency of articulatory parameters with the phonetic label was analyzed as well. I concluded that the generated articulatory parameter sequences matched the original ones acceptably closely, despite struggling more at attaining a contact between the articulators, and that the addition of articulatory parameters did not hinder the original acoustic model. The two approaches above are linked through the use of two different kinds of MRI speech data. This motivated a search for such coarticulation-aware targets as those that we had in the static case to be present or absent in the real-time data. To compare static and real-time MRI captures, the measures of structural similarity, Earth mover's distance, and SIFT were utilized; having analyzed these measures for validity and consistency, I qualitatively and quantitatively studied their temporal behavior, interpreted it and analyzed the identified similarities. I concluded that SIFT and structural similarity did capture some articulatory information and that their behavior, overall, validated the static MRI dataset. [...]
Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition". Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.
Texto completoThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-63).
The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.
by Felix Sun.
M. Eng.
Morton, K. "Speech production and synthesis". Thesis, University of Essex, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.377930.
Texto completoJin, Yi-Xuan. "A HIGH SPEED DIGITAL IMPLEMENTATION OF LPC SPEECH SYNTHESIZER USING THE TMS320". Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/275309.
Texto completoWong, Chun-ho Eddy. "Reliability of rating synthesized hypernasal speech signals in connected speech and vowels". Click to view the E-thesis via HKU Scholars Hub, 2007. http://lookup.lib.hku.hk/lookup/bib/B4200617X.
Texto completo"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, June 30, 2007." Includes bibliographical references (p. 28-30). Also available in print.
Peng, Antai. "Speech expression modeling and synthesis". Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13560.
Texto completoBrierton, Richard A. "Variable frame-rate speech synthesis". Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.357363.
Texto completoKlompje, Gideon. "A parametric monophone speech synthesis system". Thesis, Link to online version, 2006. http://hdl.handle.net/10019/561.
Texto completoMoers-Prinz, Donata [Verfasser]. "Fast Speech in Unit Selection Speech Synthesis / Donata Moers-Prinz". Bielefeld : Universitätsbibliothek Bielefeld, 2020. http://d-nb.info/1219215201/34.
Texto completoAndersson, Johan Sebastian. "Synthesis and evaluation of conversational characteristics in speech synthesis". Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8891.
Texto completoLiu, Zhu Lin. "Speech synthesis via adaptive Fourier decomposition". Thesis, University of Macau, 2011. http://umaclib3.umac.mo/record=b2493215.
Texto completoJauk, Igor. "Unsupervised learning for expressive speech synthesis". Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/460814.
Texto completoHoy en día, especialmente con el auge de las redes neuronales, la síntesis de habla se basa casi totalmente en datos. El objetivo de esta tesis es proveer métodos de entrenamiento automático y no supervisado a partir de datos para la síntesis de habla expresiva. En comparación con sistemas de síntesis "neutrales", resulta más difícil encontrar datos de entrenamiento fiables para la síntesis expresiva, a pesar de la gran disponibilidad de recursos como internet. La dificultad principal se origina en la naturaleza del habla expresiva, altamente dependiente del hablante y la situación, resultando en muchas variaciones acústicas. Las consecuencias son, primero, que es muy difícil definir etiquetas que identifiquen fiablemente todos los detalles del habla expresiva. La definición típica de 6 emociones básicas es una simplificación que tendrá consecuencias inexcusables cuando se trata con datos fuera del laboratorio. Segundo, incluso si se llegara a definir un conjunto de etiquetas, aparte del enorme esfuerzo manual que supondría, sería muy difícil conseguir suficientes datos de entrenamiento para cada variante respetando todos sus matices. El objetivo de esta tesis es estudiar métodos de entrenamiento automático para la síntesis de habla expresiva evitando etiquetas y desarrollar aplicaciones a base de estas propuestas. El enfoque abarca los dominios acústico y semántico. Con respecto al dominio acústico, el objetivo es encontrar características acústicas aptas para representar habla expresiva, especialmente en el dominio multi-locutor, acercándose a datos reales e incontrolados. Para esto, la perspectiva se apartará de las características tradicionales, principalmente basadas en la prosodia, hacia características ganadas a partir del análisis de factores, intentando identificar los componentes principales de la expresividad, concretamente los i-vectors. Los resultados demuestran que una combinación de características tradicionales y de las basadas en los i-vectors rinde mejor en la tarea del "clustering" no supervisado del habla expresiva que solo las características tradicionales e incluso mejor que amplios conjuntos de características del estado del arte en el dominio multi-locutor. Una vez definido, el conjunto de características se utiliza para el "clustering" no supervisado de un audiolibro, entrenando de cada "cluster" una voz. El método se ha evaluado en una aplicación de edición de audiolibro, donde los usuarios utilizaban las voces sintéticas para crear sus propios diálogos. Los resultados obtenidos validan la propuesta. En la aplicación de edición, los usuarios eligen voces sintéticas y las asignan a frases considerando los personajes y la expresividad. Implicando el dominio semántico, esta asignación podría realizarse automáticamente. En esta parte de la tesis, palabras y frases se representan numéricamente en espacios vectoriales entrenables, llamados embeddings, y pueden utilizarse para predecir la expresividad. Este método no solo permite una lectura automática de pasajes de texto, tomando en cuenta el contexto local, sino que también puede utilizarse como una herramienta de búsqueda semántica para datos de entrenamiento. Ambas aplicaciones se han evaluado en un experimento perceptual demostrando el potencial de la metodología propuesta. Finalmente, siguiendo las nuevas tendencias en el mundo de la síntesis de habla basada en redes neuronales, se ha desarrollado y evaluado un sistema de síntesis de voz expresiva utilizando esta tecnología. Representaciones semánticas de texto, motivadas emocionalmente, llamadas "sentiment embeddings", entrenadas con reseñas de cine, se utilizan como input adicional en el sistema. La red neuronal ahora aprende no solamente de la información segmental y contextual, sino también de esta representación del sentimiento, afectando especialmente la prosodia. El sistema se ha evaluado en dos experimentos perceptuales, demostrando la preferencia del sistema que incluye esta nueva represent
Macon, Michael W. "Speech synthesis based on sinusoidal modeling". Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13904.
Texto completoWang, Min 1961. "Format-based synthesis of Chinese speech". Thesis, McGill University, 1986. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=66001.
Texto completoRahim, Mazin. "Neural networks in articulatory speech synthesis". Thesis, University of Liverpool, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.317191.
Texto completoFekkai, Souhila. "Fractal based speech recognition and synthesis". Thesis, De Montfort University, 2002. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.269246.
Texto completoWright, Richard Douglas. "An investigation of speech synthesis parameters". Thesis, University of Southampton, 1988. https://eprints.soton.ac.uk/52279/.
Texto completoCampbell, Wilhelm. "Multi-level speech timing control". Thesis, University of Sussex, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.283832.
Texto completoMoakes, Paul Alan. "On-line adaptive nonlinear modelling of speech". Thesis, University of Sheffield, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364189.
Texto completoShannon, Sean Matthew. "Probabilistic acoustic modelling for parametric speech synthesis". Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708415.
Texto completoCummings, Kathleen E. "Analysis, synthesis, and recognition of stressed speech". Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/15673.
Texto completoVarga, A. P. "Multipulse excited linear predictive analysis in speech coding and constructive speech synthesis". Thesis, University of Cambridge, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.372909.
Texto completoMazel, David S. "Sinusoidal modeling of speech". Thesis, Georgia Institute of Technology, 1986. http://hdl.handle.net/1853/13873.
Texto completoMerritt, Thomas. "Overcoming the limitations of statistical parametric speech synthesis". Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/22071.
Texto completoEngwall, Olov. "Tongue Talking : Studies in Intraoral Speech Synthesis". Doctoral thesis, KTH, Tal, musik och hörsel, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3380.
Texto completoSakai, Shinsuke. "A Probabilistic Approach to Concatenative Speech Synthesis". 京都大学 (Kyoto University), 2012. http://hdl.handle.net/2433/152508.
Texto completoHassanain, Elham. "Novel cepstral techniques applied to speech synthesis". Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/842745/.
Texto completoVepa, Jithendra. "Join cost for unit selection speech synthesis". Thesis, University of Edinburgh, 2004. http://hdl.handle.net/1842/1452.
Texto completoWatts, Oliver Samuel. "Unsupervised learning for text-to-speech synthesis". Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7982.
Texto completoVine, Daniel Samuel Gordon. "Time-domain concatenative text-to-speech synthesis". Thesis, Bournemouth University, 1998. http://eprints.bournemouth.ac.uk/351/.
Texto completoEdge, James D. "Techniques for the synthesis of visual speech". Thesis, University of Sheffield, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.419276.
Texto completoSOLEWICZ, JOSE ALBERTO. "TEXT-TO-SPEECH SYNTHESIS FOR BRAZILIAN PORTUGUESE". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 1993. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=8690@1.
Texto completoThis work presents na unrestricted text-to-speech synthesis system for brazilian portuguese. The system is based on the concatenation by rules of previously coded speech units. An extremely reduced set of synthesis units (149) is proposed. This set is mostly comprised of consonant-vowel (CV) transitions, which represent crucial acoustic segments in the speech production process. Production of highly intelligible speech is show to be possible through concatenation of these units. A CELP model is also proposed as a compression and synthesis structure, which includes necessary adaptations in order to modify the speech prosody during its decoding phase. Subjective tests showed that speech synthesized through the proposed CELP model is judged superior to that obtained through an LPC Vocoder (mono-pulse/noise excited), which is traditionally used in text-to-speech synthesis systems.
Hardwick, John C. (John Clark). "A high quality speech analysis/synthesis system". Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/14901.
Texto completoHalabi, Nawar. "Modern standard Arabic phonetics for speech synthesis". Thesis, University of Southampton, 2016. https://eprints.soton.ac.uk/409695/.
Texto completoBeněk, Tomáš. "Implementing and Improving a Speech Synthesis System". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236079.
Texto completoNäslund, Per. "Artificial Neural Networks in Swedish Speech Synthesis". Thesis, KTH, Tal-kommunikation, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239350.
Texto completoTalsynteser, också kallat TTS (text-to-speech) används i stor utsträckning inom smarta assistenter och många andra applikationer. Samtida forskning applicerar maskininlärning och artificiella neurala nätverk (ANN) för att utföra talsyntes. Det har visats i studier att dessa system presterar bättre än de äldre konkatenativa och parametriska metoderna. I den här rapporten utforskas ANN-baserade TTS-metoder och en av metoderna implementeras för det svenska språket. Den använda metoden kallas “Tacotron” och är ett första steg mot end-to-end TTS baserat på neurala nätverk. Metoden binder samman flertalet olika ANN-tekniker. Det resulterande systemet jämförs med en parametriskt TTS genom ett graderat preferens-test som innefattar 20 svensktalande försökspersoner. En statistiskt säkerställd preferens för det ANN- baserade TTS-systemet fastställs. Försökspersonerna indikerar att det ANN-baserade TTS-systemet presterar bättre än det parametriska när det kommer till ljudkvalitet och naturlighet men visar brister inom tydlighet.
Hagrot, Joel. "A Data-Driven Approach For Automatic Visual Speech In Swedish Speech Synthesis Applications". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-246393.
Texto completoDetta projekt utreder hur artificiella neuronnät kan användas för visuell talsyntes. Ändamålet var att ta fram ett ramverk för animerade chatbotar på svenska. En översikt över litteraturen kom fram till att state-of-the-art-metoden var att använda artificiella neuronnät med antingen ljud eller fonemsekvenser som indata. Tre enkäter genomfördes, både i den slutgiltiga produktens kontext, samt i en mer neutral kontext med mindre bearbetning. De jämförde sanningsdatat, inspelat med iPhone X:s djupsensorkamera, med både neuronnätsmodellen och en grundläggande så kallad baselinemodell. Den statistiska analysen använde mixed effects-modeller för att hitta statistiskt signifikanta skillnader i resultaten. Även den temporala dynamiken analyserades. Resultaten visar att ett relativt enkelt neuronnät kunde lära sig att generera blendshapesekvenser utifrån fonemsekvenser med tillfredsställande resultat, förutom att krav såsom läppslutning för vissa konsonanter inte alltid uppfylldes. Problemen med konsonanter kunde också i viss mån ses i sanningsdatat. Detta kunde lösas med hjälp av konsonantspecifik bearbetning, vilket gjorde att neuronnätets animationer var oskiljbara från sanningsdatat och att de samtidigt upplevdes vara bättre än baselinemodellens animationer. Sammanfattningsvis så lärde sig neuronnätet vokaler väl, men hade antagligen behövt mer data för att på ett tillfredsställande sätt uppfylla kraven för vissa konsonanter. För den slutgiltiga produktens skull kan dessa krav ändå uppnås med hjälp av konsonantspecifik bearbetning.
Gordon, Jane S. "Use of synthetic speech in tests of speech discrimination". PDXScholar, 1985. https://pdxscholar.library.pdx.edu/open_access_etds/3443.
Texto completoChung, Jae H. "A new homomorphic vocoder framework using analysis-by-synthesis excitation analysis". Diss., Georgia Institute of Technology, 1991. http://hdl.handle.net/1853/15471.
Texto completoAlissali, Mamoun. "Architecture logicielle pour la synthèse multilingue de la parole". Grenoble INPG, 1993. http://www.theses.fr/1993INPG0037.
Texto completoKain, Alexander Blouke. "High resolution voice transformation /". Full text open access at:, 2001. http://content.ohsu.edu/u?/etd,189.
Texto completoPeters, Richard Alan II. "A LINEAR PREDICTION CODING MODEL OF SPEECH (SYNTHESIS, LPC, COMPUTER, ELECTRONIC)". Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/291240.
Texto completoStrömbergsson, Sofia. "The /k/s, the /t/s, and the inbetweens : Novel approaches to examining the perceptual consequences of misarticulated speech". Doctoral thesis, KTH, Tal-kommunikation, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-143102.
Texto completoQC 20140317
Low, Phuay Hui. "Statistical analysis, modelling and synthesis of voice for text to speech synthesis". Thesis, Brunel University, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.401342.
Texto completoBulyko, Ivan. "Flexible speech synthesis using weighted finite-state transducers /". Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6081.
Texto completoCrosmer, Joel R. "Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients". Diss., Georgia Institute of Technology, 1985. http://hdl.handle.net/1853/15739.
Texto completoQader, Raheel. "Pronunciation and disfluency modeling for expressive speech synthesis". Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S076/document.
Texto completoIn numerous domains, the usage of synthetic speech is conditioned upon the ability of speech synthesis systems to generate natural and expressive speech. In this frame, we address the problem of expressivity in TTS by incorporating two phenomena with a high impact on speech: pronunciation variants and speech disfluencies. In the first part of this thesis, we present a new pronunciation variant generation method which works by adapting standard i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and acoustic features and to use a probabilistic machine learning framework, namely conditional random fields (CRFs) and language models. Extensive experiments on the Buckeye corpus demonstrate the effectiveness of this approach through objective and subjective evaluations. Listening tests on synthetic speech show that adapted pronunciations are judged as more spontaneous than standard ones, as well as those realized by real speakers. Furthermore, we show that the method can be extended to other adaptation tasks, for instance, to solve the problem of inconsistency between phoneme sequences handled in TTS systems. The second part of this thesis explores a novel approach to automatic generation of speech disfluencies for TTS. Speech disfluencies are one of the most pervasive phenomena in spontaneous speech, therefore being able to automatically generate them is crucial to have more expressive synthetic speech. The proposed approach provides the advantage of generating several types of disfluencies: pauses, repetitions and revisions. To achieve this task, we formalize the problem as a theoretical process, where transformation functions are iteratively composed. We present a first implementation of the proposed process using CRFs and language models, before conducting objective and perceptual evaluations. These experiments lead to the conclusion that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements
Micallef, Paul. "A text to speech synthesis system for Maltese". Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/842702/.
Texto completo