Academic literature on the topic 'Statistical Parametric Speech Synthesizer'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Statistical Parametric Speech Synthesizer.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Statistical Parametric Speech Synthesizer"

1

Szklanny, Krzysztof, and Jakub Lachowicz. "Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer." Sensors 22, no. 9 (April 21, 2022): 3188. http://dx.doi.org/10.3390/s22093188.

Full text
Abstract:
Total laryngectomy, i.e., the surgical removal of the larynx, has a profound influence on a patient’s quality of life. The procedure results in a loss of natural voice, which in effect constitutes a significant socio-psychological problem for the patient. The main aim of the study was to develop a statistical parametric speech synthesis system for a patient with laryngeal cancer, on the basis of the patient’s speech samples recorded shortly before the surgery and to check if it was possible to generate speech quality close to that of the original recordings. The recording made use of a representative corpus of the Polish language, consisting of 2150 sentences. The recorded voice proved to indicate dysphonia, which was confirmed by the auditory-perceptual RBH scale (roughness, breathiness, hoarseness) and by acoustical analysis using AVQI (The Acoustic Voice Quality Index). The speech synthesis model was trained using the Merlin repository. Twenty-five experts participated in the MUSHRA listening tests, rating the synthetic voice at 69.4 in terms of the professional voice-over talent recording, on a 0–100 scale, which is a very good result. The authors compared the quality of the synthetic voice to another model of synthetic speech trained with the same corpus, but where a voice-over talent provided the recorded speech samples. The same experts rated the voice at 63.63, which means the patient’s synthetic voice with laryngeal cancer obtained a higher score than that of the talent-voice recordings. As such, the method enabled for the creation of a statistical parametric speech synthesizer for patients awaiting total laryngectomy. As a result, the solution would improve the quality of life as well as better mental wellbeing of the patient.
APA, Harvard, Vancouver, ISO, and other styles
2

Chee Yong, Lau, Oliver Watts, and Simon King. "Combining Lightly-supervised Learning and User Feedback to Construct Andimprove a Statistical Parametric Speech Synthesizer for Malay." Research Journal of Applied Sciences, Engineering and Technology 11, no. 11 (December 15, 2015): 1227–32. http://dx.doi.org/10.19026/rjaset.11.2229.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Coto-Jiménez, Marvin. "Discriminative Multi-Stream Postfilters Based on Deep Learning for Enhancing Statistical Parametric Speech Synthesis." Biomimetics 6, no. 1 (February 7, 2021): 12. http://dx.doi.org/10.3390/biomimetics6010012.

Full text
Abstract:
Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.
APA, Harvard, Vancouver, ISO, and other styles
4

Coto-Jiménez, Marvin. "Improving Post-Filtering of Artificial Speech Using Pre-Trained LSTM Neural Networks." Biomimetics 4, no. 2 (May 28, 2019): 39. http://dx.doi.org/10.3390/biomimetics4020039.

Full text
Abstract:
Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.
APA, Harvard, Vancouver, ISO, and other styles
5

Trinh, Son, and Kiem Hoang. "HMM-Based Vietnamese Speech Synthesis." International Journal of Software Innovation 3, no. 4 (October 2015): 33–47. http://dx.doi.org/10.4018/ijsi.2015100103.

Full text
Abstract:
In this paper, improving naturalness HMM-based speech synthesis for Vietnamese language is described. By this synthesis method, trajectories of speech parameters are generated from the trained Hidden Markov models. A final speech waveform is synthesized from those speech parameters. The main objective for the development is to achieve maximum naturalness in output speech through key points. Firstly, system uses a high quality recorded Vietnamese speech database appropriate for training, especially in statistical parametric model approach. Secondly, prosodic informations such as tone, POS (part of speech) and features based on characteristics of Vietnamese language are added to ensure the quality of synthetic speech. Third, system uses STRAIGHT which showed its ability to produce high-quality voice manipulation and was successfully incorporated into HMM-based speech synthesis. The results collected show that the speech produced by our system has the best result when being compared with the other Vietnamese TTS systems trained from the same speech data.
APA, Harvard, Vancouver, ISO, and other styles
6

Zen, Heiga, Keiichi Tokuda, and Alan W. Black. "Statistical parametric speech synthesis." Speech Communication 51, no. 11 (November 2009): 1039–64. http://dx.doi.org/10.1016/j.specom.2009.04.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Ekpenyong, Moses, Eno-Abasi Urua, Oliver Watts, Simon King, and Junichi Yamagishi. "Statistical parametric speech synthesis for Ibibio." Speech Communication 56 (January 2014): 243–51. http://dx.doi.org/10.1016/j.specom.2013.02.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Chen, Sin‐Horng, Saga Chang, and Su‐Min Lee. "A statistical model based fundamental frequency synthesizer for Mandarin speech." Journal of the Acoustical Society of America 92, no. 1 (July 1992): 114–20. http://dx.doi.org/10.1121/1.404276.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Takahashi, Sateshi, Yasuaki Satoh, Takeshi Ohno, and Katsuhiko Shirai. "Statistical modeling of dynamic spectral patterns for a speech synthesizer." Journal of the Acoustical Society of America 84, S1 (November 1988): S23. http://dx.doi.org/10.1121/1.2026230.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

King, Simon. "An introduction to statistical parametric speech synthesis." Sadhana 36, no. 5 (October 2011): 837–52. http://dx.doi.org/10.1007/s12046-011-0048-y.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Statistical Parametric Speech Synthesizer"

1

Hu, Qiong. "Statistical parametric speech synthesis based on sinusoidal models." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28719.

Full text
Abstract:
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal models. Vocoders play a crucial role during the parametrisation and reconstruction process, so we first lead an experimental comparison of a broad range of the leading vocoder types. Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes can generate high quality of speech compared with source-filter ones, component sinusoids are correlated with each other, and the number of parameters is also high and varies in each frame, which constrains its application for statistical speech synthesis. Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease and fix the number of components typically used in the standard sinusoidal model. Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system (HTS), two strategies for modelling sinusoidal parameters have been compared. In the first method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM are statistically modelled directly. In the second method (INT parameterisation), we convert both static amplitude and dynamic slope from all the harmonics of a signal, which we term the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients (RDC)) for modelling. Our results show that HDM with intermediate parameters can generate comparable quality to STRAIGHT. As correlations between features in the dynamic model cannot be modelled satisfactorily by a typical HMM-based system with diagonal covariance, we have applied and tested a deep neural network (DNN) for modelling features from these two methods. To fully exploit DNN capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling and waveform generation. For DNN training, we propose to use multi-task learning to model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We conclude from our results that sinusoidal models are indeed highly suited for statistical parametric synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based equivalent when used in conjunction with DNNs. To further improve the voice quality, phase features generated from the proposed vocoder also need to be parameterised and integrated into statistical modelling. Here, an alternative statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued back-propagation algorithm using a logarithmic minimisation criterion which includes both amplitude and phase errors is used as a learning rule. Three parameterisation methods are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued amplitude with minimum phase and complex-valued amplitude with mixed phase. Our results show the potential of using CVNNs for modelling both real and complex-valued acoustic features. Overall, this thesis has established competitive alternative vocoders for speech parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for the parametric statistical speech synthesis.
APA, Harvard, Vancouver, ISO, and other styles
2

Merritt, Thomas. "Overcoming the limitations of statistical parametric speech synthesis." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/22071.

Full text
Abstract:
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis.
APA, Harvard, Vancouver, ISO, and other styles
3

Dall, Rasmus. "Statistical parametric speech synthesis using conversational data and phenomena." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/29016.

Full text
Abstract:
Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
APA, Harvard, Vancouver, ISO, and other styles
4

Fonseca, De Sam Bento Ribeiro Manuel. "Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31338.

Full text
Abstract:
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks.
APA, Harvard, Vancouver, ISO, and other styles
5

Hong, Jung. "Statistical Parametric Models and Inference for Biomedical Signal Processing: Applications in Speech and Magnetic Resonance Imaging." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10074.

Full text
Abstract:
In this thesis, we develop statistical methods for extracting significant information from biomedical signals. Biomedical signals are not only generated from a complex system but also affected by various random factors during their measurement. The biomedical signals may then be studied in two aspects: observational noise that biomedical signals experience and intrinsic nature that noise-free signals possess. We study Magnetic Resonance (MR) images and speech signals as applications in the one- and two-dimensional signal representation. In MR imaging, we study how observational noise can be effectively modeled and then removed. Magnitude MR images suffer from Rician-distributed signal-dependent noise. Observing that the squared-magnitude MR image follows a scaled non-central Chi-square distribution on two degrees of freedom, we optimize the parameters involved in the proposed Rician-adapted Non-local Mean (RNLM) estimator by minimizing the Chi-square unbiased risk estimate in the minimum mean square error sense. A linear expansion of RNLM's is considered in order to achieve the global optimality of the parameters without data-dependency. Parallel computations and convolution operations are considered as acceleration techniques. Experiments show the proposed method favorably compares with benchmark denoising algorithms. Parametric modelings of noise-free signals are studied for robust speech applications. The voiced speech signals are often modeled as the harmonic model with the fundamental frequency, commonly assumed to be a smooth function of time. As an important feature in various speech applications, pitch, the perceived tone, is obtained by way of estimating the fundamental frequency. In this thesis, two model-based pitch estimation schemes are introduced. In the first, an iterative Auto Regressive Moving Average technique estimates harmonically tied sinusoidal components in noisy speech signals. Dynamic programming implements the smoothness of the fundamental frequency. The second introduces the Continuous-time Voiced Speech (CVS) model, which models the smooth fundamental frequency as a linear combination of block-wise continuous polynomial bases. The model parameters are obtained via a convex optimization with constraints, providing an estimate of the instantaneous fundamental frequency. Experiments validate robustness and accuracy of the proposed methods compared with some current state-of-the-art pitch estimation algorithms.
Engineering and Applied Sciences
APA, Harvard, Vancouver, ISO, and other styles
6

Evrard, Marc. "Synthèse de parole expressive à partir du texte : Des phonostyles au contrôle gestuel pour la synthèse paramétrique statistique." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112202.

Full text
Abstract:
L’objectif de cette thèse est l’étude et la conception d’une plateforme de synthèse de parole expressive.Le système de synthèse — LIPS3, développé dans le cadre de ce travail, incorpore deux éléments : un module linguistique et un module de synthèse paramétrique par apprentissage statistique (construit à l’aide de HTS et de STRAIGHT). Le système s’appuie sur un corpus monolocuteur conçu, enregistréet étiqueté à cette occasion.Une première étude, sur l’influence de la qualité de l’étiquetage du corpus d’apprentissage, indique que la synthèse paramétrique statistique est robuste aux erreurs de labels et d’alignement. Cela répond au problème de la variation des réalisations phonétiques en parole expressive.Une seconde étude, sur l’analyse acoustico-phonétique du corpus permet la caractérisation de l’espace expressif utilisé par la locutrice pour réaliser les consignes expressives qui lui ont été fournies. Les paramètres de source et les paramètres articulatoires sont analysés suivant les classes phonétiques, ce qui permet une caractérisation fine des phonostyles.Une troisième étude porte sur l’intonation et le rythme. Calliphony 2.0 est une interface de contrôlechironomique temps-réel permettant la modification de paramètres prosodiques (f0 et tempo) des signaux de synthèse sans perte de qualité, via une manipulation directe de ces paramètres. Une étude sur la stylisation de l’intonation et du rythme par contrôle gestuel montre que cette interface permet l’amélioration, non-seulement de la qualité expressive de la parole synthétisée, mais aussi de la qualité globale perçue en comparaison avec la modélisation statistique de la prosodie.Ces études montrent que la synthèse paramétrique, combinée à une interface chironomique, offre une solution performante pour la synthèse de la parole expressive, ainsi qu’un outil d’expérimentation puissant pour l’étude de la prosodie
The subject of this thesis was the study and conception of a platform for expressive speech synthesis.The LIPS3 Text-to-Speech system — developed in the context of this thesis — includes a linguistic module and a parametric statistical module (built upon HTS and STRAIGHT). The system was based on a new single-speaker corpus, designed, recorded and annotated.The first study analyzed the influence of the precision of the training corpus phonetic labeling on the synthesis quality. It showed that statistical parametric synthesis is robust to labeling and alignment errors. This addresses the issue of variation in phonetic realizations for expressive speech.The second study presents an acoustico-phonetic analysis of the corpus, characterizing the expressive space used by the speaker to instantiate the instructions that described the different expressive conditions. Voice source parameters and articulatory settings were analyzed according to their phonetic classes, which allowed for a fine phonostylistic characterization.The third study focused on intonation and rhythm. Calliphony 2.0 is a real-time chironomic interface that controls the f0 and rhythmic parameters of prosody, using drawing/writing hand gestures with a stylus and a graphic tablet. These hand-controlled modulations are used to enhance the TTS output, producing speech that is more realistic, without degradation as it is directly applied to the vocoder parameters. Intonation and rhythm stylization using this interface brings significant improvement to the prototypicality of expressivity, as well as to the general quality of synthetic speech.These studies show that parametric statistical synthesis, combined with a chironomic interface, offers an efficient solution for expressive speech synthesis, as well as a powerful tool for the study of prosody
APA, Harvard, Vancouver, ISO, and other styles
7

"Statistical Parametric Speech Synthesis using Deep Learning Architectures." 2016. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1292251.

Full text
Abstract:
本文研究了使用深度學習(Deep Learning)技術與模型的統計參數化語音合成(Statistical Parametric Speech Synthesis)框架。當前語音合成面臨的兩個主要的挑戰在於:採用聲學實現表達語音韻律的複雜度;訓練數據的稀疏性。這兩個問題很大地影響了合成語音的自然度。本文嘗試採用深度學習結構的建模能力,提高合成語音的語音自然度。
為了更精確地表示韻律上下文,本文定義了層次韻律結構,用以組織音段與超音段特征。本文採用深度學習結構,運用層次化結構的音節級別表示,構建語音合成系統。
受深度置信網絡(Deep Belief Network, DBN)在手寫數字圖像識別和生成方面成功應用的啟發,本文在DBN的框架下對語音頻譜與基頻進行建模。為了適應語音韻律與聲學參數數據包含不同分佈的特點,本文改進原有的DBN成為帶權重的多分佈深度置信網絡(Weighted Multi-Distribution Deep Belief Network, wMD-DBN)。與傳統的基於隱馬爾科夫(Hidden Markov Model, HMM)的方法相比,客觀評測中wMD-DBN生成的頻譜失真度更低,在主觀評測中,wMDDBN也得到了與HMM基線系統整體相似的結果,證明了wMD-DBN的優勢。
在語音研究領域,之前的深度神經網絡(DNN)工作主要集中在語音識別任務中,採用DNN作為分類器以得到更好的聲學模型。本文將DNN作為生成模型,并使用它在語音合成中做韻律特征到聲學特征的映射。另一方面,DNN建模的是條件概率,而不像DBN建模的是聯合概率,這使得特征映射更加符合直觀感覺。與wMD-DBN相似,本文在DNN的輸出層採用了多分佈的輸出層。本文同時為具有不尋常分佈的聲學特征設計了特殊的損失函數(Loss Function)。為了使模型得到好的效果,本文採用生成式預訓練的DBN作為模型初始化,以構建多分佈深度神經網絡(MD-DNN)結構。主觀與客觀評測顯示,MD-DNN模型比wMD-DBN和HMM模型合成的語音具有更高的自然度。
This thesis presents a statistical parametric speech synthesis framework using the deep learning techniques and models. Existing speech synthesis face two main challenges – the complexity of expressing speech prosody with its acoustic realizations and sparsity of training data. Both of them limit the naturalness of synthesized speech. This thesis attempts to improve the synthesis performance in terms of speech naturalness, by leveraging the modeling power of deep learningarchitectures.
To precisely represent the linguistic contexts, we defined a hierarchical prosodic structure to organize both the segmental and suprasegmental features, and proposed a syllable-level representation of the hierarchical structure for speech synthesis using deep learning architectures.
Inspired by Deep Belief Network’s (DBN’s) success in handwriting digit im age recognition and generation, we propose to model the speech spectrograms in addition to F0 contours as 2-D images in the DBN framework. In order to fit the speech prosodic and acoustic parameters consisting of data with various distributions, we adopt the original model into a Weighted Multi-Distribution DBN(wMD-DBN). Compared with the predominant HMM-based approach, objective evaluation shows that the spectrum generated from wMD-DBN has less distortion. Subjective tests also confirm the advantage of spectrum from wMD-DBN, and the wMD-DBN system gives a similar overall quality as the HMM baseline.
Previous work on DNN in the speech community mainly focused on using it as a classifier for better acoustic modeling in speech recognition task. Here we treat DNN as a generative model and use it for linguistic-to-acoustic feature mapping in speech synthesis. Compared to the DBN model, DNN only requires a single computing pass for feature prediction, making it more suitable for real-time synthesis. On the other hand, DNN models the conditional probability instead of the joint probability as in the DBN model, which is more intuitive for the feature mapping task. Similar as wMD-DBN, we adopt the output layer of a plain DNN into Multi-distribution (MD) output layer. We also design specialized loss functions for acoustic features with uncommon distributions. To achieve good performance on deep model structure, we use the generative pre-trained DBN as the model initialization to build the MD-DNN architecture. Both objective and subjective evaluations show that the MD-DNN model out-performs the wMD DBN and HMM in terms of the naturalness of synthesized speech.
Kang, Shiyin.
Thesis Ph.D. Chinese University of Hong Kong 2016.
Includes bibliographical references (leaves ).
Abstracts also in Chinese.
Title from PDF title page (viewed on …).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Statistical Parametric Speech Synthesizer"

1

Rao, K. Sreenivasa, and N. P. Narendra. Source Modeling Techniques for Quality Enhancement in Statistical Parametric Speech Synthesis. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-02759-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Source Modeling Techniques for Quality Enhancement in Statistical Parametric Speech Synthesis. Springer, 2018.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Statistical Parametric Speech Synthesizer"

1

Smruti, Soumya, Jagyanseni Sahoo, Monalisa Dash, and Mihir N. Mohanty. "An Approach to Design an Intelligent Parametric Synthesizer for Emotional Speech." In Advances in Intelligent Systems and Computing, 367–74. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-12012-6_40.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Al-Radhi, Mohammed Salah, Tamás Gábor Csapó, and Géza Németh. "A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis." In Speech and Computer, 11–20. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-99579-3_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Coto-Jiménez, Marvin. "Measuring the Effect of Reverberation on Statistical Parametric Speech Synthesis." In Communications in Computer and Information Science, 369–82. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-41005-6_25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

An, Xiaochun, Hongwu Yang, and Zhenye Gan. "Towards Realizing Sign Language-to-Speech Conversion by Combining Deep Learning and Statistical Parametric Speech Synthesis." In Communications in Computer and Information Science, 678–90. Singapore: Springer Singapore, 2016. http://dx.doi.org/10.1007/978-981-10-2053-7_61.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Ma, Dabiao, Zhiba Su, Wenxuan Wang, Yuhao Lu, and Zhen Li. "UFANS: U-Shaped Fully-Parallel Acoustic Neural Structure for Statistical Parametric Speech Synthesis." In PRICAI 2019: Trends in Artificial Intelligence, 273–78. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-29894-4_22.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Singh, Harman, Parminder Singh, and Manjot Kaur Gill. "Statistical Parametric Speech Synthesis for Punjabi Language using Deep Neural Network." In SCRS CONFERENCE PROCEEDINGS ON INTELLIGENT SYSTEMS, 431–41. Soft Computing Research Society, 2021. http://dx.doi.org/10.52458/978-93-91842-08-6-41.

Full text
Abstract:
In recent years, speech technology gets very advanced, due to which speech synthesis becomes an interesting area of study for researchers. Text-To-Speech (TTS) system generates the speech from the text by using a synthesized technique like concatenative, formant, articulatory, Statistical Parametric Speech Synthesis (SPSS) etc. The Deep Neural Network (DNN) based SPSS for the Punjabi language is used in this research work. The database used for this research works contains 674 audio files and a single text file containing 674 sentences. This database was created at the Language Technologies Institute at Carnegie Mellon University (CMU) provided under Festvox distribution. Ossian toolkit is used as a front-end for text processing. The two DNNs are modeled using the merlin toolkit. The duration DNN maps the linguistic and duration features of speech. The acoustic DNN maps the linguistic and acoustic features. The subjective evaluation using the Mean Opinion Score (MOS) shows that this TTS system has good quality of naturalness that is 80.2%.
APA, Harvard, Vancouver, ISO, and other styles
7

Tits, Noé, Kevin El Haddad, and Thierry Dutoit. "The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach." In Human 4.0 - From Biology to Cybernetic. IntechOpen, 2021. http://dx.doi.org/10.5772/intechopen.89849.

Full text
Abstract:
As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, and psychology. In this chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the chapter intends to assemble the different aspects of the theory and summarize the concepts.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Statistical Parametric Speech Synthesizer"

1

Zen, Heiga, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, and Przemysław Szczepaniak. "Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices." In Interspeech 2016. ISCA, 2016. http://dx.doi.org/10.21437/interspeech.2016-522.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Black, Alan W., Heiga Zen, and Keiichi Tokuda. "Statistical Parametric Speech Synthesis." In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07. IEEE, 2007. http://dx.doi.org/10.1109/icassp.2007.367298.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Fagel, Sascha. "Video-realistic synthetic speech with a parametric visual speech synthesizer." In Interspeech 2004. ISCA: ISCA, 2004. http://dx.doi.org/10.21437/interspeech.2004-422.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Aroon, Athira, and S. B. Dhonde. "Statistical Parametric Speech Synthesis: A review." In 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO). IEEE, 2015. http://dx.doi.org/10.1109/isco.2015.7282379.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Gutscher, Lorenz, Michael Pucher, Carina Lozo, Marisa Hoeschele, and Daniel C. Mann. "Statistical parametric synthesis of budgerigar songs." In 10th ISCA Speech Synthesis Workshop. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/ssw.2019-23.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Ni, Jinfu, Yoshinori Shiga, Hisashi Kawai, and Hideki Kashioka. "Experiments on unsupervised statistical parametric speech synthesis." In 2012 8th International Symposium on Chinese Spoken Language Processing (ISCSLP 2012). IEEE, 2012. http://dx.doi.org/10.1109/iscslp.2012.6423518.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Shylendra, Ahish, Sina Haji Alizad, Priyesh Shukla, and Amit Ranjan Trivedi. "Non-parametric Statistical Density Function Synthesizer and Monte Carlo Sampler in CMOS." In 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID). IEEE, 2020. http://dx.doi.org/10.1109/vlsid49098.2020.00021.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Ze, Heiga, Andrew Senior, and Mike Schuster. "Statistical parametric speech synthesis using deep neural networks." In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6639215.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

An, Shumin, Zhenhua Ling, and Lirong Dai. "Emotional statistical parametric speech synthesis using LSTM-RNNs." In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2017. http://dx.doi.org/10.1109/apsipa.2017.8282282.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Tokuda, Keiichi, Kei Hashimoto, Keiichiro Oura, and Yoshihiko Nankaku. "Temporal modeling in neural network based statistical parametric speech synthesis." In 9th ISCA Speech Synthesis Workshop. ISCA, 2016. http://dx.doi.org/10.21437/ssw.2016-18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography