Dissertations / Theses: 'Signal processing; Voice recognition'

1

Nayfeh, Taysir H. "Multi-signal processing for voice recognition in noisy environments." Thesis, This resource online, 1991. http://scholar.lib.vt.edu/theses/available/etd-10222009-125021/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Fredrickson, Steven Eric. "Neural networks for speaker identification." Thesis, University of Oxford, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.294364.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Little, M. A. "Biomechanically informed nonlinear speech signal processing." Thesis, University of Oxford, 2007. http://ora.ox.ac.uk/objects/uuid:6f5b84fb-ab0b-42e1-9ac2-5f6acc9c5b80.

Full text

Abstract:

Linear digital signal processing based around linear, time-invariant systems theory finds substantial application in speech processing. The linear acoustic source-filter theory of speech production provides ready biomechanical justification for using linear techniques. Nonetheless, biomechanical studies surveyed in this thesis display significant nonlinearity and non-Gaussinity, casting doubt on the linear model of speech production. In order therefore to test the appropriateness of linear systems assumptions for speech production, surrogate data techniques can be used. This study uncovers systematic flaws in the design and use of exiting surrogate data techniques, and, by making novel improvements, develops a more reliable technique. Collating the largest set of speech signals to-date compatible with this new technique, this study next demonstrates that the linear assumptions are not appropriate for all speech signals. Detailed analysis shows that while vowel production from healthy subjects cannot be explained within the linear assumptions, consonants can. Linear assumptions also fail for most vowel production by pathological subjects with voice disorders. Combining this new empirical evidence with information from biomechanical studies concludes that the most parsimonious model for speech production, explaining all these findings in one unified set of mathematical assumptions, is a stochastic nonlinear, non-Gaussian model, which subsumes both Gaussian linear and deterministic nonlinear models. As a case study, to demonstrate the engineering value of nonlinear signal processing techniques based upon the proposed biomechanically-informed, unified model, the study investigates the biomedical engineering application of disordered voice measurement. A new state space recurrence measure is devised and combined with an existing measure of the fractal scaling properties of stochastic signals. Using a simple pattern classifier these two measures outperform all combinations of linear methods for the detection of voice disorders on a large database of pathological and healthy vowels, making explicit the effectiveness of such biomechanically-informed, nonlinear signal processing techniques.

APA, Harvard, Vancouver, ISO, and other styles

4

Regnier, Lise. "Localization, Characterization and Recognition of Singing Voices." Phd thesis, Université Pierre et Marie Curie - Paris VI, 2012. http://tel.archives-ouvertes.fr/tel-00687475.

Full text

Abstract:

This dissertation is concerned with the problem of describing the singing voice within the audio signal of a song. This work is motivated by the fact that the lead vocal is the element that attracts the attention of most listeners. For this reason it is common for music listeners to organize and browse music collections using information related to the singing voice such as the singer name. Our research concentrates on the three major problems of music information retrieval: the localization of the source to be described (i.e. the recognition of the elements corresponding to the singing voice in the signal of a mixture of instruments), the search of pertinent features to describe the singing voice, and finally the development of pattern recognition methods based on these features to identify the singer. For this purpose we propose a set of novel features computed on the temporal variations of the fundamental frequency of the sung melody. These features, which aim to describe the vibrato and the portamento, are obtained with the aid of a dedicated model. In practice, these features are computed on the time-varying frequency of partials obtained using the sinusoidal model. In the first experiment we show that partials corresponding to the singing voice can be accurately differentiated from the partials produced by other instruments using decisions based on the parameters of the vibrato and the portamento. Once the partials emitted by the singer are identified, the segments of the song containing singing can be directly localized. To improve the recognition of the partials emitted by the singer we propose to group partials that are related harmonically. Partials are clustered according to their degree of similarity. This similarity is computed using a set of CASA cues including their temporal frequency variations (i.e. the vibrato and the portamento). The clusters of harmonically related partials corresponding to the singing voice are identified using the vocal vibrato and the portamento parameters. Groups of vocal partials can then be re-synthesized to isolate the voice. The result of the partial grouping can also be used to transcribe the sung melody. We then propose to go further with these features and study if the vibrato and portamento characteristics can be considered as a part of the singers' signature. Previous works on singer identification describe audio signals using features extracted on the short-term amplitude spectrum. The latter features aim to characterize the timbre of the sound, which, in the case of singing, is related to the vocal tract of the singer. The features we develop in this document capture long-term information related to the intonation of the singer, which is relevant to the style and the technique of the singer. We propose a method to combine these two complementary descriptions of the singing voice to increase the recognition rate of singer identification. In addition we evaluate the robustness of each type of feature against a set of variations. We show the singing voice is a highly variable instrument. To obtain a representative model of a singer's voice it is thus necessary to build models using a large set of examples covering the full tessitura of a singer. In addition, we show that features extracted directly from the partials are more robust to the presence of an instrumental accompaniment than features derived from the amplitude spectrum.

APA, Harvard, Vancouver, ISO, and other styles

5

Adami, Andre Gustavo. "Sistema de reconhecimento de locutor utilizando redes neurais artificiais." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 1997. http://hdl.handle.net/10183/18277.

Full text

Abstract:

Este trabalho envolve o emprego de recentes tecnologias ligadas a promissora área de Inteligência Computacional e a tradicional área de Processamento de Sinais Digitais. Tem por objetivo o desenvolvimento de uma aplicação especifica na área de Processamento de Voz: o reconhecimento de locutor. Inúmeras aplicações, ligadas principalmente a segurança e controle, são possíveis a partir do domínio da tecnologia de reconhecimento de locutor, tanto no que diz respeito a identificação quanto a verificação de diferentes locutores. O processo de reconhecimento de locutor pode ser dividido em duas grandes fases: extração das características básicas do sinal de voz e classificação. Na fase de extração, procurou-se aplicar os mais recentes avanços na área de Processamento Digital de Sinais ao problema proposto. Neste contexto, foram utilizadas a frequência fundamental e as frequências formantes como parâmetros que identificam o locutor. O primeiro foi obtido através do use da autocorrelação e o segundo foi obtido através da transformada de Fourier. Estes parâmetros foram extraídos na porção da fala onde o trato vocal apresenta uma coarticulação entre dois sons vocálicos. Esta abordagem visa extrair as características desta mudança do aparato vocal. Existem dois tipos de reconhecimento de locutor: identificação (busca-se reconhecer o locutor em uma população) e verificação (busca-se verificar se a identidade alegada é verdadeira). O processo de reconhecimento de locutor é dividido em duas grandes fases: extração das características (envolve aquisição, pré-processamento e extração dos parâmetros característicos do sinal) e classificação (envolve a classificação do sinal amostrado na identificação/verificação do locutor ou não). São apresentadas diversas técnicas para representação do sinal, como analise espectral, medidas de energia, autocorrelação, LPC (Linear Predictive Coding), entre outras. Também são abordadas técnicas para extração de características do sinal, como a frequência fundamental e as frequências formantes. Na fase de classificação, pode-se utilizar diversos métodos convencionais: Cadeias de Markov, Distância Euclidiana, entre outros. Além destes, existem as Redes Neurais Artificiais (RNAs) que são consideradas poderosos classificadores. As RNAs já vêm sendo utilizadas em problemas que envolvem classificações de sinais de voz. Neste trabalho serão estudados os modelos mais utilizados para o problema de reconhecimento de locutor. Assim, o tema principal da Dissertação de Mestrado deste autor é a implementação de um sistema de reconhecimento de locutor utilizando Redes Neurais Artificiais para classificação do locutor. Neste trabalho tamb6m é apresentada uma abordagem para a implementação de um sistema de reconhecimento de locutor utilizando as técnicas convencionais para o processo de classificação do locutor. As técnicas utilizadas são Dynamic Time Warping (DTW) e Vector Quantization (VQ).
This work deals with the application of recent technologies related to the promising research domain of Intelligent Computing (IC) and to the traditional Digital Signal Processing area. This work aims to apply both technologies in a Voice Processing specific application which is the speaker recognition task. Many security control applications can be supported by speaker recognition technology, both in identification and verification of different speakers. The speaker recognition process can be divided into two main phases: basic characteristics extraction from the voice signal and classification. In the extraction phase, one proposed goal was the application of recent advances in DSP theory to the problem approached in this work. In this context, the fundamental frequency and the formant frequencies were employed as parameters to identify the speaker. The first one was obtained through the use of autocorrelation and the second ones were obtained through Fourier transform. These parameters were extracted from the portion of speech where the vocal tract presents a coarticulation between two voiced sounds. This approach is used to extract the characteristics of this apparatus vocal changing. In this work, the Multi-Layer Perceptron (MLP) ANN architecture was investigated in conjunction with the backpropagation learning algorithm. In this sense, some main characteristics extracted from the signal (voice) were used as input parameters to the ANN used. The output of MLP, trained previously with the speakers features, returns the authenticity of that signal. Tests were performed with 10 different male speakers, whose age were in the range from 18 to 24 years. The results are very promising. In this work it is also presented an approach to implement a speaker recognition system by applying conventional methods to the speaker classification process. The methods used are Dynamic Time Warping (DTW) and Vector Quantization (VQ).

APA, Harvard, Vancouver, ISO, and other styles

6

Stolfi, Rumiko Oishi. "Sintese e reconhecimento da fala humana." [s.n.], 2006. http://repositorio.unicamp.br/jspui/handle/REPOSIP/276267.

Full text

Abstract:

Orientadores: Fabio Violaro, Anamaria Gomide
Dissertação (mestrado profissional) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-07T21:57:26Z (GMT). No. of bitstreams: 1 Stolfi_RumikoOishi_M.pdf: 1514197 bytes, checksum: e93f45916d359641c73b31b00952a914 (MD5) Previous issue date: 2006
Resumo: O objetivo deste trabalho é apresentar uma revisão dos principais conceitos e métodos envolvidos na síntese, processamento e reconhecimento da fala humana por computador.Estas tecnologias têm inúmeras aplicações, que têm aumentado substancialmente nos últimos anos com a popularização de equipamentos de comunicação portáteis (celulares, laptops, palmtops) e a universalização da Internet. A primeira parte deste trabalho é uma revisão dos conceitos básicos de processamento de sinais, incluindo transformada de Fourier, espectro de potência e espectrograma, filtros, digitalização de sinais e o teorema de Nyquist. A segunda parte descreve as principais características da fala humana, os mecanismos envolvidos em sua produção e percepção, e o conceito de fone (unidade lingüística de som). Nessa parte também descrevemos brevemente as principais técnicas para a conversão ortográfica-fonética, para a síntese de fala a partir da descrição fonética, e para o reconhecimento da fala natural. A terceira parte descreve um projeto prático que desenvolvemos para consolidar os conhecimentos adquiridos neste mestrado: um programa que gera canções populares japonesas a partir de uma descrição textual da letra de música, usando método de síntese concatenativa. No final do trabalho listamos também alguns softwares disponíveis (livres e comerciais) para síntese e reconhecimento da fala
Abstract: The goal of this dissertation is to review the main concepts relating to the synthesis, processing, and recognition of human speech by computer. These technologies have many applications, which have increased substantially in recent years after the spread of portable communication equipment (mobile phones, laptops, palmtops) and the universal access to the Internet. The first part of this work is a revision of fundamental concepts of signal processing, including the Fourier transform, power spectrum and spectrogram, filters, signal digitalization, and Nyquist's theorem. The second part describes the main characteristics of human speech, the mechanisms involved in its production and perception, and the concept of phone (linguistic unit of sound). In this part we also briefly describe the main techniques used for orthographic-phonetic transcription, for speech synthesis from a phonetic description, and for the recognition of natural speech. The third part describes a practical project we developed to consolidate the knowledge acquired in our Masters studies: a program that generates Japanese popular songs from a textual description of the lyrics and music, using the concatenative synthesis method. At the end of this dissertation, we list some available software products (free and commercial) for speech synthesis and speech recognition
Mestrado
Engenharia de Computação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

7

Clotworthy, Christopher John. "A study of automated voice recognition." Thesis, Queen's University Belfast, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.356909.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Wells, Ian. "Digital signal processing architectures for speech recognition." Thesis, University of the West of England, Bristol, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.294705.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Aggoun, Amar. "DPCM video signal/image processing." Thesis, University of Nottingham, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.335792.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Morris, Robert W. "Enhancement and recognition of whispered speech." Diss., Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04082004-180338/unrestricted/morris%5frobert%5fw%5f200312%5fphd.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Rex, James Alexander. "Microphone signal processing for speech recognition in cars." Thesis, University of Southampton, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.326728.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Shah, Afnan Arafat. "Improving automatic speech recognition transcription through signal processing." Thesis, University of Southampton, 2017. https://eprints.soton.ac.uk/418970/.

Full text

Abstract:

Automatic speech recognition (ASR) in the educational environment could be a solution to address the problem of gaining access to the spoken words of a lecture for many students who find lectures hard to understand, such as those whose mother tongue is not English or who have a hearing impairment. In such an environment, it is difficult for ASR to provide transcripts with Word Error Rates (WER) less than 25% for the wide range of speakers. Reducing the WER reduces the time and therefore cost of correcting errors in the transcripts. To deal with the variation of acoustic features between speakers, ASR systems implement automatic vocal tract normalisation (VTN) that warps the formants (resonant frequencies) of the speaker to better match the formants of the speakers in the training set. The ASR also implements automatic dynamic time warping (DTW) to deal with variation in the speaker’s rate of speaking, by aligning the time series of the new spoken words with the time series of the matching spoken words of the training set. This research investigates whether the ASR’s automatic estimation of VTN and DTW can be enhanced through pre-processing the recording by manually warping the formants and speaking rate of the recordings using sound processing libraries (Rubber Band and SoundTouch) before transcribing the pre-processed recordings using ASR. An initial experiment, performed with the recordings of two male and two female speakers, showed that pre-processing the recording could improve the WER by an average of 39.5% for male speakers and 36.2% for female speakers. However the selection of the best warp factors was achieved through an iterative ‘trial and error’ approach that involved many hours calculating the word error rate for each warp factor setting. Finding a more efficient approach for selecting the warp factors for pre-processing was then investigated. The second experiment investigated the development of a modification function using, as its training set, the best warp factors from the ‘trial and error’ approach to estimate the modification percentage required to improve the WER of a recording. A modification function was found that on average improved the WER by 16% for female speakers and 7% for male speakers.

APA, Harvard, Vancouver, ISO, and other styles

13

Doukas, Nikolaos. "Voice activity detection using energy based measures and source separation." Thesis, Imperial College London, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.245220.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Oddiraju, Swetha. "Improving performance for adaptive filtering with voice applications." Diss., Columbia, Mo. : University of Missouri-Columbia, 2007. http://hdl.handle.net/10355/6271.

Full text

Abstract:

Thesis (M.S.)--University of Missouri-Columbia, 2007.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on September 29, 2008) Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

15

Hanna, Salim Alia. "Digital signal processing algorithms for speech coding and recognition." Thesis, Imperial College London, 1987. http://hdl.handle.net/10044/1/46268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

YOUSSIF, ROSHDY S. "HYBRID INTELLIGENT SYSTEMS FOR PATTERN RECOGNITION AND SIGNAL PROCESSING." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1085714219.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Nosa, Ogbewi. "Signal Processing and patternrecognition algorithm for monitoringParkinson’s disease." Thesis, Högskolan Dalarna, Datateknik, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:du-2376.

Full text

Abstract:

This masters thesis describes the development of signal processing and patternrecognition in monitoring Parkison’s disease. It involves the development of a signalprocess algorithm and passing it into a pattern recogniton algorithm also. Thesealgorithms are used to determine , predict and make a conclusion on the study ofparkison’s disease. We get to understand the nature of how the parkinson’s disease isin humans.

APA, Harvard, Vancouver, ISO, and other styles

18

Wilson, Shawn C. "Voice recognition systems : assessment of implementation aboard U.S. naval ships." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2003. http://library.nps.navy.mil/uhtbin/hyperion-image/03Mar%5FWilson.pdf.

Full text

Abstract:

Thesis (M.S. in Information Systems and Operations)--Naval Postgraduate School, March 2003.
Thesis advisor(s): Michael T. McMaster, Kenneth J. Hagan. Includes bibliographical references (p. 47-49). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

19

Wu, Ping. "Kohonen self-organising neural networks in speech signal processing." Thesis, University of Reading, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386985.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Johnson, Joanna. "The effectiveness of voice recognition technology as used by persons with disabilities." Online version, 1998. http://www.uwstout.edu/lib/thesis/1998/1998johnsonj.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Watkins, L. R. "Optical fibre communications : signal processing to accommodate system impairments." Thesis, Bangor University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.279143.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Smith, Philip F. "Surface evaluation by the signal processing of ultrasonic pulses." Thesis, University of Aberdeen, 1990. http://digitool.abdn.ac.uk/R?func=search-advanced-go&find_code1=WSN&request1=AAIU024863.

Full text

Abstract:

The development of a surface texture evaluation technique for the study of roughnesses of the order of a few microns using the signal processing of ultrasonic pulse-echo signals is described. The technique of extracting surface information by means of deconvolution is introduced. Strictly, a solution to the deconvolution problem normally does not exist or is not unique. The chosen method of approaching a solution is by the nonlinear Maximum Entropy Method (MEM), which offers superior image quality over many other filters. The algorithm is described and translated into a standalone computer programme-the development of this software is described in detail. The performance of the algorithm in the field of ultrasonics is assessed by means of the study of simulations involving images similar to those obtainable in a real application. Comparison with the linear Wiener-Hopf filter is provided particularly in instances where the comparison shows weaknesses of either technique. Also examined is the frequency restoration property of the algorithm (not shown by the Wiener-Hopf filter)-potential applications of this property are also described. The final part of the study of the MEM is an examination of the effect on performance of some of the algorithm's parameters and on computer system dependencies. A brief overview of some of the surface metrology techniques currently used is given. The aim is an introduction to surface metrology and an assessment of where the technique described here fits into the general surface metrology field. The experimental system, which of course is essential to practical applications, is considered in some detail. Also considered is a wide range of ultrasonic transducers available for the research. These show a considerable variety of characteristics. Some assessment is carried out using the Maximum Entropy Method with simulated and real data to try and establish the properties of a transducer best suited to the application intended. Finally, results from grating-type test surfaces and more general rough surfaces are presented. The former are intended as a means of establishing the potential performance of the technique; the latter build on the grating results to analyse real surfaces as made by a variety of engineering techniques. Results are compared with those obtained by a stylus instrument. Generally good agreement is found, with roughnesses of around 2 microns being accurately assessed. With the accuracy of these results being less than a micron, it is concluded that this technique has a valuable contribution to the surface metrology field.

APA, Harvard, Vancouver, ISO, and other styles

23

Wang, Yuanxun. "Radar signature prediction and feature extraction using advanced signal processing techniques /." Digital version accessible at:, 1999. http://wwwlib.umi.com/cr/utexas/main.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

SANTOS, JÚNIOR Gutemberg Gonçalves dos. "Redução de ruído para sistemas de reconhecimento de voz utilizando subespaços vetoriais." Universidade Federal de Campina Grande, 2009. http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/1508.

Full text

Abstract:

Submitted by Johnny Rodrigues (johnnyrodrigues@ufcg.edu.br) on 2018-08-20T20:10:09Z No. of bitstreams: 1 GUTEMBERG GONÇALVES DOS SANTOS JÚNIOR - DISSERTAÇÃO PPGEE 2009..pdf: 2756190 bytes, checksum: 5812d37f7ad4c18eb26e9672d4890812 (MD5)
Made available in DSpace on 2018-08-20T20:10:09Z (GMT). No. of bitstreams: 1 GUTEMBERG GONÇALVES DOS SANTOS JÚNIOR - DISSERTAÇÃO PPGEE 2009..pdf: 2756190 bytes, checksum: 5812d37f7ad4c18eb26e9672d4890812 (MD5) Previous issue date: 2009-05-08
O estabelecimento de uma interface de comunicação através da voz entre seres humanos e computadores vem sendo perseguido desde o início da era da computação. Nesta direção, diversos avanços foram realizados nas últimas seis décadas, permitindo o uso comercial de aplicações com reconhecimento de voz nos dias atuais. Entretanto, fatores como ruídos, reverberações, distorções entre outros, comprometem o desempenho desses sistemas ao reduzir a taxa de acerto quando submetidos a ambientes adversos. Assim, o estudo de técnicas que diminuam os efeitos desses problemas é de grande valia e vem ganhando destaque nas últimas décadas. O trabalho apresentado nesta dissertação tem como objetivo a redução dos problemas referentes aos ruídos característicos de ambientes automotivos, tornando os sistemas de reconhecimento de voz utilizados nesses ambientes mais robustos. Dessa forma, o controle de funcionalidades não-críticas de um automóvel, ou seja, funcionalidades que não coloquem em risco a vida do usuário como tocadores de música e ar condicionado, pode ser realizado através de comandos de voz. O sistema proposto é baseado numa etapa de pré-processamento do sinal de voz através do método de subespaços vetoriais. O desempenho deste método está diretamente relacionado com as dimensões (linhas× colunas) das matrizes representativas do sinal de entrada. Levando isso em consideração, a decomposição ULLV, apesar de se tratar de uma aproximação do método de subespaços vetoriais, foi utilizada por oferecer uma menor complexidade computacional quando comparada a métodos tradicionais baseados na decomposição SVD. O sistema de reconhecimento de voz Julius foi o escolhido para o estudo de caso por se tratar de um sistema desenvolvido em código livre que oferece um alto desempenho. Um banco de dados de voz com 44800 amostras foi gerado com o modelo de um ambiente automotivo. Por ﬁm, a robustez do sistema foi avaliada e comparada com um método tradicional de redução de ruído chamado subtração espectral.
The establishment of a speech-based communication interface between humans and computers has been pursued since the beginning of the computer era. Several studies have been made over the last six decades in order to accomplish this interface, making possible commercial use of speech recognition applications. However, factors such as noise, reverberation, distortion among others degrades the performance of these systems. Thus, reducing their success rate when operating in adverse environments. With this in mind, the study of techniques to reduce the impact of these problems is of a great value and has gained prominence in recent decades. The work presented in this dissertation aims to reduce problems related to noise encountered in an automotive environment, improving the speech recognition system robustness. Thus,controlofnon-critical features of a car, such as CD player and air conditioning, can be performed through voice commands. The proposed system is based on a speech signal preprocessing step using the signal subspace method. Its performance is related to the size (lines× columns) of the matrices that represents the input signal. Therefore, the ULLV decomposition was used because it oﬀers a lower computational complexity compared to traditional methods based on SVD decomposition. The speech recognizer Julius is an open source software that oﬀers high performance and was the chosen one for the case study. A noisy speech database with 44800 samples was generated to model the automotive environment. Finally, the robustness of the system was evaluated and compared with a traditional method of noise reduction called spectral subtraction.

APA, Harvard, Vancouver, ISO, and other styles

25

Osanlou, Ardeshir. "Soft computing and fractal geometry in signal processing and pattern recognition." Thesis, De Montfort University, 2000. http://hdl.handle.net/2086/4242.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Elvira, Jose M. "Neural networks for speech and speaker recognition." Thesis, Staffordshire University, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.262314.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

El, Malki Karim. "A novel approach to high quality voice using echo cancellation and silence detection." Thesis, University of Sheffield, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.286579.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Hartley, David Andrew. "Image correlation using digital signal processors." Thesis, Liverpool John Moores University, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.304465.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Jalalinajafabadi, Farideh. "Computerised GRBAS assessement of voice quality." Thesis, University of Manchester, 2016. https://www.research.manchester.ac.uk/portal/en/theses/computerised-grbas-assessement-of-voice-quality(7efd3263-b109-4137-87cf-b9559c61730b).html.

Full text

Abstract:

Vocal cord vibration is the source of voiced phonemes in speech. Voice quality depends on the nature of this vibration. Vocal cords can be damaged by infection, neck or chest injury, tumours and more serious diseases such as laryngeal cancer. This kind of physical damage can cause loss of voice quality. To support the diagnosis of such conditions and also to monitor the effect of any treatment, voice quality assessment is required. Traditionally, this is done ‘subjectively’ by Speech and Language Therapists (SLTs) who, in Europe, use a well-known assessment approach called ‘GRBAS’. GRBAS is an acronym for a five dimensional scale of measurements of voice properties. The scale was originally devised and recommended by the Japanese Society of Logopeadics and Phoniatrics and several European research publications. The proper- ties are ‘Grade’, ‘Roughness’, ‘Breathiness’, ‘Asthenia’ and ‘Strain’. An SLT listens to and assesses a person’s voice while the person performs specific vocal maneuvers. The SLT is then required to record a discrete score for the voice quality in range of 0 to 3 for each GRBAS component. In requiring the services of trained SLTs, this subjective assessment makes the traditional GRBAS procedure expensive and time-consuming to administer. This thesis considers the possibility of using computer programs to perform objective assessments of voice quality conforming to the GRBAS scale. To do this, Digital Signal Processing (DSP) algorithms are required for measuring voice features that may indicate voice abnormality. The computer must be trained to convert DSP measurements to GRBAS scores and a ‘machine learning’ approach has been adopted to achieve this. This research was made possible by the development, by Manchester Royal Infirmary (MRI) Hospital Trust, of a ‘speech database’ with the participation of clinicians, SLT’s, patients and controls. The participation of five SLTs scorers allowed norms to be established for GRBAS scoring which provided ‘reference’ data for the machine learning approach. To support the scoring procedure carried out at MRI, a software package, referred to as GRBAS Presentation and Scoring Package (GPSP), was developed for presenting voice recordings to each of the SLTs and recording their GRBAS scores. A means of assessing intra-scorer consistency was devised and built into this system. Also, the assessment of inter-scorer consistency was advanced by the invention of a new form of the ‘Fleiss Kappa’ which is applicable to ordinal as well as categorical scoring. The means of taking these assessments of scorer consistency into account when producing ‘reference’ GRBAS scores are presented in this thesis. Such reference scores are required for training the machine learning algorithms. The DSP algorithms required for feature measurements are generally well known and available as published or commercial software packages. However, an appraisal of these algorithms and the development of some DSP ‘thesis software’ was found to be necessary. Two ‘machine learning’ regression models have been developed for map- ping the measured voice features to GRBAS scores. These are K Nearest Neighbor Regression (KNNR) and Multiple Linear Regression (MLR). Our research is based on sets of features, sets of data and prediction models that are different from the approaches in the current literature. The performance of the computerised system is evaluated against reference scores using a Normalised Root Mean Squared Error (NRMSE) measure. The performances of MLR and KNNR for objective prediction of GRBAS scores are compared and analysed ‘with feature selection’ and ‘without feature selection’. It was found that MLR with feature selection was better than MLR without feature selection and KNNR with and without feature selection, for all five GRBAS components. It was also found that MLR with feature selection gives scores for ‘Asthenia’ and ‘Strain’ which are closer to the reference scores than the scores given by all five individual SLT scorers. The best objective score for ‘Roughness’ was closer than the scores given by two SLTs, roughly equal to the score of one SLT and worse than the other two SLT scores. The best objective scores for ‘Breathiness’ and ‘Grade’ were further from the reference scores than the scores produced by all five SLT scorers. However, the worst ‘MLR with feature selection’ result has normalised RMS error which is only about 3% worse than the worst SLT scoring. The results obtained indicate that objective GRBAS measurements have the potential for further development towards a commercial product that may at least be useful in augmenting the subjective assessments of SLT scorers.

APA, Harvard, Vancouver, ISO, and other styles

30

DeVilliers, Edward Michael. "Implementing voice recognition and natural language processing in the NPSNET networked virtual environment." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1996. http://handle.dtic.mil/100.2/ADA320340.

Full text

Abstract:

Thesis (M.S. in Computer Science) Naval Postgraduate School, September 1996.
Thesis advisor(s): Nelson D. Ludlow, John S. Falby. "September 1996." Includes bibliographical references (p. 171-175). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

31

Calitz, Wietsche Roets. "Independent formant and pitch control applied to singing voice." Thesis, Stellenbosch : University of Stellenbosch, 2004. http://hdl.handle.net/10019.1/16267.

Full text

Abstract:

Thesis (MScIng)--University of Stellenbosch, 2004.
ENGLISH ABSTRACT: A singing voice can be manipulated artificially by means of a digital computer for the purposes of creating new melodies or to correct existing ones. When the fundamental frequency of an audio signal that represents a human voice is changed by simple algorithms, the formants of the voice tend to move to new frequency locations, making it sound unnatural. The main purpose is to design a technique by which the pitch and formants of a singing voice can be controlled independently.
AFRIKAANSE OPSOMMING: Onafhanklike formant- en toonhoogte beheer toegepas op ’n sangstem: ’n Sangstem kan deur ’n digitale rekenaar gemanipuleer word om nuwe melodie¨e te skep, of om bestaandes te verbeter. Wanneer die fundamentele frekwensie van ’n klanksein (wat ’n menslike stem voorstel) deur ’n eenvoudige algoritme verander word, skuif die oorspronklike formante na nuwe frekwensie gebiede. Dit veroorsaak dat die resultaat onnatuurlik klink. Die hoof oogmerk is om ’n tegniek te ontwerp wat die toonhoogte en die formante van ’n sangstem apart kan beheer.

APA, Harvard, Vancouver, ISO, and other styles

32

Yiu, Siu Fung. "Recursive state-space approach to Ground Probing Radar signal processing." Thesis, Lancaster University, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.278379.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Zhu, Yong. "Digital signal and image processing techniques for ultrasonic nondestructive evaluation." Thesis, City University London, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.336431.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Ansourian, Megeurditch N. "Digital signal processing for the analysis of fetal breathing movements." Thesis, University of Edinburgh, 1989. http://hdl.handle.net/1842/13595.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Chan, Arthur Yu Chung. "Robust speech recognition against unknown short-time noise /." View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20CHAN.

Full text

Abstract:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2002.
Includes bibliographical references (leaves 119-125). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

36

Smith, Quentin D. "Multichannel Digital Signal Processor Based Red/Black Keyset." International Foundation for Telemetering, 1992. http://hdl.handle.net/10150/611927.

Full text

Abstract:

International Telemetering Conference Proceedings / October 26-29, 1992 / Town and Country Hotel and Convention Center, San Diego, California
This paper addresses a method to provide both secure and non-secure voice communications to a DS-1 network from a common keyset. In order to comply with both the electrical isolation requirements and the operational security issues regarding voice communications, an all-digital approach to the keyset was developed based upon the AD2101 DSP. Protocols that are handled by the keyset include: Multiple PTT modes, hot mike, telephone access, priority override, direct access, indirect access, paging, and monitor only. Special features that are addressed include: independent channel by channel assignment of access protocols, headset assignment, speaker assignment, and PTT assignment. Multiple microprocessors are used to implement the foregoing as well as down-loadable configurations, remote keyset control and monitoring, and composite audio outputs. Partitioning of the digital design provides RED to BLACK channel isolation and RED channel to AC power isolation of greater than 107 dB.

APA, Harvard, Vancouver, ISO, and other styles

37

Nylén, Helmer. "Detecting Signal Corruptions in Voice Recordings for Speech Therapy." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291429.

Full text

Abstract:

When recording voice samples from a patient in speech therapy the quality of the recording may be affected by different signal corruptions, for example background noise or clipping. The equipment and expertise required to identify small disturbances are not always present at smaller clinics. Therefore, this study investigates possible machine learning algorithms to automatically detect selected corruptions in speech signals, including infrasound and random muting. Five algorithms are analyzed: kernel substitution based Support Vector Machine, Convolutional Neural Network, Long Short-term Memory (LSTM), Gaussian Mixture Model based Hidden Markov Model and Generative Model based Hidden Markov Model. A tool to generate datasets of corrupted recordings is developed to test the algorithms in both single-label and multi-label settings. Mel-frequency Cepstral Coefficients are used as the main features. For each type of corruption different ways to increase the classification accuracy are tested, for example by using a Voice Activity Detector to filter out less relevant parts of the recording, changing the feature parameters, or using an ensemble of classifiers. The experiments show that a machine learning approach is feasible for this problem as a balanced accuracy of at least 75% is reached on all tested corruptions. While the single-label study gave mixed results with no algorithm clearly outperforming the others, in the multi-label case the LSTM in general performs better than other algorithms. Notably it achieves over 95% balanced accuracy on both white noise and infrasound. As the algorithms are trained only on spoken English phrases the usability of this tool in its current state is limited, but the experiments are easily expanded upon with other types of audio recordings, corruptions, features, or classification algorithms.
När en patients röst spelas in för analys i talterapi kan inspelningskvaliteten påverkas av olika signalproblem, till exempel bakgrundsljud eller klippning. Utrustningen och expertisen som behövs för att upptäcka små störningar finns dock inte alltid tillgänglig på mindre kliniker. Därför undersöker denna studie olika maskininlärningsalgoritmer för att automatiskt kunna upptäcka utvalda problem i talinspelningar, bland andra infraljud och slumpmässig utsläckning av signalen. Fem algoritmer analyseras: stödvektormaskin, Convolutional Neural Network, Long Short-term Memory (LSTM), Gaussian mixture model-baserad dold Markovmodell och generatorbaserad dold Markovmodell. Ett verktyg för att skapa datamängder med försämrade inspelningar utvecklas för att kunna testa algoritmerna. Vi undersöker separat fallen där inspelningarna tillåts ha en eller flera problem samtidigt, och använder framförallt en slags kepstralkoefficienter, MFCC:er, som särdrag. För varje typ av problem undersöker vi också sätt att förbättra noggrannheten, till exempel genom att filtrera bort irrelevanta delar av signalen med hjälp av en röstupptäckare, ändra särdragsparametrarna, eller genom att använda en ensemble av klassificerare. Experimenten visar att maskininlärning är ett rimligt tillvägagångssätt för detta problem då den balanserade träffsäkerheten överskrider 75%för samtliga testade störningar. Den delen av studien som fokuserade på enproblemsinspelningar gav inga resultat som tydde på att en algoritm var klart bättre än de andra, men i flerproblemsfallet överträffade LSTM:en generellt övriga algoritmer. Värt att notera är att den nådde över 95 % balanserad träffsäkerhet på både vitt brus och infraljud. Eftersom algoritmerna enbart tränats på engelskspråkiga, talade meningar så har detta verktyg i nuläget begränsad praktisk användbarhet. Däremot är det lätt att utöka dessa experiment med andra typer av inspelningar, signalproblem, särdrag eller algoritmer.

APA, Harvard, Vancouver, ISO, and other styles

38

Vemulapalli, Smita. "Audio-video based handwritten mathematical content recognition." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/45958.

Full text

Abstract:

Recognizing handwritten mathematical content is a challenging problem, and more so when such content appears in classroom videos. However, given the fact that in such videos the handwritten text and the accompanying audio refer to the same content, a combination of video and audio based recognizer has the potential to significantly improve the content recognition accuracy. This dissertation, using a combination of video and audio based recognizers, focuses on improving the recognition accuracy associated with handwritten mathematical content in such videos. Our approach makes use of a video recognizer as the primary recognizer and a multi-stage assembly, developed as part of this research, is used to facilitate effective combination with an audio recognizer. Specifically, we address the following challenges related to audio-video based handwritten mathematical content recognition: (1) Video Preprocessing - generates a timestamped sequence of segmented characters from the classroom video in the face of occlusions and shadows caused by the instructor, (2) Ambiguity Detection - determines the subset of input characters that may have been incorrectly recognized by the video based recognizer and forwards this subset for disambiguation, (3) A/V Synchronization - establishes correspondence between the handwritten character and the spoken content, (4) A/V Combination - combines the synchronized outputs from the video and audio based recognizers and generates the final recognized character, and (5) Grammar Assisted A/V Based Mathematical Content Recognition - utilizes a base mathematical speech grammar for both character and structure disambiguation. Experiments conducted using videos recorded in a classroom-like environment demonstrate the significant improvements in recognition accuracy that can be achieved using our techniques.

APA, Harvard, Vancouver, ISO, and other styles

39

Loscos, Àlex. "Spectral processing of the singing voice." Doctoral thesis, Universitat Pompeu Fabra, 2007. http://hdl.handle.net/10803/7542.

Full text

Abstract:

Aquesta tesi doctoral versa sobre el processament digital de la veu cantada, més concretament, sobre l'anàlisi, transformació i síntesi d'aquets tipus de veu en el domini espectral, amb especial èmfasi en aquelles tècniques rellevants per al desenvolupament d'aplicacions musicals.

La tesi presenta nous procediments i formulacions per a la descripció i transformació d'aquells atributs específicament vocals de la veu cantada. La tesis inclou, entre d'altres, algorismes per l'anàlisi i la generació de desordres vocals como ara rugositat, ronquera, o veu aspirada, detecció i modificació de la freqüència fonamental de la veu, detecció de nasalitat, conversió de veu cantada a melodia, detecció de cops de veu, mutació de veu cantada, i transformació de veu a instrument; exemplificant alguns d'aquests algorismes en aplicacions concretes.
Esta tesis doctoral versa sobre el procesado digital de la voz cantada, más concretamente, sobre el análisis, transformación y síntesis de este tipo de voz basándose e dominio espectral, con especial énfasis en aquellas técnicas relevantes para el desarrollo de aplicaciones musicales.

La tesis presenta nuevos procedimientos y formulaciones para la descripción y transformación de aquellos atributos específicamente vocales de la voz cantada. La tesis incluye, entre otros, algoritmos para el análisis y la generación de desórdenes vocales como rugosidad, ronquera, o voz aspirada, detección y modificación de la frecuencia fundamental de la voz, detección de nasalidad, conversión de voz cantada a melodía, detección de los golpes de voz, mutación de voz cantada, y transformación de voz a instrumento; ejemplificando algunos de éstos en aplicaciones concretas.
This dissertation is centered on the digital processing of the singing voice, more concretely on the analysis, transformation and synthesis of this type of voice in the spectral domain, with special emphasis on those techniques relevant for music applications.

The thesis presents new formulations and procedures for both describing and transforming those attributes of the singing voice that can be regarded as voice specific. The thesis includes, among others, algorithms for rough and growl analysis and transformation, breathiness estimation and emulation, pitch detection and modification, nasality identification, voice to melody conversion, voice beat onset detection, singing voice morphing, and voice to instrument transformation; being some of them exemplified with concrete applications.

APA, Harvard, Vancouver, ISO, and other styles

40

Barton, Antony James. "Signal processing techniques for data reduction and event recognition in cough counting." Thesis, University of Manchester, 2013. https://www.research.manchester.ac.uk/portal/en/theses/signal-processing-techniques-for-data-reduction-and-event-recognition-in-cough-counting(dc73495a-35b0-4d17-a6f8-cc2f88008659).html.

Full text

Abstract:

This thesis presents novel techniques for the reduction of audio recordings and signal processing techniques as part of cough recognition. Evidence collected shows the reduction technique to be effective and the recognition techniques to give consistent performance across different patients. Cough is one of the commonest symptoms reported by patients to GPs. Despite this, it remains a significantly unmet medical need. At present, there exists no practical and validated technique for assessing the efficacy of therapies to treat cough on a large enough scale. Research that is presently undertaken requires fitting a patient with a recording system which will record their coughing and all other sound for a predefined period, usually 24 hours or less. This audio is then counted manually by trained cough counters to produce counts for each record which can be used as data for cough studies. Research in this field is relatively new, but a number of attempts have been made to automate this process. None so far have shown sufficient reliability or precision to be of sufficient use. The aim of this research is to analyse from the ground up signal processing techniques which can aid cough research. Specifically, the research will look into data minimisation techniques to improve the efficiency of manual counting techniques and recognition algorithmsThe research has produced a published record reduction system which can reduce 24 hour cough records down to around 10% of their original size without compromising the statistics of subsequent manual counts. Additionally, a review of signal processing techniques for cough recognition has produced a robust event detection technique and measurement techniques which have shown remarkable consistency between patients and conditions. Throughout the research a clear understanding of the limitations and possible solutions are pursued and reported on to aid further progress on what is a young and developing research field.

APA, Harvard, Vancouver, ISO, and other styles

41

Schelinski, Stefanie. "Mechanisms of Voice Processing: Evidence from Autism Spectrum Disorder." Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/19091.

Full text

Abstract:

Die korrekte Wahrnehmung stimmlicher Information ist eine Grundvoraussetzung erfolgreicher zwischenmenschlicher Kommunikation. Die Stimme einer anderen Person liefert Information darüber wer spricht (Sprechererkennung), was gesagt wird (stimmliche Spracherkennung) und über den emotionalen Zustand einer Person (stimmliche Emotionserkennung). Autismus Spektrum Störungen (ASS) sind mit Einschränkungen in der Sprechererkennung und der stimmlichen Emotionserkennung assoziiert, während die Wahrnehmung stimmlicher Sprache relativ intakt ist. Die zugrunde liegenden Mechanismen dieser Einschränkungen sind bisher jedoch unklar. Es ist beispielsweise unklar, auf welcher Verarbeitungsstufe diese Einschränkungen in der Stimmenwahrnehmung entstehen oder ob sie mit einer Dysfunktion stimmensensitiver Hirnregionen in Verbindung stehen. Im Rahmen meiner Dissertation haben wir systematisch Stimmenverarbeitung und dessen Einschränkungen bei Erwachsenen mit hochfunktionalem ASS und typisch entwickelten Kontrollprobanden (vergleichbar in Alter, Geschlecht und intellektuellen Fähigkeiten) untersucht. In den ersten beiden Studien charakterisierten wir Sprechererkennung bei ASS mittels einer umfassenden verhaltensbezogenen Testbatterie und zweier funktionaler Magnet Resonanz Tomographie (fMRT) Experimente. In der dritten Studie untersuchten wir Mechanismen eingeschränkter stimmlicher Emotionserkennung bei ASS. Unsere Ergebnisse bringen neue Kenntnisse für Modelle zwischenmenschlicher Kommunikation und erhöhen unser Verständnis elementarer Mechanismen, die den Kernsymptomen in ASS wie Schwierigkeiten in der Kommunikation, zugrunde liegen könnten. Beispielsweise unterstützen unsere Ergebnisse die Annahme, dass Einschränkungen in der Wahrnehmung und Integration basaler sensorischer Merkmale (i.S. akustischer Merkmale der Stimme) entscheidend zu Einschränkungen in sozialer Kognition (i.S. Sprechererkennung und stimmliche Emotionserkennung) beitragen.
The correct perception of information carried by the voice is a key requirement for successful human communication. Hearing another person’s voice provides information about who is speaking (voice identity), what is said (vocal speech) and the emotional state of a person (vocal emotion). Autism spectrum disorder (ASD) is associated with impaired voice identity and vocal emotion perception while the perception of vocal speech is relatively intact. However, the underlying mechanisms of these voice perception impairments are unclear. For example, it is unclear at which processing stage voice perception difficulties occur, i.e. whether they are rather of apperceptive or associative nature or whether impairments in voice identity processing in ASD are associated with dysfunction of voice-sensitive brain regions. Within the scope of my dissertation we systematically investigated voice perception and its impairments in adults with high-functioning ASD and typically developing matched controls (matched pairwise on age, gender, and intellectual abilities). In the first two studies we characterised the behavioural and neuronal profile of voice identity recognition in ASD using two functional magnetic resonance imaging (fMRI) experiments and a comprehensive behavioural test battery. In the third study we investigated the underlying behavioural mechanisms of impaired vocal emotion recognition in ASD. Our results inform models on human communication and advance our understanding for basic mechanisms which might contribute to core symptoms in ASD, such as difficulties in communication. For example, our results converge to support the view that in ASD difficulties in perceiving and integrating lower-level sensory features, i.e. acoustic characteristics of the voice might critically contribute to difficulties in higher-level social cognition, i.e. voice identity and vocal emotion recognition.

APA, Harvard, Vancouver, ISO, and other styles

42

Sukittanon, Somsak. "Modulation scale analysis : theory and application for nonstationary signal classification /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/5875.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Bakheet, Mohammed. "Improving Speech Recognition for Arabic language Using Low Amounts of Labeled Data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176437.

Full text

Abstract:

The importance of Automatic Speech Recognition (ASR) Systems, whose job is to generate text from audio, is increasing as the number of applications of these systems is rapidly going up. However, when it comes to training ASR systems, the process is difficult and rather tedious, and that could be attributed to the lack of training data. ASRs require huge amounts of annotated training data containing the audio files and the corresponding accurately written transcript files. This annotated (labeled) training data is very difficult to find for most of the languages, it usually requires people to perform the annotation manually which, apart from the monetary price it costs, is error-prone. A supervised training task is impractical for this scenario. The Arabic language is one of the languages that do not have an abundance of labeled data, which makes its ASR system's accuracy very low compared to other resource-rich languages such as English, French, or Spanish. In this research, we take advantage of unlabeled voice data by learning general data representations from unlabeled training data (only audio files) in a self-supervised task or pre-training phase. This phase is done by using wav2vec 2.0 framework which masks out input in the latent space and solves a contrastive task. The model is then fine-tuned on a few amounts of labeled data. We also exploit models that have been pre-trained on different languages, by using wav2vec 2.0, for the purpose of fine-tuning them on Arabic language by using annotated Arabic data. We show that using wav2vec 2.0 framework for pre-training on Arabic is considerably time and resource-consuming. It took the model 21.5 days (about 3 weeks) to complete 662 epochs and get a validation accuracy of 58%. Arabic is a right-to-left (rtl) language with many diacritics that indicate how letters should be pronounced, these two features make it difficult for Arabic to fit into these models, as it requires heavy pre-processing for the transcript files. We demonstrate that we can fine-tune a cross-lingual model, that is trained on raw waveforms of speech in multiple languages, on Arabic data and get a low word error rate 36.53%. We also prove that by fine-tuning the model parameters we can increase the accuracy, thus, decrease the word error rate from 54.00% to 36.69%.

APA, Harvard, Vancouver, ISO, and other styles

44

Kwok, Kwok Sai. "Algorithms for image segmentation and their applications to video signal processing." Thesis, Imperial College London, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.244298.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Birkenes, Øystein. "A Framework for Speech Recognition using Logistic Regression." Doctoral thesis, Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, 2007. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-1599.

Full text

Abstract:

Although discriminative approaches like the support vector machine or logistic regression have had great success in many pattern recognition application, they have only achieved limited success in speech recognition. Two of the difficulties often encountered include 1) speech signals typically have variable lengths, and 2) speech recognition is a sequence labeling problem, where each spoken utterance corresponds to a sequence of words or phones.

In this thesis, we present a framework for automatic speech recognition using logistic regression. We solve the difficulty of variable length speech signals by including a mapping in the logistic regression framework that transforms each speech signal into a fixed-dimensional vector. The mapping is defined either explicitly with a set of hidden Markov models (HMMs) for the use in penalized logistic regression (PLR), or implicitly through a sequence kernel to be used with kernel logistic regression (KLR). Unlike previous work that has used HMMs in combination with a discriminative classification approach, we jointly optimize the logistic regression parameters and the HMM parameters using a penalized likelihood criterion.

Experiments show that joint optimization improves the recognition accuracy significantly. The sequence kernel we present is motivated by the dynamic time warping (DTW) distance between two feature vector sequences. Instead of considering only the optimal alignment path, we sum up the contributions from all alignment paths. Preliminary experiments with the sequence kernel show promising results.

A two-step approach is used for handling the sequence labeling problem. In the first step, a set of HMMs is used to generate an N-best list of sentence hypotheses for a spoken utterance. In the second step, these sentence hypotheses are rescored using logistic regression on the segments in the N-best list. A garbage class is introduced in the logistic regression framework in order to get reliable probability estimates for the segments in the N-best lists. We present results on both a connected digit recognition task and a continuous phone recognition task.

APA, Harvard, Vancouver, ISO, and other styles

46

Faubel, Friedrich [Verfasser], and Dietrich [Akademischer Betreuer] Klakow. "Statistical signal processing techniques for robust speech recognition / Friedrich Faubel. Betreuer: Dietrich Klakow." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2016. http://d-nb.info/1090875703/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Meyer, Georg. "Models of neurons in the ventral cochlear nucleus : signal processing and speech recognition." Thesis, Keele University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.334715.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Gooch, Richard M. "Machine learning techniques for signal processing, pattern recognition and knowledge extraction from examples." Thesis, University of Bristol, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.294898.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

健紘, 大田, and Kenko Ota. "Studies in signal processing for robust speech recognition in noisy and reverberant environments." Thesis, https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB10268908/?lang=0, 2008. https://doors.doshisha.ac.jp/opac/opac_link/bibid/BB10268908/?lang=0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Chai, Xiaoyong. "Sensor-based multiple-goal recognition /." View abstract or full-text, 2005. http://library.ust.hk/cgi/db/thesis.pl?COMP%202005%20CHAI.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Signal processing; Voice recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles