Dissertations / Theses: 'Mel-Frequency Cepstral coefficients'

1

Darch, Jonathan J. A. "Robust acoustic speech feature prediction from Mel frequency cepstral coefficients." Thesis, University of East Anglia, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445206.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Edman, Sebastian. "Radar target classification using Support Vector Machines and Mel Frequency Cepstral Coefficients." Thesis, KTH, Optimeringslära och systemteori, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-214794.

Full text

Abstract:

In radar applications, there are often times when one does not only want to know that there is a target that reflecting the out sent signals but also what kind of target that reflecting these signals. This project investigates the possibilities to from raw radar data transform reflected signals and take use of human perception, in particular our hearing, and by a machine learning approach where patterns and characteristics in data are used to answer the earlier mentioned question. More specific the investigation treats two kinds of targets that are fairly comparable namely smaller Unmanned Aerial Vehicles (UAV) and Birds. By extracting complex valued radar video so called I/Q data generated by these targets using signal processing techniques and transform this data to a real signals and after this transform the signals to audible signals. A feature set commonly used in speech recognition namely Mel Frequency Cepstral Coefficients are used two describe these signals together with two Support Vector Machine classification models. The two models where tested with an independent test set and the linear model achieved a overall prediction accuracy 93.33 %. Individually the prediction resulted in 93.33 % correct classification on the UAV and 93.33 % on the birds. Secondly a radial basis model with a overall prediction accuracy of 98.33 % where achieved. Individually the prediction resulted in 100% correct classification on the UAV and 96.76 % on the birds. The project is partly done in collaboration with J. Clemedson [2] where the focus is, as mentioned earlier, to transform the signals to audible signals.
I radar applikationer räcker det ibland inte med att veta att systemet observerat ett mål när en reflekted signal dekekteras, det är ofta också utav stort intresse att veta vilket typ av föremål som signalen reflekterades mot. Detta projekt undersöker möjligheterna att utifrån rå radardata transformera de reflekterade signalerna och använda sina mänskliga sinnen, mer specifikt våran hörsel, för att skilja på olika mål och också genom en maskininlärnings approach där med hjälp av mönster och karaktärsdrag för dessa signaler används för att besvara frågeställningen. Mer ingående avgränsas denna undersökning till två typer av mål, mindre obemannade flygande farkoster (UAV) och fåglar. Genom att extrahera komplexvärd radar video även känt som I/Q data från tidigare nämnda typer av mål via signalbehandlingsmetoder transformera denna data till reella signaler, därefter transformeras dessa signaler till hörbara signaler. För att klassificera dessa typer av signaler används typiska särdrag som också används inom taligenkänning, nämligen, Mel Frequency Cepstral Coefficients tillsammans med två modeller av en Support Vector Machine klassificerings metod. Med den linjära modellen uppnåddes en prediktions noggrannhet på 93.33%. Individuellt var noggrannheten 93.33 % korrekt klassificering utav UAV:n och 93.33 % på fåglar. Med radial bas modellen uppnåddes en prediktions noggrannhet på 98.33%. Individuellt var noggrannheten 100 % korrekt klassificering utav UAV:n och 96.76% på fåglar. Projektet är delvis utfört med J. Clemedson [2] vars fokus är att, som tidigare nämnt, transformera dessa signaler till hörbara signaler.

APA, Harvard, Vancouver, ISO, and other styles

3

Yang, Chenguang. "Security in Voice Authentication." Digital WPI, 2014. https://digitalcommons.wpi.edu/etd-dissertations/79.

Full text

Abstract:

We evaluate the security of human voice password databases from an information theoretical point of view. More specifically, we provide a theoretical estimation on the amount of entropy in human voice when processed using the conventional GMM-UBM technologies and the MFCCs as the acoustic features. The theoretical estimation gives rise to a methodology for analyzing the security level in a corpus of human voice. That is, given a database containing speech signals, we provide a method for estimating the relative entropy (Kullback-Leibler divergence) of the database thereby establishing the security level of the speaker verification system. To demonstrate this, we analyze the YOHO database, a corpus of voice samples collected from 138 speakers and show that the amount of entropy extracted is less than 14-bits. We also present a practical attack that succeeds in impersonating the voice of any speaker within the corpus with a 98% success probability with as little as 9 trials. The attack will still succeed with a rate of 62.50% if 4 attempts are permitted. Further, based on the same attack rationale, we mount an attack on the ALIZE speaker verification system. We show through experimentation that the attacker can impersonate any user in the database of 69 people with about 25% success rate with only 5 trials. The success rate can achieve more than 50% by increasing the allowed authentication attempts to 20. Finally, when the practical attack is cast in terms of an entropy metric, we find that the theoretical entropy estimate almost perfectly predicts the success rate of the practical attack, giving further credence to the theoretical model and the associated entropy estimation technique.

APA, Harvard, Vancouver, ISO, and other styles

4

Wu, Qiming. "A robust audio-based symbol recognition system using machine learning techniques." University of the Western Cape, 2020. http://hdl.handle.net/11394/7614.

Full text

Abstract:

Masters of Science
This research investigates the creation of an audio-shape recognition system that is able to interpret a user’s drawn audio shapes—fundamental shapes, digits and/or letters— on a given surface such as a table-top using a generic stylus such as the back of a pen. The system aims to make use of one, two or three Piezo microphones, as required, to capture the sound of the audio gestures, and a combination of the Mel-Frequency Cepstral Coeﬃcients (MFCC) feature descriptor and Support Vector Machines (SVMs) to recognise audio shapes. The novelty of the system is in the use of piezo microphones which are low cost, light-weight and portable, and the main investigation is around determining whether these microphones are able to provide suﬃciently rich information to recognise the audio shapes mentioned in such a framework.

APA, Harvard, Vancouver, ISO, and other styles

5

Candel, Ramón Antonio José. "Verificación automática de locutores aplicando pruebas diagnósticas múltiples en serie y en paralelo basadas en DTW (Dynamic Time Warping) y NFCC (Mel-Frequency Cepstral coefficients)." Doctoral thesis, Universidad de Murcia, 2015. http://hdl.handle.net/10803/300433.

Full text

Abstract:

La presente Tesis Doctoral consiste en el diseño de un sistema capaz de realizar tareas de verificación automática de locutores, para lo cual se basa en el modelado mediante los procedimientos DTW (Dynamic Time Warping) y MFCC (Mel-Frequency Cepstral Coefficients). Una vez diseñado éste, se ha evaluado el sistema de forma tanto a nivel de pruebas individuales, DTW y MFCC por separado, como múltiples, combinación de ambas en serie y en paralelo, para grabaciones obtenidas de la base de datos AHUMADA de la Guardia Civil. Todos los resultados han sido vistos teniendo en cuenta la significación estadística de los mismos, derivada de la realización de un determinado número finito de pruebas. Se han obtenido resultados estadísticos de dicho sistema para diferentes tamaños de las bases de datos utilizadas, lo que nos permite concluir la influencia de estos en el método. Como conclusión a los mismos, podemos identificar cuál es el mejor sistema, compuesto por el tipo de modelo y el tamaño de la muestra, que debemos utilizar en un estudio forense en función de la finalidad perseguida.
The present thesis is the design of a system capable of performing automatic speaker verification, for which is based on modeling using the DTW (Dynamic Time Warping) and procedures MFCC (Mel-Frequency Cepstral Coefficients). Once designed it, we have evaluated the system so both at individual events, DTW and MFCC separately as multiple, combining both in series and in parallel, to recordings obtained from the data base AHUMADA from the Guardia Civil. All results have been seen considering the statistical significance thereof, derived from performing a given finite number of tests. Statistical results have been obtained in such a system for different sizes of the databases used, allowing us to conclude the influence of these in the method in order to fix a priori the different variables of this, in order to make the best possible study. To the same conclusion, we can identify what is the best system, consisting of model type and sample size, we use a forensic study based on the intended purpose.

APA, Harvard, Vancouver, ISO, and other styles

6

Lindstål, Tim, and Daniel Marklund. "Application of LabVIEW and myRIO to voice controlled home automation." Thesis, Uppsala universitet, Signaler och System, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-380866.

Full text

Abstract:

The aim of this project is to use NI myRIO and LabVIEW for voice controlled home automation. The NI myRIO is an embedded device which has a Xilinx FPGA and a dual-core ARM Cortex-A9processor as well as analog input/output and digital input/output, and is programmed with theLabVIEW, a graphical programming language. The voice control is implemented in two differentsystems. The first system is based on an Amazon Echo Dot for voice recognition, which is acommercial smart speaker developed by Amazon Lab126. The Echo Dot devices are connectedvia the Internet to the voice-controlled intelligent personal assistant service known as Alexa(developed by Amazon), which is capable of voice interaction, music playback, and controllingsmart devices for home automation. This system in the present thesis project is more focusingon myRIO used for the wireless control of smart home devices, where smart lamps, sensors,speakers and a LCD-display was implemented. The other system is more focusing on myRIO for speech recognition and was built on myRIOwith a microphone connected. The speech recognition was implemented using mel frequencycepstral coefficients and dynamic time warping. A few commands could be recognized, includinga wake word ”Bosse” as well as other four commands for controlling the colors of a smart lamp. The thesis project is shown to be successful, having demonstrated that the implementation ofhome automation using the NI myRIO with two voice-controlled systems can correctly controlhome devices such as smart lamps, sensors, speakers and a LCD-display.

APA, Harvard, Vancouver, ISO, and other styles

7

Larsson, Alm Kevin. "Automatic Speech Quality Assessment in Unified Communication : A Case Study." Thesis, Linköpings universitet, Programvara och system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159794.

Full text

Abstract:

Speech as a medium for communication has always been important in its ability to convey our ideas, personality and emotions. It is therefore not strange that Quality of Experience (QoE) becomes central to any business relying on voice communication. Using Unified Communication (UC) systems, users can communicate with each other in several ways using many different devices, making QoE an important aspect for such systems. For this thesis, automatic methods for assessing speech quality of the voice calls in Briteback’s UC application is studied, including a comparison of the researched methods. Three methods all using a Gaussian Mixture Model (GMM) as a regressor, paired with extraction of Human Factor Cepstral Coefficients (HFCC), Gammatone Frequency Cepstral Coefficients (GFCC) and Modified Mel Frequency Cepstrum Coefficients (MMFCC) features respectively is studied. The method based on HFCC feature extraction shows better performance in general compared to the two other methods, but all methods show comparatively low performance compared to literature. This most likely stems from implementation errors, showing the difference between theory and practice in the literature, together with the lack of reference implementations. Further work with practical aspects in mind, such as reference implementations or verification tools can make the field more popular and increase its use in the real world.

APA, Harvard, Vancouver, ISO, and other styles

8

Neville, Katrina Lee, and katrina neville@rmit edu au. "Channel Compensation for Speaker Recognition Systems." RMIT University. Electrical and Computer Engineering, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080514.093453.

Full text

Abstract:

This thesis attempts to address the problem of how best to remedy different types of channel distortions on speech when that speech is to be used in automatic speaker recognition and verification systems. Automatic speaker recognition is when a person's voice is analysed by a machine and the person's identity is worked out by the comparison of speech features to a known set of speech features. Automatic speaker verification is when a person claims an identity and the machine determines if that claimed identity is correct or whether that person is an impostor. Channel distortion occurs whenever information is sent electronically through any type of channel whether that channel is a basic wired telephone channel or a wireless channel. The types of distortion that can corrupt the information include time-variant or time-invariant filtering of the information or the addition of 'thermal noise' to the information, both of these types of distortion can cause varying degrees of error in information being received and analysed. The experiments presented in this thesis investigate the effects of channel distortion on the average speaker recognition rates and testing the effectiveness of various channel compensation algorithms designed to mitigate the effects of channel distortion. The speaker recognition system was represented by a basic recognition algorithm consisting of: speech analysis, extraction of feature vectors in the form of the Mel-Cepstral Coefficients, and a classification part based on the minimum distance rule. Two types of channel distortion were investigated: Convolutional (or lowpass filtering) effects Addition of white Gaussian noise Three different methods of channel compensation were tested: Cepstral Mean Subtraction (CMS) RelAtive SpecTrAl (RASTA) Processing Constant Modulus Algorithm (CMA) The results from the experiments showed that for both CMS and RASTA processing that filtering at low cutoff frequencies, (3 or 4 kHz), produced improvements in the average speaker recognition rates compared to speech with no compensation. The levels of improvement due to RASTA processing were higher than the levels achieved due to the CMS method. Neither the CMS or RASTA methods were able to improve accuracy of the speaker recognition system for cutoff frequencies of 5 kHz, 6 kHz or 7 kHz. In the case of noisy speech all methods analysed were able to compensate for high SNR of 40 dB and 30 dB and only RASTA processing was able to compensate and improve the average recognition rate for speech corrupted with a high level of noise (SNR of 20 dB and 10 dB).

APA, Harvard, Vancouver, ISO, and other styles

9

Alvarenga, Rodrigo Jorge. "Reconhecimento de comandos de voz por redes neurais." Universidade de Taubaté, 2012. http://www.bdtd.unitau.br/tedesimplificado/tde_busca/arquivo.php?codArquivo=587.

Full text

Abstract:

Sistema de reconhecimento de fala tem amplo emprego no universo industrial, no aperfeiçoamento de operações e procedimentos humanos e no setor do entretenimento e recreação. O objetivo específico do trabalho foi conceber e desenvolver um sistema de reconhecimento de voz, capaz de identificar comandos de voz, independentemente do locutor. A finalidade precípua do sistema é controlar movimentos de robôs, com aplicações na indústria e no auxílio de deficientes físicos. Utilizou-se a abordagem da tomada de decisão por meio de uma rede neural treinada com as características distintivas do sinal de fala de 16 locutores. As amostras dos comandos foram coletadas segundo o critério de conveniência (em idade e sexo), a fim de garantir uma maior discriminação entre as características de voz, e assim alcançar a generalização da rede neural utilizada. O préprocessamento consistiu na determinação dos pontos extremos da locução do comando e na filtragem adaptativa de Wiener. Cada comando de fala foi segmentado em 200 janelas, com superposição de 25% . As features utilizadas foram a taxa de cruzamento de zeros, a energia de curto prazo e os coeficientes ceptrais na escala de frequência mel. Os dois primeiros coeficientes da codificação linear preditiva e o seu erro também foram testados. A rede neural empregada como classificador foi um perceptron multicamadas, treinado pelo algoritmo backpropagation. Várias experimentações foram realizadas para a escolha de limiares, valores práticos, features e configurações da rede neural. Os resultados foram considerados muito bons, alcançando uma taxa de acertos de 89,16%, sob as condições de pior caso da amostragem dos comandos.
Systems for speech recognition have widespread use in the industrial universe, in the improvement of human operations and procedures and in the area of entertainment and recreation. The specific objective of this study was to design and develop a voice recognition system, capable of identifying voice commands, regardless of the speaker. The main purpose of the system is to control movement of robots, with applications in industry and in aid of disabled people. We used the approach of decision making, by means of a neural network trained with the distinctive features of the speech of 16 speakers. The samples of the voice commands were collected under the criterion of convenience (age and sex), to ensure a greater discrimination between the voice characteristics and to reach the generalization of the neural network. Preprocessing consisted in the determination of the endpoints of each command signal and in the adaptive Wiener filtering. Each speech command was segmented into 200 windows with overlapping of 25%. The features used were the zero crossing rate, the short-term energy and the mel-frequency ceptral coefficients. The first two coefficients of the linear predictive coding and its error were also tested. The neural network classifier was a multilayer perceptron, trained by the backpropagation algorithm. Several experiments were performed for the choice of thresholds, practical values, features and neural network configurations. Results were considered very good, reaching an acceptance rate of 89,16%, under the `worst case conditions for the sampling of the commands.

APA, Harvard, Vancouver, ISO, and other styles

10

Larsson, Joel. "Optimizing text-independent speaker recognition using an LSTM neural network." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-26312.

Full text

Abstract:

In this paper a novel speaker recognition system is introduced. Automated speaker recognition has become increasingly popular to aid in crime investigations and authorization processes with the advances in computer science. Here, a recurrent neural network approach is used to learn to identify ten speakers within a set of 21 audio books. Audio signals are processed via spectral analysis into Mel Frequency Cepstral Coefficients that serve as speaker specific features, which are input to the neural network. The Long Short-Term Memory algorithm is examined for the first time within this area, with interesting results. Experiments are made as to find the optimum network model for the problem. These show that the network learns to identify the speakers well, text-independently, when the recording situation is the same. However the system has problems to recognize speakers from different recordings, which is probably due to noise sensitivity of the speech processing algorithm in use.

APA, Harvard, Vancouver, ISO, and other styles

11

Hrabina, Martin. "VÝVOJ ALGORITMŮ PRO ROZPOZNÁVÁNÍ VÝSTŘELŮ." Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-409087.

Full text

Abstract:

Táto práca sa zaoberá rozpoznávaním výstrelov a pridruženými problémami. Ako prvé je celá vec predstavená a rozdelená na menšie kroky. Ďalej je poskytnutý prehľad zvukových databáz, významné publikácie, akcie a súčasný stav veci spoločne s prehľadom možných aplikácií detekcie výstrelov. Druhá časť pozostáva z porovnávania príznakov pomocou rôznych metrík spoločne s porovnaním ich výkonu pri rozpoznávaní. Nasleduje porovnanie algoritmov rozpoznávania a sú uvedené nové príznaky použiteľné pri rozpoznávaní. Práca vrcholí návrhom dvojstupňového systému na rozpoznávanie výstrelov, monitorujúceho okolie v reálnom čase. V závere sú zhrnuté dosiahnuté výsledky a načrtnutý ďalší postup.

APA, Harvard, Vancouver, ISO, and other styles

12

Zezula, Miroslav. "Online detekce jednoduchých příkazů v audiosignálu." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2011. http://www.nusl.cz/ntk/nusl-229484.

Full text

Abstract:

This thesis describes the development of voice module, that can recognize simple speech commands by comparation of input sound with recorded templates. The first part of thesis contains a description of used algorithm and a verification of its functionality. The algorithm is based on Mel-frequency cepstral coefficients and dynamic time warping. Thereafter the hardware of voice module is designed, containing signal controller 56F805 from Freescale. The signal from microphone is conditioned by operational amplifiers and digital filter. The third part deals with the development of software for the controller and describes the fixed point implementation of the algorithm, respecting limited capabilities of the controller. Final test proves the usability of voice module in low-noise environment.

APA, Harvard, Vancouver, ISO, and other styles

13

Mahajan, Mayur. "Development of a speech recognition system using the Mel Frequency Cepstrum Coefficient method." Thesis, California State University, Long Beach, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10141515.

Full text

Abstract:

Voice recognition systems have found widespread use in applications such as tele-shopping, tele-banking, information services, home automation, voice message security, and voice call dialing, which allows a driver to make calls safely while driving.

This project presents the development of a high performance speech recognition system using human voice models. Recognizing the behavior of the human ear, the Mel Frequency Cepstral Coefficient (MFCC) method is used to develop the system capability for feature extraction. Vector quantization optimized by the Linde Buzo Gray (LGB) algorithm is used for feature matching. Experimental results show that the system has over 90% success rate in the noise-free case, but the system performance deteriorates in the presence of noise. The system, however, has better recognition ability when the noise signal consists of harmonic components, as compared to a non-stationary, non-harmonic signal.

APA, Harvard, Vancouver, ISO, and other styles

14

Hrušovský, Enrik. "Automatická klasifikace výslovnosti hlásky R." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2018. http://www.nusl.cz/ntk/nusl-377664.

Full text

Abstract:

This diploma thesis deals with automatic clasification of vowel R. Purpose of this thesis is to made program for detection of pronounciation of speech defects at vowel R in children. In thesis are processed parts as speech creation, speech therapy, dyslalia and subsequently speech signal processing and analysis methods. In the last part is designed software for automatic detection of pronounciation of vowel R. For recognition of pronounciation is used algorithm MFCC for extracting features. This features are subsequently classified by neural network to the group of correct or incorrect pronounciation and is evaluated classification success.

APA, Harvard, Vancouver, ISO, and other styles

15

Okuyucu, Cigdem. "Semantic Classification And Retrieval System For Environmental Sounds." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12615114/index.pdf.

Full text

Abstract:

The growth of multimedia content in recent years motivated the research on audio classification and content retrieval area. In this thesis, a general environmental audio classification and retrieval approach is proposed in which higher level semantic classes (outdoor, nature, meeting and violence) are obtained from lower level acoustic classes (emergency alarm, car horn, gun-shot, explosion, automobile, motorcycle, helicopter, wind, water, rain, applause, crowd and laughter). In order to classify an audio sample into acoustic classes, MPEG-7 audio features, Mel Frequency Cepstral Coefficients (MFCC) feature and Zero Crossing Rate (ZCR) feature are used with Hidden Markov Model (HMM) and Support Vector Machine (SVM) classifiers. Additionally, a new classification method is proposed using Genetic Algorithm (GA) for classification of semantic classes. Query by Example (QBE) and keyword-based query capabilities are implemented for content retrieval.

APA, Harvard, Vancouver, ISO, and other styles

16

Assaad, Firas Souhail. "Biometric Multi-modal User Authentication System based on Ensemble Classifier." University of Toledo / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1418074931.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Dušil, Lubomír. "Automatické rozpoznávání logopedických vad v řečovém projevu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2009. http://www.nusl.cz/ntk/nusl-218161.

Full text

Abstract:

The thesis is aimed at an analysis and automatic detection of logopaedic defects in speech utterance. Its objective is to facilitate and accelerate the work of logopaedists and to increase percentage of detected logopaedic defects in children of the youngest possible age followed by the most successful treatment. It presents methods of speech work, classification of the defects within individual stages of child development and appropriate words for identification of the speech defects and their subsequent remedy. After that there are analyses of methods of calculating coefficients which reflect human speech best. Also classifiers which are used to discern and determine whether it is a speech defect or not. Classifiers exploit coefficients for their work. Coefficients and classifiers are being tested and their best combination is being looked for in order to achieve the highest possible success rate of the automatic detection of the speech defects. All the programming and testing jobs has been conducted in the Matlab programme.

APA, Harvard, Vancouver, ISO, and other styles

18

Pešek, Milan. "Detekce logopedických vad v řeči." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2009. http://www.nusl.cz/ntk/nusl-218106.

Full text

Abstract:

The thesis deals with a design and an implementation of software for a detection of logopaedia defects of speech. Due to the need of early logopaedia defects detecting, this software is aimed at a child’s age speaker. The introductory part describes the theory of speech realization, simulation of speech realization for numerical processing, phonetics, logopaedia and basic logopaedia defects of speech. There are also described used methods for feature extraction, for segmentation of words to speech sounds and for features classification into either correct or incorrect pronunciation class. In the next part of the thesis there are results of testing of selected methods presented. For logopaedia speech defects recognition algorithms are used in order to extract the features MFCC and PLP. The segmentation of words to speech sounds is performed on the base of Differential Function method. The extracted features of a sound are classified into either a correct or an incorrect pronunciation class with one of tested methods of pattern recognition. To classify the features, the k-NN, SVN, ANN, and GMM methods are tested.

APA, Harvard, Vancouver, ISO, and other styles

19

Wang, Yihan. "Automatic Speech Recognition Model for Swedish using Kaldi." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285538.

Full text

Abstract:

With the development of intelligent era, speech recognition has been a hottopic. Although many automatic speech recognition(ASR) tools have beenput into the market, a considerable number of them do not support Swedishbecause of its small number. In this project, a Swedish ASR model basedon Hidden Markov Model and Gaussian Mixture Models is established usingKaldi which aims to help ICA Banken complete the classification of aftersalesvoice calls. A variety of model patterns have been explored, whichhave different phoneme combination methods and eigenvalue extraction andprocessing methods. Word Error Rate and Real Time Factor are selectedas evaluation criteria to compare the recognition accuracy and speed ofthe models. As far as large vocabulary continuous speech recognition isconcerned, triphone is much better than monophone. Adding feature transformationwill further improve the speed of accuracy. The combination oflinear discriminant analysis, maximum likelihood linear transformand speakeradaptive training obtains the best performance in this implementation. Fordifferent feature extraction methods, mel-frequency cepstral coefficient ismore conducive to obtain higher accuracy, while perceptual linear predictivetends to improve the overall speed.
Det existerar flera lösningar för automatisk transkribering på marknaden, menen stor del av dem stödjer inte svenska på grund utav det relativt få antalettalare. I det här projektet så skapades automatisk transkribering för svenskamed Hidden Markov models och Gaussian mixture models genom att användaKaldi. Detta för att kunna möjliggöra för ICABanken att klassificera samtal tillsin kundtjänst. En mängd av modellvariationer med olika fonemkombinationsmetoder,egenvärdesberäkning och databearbetningsmetoder har utforskats.Word error rate och real time factor är valda som utvärderingskriterier föratt jämföra precisionen och hastigheten mellan modellerna. När det kommertill kontinuerlig transkribering för ett stort ordförråd så resulterar triphonei mycket bättre prestanda än monophone. Med hjälp utav transformationerså förbättras både precisionen och hastigheten. Kombinationen av lineardiscriminatn analysis, maximum likelihood linear transformering och speakeradaptive träning resulterar i den bästa prestandan i denna implementation.För olika egenskapsextraktioner så bidrar mel-frequency cepstral koefficiententill en bättre precision medan perceptual linear predictive tenderar att ökahastigheten.

APA, Harvard, Vancouver, ISO, and other styles

20

Лавриненко, Олександр Юрійович, Александр Юрьевич Лавриненко, and Oleksandr Lavrynenko. "Методи підвищення ефективності семантичного кодування мовних сигналів." Thesis, Національний авіаційний університет, 2021. https://er.nau.edu.ua/handle/NAU/52212.

Full text

Abstract:

Дисертаційна робота присвячена вирішенню актуальної науково-практичної проблеми в телекомунікаційних системах, а саме підвищення пропускної здатності каналу передачі семантичних мовних даних за рахунок ефективного їх кодування, тобто формулюється питання підвищення ефективності семантичного кодування, а саме – з якою мінімальною швидкістю можливо кодувати семантичні ознаки мовних сигналів із заданою ймовірністю безпомилкового їх розпізнавання? Саме на це питання буде дана відповідь у даному науковому дослідженні, що є актуальною науково-технічною задачею враховуючи зростаючу тенденцію дистанційної взаємодії людей і роботизованої техніки за допомогою мови, де безпомилковість функціонування даного типу систем безпосередньо залежить від ефективності семантичного кодування мовних сигналів. У роботі досліджено відомий метод підвищення ефективності семантичного кодування мовних сигналів на основі мел-частотних кепстральних коефіцієнтів, який полягає в знаходженні середніх значень коефіцієнтів дискретного косинусного перетворення прологарифмованої енергії спектра дискретного перетворення Фур'є обробленого трикутним фільтром в мел-шкалі. Проблема полягає в тому, що представлений метод семантичного кодування мовних сигналів на основі мел-частотних кепстральних коефіцієнтів не дотримується умови адаптивності, тому було сформульовано основну наукову гіпотезу дослідження, яка полягає в тому що підвищити ефективність семантичного кодування мовних сигналів можливо за рахунок використання адаптивного емпіричного вейвлет-перетворення з подальшим застосуванням спектрального аналізу Гільберта. Під ефективністю кодування розуміється зниження швидкості передачі інформації із заданою ймовірністю безпомилкового розпізнавання семантичних ознак мовних сигналів, що дозволить значно знизити необхідну смугу пропускання, тим самим підвищуючи пропускну здатність каналу зв'язку. У процесі доведення сформульованої наукової гіпотези дослідження були отримані наступні результати: 1) вперше розроблено метод семантичного кодування мовних сигналів на основі емпіричного вейвлетперетворення, який відрізняється від існуючих методів побудовою множини адаптивних смугових вейвлет-фільтрів Мейера з подальшим застосуванням спектрального аналізу Гільберта для знаходження миттєвих амплітуд і частот функцій внутрішніх емпіричних мод, що дозволить визначити семантичні ознаки мовних сигналів та підвищити ефективність їх кодування; 2) вперше запропоновано використовувати метод адаптивного емпіричного вейвлет-перетворення в задачах кратномасштабного аналізу та семантичного кодування мовних сигналів, що дозволить підвищити ефективність спектрального аналізу за рахунок розкладання високочастотного мовного коливання на його низькочастотні складові, а саме внутрішні емпіричні моди; 3) отримав подальший розвиток метод семантичного кодування мовних сигналів на основі мел-частотних кепстральних коефіцієнтів, але з використанням базових принципів адаптивного спектрального аналізу за допомогою емпіричного вейвлет-перетворення, що підвищує ефективність даного методу.
The thesis is devoted to the solution of the actual scientific and practical problem in telecommunication systems, namely increasing the bandwidth of the semantic speech data transmission channel due to their efficient coding, that is the question of increasing the efficiency of semantic coding is formulated, namely – at what minimum speed it is possible to encode semantic features of speech signals with the set probability of their error-free recognition? It is on this question will be answered in this research, which is an urgent scientific and technical task given the growing trend of remote human interaction and robotic technology through speech, where the accurateness of this type of system directly depends on the effectiveness of semantic coding of speech signals. In the thesis the well-known method of increasing the efficiency of semantic coding of speech signals based on mel-frequency cepstral coefficients is investigated, which consists in finding the average values of the coefficients of the discrete cosine transformation of the prologarithmic energy of the spectrum of the discrete Fourier transform treated by a triangular filter in the mel-scale. The problem is that the presented method of semantic coding of speech signals based on mel-frequency cepstral coefficients does not meet the condition of adaptability, therefore the main scientific hypothesis of the study was formulated, which is that to increase the efficiency of semantic coding of speech signals is possible through the use of adaptive empirical wavelet transform followed by the use of Hilbert spectral analysis. Coding efficiency means a decrease in the rate of information transmission with a given probability of error-free recognition of semantic features of speech signals, which will significantly reduce the required passband, thereby increasing the bandwidth of the communication channel. In the process of proving the formulated scientific hypothesis of the study, the following results were obtained: 1) the first time the method of semantic coding of speech signals based on empirical wavelet transform is developed, which differs from existing methods by constructing a sets of adaptive bandpass wavelet-filters Meyer followed by the use of Hilbert spectral analysis for finding instantaneous amplitudes and frequencies of the functions of internal empirical modes, which will determine the semantic features of speech signals and increase the efficiency of their coding; 2) the first time it is proposed to use the method of adaptive empirical wavelet transform in problems of multiscale analysis and semantic coding of speech signals, which will increase the efficiency of spectral analysis due to the decomposition of high-frequency speech oscillations into its low-frequency components, namely internal empirical modes; 3) received further development the method of semantic coding of speech signals based on mel-frequency cepstral coefficients, but using the basic principles of adaptive spectral analysis with the application empirical wavelet transform, which increases the efficiency of this method. Conducted experimental research in the software environment MATLAB R2020b showed, that the developed method of semantic coding of speech signals based on empirical wavelet transform allows you to reduce the encoding speed from 320 to 192 bit/s and the required passband from 40 to 24 Hz with a probability of error-free recognition of about 0.96 (96%) and a signal-to-noise ratio of 48 dB, according to which its efficiency increases 1.6 times in contrast to the existing method. The results obtained in the thesis can be used to build systems for remote interaction of people and robotic equipment using speech technologies, such as speech recognition and synthesis, voice control of technical objects, low-speed encoding of speech information, voice translation from foreign languages, etc.

APA, Harvard, Vancouver, ISO, and other styles

21

Bekli, Zeid, and William Ouda. "A performance measurement of a Speaker Verification system based on a variance in data collection for Gaussian Mixture Model and Universal Background Model." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20122.

Full text

Abstract:

Voice recognition has become a more focused and researched field in the last century,and new techniques to identify speech has been introduced. A part of voice recognition isspeaker verification which is divided into Front-end and Back-end. The first componentis the front-end or feature extraction where techniques such as Mel-Frequency CepstrumCoefficients (MFCC) is used to extract the speaker specific features of a speech signal,MFCC is mostly used because it is based on the known variations of the humans ear’scritical frequency bandwidth. The second component is the back-end and handles thespeaker modeling. The back-end is based on the Gaussian Mixture Model (GMM) andGaussian Mixture Model-Universal Background Model (GMM-UBM) methods forenrollment and verification of the specific speaker. In addition, normalization techniquessuch as Cepstral Means Subtraction (CMS) and feature warping is also used forrobustness against noise and distortion. In this paper, we are going to build a speakerverification system and experiment with a variance in the amount of training data for thetrue speaker model, and to evaluate the system performance. And further investigate thearea of security in a speaker verification system then two methods are compared (GMMand GMM-UBM) to experiment on which is more secure depending on the amount oftraining data available.This research will therefore give a contribution to how much data is really necessary fora secure system where the False Positive is as close to zero as possible, how will theamount of training data affect the False Negative (FN), and how does this differ betweenGMM and GMM-UBM.The result shows that an increase in speaker specific training data will increase theperformance of the system. However, too much training data has been proven to beunnecessary because the performance of the system will eventually reach its highest point and in this case it was around 48 min of data, and the results also show that the GMMUBM model containing 48- to 60 minutes outperformed the GMM models.

APA, Harvard, Vancouver, ISO, and other styles

22

Sklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition." Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.

Full text

Abstract:

In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success. Communication media, either analog (such as analog landline phones) or digital (VoIP) distort the speaker?s speech signal often in very complex ways: linear distortion occurs in all channels, either in the magnitude or phase spectrum. Non-linear but time-invariant distortion will always appear in all real systems. In digital systems we also have network effects which will produce packet losses and delays and repeated packets. Finally, one cannot really assert what path a signal will take, and so having error or distortion in between is almost a certainty. The channel introduces an acoustical mismatch between the speaker's signal and the trained data in the ASR, which results in poor recognition performance. The approach so far, has been to try to undo the havoc produced by the channels, i.e. compensate for the channel's behavior. In this thesis, we try to characterize the effects of different transmission media and use that as an inexpensive and repeatable way to train ASR systems.

APA, Harvard, Vancouver, ISO, and other styles

23

Kuo, Yo-zhen, and 郭又禎. "Improved Mel-scale Frequency Cepstral Coefficients for Keyword Spotting Technique." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/27592493670347223949.

Full text

Abstract:

碩士
國立中央大學
電機工程學系
102
In the speech recognition system, Mel frequency cepstral coefficients (MFCCs) are the feature parameters that are used widely. Because of the wide applications of MFCC in the audio signal processing, lots of studies on the improvement of MFCCs were presented. In this study, we use particle swarm optimization algorithm to optimize the weight of MFCC filter bank. We utilize the difference between voice training database’s energy statistical curve and MFCC filter bank’s envelope as fitness function. Experimental results show that the proposed MFCCs method improves the recognition rate. In noisy environment experiments, the presented MFCCs method also improves the recognition performance.

APA, Harvard, Vancouver, ISO, and other styles

24

Tang, Chu-Liang, and 唐曲亮. "Improved Mel Frequency Cepstral Coefficients Combined with Multiple Speech Features." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/57856949340151071584.

Full text

Abstract:

碩士
國立中央大學
電機工程學系
103
This thesis studies the speech feature extracting and feature compensation in speech recognition. Several speech features are selected for combinations. The best one is cascading Linear Prediction Cepstral Coefficients (LPCC) and Mel-Frequency Cepstral Coefficient (MFCC). The MFCCs used here are obtained by utilizing a Gaussian Mel-Frequency band instead of using a triangular filter bank. And by experiments, it is found that the best combination ratio of LPCC and MFCC is 1:1. The thesis also showed that further improved performance is possible if Cepstral Mean and Variance Normalization (CMVN) is added.

APA, Harvard, Vancouver, ISO, and other styles

25

林士棻. "Bird songs recognition using two-dimensional Mel-scale frequency cepstral coefficients." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/38302762655714685237.

Full text

Abstract:

碩士
中華大學
資訊工程學系(所)
94
We propose a method to automatically identify birds from their sounds in this paper. First, each syllable corresponding to a piece of vocalization is segmented. The average LPCC (ALPCC), average MFCC (AMFCC), Static MFCC (SMFCC), Two-dimensional MFCC (TDMFCC), Dynamic two-dimensional MFCC (DTDMFCC) and TDMFCC+DTDMFCC over all frames in a syllable are calculated as the vocalization features. Linear discriminant analysis (LDA) is exploited to increase the classification accuracy at a lower dimensional feature vector space. A clustering algorithm, called progressive constructive clustering (PCC) algorithm, is used to divide the feature vectors which were computed from the same bird species into several subclasses. In our experiments, TDMFCC+DTDMFCC can achieve average classification accuracy 90% and 89% for 420 bird species and 561 bird species.

APA, Harvard, Vancouver, ISO, and other styles

26

Lin, Shih-Fen, and 林士棻. "Bird songs recognition using two-dimensional Mel-scale frequency cepstral coefficients." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/94553686394732089037.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

HUANG, CHUAN-HAO, and 黃川豪. "Multi-feature Speaker Verification Based on Mel-frequency cepstral coefficients and Formants." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/4nbqev.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

CHIANG, MING-DA, and 蔣明達. "Speaker Recognition Using Mel-Scale Frequency Cepstral Coefficients by Time Domain Filtering method." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/13444981721982290438.

Full text

Abstract:

碩士
中華技術學院
電子工程研究所碩士班
96
ABSTRACT According to past papers, we find that the algorithm based on Mel-frequency cepstral coefficients (MFCCs) has a better performance than any other algorithms which based on the other feature parameters [1-7]. The Mel-frequency cepstral coefficients are taken by following procedures, including: framing, multiplied by the Hamming Window, taking the fast Fourier transform (FFT), filtered in frequency domain by Mel-frequency triangular filter bank, calculating the logarithmic energy of filter outputs, and taking discrete cosine transform (DCT) to obtain the Mel-frequency cepstral coefficients. In this paper, for finding the Mel-frequency cepstral coefficients, the conventional frequency domain filtering procedures [1] are replaced by our new directly time domain filtering procedures proposed in this thesis. The simulation results show that the performances between our new method and the previous approach [1] are quite similar.

APA, Harvard, Vancouver, ISO, and other styles

29

Xu, Sheng-Bin, and 徐勝斌. "Continuous Birdsong Recognition Using Dynamic and Temporal Two-Dimensional Mel-Frequency Cepstral Coefficients." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/21749503795140776068.

Full text

Abstract:

碩士
中華大學
資訊工程學系(所)
97
In this paper, we will propose an approach for the classification of bird species using fixed-duration sound segments extracted from continuous birdsong recordings. First, each sound segment is divided into a number of overlapped texture windows. Each texture window will be individually classified and then a fusion approach is employed to determine the classification result of the input segment. The features derived from static, transitional, and temporal information of two-dimensional Mel-frequency cepstral coefficients (TDMFCC) will be extracted for the classification of each texture window. TDMFCC can describe both static and dynamic characteristics of a texture window, and dynamic TDMFCC (DTDMFCC) is used to describe sharp transitions within a texture window, and global dynamic TDMFCC (GDTDMFCC) is developed to describe long-time temporal variations in a texture window, and the concepts of DTDMFCC, which computes local regression coefficients, and GDTDMFCC, which evaluates global contrast information, can be integrated to form a new feature vector, called global and local DTDMFCC (GLDTDMFCC). Furthermore, we use principal component analysis (PCA) to reduce the feature dimension, Gaussian mixture models (GMM) to model the sound of different bird species, and linear discriminant analysis (LDA) to improve the classification accuracy at a lower dimensional feature vector space. In our experiment, the highest average classification accuracy is 94.62% for the classification of 28 kinds of bird species.

APA, Harvard, Vancouver, ISO, and other styles

30

Lin, Bo-Zhi, and 林柏志. "Speaker Recognition Algorithm Using Mel-Scale Frequency Cepstral Coefficients with Two Stages Linear Prediction Filters." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/18209732501243789128.

Full text

Abstract:

碩士
中華技術學院
電子工程研究所碩士班
94
The development of computer and communication technologies hastens the application requirements of speaker recognition and speech recognition. The purpose of this paper is to present a new algorithm to promote the performance of speaker recognition. The algorithm uses two stages linear prediction error filters to estimate the spectrogram of the processed speech signal. Then, the algorithm uses Mel-scale triangle bandpass filters bank to obtain the Mel-scale frequency cepstral coefficients（MFCC）to build the needed Gaussian mixture model for speaker recognition. To verify that the algorithm can work well and to compare the performance with the other algorithms, we use the mandarin speech data base, MAT-400, which was bought from the Association for Computational Linguistics and Chinese Language Processing. The experimental results show that the proposed algorithm has the best performance in the case of higher signal-to-noise ratio.

APA, Harvard, Vancouver, ISO, and other styles

31

Yang-Ming, Cheng, and 鄭陽銘. "A Mel-Scale Frequency Cepstral Coefficients Speaker Recognition Algorithm Based on Linear Prediction Spectrum Estimation." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/38345345070598427641.

Full text

Abstract:

碩士
中華技術學院
電子工程研究所碩士班
93
According to the past research, we know that the spectrum estimation based on linear prediction is more robust than the spectrum estimation based on FFT in the case of lower SNR. In this paper, we propose a new speaker identification algorithm based on linear prediction spectrum estimation. In this algorithm, the spectrum estimation algorithm based on short time faster Furrier transform is replaced by the linear prediction spectrum estimation algorithm, then, the Mel-scale frequency cepstral coefficients are obtained by using the Mel-scale frequency triangle filter-bank. Experimental results show that the new algorithm have better performance than the algorithm based on FFT in the case of lower SNR.

APA, Harvard, Vancouver, ISO, and other styles

32

Bowman, Casady. "Perceiving Emotion in Sounds: Does Timbre Play a Role?" Thesis, 2011. http://hdl.handle.net/1969.1/ETD-TAMU-2011-12-10656.

Full text

Abstract:

Acoustic features of sound such as pitch, loudness, perceived duration and timbre have been shown to be related to emotion in regard to sound, demonstrating that an important connection between the perceived emotions and their timbres is lacking. This study investigates the relationship between acoustic features of sound and emotion in regard to timbre. In two experiments we investigated whether particular acoustic components of sound can predict timbre, and particular categories of emotion, and how these attributes are related. Two behavioral experiments related perceived emotion ratings with synthetically created sounds and International Affective Digitized Sounds (Bradley & Lang, 2007) sounds. Also, two timbre experiments found acoustic components of synthetically created sounds, and IADS. Regression analyses uncovered some relationships between emotion, timbre, and acoustic features of sound. Results indicate that emotion is perceived differently for synthetic instrumental sounds and IADS. Mel-frequency cepstral coefficients were a strong predictor of perceived emotion of instrumental sounds; however, this was not the case for the IADS. This difference lends itself to the idea that there is a strong relationship between emotion and timbre for instrumental sounds, perhaps in part because of their relationship to speech and the way these different sounds are processed.

APA, Harvard, Vancouver, ISO, and other styles

33

Wu, Sunrise, and 吳尚叡. "Design Time Domain Filter Banks Using Least Squares Method to Calculate the Mel-Frequency Cepstral Coefficients for Speaker Recognition." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/08178129842426697899.

Full text

Abstract:

碩士
中華技術學院
電子工程研究所碩士班
96
Up to now, the best speaker recognition technique is based on Mel-frequency cepstral coefficients (MFCCs) [1-4,11] method. The main procedures on taking MFCCs are undergone by: framing, Hamming windowing, multiplied by FFT（Fast Fourier Transform）[7], filtered by Mel-scale triangular filter bank, taken the logarithmic energies of outputs, and transformed by DCT (Discrete Cosine Transform)[1-8]. After these processes, the MFCCs are obtained. The main topic of this thesis is we replace previous procedures of FFT [7] and filtering using a frequency-domain Mel-scale triangular filter bank[15] by filtering using a time-domain Mel-scale triangular filter bank. The time-domain Mel-scale triangular filter bank[1-8,14] we mentioned is obtained by the least square method[10,13], which is used to obtain the Mel-frequency cepstral coefficients of speaker speeches. From the results of our experiments, we find that the successful speaker recognition ratios between the conventional MFCC method[2,3,6,14] and our new approach are very similar.

APA, Harvard, Vancouver, ISO, and other styles

34

Yuan, Hor, and 原禾. "Design Time Domain Filter Banks Using Least Squares Method to Calculate the Mel-Frequency Cepstral Coefficients for Non-Continuous Speech Recognition." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/76162451347630250736.

Full text

Abstract:

碩士
中華技術學院
電子工程研究所碩士班
97
In speech recognition, the Mel frequency cepstral coefficients (MFCC) is currently popular to be used in speech recognition and speaker recognition[2,8-11,14,15]. To obtain the MFCC, the main procedures are filtering the speech signal by a set of triangular Mel-scale Filter Bank in the frequency domain to obtain the logarithm of the output powers of filter bank, and then taking Discrete Cosine Transform to obtain the MFCC. In this paper, the frequency domain triangular Mel-scale filter bank is replaced by a new designed time domain triangular Mel-scale filter bank. The experimental results show that the performances of speech recognition algorithms between that extracting MFCC using the conventional triangular Mel-scale filter bank and that extracting MFCC using the new designed time domain Mel-scale filter bank are very similar.

APA, Harvard, Vancouver, ISO, and other styles

35

Sujatha, J. "Improved MFCC Front End Using Spectral Maxima For Noisy Speech Recognition." Thesis, 2005. http://etd.iisc.ernet.in/handle/2005/1506.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Lei, Ying, and 雷穎. "Chip Design of Mel Frequency Cepstral Coefficient for Speech Recognition." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/69494964042006361607.

Full text

Abstract:

碩士
國立暨南國際大學
電機工程學系
94
This paper proposed the chip design of speech recognition for multimedia system. It is composed by three cores: a low-power high performance fast fourier transform (FFT) processor, a Mel-scale frequency cepstral coefficient (MFCC) circuit, and a dual-ALU digital signal process (DSP) processor with dynamic time warping speech recognition algorithm. The DSP processor had been implemented by previous researcher. In this paper, we mainly proposed the FFT processor and MFCC chip. In the FFT processor, we proposed a novel register array based pipelined radix-2 structure to reduce power consumption and computation cycles. In the MFCC circuit, we adopt one pair of accumulation procedure to reduce the computation of Mel frequency bank. In addition, we also minimize the look-up table size for logarithm operations, and we use gating clock issue to reduce power consumption. The two chips are synthesized by TSMC 0.18um cell library. The die size of the FFT/IFFT processor is approximately 4.73 . And the die size of the MFCC chip is approximately 1.71 . The two chips both work at 100 MHz.

APA, Harvard, Vancouver, ISO, and other styles

37

Liu, Yi-Ming. "The Chip Design of Reconfigurable FFT-Based Mel Frequency Cepstrum Coefficient." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0020-1607200815503000.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Liu, Yi-Ming, and 劉益銘. "The Chip Design of Reconfigurable FFT-Based Mel Frequency Cepstrum Coefficient." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/51081196289542485757.

Full text

Abstract:

碩士
國立暨南國際大學
電機工程學系
96
This thesis proposed the chip design of speech recognition for multimedia system. It is composed by three cores: a reconfigurable fast Fourier transform (FFT) processor, a Mel-scale frequency cepstrum coefficient (MFCC) circuit, and a dual-ALU digital signal process (DSP) processor with dynamic time warping speech recognition algorithm. In this FFT processor, we used the reconfigurable architecture to design our FFT processor. We proposed a novel register array based pipelined radix-22 structure to reduce power consumption and computation cycles. The FFT processor can be widely used in speech recognition, image process and communication system. The MFCC chips are synthesized by TSMC 0.13um cell library. The gate count of the MFCC chip is about 4767. The latency is about 2.60μs. The MFCC chip is work at 100 MHz.

APA, Harvard, Vancouver, ISO, and other styles

39

(6642491), Jingzhao Dai. "SPARSE DISCRETE WAVELET DECOMPOSITION AND FILTER BANK TECHNIQUES FOR SPEECH RECOGNITION." Thesis, 2019.

Find full text

Abstract:

Speech recognition is widely applied to translation from speech to related text, voice driven commands, human machine interface and so on [1]-[8]. It has been increasingly proliferated to Human’s lives in the modern age. To improve the accuracy of speech recognition, various algorithms such as artificial neural network, hidden Markov model and so on have been developed [1], [2].

In this thesis work, the tasks of speech recognition with various classifiers are investigated. The classifiers employed include the support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF) and convolutional neural network (CNN). Two novel features extraction methods of sparse discrete wavelet decomposition (SDWD) and bandpass filtering (BPF) based on the Mel filter banks [9] are developed and proposed. In order to meet diversity of classification algorithms, one-dimensional (1D) and two-dimensional (2D) features are required to be obtained. The 1D features are the array of power coefficients in frequency bands, which are dedicated for training SVM, KNN and RF classifiers while the 2D features are formed both in frequency domain and temporal variations. In fact, the 2D feature consists of the power values in decomposed bands versus consecutive speech frames. Most importantly, the 2D feature with geometric transformation are adopted to train CNN.

Speech recognition including males and females are from the recorded data set as well as the standard data set. Firstly, the recordings with little noise and clear pronunciation are applied with the proposed feature extraction methods. After many trials and experiments using this dataset, a high recognition accuracy is achieved. Then, these feature extraction methods are further applied to the standard recordings having random characteristics with ambient noise and unclear pronunciation. Many experiment results validate the effectiveness of the proposed feature extraction techniques.

APA, Harvard, Vancouver, ISO, and other styles

40

Weng, Yu-Sheng, and 翁育生. "The Chip Design of Mel Frequency Cepstrum Coefficient for HMM Speech Recognition." Thesis, 1998. http://ndltd.ncl.edu.tw/handle/80385881218692562375.

Full text

Abstract:

碩士
國立成功大學
電機工程學系
86
MFCC倒頻譜係數是語音特徵參數的一種，主要從人類聽覺的物理特性所得來，非常能夠代表語音，除了計算方式較LPCC倒頻譜參數更加直接外，也有相當不錯的辨識率。此外語音辨識近來則以隱藏式馬可夫模型（HMM ）為主要架構，因為在許多相關應用方面都有相當不錯的效果，所以使用該模型作為參數求取後的語音辨識驗證。因此我們希望將MFCC演算法以硬體實現，並做為未來整個辨識系統中參數計算的模組。本論文首先詳細分析整個原始MFCC演算法所需計算量，使用簡化過之餘弦查表方法（ simplified cosine table-lookup method）降低一半的記憶體，也將乘法運算減少為原來二分之一；其次，利用Mel 頻率座標轉換之特性，將加權能量頻譜所需之相關乘法運算與記憶體同時降為原來一半；最後更採用修改後之分割式查表法（modified partitioned logarithm table look- up method ）計算對數，除維持原精確度以外，更同時減少查表過程所需運算與大幅降低查表所需儲存空間達50%之多。整個硬體架構依據修改後之演算法配合TSMC 0.6μm製程之標準元件庫設計完成。晶片面積為3.2*3.3 mm2，以120支接腳包裝，總閘數約為10000，最高工作頻率為 50MHz，可以充分符合即時語音參數計算之需求。 Mel Frequency Cepstrum Coefficient is one kind of speech feature parameters, derived from the characteristic of human hearing. It isnot only good enough to model human speech but more straightforwardthan LPC cepstrum in calculation and has nice recognition rate. Ourpurpose is to implement the MFCC algorithm in hardware which functionsas the speech feature extraction module in overall recognition system. In our thesis, we first study the original MFCC algorithm in detailand analyze its required computational load. We utilize the simplifiedcosine table-lookup method to reduce the memory requirements and thenumber of multiplication to one half. Secondly, both the multiplication operations and memory size concerning weighted energy spectrum are cut down to one half by taking advantage of mapping between mel scale and frequency scale. Finally, we perform the logarithm operations by means of modified partitioned table look-up method. It has fewer intermediate operations needed by table look-up and dramatically decreases the required table size to 50% of original one with the same accuracy. The chip has been implemented using TSMC 0.6μm CELL Library. The chipsize is 3.2*3.3mm2, it contains 120 I/O PADS and the gate count is about 10000. The maximum working frequency is 50MHz and fully meets the requirementof real-time speech feature calculation.

APA, Harvard, Vancouver, ISO, and other styles

41

Chen, Chia-Yu, and 陳佳妤. "The Investigation of Chinese Vowel Recognition for Mel-Frequency Cepstrum Coefficient Feature." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/9jf6bu.

Full text

Abstract:

碩士
國立中興大學
統計學研究所
99
This paper is mainly to discuss the speech vowel recognition of 337 isolated mandarin words for dependence. We use the features of Mel-Frequency Cepstrum Coefficient(Mfcc). We consider three factors such as “the length of frame”, “the duration of the consonant” and “the dimension of speech feature”in the experiment. The method of K-nearest neighbor is used for recognition and the optimal combination can be found for the three experimental factors. In the experimental result, the best recognition can be up to 98.5%.

APA, Harvard, Vancouver, ISO, and other styles

42

Chen, Wan-Yu, and 陳宛余. "The Investigation of Capturing Mel-Frequency Cepstrum Coefficient Features on Mandarin Consonant Word Recognition." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/33299797950806290067.

Full text

Abstract:

碩士
中興大學
統計學研究所
99
The aim of this paper is to discuss the 337 mandarin consonant word recognition under the vowel is correct. The feature of Mel-frequency cepstrum coefficient (MFCC) and k-nearest neighbor (KNN) method are used for the recognition. Four experimental factors are considered in the word. That is, the length of frame, the dimension of MACC, the swing of frame, the duration of the consonant. We can fine the optimal parameters to the four experimental factors and the highest recognition is 95.84%.

APA, Harvard, Vancouver, ISO, and other styles

43

Chu, Feng-Seng, and 朱峰森. "Improved Approaches of Processing Perceptual Linear Prediction（PLP）and Mel Frequency Cepstrum Coefficient（MFCC）Parameters for Robust Speech Recognition." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/26578739886453071884.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Jhong, Jing-Jyue, and 鍾靖爵. "Using the Method of Common Vector to Recognize Isolated Mandarin Word for Speaker-dependent System with Optimal Mel-frequency Cepstrum Coefficient." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/zwc24k.

Full text

Abstract:

碩士
國立中興大學
統計學研究所
99
The paper aims to investigate the recognition of 340 isolated mandarin words by using the feature of Mel-frequency cepstrum coefficient, since it is the most well-known feature which can be less susceptible to noise interference. The speech model is constructed by the method of common vector. Moreover, the two-stage method will also be used to improve the recognition for similar consonants. The speech database in this experiment are recorded by twelve different speakers. Each isolated mandarin word is recorded ten times. We will find the optimal parameter set in order to get the highest recognition rate by doing cross-validation through every parameters. Finally, the best average recognition rate in this experiment is 91.80%, and the variance is 0.008.

APA, Harvard, Vancouver, ISO, and other styles

45

Wu, Jhong-Da, and 吳忠達. "Using K-Nearest Neighbor Method and the Optimal Mel-Frequency Cepstrum Coefficient Feature to Recognize Isolated Mandarin Word for Speaker-Dependent System." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/47366663775116977709.

Full text

Abstract:

碩士
中興大學
統計學研究所
99
This paper is mainly to discuss the speech recognition of 337 isolation mandarin words for speaker dependent. The feature is Mel-frequency cepstrum coefficient(Mfcc), and the method is k-nearest neighbor(knn), for the recognition, we try to find out the optimal parameters to obtain high performance recognition. Six experimental factors(the length of frame, the dimension of Mfcc, the number of frame, the weight of consonant and vowel, the swing of frame and the duration of consonant) we considered in the work. We find that the best average rate of recognition in database attains 91.5%.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Mel-Frequency Cepstral coefficients'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles