Dissertationen: „Speech Activity Detection (SAD)“

1

Näslund, Anton, und Charlie Jeansson. „Robust Speech Activity Detection and Direction of Arrival Using Convolutional Neural Networks“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297756.

Der volle Inhalt der Quelle

Annotation:

Social robots are becoming more and more common in our everyday lives. In the field of conversational robotics, the development goes towards socially engaging robots with humanlike conversation. This project looked into one of the technical aspects when recognizing speech, videlicet speech activity detection (SAD). The presented solution uses a convolutional neural network (CNN) based system to detect speech in a forward azimuth area. The project used a dataset from FestVox, called CMU Artic and was complimented by adding recorded noises. A library called Pyroomacoustics were used to simulate a real world setup to create a robust system. A simplified version was built, this model only detected speech activity and a accuracy of 95%was reached. The finished model resulted in an accuracy of 93%.It was compared with similar project, a voice activity detection(VAD) algorithm WebRTC with beamforming, as no previous published solutions to our project was found. Our solution proved to be higher in accuracy in both cases, compared to the accuracy WebRTC achieved on our dataset.
Sociala robotar blir vanligare och vanligare i våra vardagliga liv. Inom konversationsrobotik går utvecklingen mot socialt engagerande robotar som kan ha mänskliga konversationer. Detta projekt tittar på en av de tekniska aspekterna vid taligenkänning, nämligen talaktivitets detektion. Den presenterade lösningen använder ett convolutional neuralt nätverks(CNN) baserat system för att detektera tal i ett framåtriktat azimut område. Projektet använde sig av ett dataset från FestVox, kallat CMU Artic och kompletterades genom att lägga till ett antal inspelade störningsljud. Ett bibliotek som heter Pyroomacoustics användes för att simulera en verklig miljö för att skapa ett robust system. En förenklad modell konstruerades som endast detekterade talaktivitet och en noggrannhet på 95% uppnåddes. Den färdiga maskinen resulterade i en noggrannhet på 93%. Det jämfördes med liknande projekt, en röstaktivitetsdetekterings (VAD) algoritm WebRTC med strålformning, eftersom inga tidigare publicerade lösningar för vårt projekt hittades. Det visade sig att våra lösningar hade högre noggrannhet än den WebRTC uppnådde på vårt dataset.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Wejdelind, Marcus, und Nils Wägmark. „Multi-speaker Speech Activity Detection From Video“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297701.

Der volle Inhalt der Quelle

Annotation:

A conversational robot will in many cases have todeal with multi-party spoken interaction in which one or morepeople could be speaking simultaneously. To do this, the robotmust be able to identify the speakers in order to attend to them.Our project has approached this problem from a visual pointof view where a Convolutional Neural Network (CNN) wasimplemented and trained using video stream input containingone or more faces from an already existing data set (AVA-Speech). The goal for the network has then been to for eachface, and in each point in time, detect the probability of thatperson speaking. Our best result using an added Optical Flowfunction was 0.753 while we reached 0.781 using another pre-processing method of the data. These numbers correspondedsurprisingly well with the existing scientific literature in thearea where 0.77 proved to be an appropriate benchmark level.
En social robot kommer i många fall tvingasatt hantera konversationer med flera interlokutörer och därolika personer pratar samtidigt. För att uppnå detta är detviktigt att roboten kan identifiera talaren för att i nästa ledkunna bistå eller interagera med denna. Detta projekt harundersökt problemet med en visuell utgångspunkt där ettFaltningsnätverk (CNN) implementerades och tränades medvideo-input från ett redan befintligt dataset (AVA-Speech).Målet för nätverket har varit att för varje ansikte, och i varjetidpunkt, detektera sannolikheten att den personen talar. Vårtbästa resultat vid användning av Optical Flow var 0,753 medanvi lyckades nå 0,781 med en annan typ av förprocessering avdatan. Dessa resultat motsvarade den existerande vetenskapligalitteraturen på området förvånansvärt bra där 0,77 har visatsig vara ett lämpligt jämförelsevärde.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Murrin, Paul. „Objective measurement of voice activity detectors“. Thesis, University of York, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.325647.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Laverty, Stephen William. „Detection of Nonstationary Noise and Improved Voice Activity Detection in an Automotive Hands-free Environment“. Link to electronic thesis, 2005. http://www.wpi.edu/Pubs/ETD/Available/etd-051105-110646/.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

Minotto, Vicente Peruffo. „Audiovisual voice activity detection and localization of simultaneous speech sources“. reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/77231.

Der volle Inhalt der Quelle

Annotation:

Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores.
Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

6

Ent, Petr. „Voice Activity Detection“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-235483.

Der volle Inhalt der Quelle

Annotation:

Práce pojednává o využití support vector machines v detekci řečové aktivity. V první části jsou zkoumány různé druhy příznaků, jejich extrakce a zpracování a je nalezena jejich optimální kombinace, která podává nejlepší výsledky. Druhá část představuje samotný systém pro detekci řečové aktivity a ladění jeho parametrů. Nakonec jsou výsledky porovnány s dvěma dalšími systémy, založenými na odlišných principech. Pro testování a ladění byla použita ERT broadcast news databáze. Porovnání mezi systémy bylo pak provedeno na databázi z NIST06 Rich Test Evaluations.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Cho, Yong Duk. „Speech detection, enhancement and compression for voice communications“. Thesis, University of Surrey, 2001. http://epubs.surrey.ac.uk/842991/.

Der volle Inhalt der Quelle

Annotation:

Speech signal processing for voice communications can be characterised in terms of silence compression, noise reduction, and speech compression. The limit in the channel bandwidth of voice communication systems requires efficient compression of speech and silence signals while retaining the voice quality. Silence compression by means of both voice activity detection (VAD) and comfort noise generation could present transparent speech-quality while substantially lowering the transmission bit-rate, since pause regions between talk spurts do not include any voice information. Thus, this thesis proposes smoothed likelihood ratio-based VAD, designed on the basis of a behavioural analysis and improvement of a statistical model-based voice activity detector. Input speech could exhibit noisy signals, which could make the voice communication fatiguing and less intelligible. This task can be alleviated by noise reduction as a preprocessor for speech coding. Noise characteristics in speech enhancement are adapted typically during the pause regions classified by a voice activity detector. However, VAD errors could lead to over- or under- estimation of the noise statistics. Thus, this thesis proposes mixed decision-based noise adaptation based on a integration of soft and hard decision-based methods, defined with the speech presence uncertainty and VAD result, respectively. At low bit-rate speech coding, the sinusoidal model has been widely applied because of its good nature exploiting the phase redundancy of speech signals. Its performance, however, can be severely smeared by mis-estimation of the pitch. Thus, this thesis proposes a robust pitch estimation technique based on the autocorrelation of spectral amplitudes. Another important parameter in sinusoidal speech coding is the spectral magnitude of the LP-residual signal. It is, however, not easy to directly quantise the magnitudes because the dimensions of the spectral vectors are variable from frame to frame depending on the pitch. To alleviate this problem, this thesis proposes mel-scale-based dimension conversion, which converts the spectral vectors to a fixed dimension based on mel-scale warping. A predictive coding scheme is also employed in order to exploit the inter-frame redundancy between the spectral vectors. Experimental results show that each proposed technique is suitable for enhancing speech quality for voice communications. Furthermore, an improved speech coder incorporating the proposed techniques is developed. The vocoder gives speech quality comparable to TIA/EIA IS-127 for noisy speech whilst operating at lower than half the bit-rate of the reference coder. Key words: voice activity detection, speech enhancement, pitch, spectral magnitude quantisation, low bit-rate coding.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Doukas, Nikolaos. „Voice activity detection using energy based measures and source separation“. Thesis, Imperial College London, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.245220.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Sinclair, Mark. „Speech segmentation and speaker diarisation for transcription and translation“. Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20970.

Der volle Inhalt der Quelle

Annotation:

This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

10

Thorell, Hampus. „Voice Activity Detection in the Tiger Platform“. Thesis, Linköping University, Department of Electrical Engineering, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-6586.

Der volle Inhalt der Quelle

Annotation:

Sectra Communications AB has developed a terminal for encrypted communication called the Tiger platform. During voice communication delays have sometimes been experienced resulting in conversational complications.

A solution to this problem, as was proposed by Sectra, would be to introduce voice activity detection, which means a separation of speech parts and non-speech parts of the input signal, to the Tiger platform. By only transferring the speech parts to the receiver, the bandwidth needed should be dramatically decreased. A lower bandwidth needed implies that the delays slowly should disappear. The problem is then to come up with a method that manages to distinguish the speech parts from the input signal. Fortunately a lot of theory on the subject has been done and numerous voice activity methods exist today.

In this thesis the theory of voice activity detection has been studied. A review of voice activity detectors that exist on the market today followed by an evaluation of some of these was performed in order to select a suitable candidate for the Tiger platform. This evaluation would later become the foundation for the selection of a voice activity detector for implementation.

Finally, the implementation of the chosen voice activity detector, including a comfort noise generator, was done on the platform. This implementation was based on the special requirements of the platform. Tests of the implementation in office environments show that possible delays are steadily being reduced during periods of speech inactivity, while the active speech quality is preserved.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

11

Cooper, Douglas. „Speech Detection using Gammatone Features and One-Class Support Vector Machine“. Master's thesis, University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5923.

Der volle Inhalt der Quelle

Annotation:

A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD's rely on time-domain features and simple thresholds for efficient speech detection however this doesn't say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and non- speech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5dB.
M.S.E.E.
Masters
Electrical Engineering and Computer Science
Engineering and Computer Science
Electrical Engineering; Accelerated BS to MS

APA, Harvard, Vancouver, ISO und andere Zitierweisen

12

Temko, Andriy. „Acoustic event detection and classification“. Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.

Der volle Inhalt der Quelle

Annotation:

L'activitat humana que té lloc en sales de reunions o aules d'ensenyament es veu reflectida en una rica varietat d'events acústics, ja siguin produïts pel cos humà o per objectes que les persones manegen. Per això, la determinació de la identitat dels sons i de la seva posició temporal pot ajudar a detectar i a descriure l'activitat humana que té lloc en la sala. A més a més, la detecció de sons diferents de la veu pot ajudar a millorar la robustes de tecnologies de la parla com el reconeixement automàtica a condicions de treball adverses. L'objectiu d'aquesta tesi és la detecció i classificació automàtica d'events acústics. Es tracta de processar els senyals acústics recollits per micròfons distants en sales de reunions o aules per tal de convertir-los en descripcions simbòliques que es corresponguin amb la percepció que un oient tindria dels diversos events sonors continguts en els senyals i de les seves fonts. En primer lloc, s'encara la tasca de classificació automàtica d'events acústics amb classificadors de màquines de vectors suport (Support Vector Machines (SVM)), elecció motivada per l'escassetat de dades d'entrenament. Per al problema de reconeixement multiclasse es desenvolupa un esquema d'agrupament automàtic amb conjunt de característiques variable i basat en matrius de confusió. Realitzant proves amb la base de dades recollida, aquest classificador obté uns millors resultats que la tècnica basada en models de barreges de Gaussianes (Gaussian Mixture Models (GMM)), i aconsegueix una reducció relativa de l'error mitjà elevada en comparació amb el millor resultat obtingut amb l'esquema convencional basat en arbre binari. Continuant amb el problema de classificació, es comparen unes quantes maneres alternatives d'estendre els SVM al processament de seqüències, en un intent d'evitar l'inconvenient de treballar amb vectors de longitud fixa que presenten els SVM quan han de tractar dades d'àudio. En aquestes proves s'observa que els nuclis de deformació temporal dinàmica funcionen bé amb sons que presenten una estructura temporal. A més a més, s'usen conceptes i eines manllevats de la teoria de lògica difusa per investigar, d'una banda, la importància de cada una de les característiques i el grau d'interacció entre elles, i d'altra banda, tot cercant l'augment de la taxa de classificació, s'investiga la fusió de les
sortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

13

Danko, Michal. „Identifikace hudby, řeči, křiku, zpěvu v audio (video) záznamu“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255309.

Der volle Inhalt der Quelle

Annotation:

This thesis follows the trend of last decades in using neural networks in order to detect speech in noisy data. The text begins with basic knowledge about discussed topics, such as audio features, machine learning and neural networks. The network parameters are examined in order to provide the most suitable background for the experiments. The main focus of the experiments is to observe the influence of various sound events on the speech detection on a small, diverse database. Where the sound events correlated to the speech proved to be the most beneficial. In addition, the accuracy of the acoustic events, previously used only as a supplement to the speech, is also a part of experimentation. The experiment of examining the extending of the datasets by more fairly distributed data shows that it doesn't guarantee an improvement. And finally, the last experiment demonstrates that the network indeed succeeded in learning how to predict voice activity in both clean and noisy data.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

14

Podloucká, Lenka. „Identifikace pauz v rušeném řečovém signálu“. Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2008. http://www.nusl.cz/ntk/nusl-217266.

Der volle Inhalt der Quelle

Annotation:

This diploma thesis deals with pause identification with degraded speech signal. The speech characteristics and the conception of speech signal processing are described here. The work aim was to create the reliable recognizing method to establish speech and non-speech segments of speech signal with and without degraded speech signal. The five empty pause detectors were realized in computing environment MATLAB. There was the energetic detector in time domain, two-step detector in spectral domain, one-step integral detector, two-step integral detector and differential detector in cepstrum. The spectral detector makes use of energetic characteristics of speech signal in first step and statistic analysis in second step. Cepstral detectors make use of integral or differential algorithms. The detectors robustness was tested for different types of speech degradation and different values of Signal to Noise Ratio. The test of influence different speech degradation was conducted to compare non-speech detection for detectors by ROC (Receiver Operating Characteristic) Curves.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

15

Min-Chang, Chang, und 張民昌. „Voice Activity Detection and Its Application to Speech Coding“. Thesis, 2003. http://ndltd.ncl.edu.tw/handle/86106727484751378312.

Der volle Inhalt der Quelle

Annotation:

碩士
國立臺北科技大學
電機工程系碩士班
91
Voice activity detector is usually to be the preprocessor of a speech encoder in order to determine whether the incoming signal is a speech segment or not. If it is, a normal speech coder is used to encode the speech segment. If it is not, fewer parameters called silence insertion descriptor (SID) are needed to transmit to the decoder then a comfort noise generator (CNG) is exploited to mimic the background noise. According to the statistics about people’s talking, above 40 % even as higher as 60 % time slice is silence between talk spurts, so lots of bit rates and bandwidth can be saved. The subject of this thesis is to develop an efficient voice activity detection (VAD) algorithm. There are five speech parameters used to classify the input signal into voiced segments (speech like segments) and unvoiced segments (non-speech like segments) including the segmental energy, the spectral distortion, the zero crossing rate, the fundamental period(pitch), and the sum of the vocal areas. The determination of the proposed VAD model’s parameters and thresholds is based on the steepest descent algorithm. About two-thirds of the teaching material of “Let’s talk in English” in March, 2003 are used as the training database, and the rest is used as the testing database. Finally, the performance of the objective error rate and subjective listening test is studied and compared with the VAD methods of the famous half-rate GSM and G.729 speech coders.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

16

Lai, Chen-Wei, und 賴辰瑋. „The Research on the Voice Activity Detection and Speech Enhancement for Noisy Speech Recognition“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/06070640933840270072.

Der volle Inhalt der Quelle

Annotation:

碩士
國立暨南國際大學
電機工程學系
93
When a speech recognizer is applied in a real environment, its performance is often degraded seriously due to the existence of additive noise. In order to improve the robustness of the recognition system under noisy conditions, various approaches have been proposed, one direction of these approaches is attempt to detect the presence the presence of noise, to estimate the characteristics of the noise and then to remove or alleviate the noise in speech signals. In the thesis, we first study several voice activity detection (endpoint detection) approaches, which may detect the noise-only portions in a speech sequence. Then the noise statistics can be estimated via these noise portions. These approaches include order statistic filter (OSF), subband order statistic filter(SOSF), long-term spectrum divergence(LTSD), Kullback-Leibler distance(KL),energy and entropy, experimental results show that K-L distance method performs the best. That is, it gives the endpoints of noise-only portions closest to those obtained manually. Secondly, the speech enhancement approaches are studied, which try to reduce the noise component within the speech signal in different domains. For example, Nonlinear Spectral Subtraction(NSS) and Wiener Filter(WF) perform in linear spectral domain, Mel Spectral Subtraction(MSS) performs in mel spectral domain. Furthermore, we propose the Cepstral Statistics Compensation(CSC) method, which performs in cepstral domain, it is found that the effect of these back-end speech enhancement approaches in general depends on the accuracy of the front-end VAD, and CSC gives the optimal recognition rates among all approaches. CSCeven performs better than two popular temporal filtering approaches, Cepstaral mean subtraction(CMS) and Cepsral normalization(CN). In conclusion, robust VAD and speech enhancement approaches can effectively improve the noisy speech recognition, and have one special advantage. That is since they just perform on the speech to be recognized, it is no need to adjust the recognition models.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

17

WU, DONG-HAN, und 吳東翰. „Reduced Computation of Speech Coder Using a Voice Activity Detection Algorithm“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/83961092562227492103.

Der volle Inhalt der Quelle

Annotation:

碩士
南臺科技大學
資訊工程系
105
The explosive growth of Internet use and multimedia technology, multimedia communication is integrated into a personal information machine nowadays, and due to the latter’s limited computational capability, the need for a coder with low computational complexity to match different hardware platforms and integrate the services of media sources has arisen. For an Internet or wireless speech communicator, heavy computation uses more power and contributes to higher pricing of the communicator or reduced battery life. In order to achieve the real-time and continuity of speech communication, reduction of computational complexity for the speech coder is desirable for modern communication systems. In this thesis, we use a Voice Activity Detection (VAD) algorithm, which is merely used to classify the speech signal into two types of frames, active frames and inactive frames in our proposed method. We analyzed the characteristic of the inactive speech signals in our experiments. The experimental results are obvious that the encoding parameters are uniform distributed for the inactive speech subframes. Therefore, if the current frame is an inactive speech frame, then the code excited signal of current frame is not encoded instead of random arrangement the encoding parameters for the codebook structure. The Overall simulation results indicate that the average perceptual evaluation of speech quality score is degraded slightly, by 0.023, and our proposed methods can reduce total computational complexity by about 30% relative to the original G.723.1 encoder computation load with perceptually negligible degradation.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

18

Tu, Wen Hsiang, und 杜文祥. „Study on the Voice Activity Detection Techniques for Robust Speech Feature Extraction“. Thesis, 2007. http://ndltd.ncl.edu.tw/handle/76966247400637028949.

Der volle Inhalt der Quelle

Annotation:

碩士
國立暨南國際大學
電機工程學系
95
The performance of a speech recognition system is often degraded due to the mismatch between the environments of development and application. One of the major sources that give rises to this mismatch is additive noise. The approaches for handling the problem of additive noise can be divided into three classes: speech enhancement, robust speech feature extraction, and compensation of speech models. In this thesis, we are focused on the second class, robust speech feature extraction. The approaches of speech robust feature extraction are often together with the voice activity detection in order to estimate the noise characteristics. A voice activity detector (VAD) is used to discriminate the speech and noise-only portions within an utterance. This thesis primarily investigates the effectiveness of various features for the VAD. These features include low-frequency spectral magnitude (LFSM), full-band spectral magnitude (FBSM), cumulative quantized spectrum (CQS) and high-pass log-energy. The resulting VAD offers the noise information to two noise-robustness techniques, spectral subtraction (SS) and silence log-energy normalization (SLEN), in order to reduce the influence of additive noise in speech recognition. The recognition experiments are conducted on Aurora-2 database. Experimental results show that the proposed VAD is capable of providing accurate noise information, with which the following processes, SS and SLEN, significantly improve the speech recognition performance in various noise-corrupted environments. As a result, we confirm that an appropriate selection of features for VAD implicitly improves the noise robustness of a speech recognition system.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

19

楊佳興. „A Real-time Speech Purification and Voice Activity Detection System Using Microphone Array“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/qy6qq9.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

20

Hsuei, Yan-Jung, und 許晏榮. „SOPC Implementation of Speech Purification and Voice Activity Detection System Using Microphone Array“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/mkjwd4.

Der volle Inhalt der Quelle

Annotation:

碩士
國立交通大學
電機與控制工程系所
94
A real-time speech purification and voice activity detection (VAD) system for noisy indoor environment is proposed in this thesis. The system contains a real-time eight channel microphone array signal processing platform. An adaptive spatial filter is also designed on the platform to provide the system with the ability of environmental characteristic and noise adaptation. All the algorithms are realized on a Nios embedded system-on-programmable-chip (SOPC) platform. The VAD algorithm is executed by the Nios processor and the adaptive filter is accelerated by a self-designed hardware, which is a customized peripheral. The communication between the Nios and processor and the customized peripheral is achieved by the Avalon Bus. Since the order of the spatial filter is flexible, the system can be adjusted for superior speech puirificaiton result. The experimental results verify that the system can suppress the effect of environmental noise and improve the SNR effectively.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

21

Chen, Hung-Bin, und 陳鴻彬. „On the Study of Energy-Based Speech Feature Normalization and Application to Voice Activity Detection“. Thesis, 2007. http://ndltd.ncl.edu.tw/handle/41039482721804356460.

Der volle Inhalt der Quelle

Annotation:

碩士
國立臺灣師範大學
資訊工程研究所
95
This thesis considered robust speech recognition in various noise environments, with a special focus on investigating the ways to reconstruct the clean time-domain log-energy features from the noise-contaminated ones. Based on the distribution characteristics of the log-energy features of each speech utterance, we aimed to develop an efficient approach to rescale the log-energy features of the noisy speech utterance so as to alleviate the mismatch caused by environmental noises for better speech recognition performance. As the time-domain phenomena of the speech signals reveal that lower-energy speech frames are more vulnerable to additive noises than higher-energy ones, and that the magnitudes of the log-energy features of the speech utterance tend to be lifted up when they are seriously interfered with additive noise, we therefore proposed a simple but effective approach, named log-energy rescaling normalization (LERN), to appropriately rescale the log-energy features of noisy speech to that of the desirable clean one. The speech recognition experiments were conducted under various noise conditions using the European Telecommunications Standards Institute (ETSI) Aurora-2.0 database. The database contains a set of connected digit utterances spoken in English. It offers eight noise sources and seven different signal-to-noise ratios (SNRs). The experiment results showed that the performance of the proposed LERN approach was considerably better than the other conventional energy or log-energy feature normalization methods. Another set of experiments conducted on the large vocabulary continuous speech recognition (LVCSR) of Mandarin broadcast news also evidenced the effectiveness of LERN.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

22

ZHENG, SU-XING, und 鄭素幸. „A study on wireless digital subscriber loop and the channel sharing efficiency through speech activity detection“. Thesis, 1992. http://ndltd.ncl.edu.tw/handle/03260404361617659124.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

23

Venter, Petrus Jacobus. „Recording and automatic detection of African elephant (Loxodonta africana) infrasonic rumbles“. Diss., 2008. http://hdl.handle.net/2263/28329.

Der volle Inhalt der Quelle

Annotation:

The value of studying elephant vocalizations lies in the abundant information that can be retrieved from it. Recordings of elephant rumbles can be used by researchers to determine the size and composition of the herd, the sexual state, as well as the emotional condition of an elephant. It is a difficult task for researchers to obtain large volumes of continuous recordings of elephant vocalizations. Recordings are normally analysed manually to identify the location of rumbles via the tedious and time consuming methods of sped up listening and the visual evaluation of spectrograms. The application of speech processing on elephant vocalizations is a highly unexploited resource. The aim of this study was to contribute to the current body of knowledge and resources of elephant research by developing a tool for recording high volumes of continuous acoustic data in harsh natural conditions as well as examining the possibilities of applying human speech processing techniques to elephant rumbles to achieve automatic detection of these rumbles in recordings. The recording tool was designed and implemented as an elephant recording collar that has an onboard data storage capacity of 128 gigabytes, enough memory to record sound data continuously for a period of nine months. Data is stored in the wave file format and the device has the ability to navigate and control the FAT32 file system so that the files can be read and downloaded to a personal computer. The collar also has the ability to stamp sound files with the time and date, ambient temperature and GPS coordinates. Several different options for microphone placement and protection have been tested experimentally to find an acceptable solution. A relevant voice activity detection algorithm was chosen as a base for the automatic detection of infrasonic elephant rumbles. The chosen algorithm is based on a robust pitch determination algorithm that has been experimentally verified to function correctly under a signal-to-noise ratio as low as -8 dB when more than four harmonic structures exist in a sound. The algorithm was modified to be used for elephant rumbles and was tested with previously recorded elephant vocalization data. The results obtained suggest that the algorithm can accurately detect elephant rumbles from recordings. The number of false alarms and undetected calls increase when recordings are contaminated with unwanted noise that contains harmonic structures or when the harmonic nature of a rumble is lost. Data obtained from the recording collar is less prone to being contaminated than far field recordings and the automatic detection algorithm should provide an accurate tool for detecting any rumbles that appear in the recordings.
Dissertation (MEng)--University of Pretoria, 2008.
Electrical, Electronic and Computer Engineering
unrestricted

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Dissertationen zum Thema „Speech Activity Detection (SAD)“

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an