Dissertationen zum Thema „Speech Activity Detection (SAD)“
Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an
Machen Sie sich mit Top-23 Dissertationen für die Forschung zum Thema "Speech Activity Detection (SAD)" bekannt.
Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.
Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.
Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.
Näslund, Anton, und Charlie Jeansson. „Robust Speech Activity Detection and Direction of Arrival Using Convolutional Neural Networks“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297756.
Der volle Inhalt der QuelleSociala robotar blir vanligare och vanligare i våra vardagliga liv. Inom konversationsrobotik går utvecklingen mot socialt engagerande robotar som kan ha mänskliga konversationer. Detta projekt tittar på en av de tekniska aspekterna vid taligenkänning, nämligen talaktivitets detektion. Den presenterade lösningen använder ett convolutional neuralt nätverks(CNN) baserat system för att detektera tal i ett framåtriktat azimut område. Projektet använde sig av ett dataset från FestVox, kallat CMU Artic och kompletterades genom att lägga till ett antal inspelade störningsljud. Ett bibliotek som heter Pyroomacoustics användes för att simulera en verklig miljö för att skapa ett robust system. En förenklad modell konstruerades som endast detekterade talaktivitet och en noggrannhet på 95% uppnåddes. Den färdiga maskinen resulterade i en noggrannhet på 93%. Det jämfördes med liknande projekt, en röstaktivitetsdetekterings (VAD) algoritm WebRTC med strålformning, eftersom inga tidigare publicerade lösningar för vårt projekt hittades. Det visade sig att våra lösningar hade högre noggrannhet än den WebRTC uppnådde på vårt dataset.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm
Wejdelind, Marcus, und Nils Wägmark. „Multi-speaker Speech Activity Detection From Video“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297701.
Der volle Inhalt der QuelleEn social robot kommer i många fall tvingasatt hantera konversationer med flera interlokutörer och därolika personer pratar samtidigt. För att uppnå detta är detviktigt att roboten kan identifiera talaren för att i nästa ledkunna bistå eller interagera med denna. Detta projekt harundersökt problemet med en visuell utgångspunkt där ettFaltningsnätverk (CNN) implementerades och tränades medvideo-input från ett redan befintligt dataset (AVA-Speech).Målet för nätverket har varit att för varje ansikte, och i varjetidpunkt, detektera sannolikheten att den personen talar. Vårtbästa resultat vid användning av Optical Flow var 0,753 medanvi lyckades nå 0,781 med en annan typ av förprocessering avdatan. Dessa resultat motsvarade den existerande vetenskapligalitteraturen på området förvånansvärt bra där 0,77 har visatsig vara ett lämpligt jämförelsevärde.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm
Murrin, Paul. „Objective measurement of voice activity detectors“. Thesis, University of York, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.325647.
Der volle Inhalt der QuelleLaverty, Stephen William. „Detection of Nonstationary Noise and Improved Voice Activity Detection in an Automotive Hands-free Environment“. Link to electronic thesis, 2005. http://www.wpi.edu/Pubs/ETD/Available/etd-051105-110646/.
Der volle Inhalt der QuelleMinotto, Vicente Peruffo. „Audiovisual voice activity detection and localization of simultaneous speech sources“. reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/77231.
Der volle Inhalt der QuelleGiven the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
Ent, Petr. „Voice Activity Detection“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-235483.
Der volle Inhalt der QuelleCho, Yong Duk. „Speech detection, enhancement and compression for voice communications“. Thesis, University of Surrey, 2001. http://epubs.surrey.ac.uk/842991/.
Der volle Inhalt der QuelleDoukas, Nikolaos. „Voice activity detection using energy based measures and source separation“. Thesis, Imperial College London, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.245220.
Der volle Inhalt der QuelleSinclair, Mark. „Speech segmentation and speaker diarisation for transcription and translation“. Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20970.
Der volle Inhalt der QuelleThorell, Hampus. „Voice Activity Detection in the Tiger Platform“. Thesis, Linköping University, Department of Electrical Engineering, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-6586.
Der volle Inhalt der QuelleSectra Communications AB has developed a terminal for encrypted communication called the Tiger platform. During voice communication delays have sometimes been experienced resulting in conversational complications.
A solution to this problem, as was proposed by Sectra, would be to introduce voice activity detection, which means a separation of speech parts and non-speech parts of the input signal, to the Tiger platform. By only transferring the speech parts to the receiver, the bandwidth needed should be dramatically decreased. A lower bandwidth needed implies that the delays slowly should disappear. The problem is then to come up with a method that manages to distinguish the speech parts from the input signal. Fortunately a lot of theory on the subject has been done and numerous voice activity methods exist today.
In this thesis the theory of voice activity detection has been studied. A review of voice activity detectors that exist on the market today followed by an evaluation of some of these was performed in order to select a suitable candidate for the Tiger platform. This evaluation would later become the foundation for the selection of a voice activity detector for implementation.
Finally, the implementation of the chosen voice activity detector, including a comfort noise generator, was done on the platform. This implementation was based on the special requirements of the platform. Tests of the implementation in office environments show that possible delays are steadily being reduced during periods of speech inactivity, while the active speech quality is preserved.
Cooper, Douglas. „Speech Detection using Gammatone Features and One-Class Support Vector Machine“. Master's thesis, University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5923.
Der volle Inhalt der QuelleM.S.E.E.
Masters
Electrical Engineering and Computer Science
Engineering and Computer Science
Electrical Engineering; Accelerated BS to MS
Temko, Andriy. „Acoustic event detection and classification“. Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.
Der volle Inhalt der Quellesortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.
Danko, Michal. „Identifikace hudby, řeči, křiku, zpěvu v audio (video) záznamu“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255309.
Der volle Inhalt der QuellePodloucká, Lenka. „Identifikace pauz v rušeném řečovém signálu“. Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2008. http://www.nusl.cz/ntk/nusl-217266.
Der volle Inhalt der QuelleMin-Chang, Chang, und 張民昌. „Voice Activity Detection and Its Application to Speech Coding“. Thesis, 2003. http://ndltd.ncl.edu.tw/handle/86106727484751378312.
Der volle Inhalt der Quelle國立臺北科技大學
電機工程系碩士班
91
Voice activity detector is usually to be the preprocessor of a speech encoder in order to determine whether the incoming signal is a speech segment or not. If it is, a normal speech coder is used to encode the speech segment. If it is not, fewer parameters called silence insertion descriptor (SID) are needed to transmit to the decoder then a comfort noise generator (CNG) is exploited to mimic the background noise. According to the statistics about people’s talking, above 40 % even as higher as 60 % time slice is silence between talk spurts, so lots of bit rates and bandwidth can be saved. The subject of this thesis is to develop an efficient voice activity detection (VAD) algorithm. There are five speech parameters used to classify the input signal into voiced segments (speech like segments) and unvoiced segments (non-speech like segments) including the segmental energy, the spectral distortion, the zero crossing rate, the fundamental period(pitch), and the sum of the vocal areas. The determination of the proposed VAD model’s parameters and thresholds is based on the steepest descent algorithm. About two-thirds of the teaching material of “Let’s talk in English” in March, 2003 are used as the training database, and the rest is used as the testing database. Finally, the performance of the objective error rate and subjective listening test is studied and compared with the VAD methods of the famous half-rate GSM and G.729 speech coders.
Lai, Chen-Wei, und 賴辰瑋. „The Research on the Voice Activity Detection and Speech Enhancement for Noisy Speech Recognition“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/06070640933840270072.
Der volle Inhalt der Quelle國立暨南國際大學
電機工程學系
93
When a speech recognizer is applied in a real environment, its performance is often degraded seriously due to the existence of additive noise. In order to improve the robustness of the recognition system under noisy conditions, various approaches have been proposed, one direction of these approaches is attempt to detect the presence the presence of noise, to estimate the characteristics of the noise and then to remove or alleviate the noise in speech signals. In the thesis, we first study several voice activity detection (endpoint detection) approaches, which may detect the noise-only portions in a speech sequence. Then the noise statistics can be estimated via these noise portions. These approaches include order statistic filter (OSF), subband order statistic filter(SOSF), long-term spectrum divergence(LTSD), Kullback-Leibler distance(KL),energy and entropy, experimental results show that K-L distance method performs the best. That is, it gives the endpoints of noise-only portions closest to those obtained manually. Secondly, the speech enhancement approaches are studied, which try to reduce the noise component within the speech signal in different domains. For example, Nonlinear Spectral Subtraction(NSS) and Wiener Filter(WF) perform in linear spectral domain, Mel Spectral Subtraction(MSS) performs in mel spectral domain. Furthermore, we propose the Cepstral Statistics Compensation(CSC) method, which performs in cepstral domain, it is found that the effect of these back-end speech enhancement approaches in general depends on the accuracy of the front-end VAD, and CSC gives the optimal recognition rates among all approaches. CSCeven performs better than two popular temporal filtering approaches, Cepstaral mean subtraction(CMS) and Cepsral normalization(CN). In conclusion, robust VAD and speech enhancement approaches can effectively improve the noisy speech recognition, and have one special advantage. That is since they just perform on the speech to be recognized, it is no need to adjust the recognition models.
WU, DONG-HAN, und 吳東翰. „Reduced Computation of Speech Coder Using a Voice Activity Detection Algorithm“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/83961092562227492103.
Der volle Inhalt der Quelle南臺科技大學
資訊工程系
105
The explosive growth of Internet use and multimedia technology, multimedia communication is integrated into a personal information machine nowadays, and due to the latter’s limited computational capability, the need for a coder with low computational complexity to match different hardware platforms and integrate the services of media sources has arisen. For an Internet or wireless speech communicator, heavy computation uses more power and contributes to higher pricing of the communicator or reduced battery life. In order to achieve the real-time and continuity of speech communication, reduction of computational complexity for the speech coder is desirable for modern communication systems. In this thesis, we use a Voice Activity Detection (VAD) algorithm, which is merely used to classify the speech signal into two types of frames, active frames and inactive frames in our proposed method. We analyzed the characteristic of the inactive speech signals in our experiments. The experimental results are obvious that the encoding parameters are uniform distributed for the inactive speech subframes. Therefore, if the current frame is an inactive speech frame, then the code excited signal of current frame is not encoded instead of random arrangement the encoding parameters for the codebook structure. The Overall simulation results indicate that the average perceptual evaluation of speech quality score is degraded slightly, by 0.023, and our proposed methods can reduce total computational complexity by about 30% relative to the original G.723.1 encoder computation load with perceptually negligible degradation.
Tu, Wen Hsiang, und 杜文祥. „Study on the Voice Activity Detection Techniques for Robust Speech Feature Extraction“. Thesis, 2007. http://ndltd.ncl.edu.tw/handle/76966247400637028949.
Der volle Inhalt der Quelle國立暨南國際大學
電機工程學系
95
The performance of a speech recognition system is often degraded due to the mismatch between the environments of development and application. One of the major sources that give rises to this mismatch is additive noise. The approaches for handling the problem of additive noise can be divided into three classes: speech enhancement, robust speech feature extraction, and compensation of speech models. In this thesis, we are focused on the second class, robust speech feature extraction. The approaches of speech robust feature extraction are often together with the voice activity detection in order to estimate the noise characteristics. A voice activity detector (VAD) is used to discriminate the speech and noise-only portions within an utterance. This thesis primarily investigates the effectiveness of various features for the VAD. These features include low-frequency spectral magnitude (LFSM), full-band spectral magnitude (FBSM), cumulative quantized spectrum (CQS) and high-pass log-energy. The resulting VAD offers the noise information to two noise-robustness techniques, spectral subtraction (SS) and silence log-energy normalization (SLEN), in order to reduce the influence of additive noise in speech recognition. The recognition experiments are conducted on Aurora-2 database. Experimental results show that the proposed VAD is capable of providing accurate noise information, with which the following processes, SS and SLEN, significantly improve the speech recognition performance in various noise-corrupted environments. As a result, we confirm that an appropriate selection of features for VAD implicitly improves the noise robustness of a speech recognition system.
楊佳興. „A Real-time Speech Purification and Voice Activity Detection System Using Microphone Array“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/qy6qq9.
Der volle Inhalt der QuelleHsuei, Yan-Jung, und 許晏榮. „SOPC Implementation of Speech Purification and Voice Activity Detection System Using Microphone Array“. Thesis, 2005. http://ndltd.ncl.edu.tw/handle/mkjwd4.
Der volle Inhalt der Quelle國立交通大學
電機與控制工程系所
94
A real-time speech purification and voice activity detection (VAD) system for noisy indoor environment is proposed in this thesis. The system contains a real-time eight channel microphone array signal processing platform. An adaptive spatial filter is also designed on the platform to provide the system with the ability of environmental characteristic and noise adaptation. All the algorithms are realized on a Nios embedded system-on-programmable-chip (SOPC) platform. The VAD algorithm is executed by the Nios processor and the adaptive filter is accelerated by a self-designed hardware, which is a customized peripheral. The communication between the Nios and processor and the customized peripheral is achieved by the Avalon Bus. Since the order of the spatial filter is flexible, the system can be adjusted for superior speech puirificaiton result. The experimental results verify that the system can suppress the effect of environmental noise and improve the SNR effectively.
Chen, Hung-Bin, und 陳鴻彬. „On the Study of Energy-Based Speech Feature Normalization and Application to Voice Activity Detection“. Thesis, 2007. http://ndltd.ncl.edu.tw/handle/41039482721804356460.
Der volle Inhalt der Quelle國立臺灣師範大學
資訊工程研究所
95
This thesis considered robust speech recognition in various noise environments, with a special focus on investigating the ways to reconstruct the clean time-domain log-energy features from the noise-contaminated ones. Based on the distribution characteristics of the log-energy features of each speech utterance, we aimed to develop an efficient approach to rescale the log-energy features of the noisy speech utterance so as to alleviate the mismatch caused by environmental noises for better speech recognition performance. As the time-domain phenomena of the speech signals reveal that lower-energy speech frames are more vulnerable to additive noises than higher-energy ones, and that the magnitudes of the log-energy features of the speech utterance tend to be lifted up when they are seriously interfered with additive noise, we therefore proposed a simple but effective approach, named log-energy rescaling normalization (LERN), to appropriately rescale the log-energy features of noisy speech to that of the desirable clean one. The speech recognition experiments were conducted under various noise conditions using the European Telecommunications Standards Institute (ETSI) Aurora-2.0 database. The database contains a set of connected digit utterances spoken in English. It offers eight noise sources and seven different signal-to-noise ratios (SNRs). The experiment results showed that the performance of the proposed LERN approach was considerably better than the other conventional energy or log-energy feature normalization methods. Another set of experiments conducted on the large vocabulary continuous speech recognition (LVCSR) of Mandarin broadcast news also evidenced the effectiveness of LERN.
ZHENG, SU-XING, und 鄭素幸. „A study on wireless digital subscriber loop and the channel sharing efficiency through speech activity detection“. Thesis, 1992. http://ndltd.ncl.edu.tw/handle/03260404361617659124.
Der volle Inhalt der QuelleVenter, Petrus Jacobus. „Recording and automatic detection of African elephant (Loxodonta africana) infrasonic rumbles“. Diss., 2008. http://hdl.handle.net/2263/28329.
Der volle Inhalt der QuelleDissertation (MEng)--University of Pretoria, 2008.
Electrical, Electronic and Computer Engineering
unrestricted