Thematische Bibliographien / Speech Activity Detection (SAD)

Inhaltsverzeichnis

Zeitschriftenartikel
Dissertationen
Buchteile
Konferenzberichte

Auswahl der wissenschaftlichen Literatur zum Thema „Speech Activity Detection (SAD)“

Autor: Grafiati

Veröffentlicht am 24. Juni 2021

Zuletzt aktualisiert am 12. Februar 2022

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit den Listen der aktuellen Artikel, Bücher, Dissertationen, Berichten und anderer wissenschaftlichen Quellen zum Thema "Speech Activity Detection (SAD)" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Zeitschriftenartikel zum Thema "Speech Activity Detection (SAD)"

Kaur, Sukhvinder, und J. S. Sohal. „Speech Activity Detection and its Evaluation in Speaker Diarization System“. INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 16, Nr. 1 (13.03.2017): 7567–72. http://dx.doi.org/10.24297/ijct.v16i1.5893.

Der volle Inhalt der Quelle

Annotation:

In speaker diarization, the speech/voice activity detection is performed to separate speech, non-speech and silent frames. Zero crossing rate and root mean square value of frames of audio clips has been used to select training data for silent, speech and nonspeech models. The trained models are used by two classifiers, Gaussian mixture model (GMM) and Artificial neural network (ANN), to classify the speech and non-speech frames of audio clip. The results of ANN and GMM classifier are compared by Receiver operating characteristics (ROC) curve and Detection ErrorTradeoff (DET) graph. It is concluded that neural network based SADcomparatively better than Gaussian mixture model based SAD.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Dutta, Satwik, Prasanna Kothalkar, Johanna Rudolph, Christine Dollaghan, Jennifer McGlothlin, Thomas Campbell und John H. Hansen. „Advancing speech activity detection for automatic speech assessment of pre-school children prompted speech using COMBO-SAD“. Journal of the Acoustical Society of America 148, Nr. 4 (Oktober 2020): 2469–67. http://dx.doi.org/10.1121/1.5146831.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mahalakshmi, P. „A REVIEW ON VOICE ACTIVITY DETECTION AND MEL-FREQUENCY CEPSTRAL COEFFICIENTS FOR SPEAKER RECOGNITION (TREND ANALYSIS)“. Asian Journal of Pharmaceutical and Clinical Research 9, Nr. 9 (01.12.2016): 360. http://dx.doi.org/10.22159/ajpcr.2016.v9s3.14352.

Der volle Inhalt der Quelle

Annotation:

ABSTRACTObjective: The objective of this review article is to give a complete review of various techniques that are used for speech recognition purposes overtwo decades.Methods: VAD-Voice Activity Detection, SAD-Speech Activity Detection techniques are discussed that are used to distinguish voiced from unvoicedsignals and MFCC- Mel Frequency Cepstral Coefficient technique is discussed which detects specific features.Results: The review results show that research in MFCC has been dominant in signal processing in comparison to VAD and other existing techniques.Conclusion: A comparison of different speaker recognition techniques that were used previously were discussed and those in current research werealso discussed and a clear idea of the better technique was identified through the review of multiple literature for over two decades.Keywords: Cepstral analysis, Mel-frequency cepstral coefficients, signal processing, speaker recognition, voice activity detection.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhao, Hui, Yu Tai Wang und Xing Hai Yang. „Emotion Detection System Based on Speech and Facial Signals“. Advanced Materials Research 459 (Januar 2012): 483–87. http://dx.doi.org/10.4028/www.scientific.net/amr.459.483.

Der volle Inhalt der Quelle

Annotation:

This paper introduces the present status of speech emotion detection. In order to improve the emotion recognition rate of single mode, the bimodal fusion method based on speech and facial expression is proposed. First, we establishes emotional database include speech and facial expression. For different emotions, calm, happy, surprise, anger, sad, we extract ten speech parameters and use the PCA method to detect the speech emotion. Then we analyze the bimodal emotion detection of fusing facial expression information. The experiment results show that the emotion recognition rate with bimodal fusion is about 6 percent points higher than the recognition rate with only speech prosodic features

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Gelly, Gregory, und Jean-Luc Gauvain. „Optimization of RNN-Based Speech Activity Detection“. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, Nr. 3 (März 2018): 646–56. http://dx.doi.org/10.1109/taslp.2017.2769220.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Koh, Min‐sung, und Margaret Mortz. „Improved voice activity detection of noisy speech“. Journal of the Acoustical Society of America 107, Nr. 5 (Mai 2000): 2907–8. http://dx.doi.org/10.1121/1.428823.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Quan, Changqin, Bin Zhang, Xiao Sun und Fuji Ren. „A combined cepstral distance method for emotional speech recognition“. International Journal of Advanced Robotic Systems 14, Nr. 4 (01.07.2017): 172988141771983. http://dx.doi.org/10.1177/1729881417719836.

Der volle Inhalt der Quelle

Annotation:

Affective computing is not only the direction of reform in artificial intelligence but also exemplification of the advanced intelligent machines. Emotion is the biggest difference between human and machine. If the machine behaves with emotion, then the machine will be accepted by more people. Voice is the most natural and can be easily understood and accepted manner in daily communication. The recognition of emotional voice is an important field of artificial intelligence. However, in recognition of emotions, there often exists the phenomenon that two emotions are particularly vulnerable to confusion. This article presents a combined cepstral distance method in two-group multi-class emotion classification for emotional speech recognition. Cepstral distance combined with speech energy is well used as speech signal endpoint detection in speech recognition. In this work, the use of cepstral distance aims to measure the similarity between frames in emotional signals and in neutral signals. These features are input for directed acyclic graph support vector machine classification. Finally, a two-group classification strategy is adopted to solve confusion in multi-emotion recognition. In the experiments, Chinese mandarin emotion database is used and a large training set (1134 + 378 utterances) ensures a powerful modelling capability for predicting emotion. The experimental results show that cepstral distance increases the recognition rate of emotion sad and can balance the recognition results with eliminating the over fitting. And for the German corpus Berlin emotional speech database, the recognition rate between sad and boring, which are very difficult to distinguish, is up to 95.45%.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Dash, Debadatta, Paul Ferrari, Satwik Dutta und Jun Wang. „NeuroVAD: Real-Time Voice Activity Detection from Non-Invasive Neuromagnetic Signals“. Sensors 20, Nr. 8 (16.04.2020): 2248. http://dx.doi.org/10.3390/s20082248.

Der volle Inhalt der Quelle

Annotation:

Neural speech decoding-driven brain-computer interface (BCI) or speech-BCI is a novel paradigm for exploring communication restoration for locked-in (fully paralyzed but aware) patients. Speech-BCIs aim to map a direct transformation from neural signals to text or speech, which has the potential for a higher communication rate than the current BCIs. Although recent progress has demonstrated the potential of speech-BCIs from either invasive or non-invasive neural signals, the majority of the systems developed so far still assume knowing the onset and offset of the speech utterances within the continuous neural recordings. This lack of real-time voice/speech activity detection (VAD) is a current obstacle for future applications of neural speech decoding wherein BCI users can have a continuous conversation with other speakers. To address this issue, in this study, we attempted to automatically detect the voice/speech activity directly from the neural signals recorded using magnetoencephalography (MEG). First, we classified the whole segments of pre-speech, speech, and post-speech in the neural signals using a support vector machine (SVM). Second, for continuous prediction, we used a long short-term memory-recurrent neural network (LSTM-RNN) to efficiently decode the voice activity at each time point via its sequential pattern-learning mechanism. Experimental results demonstrated the possibility of real-time VAD directly from the non-invasive neural signals with about 88% accuracy.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mattys, Sven L., und Jamie H. Clark. „Lexical activity in speech processing: evidence from pause detection“. Journal of Memory and Language 47, Nr. 3 (Oktober 2002): 343–59. http://dx.doi.org/10.1016/s0749-596x(02)00037-2.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Potamitis, I., und E. Fishler. „Speech activity detection of moving speaker using microphone arrays“. Electronics Letters 39, Nr. 16 (2003): 1223. http://dx.doi.org/10.1049/el:20030726.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Dissertationen zum Thema "Speech Activity Detection (SAD)"

Näslund, Anton, und Charlie Jeansson. „Robust Speech Activity Detection and Direction of Arrival Using Convolutional Neural Networks“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297756.

Der volle Inhalt der Quelle

Annotation:

Social robots are becoming more and more common in our everyday lives. In the field of conversational robotics, the development goes towards socially engaging robots with humanlike conversation. This project looked into one of the technical aspects when recognizing speech, videlicet speech activity detection (SAD). The presented solution uses a convolutional neural network (CNN) based system to detect speech in a forward azimuth area. The project used a dataset from FestVox, called CMU Artic and was complimented by adding recorded noises. A library called Pyroomacoustics were used to simulate a real world setup to create a robust system. A simplified version was built, this model only detected speech activity and a accuracy of 95%was reached. The finished model resulted in an accuracy of 93%.It was compared with similar project, a voice activity detection(VAD) algorithm WebRTC with beamforming, as no previous published solutions to our project was found. Our solution proved to be higher in accuracy in both cases, compared to the accuracy WebRTC achieved on our dataset.
Sociala robotar blir vanligare och vanligare i våra vardagliga liv. Inom konversationsrobotik går utvecklingen mot socialt engagerande robotar som kan ha mänskliga konversationer. Detta projekt tittar på en av de tekniska aspekterna vid taligenkänning, nämligen talaktivitets detektion. Den presenterade lösningen använder ett convolutional neuralt nätverks(CNN) baserat system för att detektera tal i ett framåtriktat azimut område. Projektet använde sig av ett dataset från FestVox, kallat CMU Artic och kompletterades genom att lägga till ett antal inspelade störningsljud. Ett bibliotek som heter Pyroomacoustics användes för att simulera en verklig miljö för att skapa ett robust system. En förenklad modell konstruerades som endast detekterade talaktivitet och en noggrannhet på 95% uppnåddes. Den färdiga maskinen resulterade i en noggrannhet på 93%. Det jämfördes med liknande projekt, en röstaktivitetsdetekterings (VAD) algoritm WebRTC med strålformning, eftersom inga tidigare publicerade lösningar för vårt projekt hittades. Det visade sig att våra lösningar hade högre noggrannhet än den WebRTC uppnådde på vårt dataset.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wejdelind, Marcus, und Nils Wägmark. „Multi-speaker Speech Activity Detection From Video“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-297701.

Der volle Inhalt der Quelle

Annotation:

A conversational robot will in many cases have todeal with multi-party spoken interaction in which one or morepeople could be speaking simultaneously. To do this, the robotmust be able to identify the speakers in order to attend to them.Our project has approached this problem from a visual pointof view where a Convolutional Neural Network (CNN) wasimplemented and trained using video stream input containingone or more faces from an already existing data set (AVA-Speech). The goal for the network has then been to for eachface, and in each point in time, detect the probability of thatperson speaking. Our best result using an added Optical Flowfunction was 0.753 while we reached 0.781 using another pre-processing method of the data. These numbers correspondedsurprisingly well with the existing scientific literature in thearea where 0.77 proved to be an appropriate benchmark level.
En social robot kommer i många fall tvingasatt hantera konversationer med flera interlokutörer och därolika personer pratar samtidigt. För att uppnå detta är detviktigt att roboten kan identifiera talaren för att i nästa ledkunna bistå eller interagera med denna. Detta projekt harundersökt problemet med en visuell utgångspunkt där ettFaltningsnätverk (CNN) implementerades och tränades medvideo-input från ett redan befintligt dataset (AVA-Speech).Målet för nätverket har varit att för varje ansikte, och i varjetidpunkt, detektera sannolikheten att den personen talar. Vårtbästa resultat vid användning av Optical Flow var 0,753 medanvi lyckades nå 0,781 med en annan typ av förprocessering avdatan. Dessa resultat motsvarade den existerande vetenskapligalitteraturen på området förvånansvärt bra där 0,77 har visatsig vara ett lämpligt jämförelsevärde.
Kandidatexjobb i elektroteknik 2020, KTH, Stockholm

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Murrin, Paul. „Objective measurement of voice activity detectors“. Thesis, University of York, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.325647.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Laverty, Stephen William. „Detection of Nonstationary Noise and Improved Voice Activity Detection in an Automotive Hands-free Environment“. Link to electronic thesis, 2005. http://www.wpi.edu/Pubs/ETD/Available/etd-051105-110646/.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Minotto, Vicente Peruffo. „Audiovisual voice activity detection and localization of simultaneous speech sources“. reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/77231.

Der volle Inhalt der Quelle

Annotation:

Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores.
Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Ent, Petr. „Voice Activity Detection“. Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2009. http://www.nusl.cz/ntk/nusl-235483.

Der volle Inhalt der Quelle

Annotation:

Práce pojednává o využití support vector machines v detekci řečové aktivity. V první části jsou zkoumány různé druhy příznaků, jejich extrakce a zpracování a je nalezena jejich optimální kombinace, která podává nejlepší výsledky. Druhá část představuje samotný systém pro detekci řečové aktivity a ladění jeho parametrů. Nakonec jsou výsledky porovnány s dvěma dalšími systémy, založenými na odlišných principech. Pro testování a ladění byla použita ERT broadcast news databáze. Porovnání mezi systémy bylo pak provedeno na databázi z NIST06 Rich Test Evaluations.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Cho, Yong Duk. „Speech detection, enhancement and compression for voice communications“. Thesis, University of Surrey, 2001. http://epubs.surrey.ac.uk/842991/.

Der volle Inhalt der Quelle

Annotation:

Speech signal processing for voice communications can be characterised in terms of silence compression, noise reduction, and speech compression. The limit in the channel bandwidth of voice communication systems requires efficient compression of speech and silence signals while retaining the voice quality. Silence compression by means of both voice activity detection (VAD) and comfort noise generation could present transparent speech-quality while substantially lowering the transmission bit-rate, since pause regions between talk spurts do not include any voice information. Thus, this thesis proposes smoothed likelihood ratio-based VAD, designed on the basis of a behavioural analysis and improvement of a statistical model-based voice activity detector. Input speech could exhibit noisy signals, which could make the voice communication fatiguing and less intelligible. This task can be alleviated by noise reduction as a preprocessor for speech coding. Noise characteristics in speech enhancement are adapted typically during the pause regions classified by a voice activity detector. However, VAD errors could lead to over- or under- estimation of the noise statistics. Thus, this thesis proposes mixed decision-based noise adaptation based on a integration of soft and hard decision-based methods, defined with the speech presence uncertainty and VAD result, respectively. At low bit-rate speech coding, the sinusoidal model has been widely applied because of its good nature exploiting the phase redundancy of speech signals. Its performance, however, can be severely smeared by mis-estimation of the pitch. Thus, this thesis proposes a robust pitch estimation technique based on the autocorrelation of spectral amplitudes. Another important parameter in sinusoidal speech coding is the spectral magnitude of the LP-residual signal. It is, however, not easy to directly quantise the magnitudes because the dimensions of the spectral vectors are variable from frame to frame depending on the pitch. To alleviate this problem, this thesis proposes mel-scale-based dimension conversion, which converts the spectral vectors to a fixed dimension based on mel-scale warping. A predictive coding scheme is also employed in order to exploit the inter-frame redundancy between the spectral vectors. Experimental results show that each proposed technique is suitable for enhancing speech quality for voice communications. Furthermore, an improved speech coder incorporating the proposed techniques is developed. The vocoder gives speech quality comparable to TIA/EIA IS-127 for noisy speech whilst operating at lower than half the bit-rate of the reference coder. Key words: voice activity detection, speech enhancement, pitch, spectral magnitude quantisation, low bit-rate coding.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Doukas, Nikolaos. „Voice activity detection using energy based measures and source separation“. Thesis, Imperial College London, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.245220.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Sinclair, Mark. „Speech segmentation and speaker diarisation for transcription and translation“. Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20970.

Der volle Inhalt der Quelle

Annotation:

This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Thorell, Hampus. „Voice Activity Detection in the Tiger Platform“. Thesis, Linköping University, Department of Electrical Engineering, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-6586.

Der volle Inhalt der Quelle

Annotation:

Sectra Communications AB has developed a terminal for encrypted communication called the Tiger platform. During voice communication delays have sometimes been experienced resulting in conversational complications.

A solution to this problem, as was proposed by Sectra, would be to introduce voice activity detection, which means a separation of speech parts and non-speech parts of the input signal, to the Tiger platform. By only transferring the speech parts to the receiver, the bandwidth needed should be dramatically decreased. A lower bandwidth needed implies that the delays slowly should disappear. The problem is then to come up with a method that manages to distinguish the speech parts from the input signal. Fortunately a lot of theory on the subject has been done and numerous voice activity methods exist today.

In this thesis the theory of voice activity detection has been studied. A review of voice activity detectors that exist on the market today followed by an evaluation of some of these was performed in order to select a suitable candidate for the Tiger platform. This evaluation would later become the foundation for the selection of a voice activity detector for implementation.

Finally, the implementation of the chosen voice activity detector, including a comfort noise generator, was done on the platform. This implementation was based on the special requirements of the platform. Tests of the implementation in office environments show that possible delays are steadily being reduced during periods of speech inactivity, while the active speech quality is preserved.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Buchteile zum Thema "Speech Activity Detection (SAD)"

Alam, Tanvirul, und Akib Khan. „Lightweight CNN for Robust Voice Activity Detection“. In Speech and Computer, 1–12. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-60276-5_1.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Solé-Casals, Jordi, Pere Martí-Puig, Ramon Reig-Bolaño und Vladimir Zaiats. „Score Function for Voice Activity Detection“. In Advances in Nonlinear Speech Processing, 76–83. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-11509-7_10.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Pertilä, Pasi, Alessio Brutti, Piergiorgio Svaizer und Maurizio Omologo. „Multichannel Source Activity Detection, Localization, and Tracking“. In Audio Source Separation and Speech Enhancement, 47–64. Chichester, UK: John Wiley & Sons Ltd, 2018. http://dx.doi.org/10.1002/9781119279860.ch4.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Málek, Jiří, und Jindřich Žďánský. „Voice-Activity and Overlapped Speech Detection Using x-Vectors“. In Text, Speech, and Dialogue, 366–76. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58323-1_40.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Huang, Zhongqiang, und Mary P. Harper. „Speech Activity Detection on Multichannels of Meeting Recordings“. In Machine Learning for Multimodal Interaction, 415–27. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11677482_35.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zelinka, Jan. „Deep Learning and Online Speech Activity Detection for Czech Radio Broadcasting“. In Text, Speech, and Dialogue, 428–35. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-00794-2_46.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Chu, Stephen M., Etienne Marcheret und Gerasimos Potamianos. „Automatic Speech Recognition and Speech Activity Detection in the CHIL Smart Room“. In Machine Learning for Multimodal Interaction, 332–43. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11677482_29.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Górriz, J. M., C. G. Puntonet, J. Ramírez und J. C. Segura. „Bispectrum Estimators for Voice Activity Detection and Speech Recognition“. In Nonlinear Analyses and Algorithms for Speech Processing, 174–85. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11613107_15.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Macho, Dušan, Climent Nadeu und Andrey Temko. „Robust Speech Activity Detection in Interactive Smart-Room Environments“. In Machine Learning for Multimodal Interaction, 236–47. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11965152_21.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Honarmandi Shandiz, Amin, und László Tóth. „Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks“. In Text, Speech, and Dialogue, 499–510. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-83527-9_43.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Konferenzberichte zum Thema "Speech Activity Detection (SAD)"

Abdulla, Waleed H., Zhou Guan und Hou Chi Sou. „Noise robust speech activity detection“. In 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 2009. http://dx.doi.org/10.1109/isspit.2009.5407509.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Matic, A., V. Osmani und O. Mayora. „Speech activity detection using accelerometer“. In 2012 34th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2012. http://dx.doi.org/10.1109/embc.2012.6346377.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Tsai, TJ, und Nelson Morgan. „Speech activity detection: An economics approach“. In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6638987.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Khoury, Elie, und Matt Garland. „I-Vectors for speech activity detection“. In Odyssey 2016. ISCA, 2016. http://dx.doi.org/10.21437/odyssey.2016-48.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

K, Punnoose A. „New Features for Speech Activity Detection“. In SMM19, Workshop on Speech, Music and Mind 2019. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/smm.2019-6.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Laskowski, Kornel, Qin Jin und Tanja Schultz. „Crosscorrelation-based multispeaker speech activity detection“. In Interspeech 2004. ISCA: ISCA, 2004. http://dx.doi.org/10.21437/interspeech.2004-350.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Sarfjoo, Seyyed Saeed, Srikanth Madikeri und Petr Motlicek. „Speech Activity Detection Based on Multilingual Speech Recognition System“. In Interspeech 2021. ISCA: ISCA, 2021. http://dx.doi.org/10.21437/interspeech.2021-1058.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Harsha, B. V. „A noise robust speech activity detection algorithm“. In Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004. IEEE, 2004. http://dx.doi.org/10.1109/isimp.2004.1434065.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Shahsavari, Sajad, Hossein Sameti und Hossein Hadian. „Speech activity detection using deep neural networks“. In 2017 Iranian Conference on Electrical Engineering (ICEE). IEEE, 2017. http://dx.doi.org/10.1109/iraniancee.2017.7985293.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Heese, Florian, Markus Niermann und Peter Vary. „Speech-codebook based soft Voice Activity Detection“. In ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015. http://dx.doi.org/10.1109/icassp.2015.7178789.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!