Academic literature on the topic 'Speech diarization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Speech diarization.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Speech diarization"

1

Mertens, Robert, Po-Sen Huang, Luke Gottlieb, Gerald Friedland, Ajay Divakaran, and Mark Hasegawa-Johnson. "On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks." International Journal of Multimedia Data Engineering and Management 3, no. 3 (July 2012): 1–19. http://dx.doi.org/10.4018/jmdem.2012070101.

Full text
Abstract:
A video’s soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as “ngine sounds,” “utdoor/indoor sounds.” These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question “ho spoke when?”by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.
APA, Harvard, Vancouver, ISO, and other styles
2

Astapov, Sergei, Aleksei Gusev, Marina Volkova, Aleksei Logunov, Valeriia Zaluskaia, Vlada Kapranova, Elena Timofeeva, Elena Evseeva, Vladimir Kabarov, and Yuri Matveev. "Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization." Mathematics 9, no. 23 (November 23, 2021): 2998. http://dx.doi.org/10.3390/math9232998.

Full text
Abstract:
Recently developed methods in spontaneous speech analytics require the use of speaker separation based on audio data, referred to as diarization. It is applied to widespread use cases, such as meeting transcription based on recordings from distant microphones and the extraction of the target speaker’s voice profiles from noisy audio. However, speech recognition and analysis can be hindered by background and point-source noise, overlapping speech, and reverberation, which all affect diarization quality in conjunction with each other. To compensate for the impact of these factors, there are a variety of supportive speech analytics methods, such as quality assessments in terms of SNR and RT60 reverberation time metrics, overlapping speech detection, instant speaker number estimation, etc. The improvements in speaker verification methods have benefits in the area of speaker separation as well. This paper introduces several approaches aimed towards improving diarization system quality. The presented experimental results demonstrate the possibility of refining initial speaker labels from neural-based VAD data by means of fusion with labels from quality estimation models, overlapping speech detectors, and speaker number estimation models, which contain CNN and LSTM modules. Such fusing approaches allow us to significantly decrease DER values compared to standalone VAD methods. Cases of ideal VAD labeling are utilized to show the positive impact of ResNet-101 neural networks on diarization quality in comparison with basic x-vectors and ECAPA-TDNN architectures trained on 8 kHz data. Moreover, this paper highlights the advantage of spectral clustering over other clustering methods applied to diarization. The overall quality of diarization is improved at all stages of the pipeline, and the combination of various speech analytics methods makes a significant contribution to the improvement of diarization quality.
APA, Harvard, Vancouver, ISO, and other styles
3

Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.

Full text
Abstract:
This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system’s ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.
APA, Harvard, Vancouver, ISO, and other styles
4

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (June 26, 2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text
Abstract:
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints
APA, Harvard, Vancouver, ISO, and other styles
5

V, Sethuram, Ande Prasad, and R. Rajeswara Rao. "Metaheuristic adapted convolutional neural network for Telugu speaker diarization." Intelligent Decision Technologies 15, no. 4 (January 10, 2022): 561–77. http://dx.doi.org/10.3233/idt-211005.

Full text
Abstract:
In speech technology, a pivotal role is being played by the Speaker diarization mechanism. In general, speaker diarization is the mechanism of partitioning the input audio stream into homogeneous segments based on the identity of the speakers. The automatic transcription readability can be improved with the speaker diarization as it is good in recognizing the audio stream into the speaker turn and often provides the true speaker identity. In this research work, a novel speaker diarization approach is introduced under three major phases: Feature Extraction, Speech Activity Detection (SAD), and Speaker Segmentation and Clustering process. Initially, from the input audio stream (Telugu language) collected, the Mel Frequency Cepstral coefficient (MFCC) based features are extracted. Subsequently, in Speech Activity Detection (SAD), the music and silence signals are removed. Then, the acquired speech signals are segmented for each individual speaker. Finally, the segmented signals are subjected to the speaker clustering process, where the Optimized Convolutional Neural Network (CNN) is used. To make the clustering more appropriate, the weight and activation function of CNN are fine-tuned by a new Self Adaptive Sea Lion Algorithm (SA-SLnO). Finally, a comparative analysis is made to exhibit the superiority of the proposed speaker diarization work. Accordingly, the accuracy of the proposed method is 0.8073, which is 5.255, 2.45%, and 0.075, superior to the existing works.
APA, Harvard, Vancouver, ISO, and other styles
6

Murali, Abhejay, Satwik Dutta, Meena Chandra Shekar, Dwight Irvin, Jay Buzhardt, and John H. Hansen. "Towards developing speaker diarization for parent-child interactions." Journal of the Acoustical Society of America 152, no. 4 (October 2022): A61. http://dx.doi.org/10.1121/10.0015551.

Full text
Abstract:
Daily interactions of children with their parents are crucial for spoken language skills and overall development. Capturing such interactions can help to provide meaningful feedback to parents as well as practitioners. Naturalistic audio capture and developing further speech processing pipeline for parent-child interactions is a challenging problem. One of the first important steps in the speech processing pipeline is Speaker Diarization—to identify who spoke when. Speaker Diarization is the method of separating a captured audio stream into analogous segments that are differentiated by the speaker’s (child or parent’s) identity. Following ongoing COVID-19 restrictions and human subjects research IRB protocols, an unsupervised data collection approach was formulated to collect parent-child interactions (of consented families) using LENA device—a light weight audio recorder. Different interaction scenarios were explored: book reading activity at home and spontaneous interactions in a science museum. To identify child’s speech from a parent, we train the Diarization models on open-source adult speech data and children speech data acquired from LDC (Linguistic Data Consortium). Various speaker embeddings (e.g., x-vectors, i-vectors, resnets) will be explored. Results will be reported using Diarization Error Rate. [Work sponsored by NSF via Grant Nos. 1918032 and 1918012.]
APA, Harvard, Vancouver, ISO, and other styles
7

Taha, Thaer Mufeed, Zaineb Ben Messaoud, and Mondher Frikha. "Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization." International Journal of Interactive Mobile Technologies (iJIM) 18, no. 03 (February 9, 2024): 88–103. http://dx.doi.org/10.3991/ijim.v18i03.43013.

Full text
Abstract:
This paper introduces three system architectures for speaker identification that aim to overcome the limitations of diarization and voice-based biometric systems. Diarization systems utilize unsupervised algorithms to segment audio data based on the time boundaries of utterances, but they do not distinguish individual speakers. On the other hand, voice-based biometric systems can only identify individuals in recordings with a single speaker. Identifying speakers in recordings of natural conversations can be challenging, especially when emotional shifts can alter voice characteristics, making gender identification difficult. To address this issue, the proposed architectures include techniques for gender, emotion, and diarization at either the segment or group level. The evaluation of these architectures utilized two speech databases, namely VoxCeleb and RAVDESS (Ryerson audio-visual database of emotional speech and song) datasets. The findings reveal that the proposed approach outperforms the strategy level in terms of recognition results, despite the real-time processing advantage of the latter. The challenge of identifying multiple speakers engaging in a conversation while considering emotional changes that impact speech is effectively addressed by the proposed architectures. The data indicates that the gender and emotion classification of diarization achieves an accuracy of over 98 percent. These results suggest that the proposed speech-based approach can achieve highly accurate speaker identification.
APA, Harvard, Vancouver, ISO, and other styles
8

Kothalkar, Prasanna V., John H. L. Hansen, Dwight Irvin, and Jay Buzhardt. "Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation." Journal of the Acoustical Society of America 155, no. 2 (February 1, 2024): 1198–215. http://dx.doi.org/10.1121/10.0024353.

Full text
Abstract:
Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates alternate deep learning-based lightweight, knowledge-distilled, diarization solutions for segmenting classroom interactions of 3–5 years old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our lightest CNN model achieves a best F1-score of ∼76.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with automatic speech recognition-based re-segmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations.
APA, Harvard, Vancouver, ISO, and other styles
9

Kshirod, Kshirod Sarmah. "Speaker Diarization with Deep Learning Techniques." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 11, no. 3 (December 15, 2020): 2570–82. http://dx.doi.org/10.61841/turcomat.v11i3.14309.

Full text
Abstract:
Speaker diarization is a task to identify the speaker when different speakers spoke in an audio or video recording environment. Artificial intelligence (AI) fields have effectively used Deep Learning (DL) to solve a variety of real-world application challenges. With effective applications in a wide range of subdomains, such as natural language processing, image processing, computer vision, speech and speaker recognition, and emotion recognition, cyber security, and many others, DL, a very innovative field of Machine Learning (ML), that is quickly emerging as the most potent machine learning technique. DL techniques have outperformed conventional approaches recently in speaker diarization as well as speaker recognition. The technique of assigning classes to speech recordings that correspond to the speaker's identity is known as speaker diarization, and it allows one to determine who spoke when. A crucial step in speech processing is speaker diarization, which divides an audio recording into different speaker areas. In-depth analysis of speaker diarization utilizing a variety of deep learning algorithms that are presented in this research paper. NIST-2000 CALLHOME and our in-house database ALSD-DB are the two voice corpora we used for this study's tests. TDNN-based embeddings with x-vectors, LSTM-based embeddings with d-vectors, and lastly embeddings fusion of both x-vector and d-vector are used in the tests for the basic system. For the NIST-2000 CALLHOME database, LSTM based embeddings with d-vector and embeddings integrating both x-vector and d-vector exhibit improved performance with DER of 8.25% and 7.65%, respectively, and of 10.45% and 9.65% for the local ALSD-DB database
APA, Harvard, Vancouver, ISO, and other styles
10

Lleida, Eduardo, Alfonso Ortega, Antonio Miguel, Virginia Bazán-Gil, Carmen Pérez, Manuel Gómez, and Alberto de Prada. "Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media." Applied Sciences 9, no. 24 (December 11, 2019): 5412. http://dx.doi.org/10.3390/app9245412.

Full text
Abstract:
The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Speech diarization"

1

Zelenák, Martin. "Detection and handling of overlapping speech for speaker diarization." Doctoral thesis, Universitat Politècnica de Catalunya, 2012. http://hdl.handle.net/10803/72431.

Full text
Abstract:
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling.
APA, Harvard, Vancouver, ISO, and other styles
2

Otterson, Scott. "Use of speaker location features in meeting diarization /." Thesis, Connect to this title online; UW restricted, 2008. http://hdl.handle.net/1773/15463.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Peso, Pablo. "Spatial features of reverberant speech : estimation and application to recognition and diarization." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/45664.

Full text
Abstract:
Distant talking scenarios, such as hands-free calling or teleconference meetings, are essential for natural and comfortable human-machine interaction and they are being increasingly used in multiple contexts. The acquired speech signal in such scenarios is reverberant and affected by additive noise. This signal distortion degrades the performance of speech recognition and diarization systems creating troublesome human-machine interactions. This thesis proposes a method to non-intrusively estimate room acoustic parameters, paying special attention to a room acoustic parameter highly correlated with speech recognition degradation: clarity index. In addition, a method to provide information regarding the estimation accuracy is proposed. An analysis of the phoneme recognition performance for multiple reverberant environments is presented, from which a confusability metric for each phoneme is derived. This confusability metric is then employed to improve reverberant speech recognition performance. Additionally, room acoustic parameters can as well be used in speech recognition to provide robustness against reverberation. A method to exploit clarity index estimates in order to perform reverberant speech recognition is introduced. Finally, room acoustic parameters can also be used to diarize reverberant speech. A room acoustic parameter is proposed to be used as an additional source of information for single-channel diarization purposes in reverberant environments. In multi-channel environments, the time delay of arrival is a feature commonly used to diarize the input speech, however the computation of this feature is affected by reverberation. A method is presented to model the time delay of arrival in a robust manner so that speaker diarization is more accurately performed.
APA, Harvard, Vancouver, ISO, and other styles
4

Sinclair, Mark. "Speech segmentation and speaker diarisation for transcription and translation." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20970.

Full text
Abstract:
This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for.
APA, Harvard, Vancouver, ISO, and other styles
5

Ishizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Yin, Ruiqing. "Steps towards end-to-end neural speaker diarization." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS261/document.

Full text
Abstract:
La tâche de segmentation et de regroupement en locuteurs (speaker diarization) consiste à identifier "qui parle quand" dans un flux audio sans connaissance a priori du nombre de locuteurs ou de leur temps de parole respectifs. Les systèmes de segmentation et de regroupement en locuteurs sont généralement construits en combinant quatre étapes principales. Premièrement, les régions ne contenant pas de parole telles que les silences, la musique et le bruit sont supprimées par la détection d'activité vocale (VAD). Ensuite, les régions de parole sont divisées en segments homogènes en locuteur par détection des changements de locuteurs, puis regroupées en fonction de l'identité du locuteur. Enfin, les frontières des tours de parole et leurs étiquettes sont affinées avec une étape de re-segmentation. Dans cette thèse, nous proposons d'aborder ces quatre étapes avec des approches fondées sur les réseaux de neurones. Nous formulons d’abord le problème de la segmentation initiale (détection de l’activité vocale et des changements entre locuteurs) et de la re-segmentation finale sous la forme d’un ensemble de problèmes d’étiquetage de séquence, puis nous les résolvons avec des réseaux neuronaux récurrents de type Bi-LSTM (Bidirectional Long Short-Term Memory). Au stade du regroupement des régions de parole, nous proposons d’utiliser l'algorithme de propagation d'affinité à partir de plongements neuronaux de ces tours de parole dans l'espace vectoriel des locuteurs. Des expériences sur un jeu de données télévisées montrent que le regroupement par propagation d'affinité est plus approprié que le regroupement hiérarchique agglomératif lorsqu'il est appliqué à des plongements neuronaux de locuteurs. La segmentation basée sur les réseaux récurrents et la propagation d'affinité sont également combinées et optimisées conjointement pour former une chaîne de regroupement en locuteurs. Comparé à un système dont les modules sont optimisés indépendamment, la nouvelle chaîne de traitements apporte une amélioration significative. De plus, nous proposons d’améliorer l'estimation de la matrice de similarité par des réseaux neuronaux récurrents, puis d’appliquer un partitionnement spectral à partir de cette matrice de similarité améliorée. Le système proposé atteint des performances à l'état de l'art sur la base de données de conversation téléphonique CALLHOME. Enfin, nous formulons le regroupement des tours de parole en mode séquentiel sous la forme d'une tâche supervisée d’étiquetage de séquence et abordons ce problème avec des réseaux récurrents empilés. Pour mieux comprendre le comportement du système, une analyse basée sur une architecture de codeur-décodeur est proposée. Sur des exemples synthétiques, nos systèmes apportent une amélioration significative par rapport aux méthodes de regroupement traditionnelles
Speaker diarization is the task of determining "who speaks when" in an audio stream that usually contains an unknown amount of speech from an unknown number of speakers. Speaker diarization systems are usually built as the combination of four main stages. First, non-speech regions such as silence, music, and noise are removed by Voice Activity Detection (VAD). Next, speech regions are split into speaker-homogeneous segments by Speaker Change Detection (SCD), later grouped according to the identity of the speaker thanks to unsupervised clustering approaches. Finally, speech turn boundaries and labels are (optionally) refined with a re-segmentation stage. In this thesis, we propose to address these four stages with neural network approaches. We first formulate both the initial segmentation (voice activity detection and speaker change detection) and the final re-segmentation as a set of sequence labeling problems and then address them with Bidirectional Long Short-Term Memory (Bi-LSTM) networks. In the speech turn clustering stage, we propose to use affinity propagation on top of neural speaker embeddings. Experiments on a broadcast TV dataset show that affinity propagation clustering is more suitable than hierarchical agglomerative clustering when applied to neural speaker embeddings. The LSTM-based segmentation and affinity propagation clustering are also combined and jointly optimized to form a speaker diarization pipeline. Compared to the pipeline with independently optimized modules, the new pipeline brings a significant improvement. In addition, we propose to improve the similarity matrix by bidirectional LSTM and then apply spectral clustering on top of the improved similarity matrix. The proposed system achieves state-of-the-art performance in the CALLHOME telephone conversation dataset. Finally, we formulate sequential clustering as a supervised sequence labeling task and address it with stacked RNNs. To better understand its behavior, the analysis is based on a proposed encoder-decoder architecture. Our proposed systems bring a significant improvement compared with traditional clustering methods on toy examples
APA, Harvard, Vancouver, ISO, and other styles
7

Cui, Can. "Séparation, diarisation et reconnaissance de la parole conjointes pour la transcription automatique de réunions." Electronic Thesis or Diss., Université de Lorraine, 2024. http://www.theses.fr/2024LORR0103.

Full text
Abstract:
La transcription de réunions enregistrées par une antenne de microphones distante est particulièrement difficile en raison de la superposition des locuteurs, du bruit ambiant et de la réverbération. Pour résoudre ces problèmes, nous avons exploré trois approches. Premièrement, nous utilisons un modèle de séparation de sources multicanal pour séparer les locuteurs, puis un modèle de reconnaissance automatique de la parole (ASR) monocanal et mono-locuteur pour transcrire la parole séparée et rehaussée. Deuxièmement, nous proposons un modèle multicanal multi-locuteur de bout-en-bout (MC-SA-ASR), qui s'appuie sur un modèle multi-locuteur monocanal (SA-ASR) existant et inclut un encodeur multicanal par Conformer avec un mécanisme d'attention multi-trame intercanale (MFCCA). Contrairement aux approches traditionnelles qui nécessitent un modèle de rehaussement de la parole multicanal en amont, le modèle MC-SA-ASR traite les microphones distants de bout-en-bout. Nous avons également expérimenté différentes caractéristiques d'entrée, dont le banc de filtres Mel et les caractéristiques de phase, pour ce modèle. Enfin, nous utilisons un modèle de formation de voies et de rehaussement multicanal comme pré-traitement, suivi d'un modèle SA-ASR monocanal pour traiter la parole multi-locuteur rehaussée. Nous avons testé différentes techniques de formation de voies fixe, hybride ou neuronale et proposé d'apprendre conjointement les modèles de formation de voies neuronale et de SA-ASR en utilisant le coût d'apprentissage de ce dernier. En plus de ces méthodes, nous avons développé un pipeline de transcription de réunions qui intègre la détection de l'activité vocale, la diarisation et le SA-ASR pour traiter efficacement les enregistrements de réunions réelles. Les résultats expérimentaux indiquent que, même si l'utilisation d'un modèle de séparation de sources peut améliorer la qualité de la parole, les erreurs de séparation peuvent se propager à l'ASR, entraînant des performances sous-optimales. Une approche guidée de séparation de sources s'avère plus efficace. Notre modèle MC-SA-ASR proposé démontre l'efficacité de l'intégration des informations multicanales et des informations partagées entre les modules d'ASR et de locuteur. Des expériences avec différentes catactéristiques d'entrée révèlent que les modèles appris avec les caractéristiques de Mel Filterbank fonctionnent mieux en termes de taux d'erreur sur les mots (WER) et de taux d'erreur sur les locuteurs (SER) lorsque le nombre de canaux et de locuteurs est faible (2 canaux avec 1 ou 2 locuteurs). Cependant, pour les configurations à 3 ou 4 canaux et 3 locuteurs, les modèles appris sur des caractéristiques de phase supplémentaires surpassent ceux utilisant uniquement les caractéristiques Mel. Cela suggère que les informations de phase peuvent améliorer la transcription du contenu vocal en exploitant les informations de localisation provenant de plusieurs canaux. Bien que MC-SA-ASR basé sur MFCCA surpasse les modèles SA-ASR et MC-ASR monocanal sans module de locuteur, les modèle de formation de voies et de SA-ASR conjointes permet d'obtenir des résultats encore meilleurs. Plus précisément, l'apprentissage conjoint de la formation de voies neuronale et de SA-ASR donne les meilleures performances, ce qui indique que l'amélioration de la qualité de la parole pourrait être une approche plus directe et plus efficace que l'utilisation d'un modèle MC-SA-ASR de bout-en-bout pour la transcription de réunions multicanales. En outre, l'étude du pipeline de transcription de réunions réelles souligne le potentiel pour des meilleurs modèles de bout-en-bout. Dans notre étude sur l'amélioration de l'attribution des locuteurs par SA-ASR, nous avons constaté que le module d'ASR n'est pas sensible aux modifications du module de locuteur. Cela met en évidence la nécessité d'architectures améliorées qui intègrent plus efficacement l'ASR et l'information de locuteur
Far-field microphone-array meeting transcription is particularly challenging due to overlapping speech, ambient noise, and reverberation. To address these issues, we explored three approaches. First, we employ a multichannel speaker separation model to isolate individual speakers, followed by a single-channel, single-speaker automatic speech recognition (ASR) model to transcribe the separated and enhanced audio. This method effectively enhances speech quality for ASR. Second, we propose an end-to-end multichannel speaker-attributed ASR (MC-SA-ASR) model, which builds on an existing single-channel SA-ASR model and incorporates a multichannel Conformer-based encoder with multi-frame cross-channel attention (MFCCA). Unlike traditional approaches that require a multichannel front-end speech enhancement model, the MC-SA-ASR model handles far-field microphones in an end-to-end manner. We also experimented with different input features, including Mel filterbank and phase features, for that model. Lastly, we incorporate a multichannel beamforming and enhancement model as a front-end processing step, followed by a single-channel SA-ASR model to process the enhanced multi-speaker speech signals. We tested different fixed, hybrid, and fully neural network-based beamformers and proposed to jointly optimize the neural beamformer and SA-ASR models using the training objective for the latter. In addition to these methods, we developed a meeting transcription pipeline that integrates voice activity detection, speaker diarization, and SA-ASR to process real meeting recordings effectively. Experimental results indicate that, while using a speaker separation model can enhance speech quality, separation errors can propagate to ASR, resulting in suboptimal performance. A guided speaker separation approach proves to be more effective. Our proposed MC-SA-ASR model demonstrates efficiency in integrating multichannel information and the shared information between the ASR and speaker blocks. Experiments with different input features reveal that models trained with Mel filterbank features perform better in terms of word error rate (WER) and speaker error rate (SER) when the number of channels and speakers is low (2 channels with 1 or 2 speakers). However, for settings with 3 or 4 channels and 3 speakers, models trained with additional phase information outperform those using only Mel filterbank features. This suggests that phase information can enhance ASR by leveraging localization information from multiple channels. Although MFCCA-based MC-SA-ASR outperforms the single-channel SA-ASR and MC-ASR models without a speaker block, the joint beamforming and SA-ASR model further improves the performance. Specifically, joint training of the neural beamformer and SA-ASR yields the best performance, indicating that improving speech quality might be a more direct and efficient approach than using an end-to-end MC-SA-ASR model for multichannel meeting transcription. Furthermore, the study of the real meeting transcription pipeline underscores the potential for better end-to-end models. In our investigation on improving speaker assignment in SA-ASR, we found that the speaker block does not effectively help improve the ASR performance. This highlights the need for improved architectures that more effectively integrate ASR and speaker information
APA, Harvard, Vancouver, ISO, and other styles
8

Mariotte, Théo. "Traitement automatique de la parole en réunion par dissémination de capteurs." Electronic Thesis or Diss., Le Mans, 2024. http://www.theses.fr/2024LEMA1001.

Full text
Abstract:
Ces travaux de thèse se concentrent sur le traitement automatique de la parole, et plus particulièrement sur la diarisation en locuteurs. Cette tâche nécessite de segmenter le signal afin d'identifier des évènements tels que la présence de parole, de parole superposée ou de changements de locuteur. Cette recherche se focalise sur le cas où le signal est capté par un dispositif placé au centre d'un groupe de locuteurs, comme lors de réunions. Ces conditions entraînent une dégradation de la qualité des signaux en raison de l'éloignement des sources sonores (parole distante).Afin de pallier cette dégradation, une approche consiste à enregistrer le signal à l'aide d'un ensemble de microphones formant une antenne acoustique. Le signal multicanal obtenu permet d'obtenir des informations sur la répartition spatiale du champ acoustique. Deux axes de recherche sont explorés pour la segmentation de la parole à l'aide d'antecnnes de microphones.Le premier axe introduit une méthode combinant des caractéristiques acoustiques avec des caractéristiques spatiales. Un nouveau jeu de caractéristiques, basé sur le formalisme des harmoniques circulaires, est proposé. Cette approche améliore les performances de segmentation en conditions distantes, tout en réduisant le nombre de paramètres des modèles et en garantissant une certaine robustesse en cas de désactivation de certains microphones.Le second axe propose plusieurs approches de combinaison des canaux en utilisant des mécanismes d'auto-attention. Différents modèles, inspirés d'une architecture existante, sont développés. La combinaison de canaux améliore également la segmentation en conditions distantes. Deux de ces approches rendent l'extraction de caractéristiques plus interprétable. Les systèmes de segmentation de la parole distante proposés améliorent également la diarisation en locuteurs.La combinaison de canaux montre une faible robustesse en cas de changement de géométrie de l'antenne en phase d'évaluation. Pour y remédier, une procédure d'apprentissage est proposée, qui améliore la robustesse en présence d'une antenne non conforme.Finalement, les travaux menés ont permis d'identifier un manque dans les jeux de données publics disponibles pour le traitement automatique de la parole distante. Un protocole d'acquisition est introduit pour l'acquisition de signaux en réunions et intégrant l'annotation de la position des locuteurs en plus de la segmentation.En somme, ces travaux visent à améliorer la qualité de la segmentation de la parole distante multicanale. Les méthodes proposées exploitent l'information spatiale fournie par les antennes de microphones en garantissant une certaine robustesse au nombre de microphones disponibles
This thesis work focuses on automatic speech processing, and more specifically on speaker diarization. This task requires the signal to be segmented to identify events such as voice activity, overlapped speech, or speaker changes. This work tackles the scenario where the signal is recorded by a device located in the center of a group of speakers, as in meetings. These conditions lead to a degradation in signal quality due to the distance between the speakers (distant speech).To mitigate this degradation, one approach is to record the signal using a microphone array. The resulting multichannel signal provides information on the spatial distribution of the acoustic field. Two lines of research are being explored for speech segmentation using microphone arrays.The first introduces a method combining acoustic features with spatial features. We propose a new set of features based on the circular harmonics expansion. This approach improves segmentation performance under distant speech conditions while reducing the number of model parameters and improving robustness in case of change in the array geometry.The second proposes several approaches that combine channels using self-attention. Different models, inspired by an existing architecture, are developed. Combining channels also improves segmentation under distant speech conditions. Two of these approaches make feature extraction more interpretable. The proposed distant speech segmentation systems also improve speaker diarization.Channel combination shows poor robustness to changes in the array geometry during inference. To avoid this behavior, a learning procedure is proposed, which improves the robustness in case of array mismatch.Finally, we identified a gap in the public datasets available for distant multichannel automatic speech processing. An acquisition protocol is introduced to build a new dataset, integrating speaker position annotation in addition to speaker diarization.Thus, this work aims to improve the quality of multichannel distant speech segmentation. The proposed methods exploit the spatial information provided by microphone arrays while improving the robustness in case of array mismatch
APA, Harvard, Vancouver, ISO, and other styles
9

Hsu, Wu-Hua, and 許吳華. "A Preliminary Study on Speaker Diarization for Automatic Transcription of Broadcast Radio Speech." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/a3z9vr.

Full text
Abstract:
碩士
國立臺北科技大學
電子工程系
106
We use Time-delay Neural Network for Speaker Diarization. The average DER is 27.74%, which is better than 31.08% of GMM. We use trained automatic speaker diarization system to classify information of unmarked speakers in the NER-210 corpus, retrain the ASR by marking the output of the speaker information timeline. The experimental results show, through the speaker diarization system, the ASR system that classifies the speaker information can reduce the original CER from 20.01% to 19.13%. In addition, the average CER of the basic LSTM model on the automatic speech recognition system is 17.2%. The average CER can be reduced to 13.12% using the multi-layer serial neural network CNN-TDNN-LSTM model. Then, we using Confidence Measure data selection and adding more word sequences in the language model to increase the recognition rate, the average CER can be reduced to 9.2%.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Speech diarization"

1

Avdeeva, Anastasia, and Sergey Novoselov. "Deep Speaker Embeddings Based Online Diarization." In Speech and Computer, 24–32. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-20980-2_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zajíc, Zbyněk, Josef V. Psutka, and Luděk Müller. "Diarization Based on Identification with X-Vectors." In Speech and Computer, 667–78. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-60276-5_64.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Edwards, Erik, Michael Brenndoerfer, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. "A Free Synthetic Corpus for Speaker Diarization Research." In Speech and Computer, 113–22. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-99579-3_13.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Zajíc, Zbyněk, Josef V. Psutka, Lucie Zajícová, Luděk Müller, and Petr Salajka. "Diarization of the Language Consulting Center Telephone Calls." In Speech and Computer, 549–58. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-26061-3_56.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Edwards, Erik, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. "Speaker Diarization: A Top-Down Approach Using Syllabic Phonology." In Speech and Computer, 123–33. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-99579-3_14.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kudashev, Oleg, and Alexander Kozlov. "The Diarization System for an Unknown Number of Speakers." In Speech and Computer, 340–44. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-319-01931-4_45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Nguyen, Trung Hieu, Eng Siong Chng, and Haizhou Li. "Speaker Diarization: An Emerging Research." In Speech and Audio Processing for Coding, Enhancement and Recognition, 229–77. New York, NY: Springer New York, 2014. http://dx.doi.org/10.1007/978-1-4939-1456-2_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Kynych, Frantisek, Jindrich Zdansky, Petr Cerva, and Lukas Mateju. "Online Speaker Diarization Using Optimized SE-ResNet Architecture." In Text, Speech, and Dialogue, 176–87. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-40498-6_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zajíc, Zbyněk, Jan Zelinka, and Luděk Müller. "Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech." In Speech and Computer, 555–63. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-66429-3_55.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Kunešová, Marie, Marek Hrúz, Zbyněk Zajíc, and Vlasta Radová. "Detection of Overlapping Speech for the Purposes of Speaker Diarization." In Speech and Computer, 247–57. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-26061-3_26.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Speech diarization"

1

Von Neumann, Thilo, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, and Reinhold Haeb-Umbach. "Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization." In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 775–79. IEEE, 2024. http://dx.doi.org/10.1109/icasspw62465.2024.10625894.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Lamel, Lori, Jean-Luc Gauvain, and Leonardo Canseco-Rodriguez. "Speaker diarization from speech transcripts." In Interspeech 2004. ISCA: ISCA, 2004. http://dx.doi.org/10.21437/interspeech.2004-250.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Bounazou, Hadjer, Nassim Asbai, and Sihem Zitouni. "Speaker Diarization in Overlapped Speech." In 2022 19th International Multi-Conference on Systems, Signals & Devices (SSD). IEEE, 2022. http://dx.doi.org/10.1109/ssd54932.2022.9955684.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Jiang, Yidi, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, and Haizhou Li. "Prompt-Driven Target Speech Diarization." In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. http://dx.doi.org/10.1109/icassp48485.2024.10446072.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Xie, Jiamin, Leibny Paola García-Perera, Daniel Povey, and Sanjeev Khudanpur. "Multi-PLDA Diarization on Children’s Speech." In Interspeech 2019. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/interspeech.2019-2961.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Gebre, Binyam Gebrekidan, Peter Wittenburg, Sebastian Drude, Marijn Huijbregts, and Tom Heskes. "Speaker diarization using gesture and speech." In Interspeech 2014. ISCA: ISCA, 2014. http://dx.doi.org/10.21437/interspeech.2014-141.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Lupu, Eugen, Anca Apatean, and Radu Arsinte. "Speaker diarization experiments for Romanian parliamentary speech." In 2015 International Symposium on Signals, Circuits and Systems (ISSCS). IEEE, 2015. http://dx.doi.org/10.1109/isscs.2015.7204023.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Lyu, Dau-Cheng, Eng-Siong Chng, and Haizhou Li. "Language diarization for code-switch conversational speech." In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6639083.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Imseng, David, and Gerald Friedland. "Robust Speaker Diarization for short speech recordings." In Understanding (ASRU). IEEE, 2009. http://dx.doi.org/10.1109/asru.2009.5373254.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Wang, Yingzhi, Mirco Ravanelli, and Alya Yacoubi. "Speech Emotion Diarization: Which Emotion Appears When?" In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023. http://dx.doi.org/10.1109/asru57964.2023.10389790.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Speech diarization"

1

Hansen, John H. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. Fort Belvoir, VA: Defense Technical Information Center, October 2015. http://dx.doi.org/10.21236/ada623029.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography