Academic literature on the topic 'Speech diarization'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Speech diarization.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Journal articles on the topic "Speech diarization"
Mertens, Robert, Po-Sen Huang, Luke Gottlieb, Gerald Friedland, Ajay Divakaran, and Mark Hasegawa-Johnson. "On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks." International Journal of Multimedia Data Engineering and Management 3, no. 3 (July 2012): 1–19. http://dx.doi.org/10.4018/jmdem.2012070101.
Full textAstapov, Sergei, Aleksei Gusev, Marina Volkova, Aleksei Logunov, Valeriia Zaluskaia, Vlada Kapranova, Elena Timofeeva, Elena Evseeva, Vladimir Kabarov, and Yuri Matveev. "Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization." Mathematics 9, no. 23 (November 23, 2021): 2998. http://dx.doi.org/10.3390/math9232998.
Full textLyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.
Full textPrabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (June 26, 2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.
Full textV, Sethuram, Ande Prasad, and R. Rajeswara Rao. "Metaheuristic adapted convolutional neural network for Telugu speaker diarization." Intelligent Decision Technologies 15, no. 4 (January 10, 2022): 561–77. http://dx.doi.org/10.3233/idt-211005.
Full textMurali, Abhejay, Satwik Dutta, Meena Chandra Shekar, Dwight Irvin, Jay Buzhardt, and John H. Hansen. "Towards developing speaker diarization for parent-child interactions." Journal of the Acoustical Society of America 152, no. 4 (October 2022): A61. http://dx.doi.org/10.1121/10.0015551.
Full textTaha, Thaer Mufeed, Zaineb Ben Messaoud, and Mondher Frikha. "Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization." International Journal of Interactive Mobile Technologies (iJIM) 18, no. 03 (February 9, 2024): 88–103. http://dx.doi.org/10.3991/ijim.v18i03.43013.
Full textKothalkar, Prasanna V., John H. L. Hansen, Dwight Irvin, and Jay Buzhardt. "Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation." Journal of the Acoustical Society of America 155, no. 2 (February 1, 2024): 1198–215. http://dx.doi.org/10.1121/10.0024353.
Full textKshirod, Kshirod Sarmah. "Speaker Diarization with Deep Learning Techniques." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 11, no. 3 (December 15, 2020): 2570–82. http://dx.doi.org/10.61841/turcomat.v11i3.14309.
Full textLleida, Eduardo, Alfonso Ortega, Antonio Miguel, Virginia Bazán-Gil, Carmen Pérez, Manuel Gómez, and Alberto de Prada. "Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media." Applied Sciences 9, no. 24 (December 11, 2019): 5412. http://dx.doi.org/10.3390/app9245412.
Full textDissertations / Theses on the topic "Speech diarization"
Zelenák, Martin. "Detection and handling of overlapping speech for speaker diarization." Doctoral thesis, Universitat Politècnica de Catalunya, 2012. http://hdl.handle.net/10803/72431.
Full textOtterson, Scott. "Use of speaker location features in meeting diarization /." Thesis, Connect to this title online; UW restricted, 2008. http://hdl.handle.net/1773/15463.
Full textPeso, Pablo. "Spatial features of reverberant speech : estimation and application to recognition and diarization." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/45664.
Full textSinclair, Mark. "Speech segmentation and speaker diarisation for transcription and translation." Thesis, University of Edinburgh, 2016. http://hdl.handle.net/1842/20970.
Full textIshizuka, Kentaro. "Studies on Acoustic Features for Automatic Speech Recognition and Speaker Diarization in Real Environments." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123834.
Full textYin, Ruiqing. "Steps towards end-to-end neural speaker diarization." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS261/document.
Full textSpeaker diarization is the task of determining "who speaks when" in an audio stream that usually contains an unknown amount of speech from an unknown number of speakers. Speaker diarization systems are usually built as the combination of four main stages. First, non-speech regions such as silence, music, and noise are removed by Voice Activity Detection (VAD). Next, speech regions are split into speaker-homogeneous segments by Speaker Change Detection (SCD), later grouped according to the identity of the speaker thanks to unsupervised clustering approaches. Finally, speech turn boundaries and labels are (optionally) refined with a re-segmentation stage. In this thesis, we propose to address these four stages with neural network approaches. We first formulate both the initial segmentation (voice activity detection and speaker change detection) and the final re-segmentation as a set of sequence labeling problems and then address them with Bidirectional Long Short-Term Memory (Bi-LSTM) networks. In the speech turn clustering stage, we propose to use affinity propagation on top of neural speaker embeddings. Experiments on a broadcast TV dataset show that affinity propagation clustering is more suitable than hierarchical agglomerative clustering when applied to neural speaker embeddings. The LSTM-based segmentation and affinity propagation clustering are also combined and jointly optimized to form a speaker diarization pipeline. Compared to the pipeline with independently optimized modules, the new pipeline brings a significant improvement. In addition, we propose to improve the similarity matrix by bidirectional LSTM and then apply spectral clustering on top of the improved similarity matrix. The proposed system achieves state-of-the-art performance in the CALLHOME telephone conversation dataset. Finally, we formulate sequential clustering as a supervised sequence labeling task and address it with stacked RNNs. To better understand its behavior, the analysis is based on a proposed encoder-decoder architecture. Our proposed systems bring a significant improvement compared with traditional clustering methods on toy examples
Cui, Can. "Séparation, diarisation et reconnaissance de la parole conjointes pour la transcription automatique de réunions." Electronic Thesis or Diss., Université de Lorraine, 2024. http://www.theses.fr/2024LORR0103.
Full textFar-field microphone-array meeting transcription is particularly challenging due to overlapping speech, ambient noise, and reverberation. To address these issues, we explored three approaches. First, we employ a multichannel speaker separation model to isolate individual speakers, followed by a single-channel, single-speaker automatic speech recognition (ASR) model to transcribe the separated and enhanced audio. This method effectively enhances speech quality for ASR. Second, we propose an end-to-end multichannel speaker-attributed ASR (MC-SA-ASR) model, which builds on an existing single-channel SA-ASR model and incorporates a multichannel Conformer-based encoder with multi-frame cross-channel attention (MFCCA). Unlike traditional approaches that require a multichannel front-end speech enhancement model, the MC-SA-ASR model handles far-field microphones in an end-to-end manner. We also experimented with different input features, including Mel filterbank and phase features, for that model. Lastly, we incorporate a multichannel beamforming and enhancement model as a front-end processing step, followed by a single-channel SA-ASR model to process the enhanced multi-speaker speech signals. We tested different fixed, hybrid, and fully neural network-based beamformers and proposed to jointly optimize the neural beamformer and SA-ASR models using the training objective for the latter. In addition to these methods, we developed a meeting transcription pipeline that integrates voice activity detection, speaker diarization, and SA-ASR to process real meeting recordings effectively. Experimental results indicate that, while using a speaker separation model can enhance speech quality, separation errors can propagate to ASR, resulting in suboptimal performance. A guided speaker separation approach proves to be more effective. Our proposed MC-SA-ASR model demonstrates efficiency in integrating multichannel information and the shared information between the ASR and speaker blocks. Experiments with different input features reveal that models trained with Mel filterbank features perform better in terms of word error rate (WER) and speaker error rate (SER) when the number of channels and speakers is low (2 channels with 1 or 2 speakers). However, for settings with 3 or 4 channels and 3 speakers, models trained with additional phase information outperform those using only Mel filterbank features. This suggests that phase information can enhance ASR by leveraging localization information from multiple channels. Although MFCCA-based MC-SA-ASR outperforms the single-channel SA-ASR and MC-ASR models without a speaker block, the joint beamforming and SA-ASR model further improves the performance. Specifically, joint training of the neural beamformer and SA-ASR yields the best performance, indicating that improving speech quality might be a more direct and efficient approach than using an end-to-end MC-SA-ASR model for multichannel meeting transcription. Furthermore, the study of the real meeting transcription pipeline underscores the potential for better end-to-end models. In our investigation on improving speaker assignment in SA-ASR, we found that the speaker block does not effectively help improve the ASR performance. This highlights the need for improved architectures that more effectively integrate ASR and speaker information
Mariotte, Théo. "Traitement automatique de la parole en réunion par dissémination de capteurs." Electronic Thesis or Diss., Le Mans, 2024. http://www.theses.fr/2024LEMA1001.
Full textThis thesis work focuses on automatic speech processing, and more specifically on speaker diarization. This task requires the signal to be segmented to identify events such as voice activity, overlapped speech, or speaker changes. This work tackles the scenario where the signal is recorded by a device located in the center of a group of speakers, as in meetings. These conditions lead to a degradation in signal quality due to the distance between the speakers (distant speech).To mitigate this degradation, one approach is to record the signal using a microphone array. The resulting multichannel signal provides information on the spatial distribution of the acoustic field. Two lines of research are being explored for speech segmentation using microphone arrays.The first introduces a method combining acoustic features with spatial features. We propose a new set of features based on the circular harmonics expansion. This approach improves segmentation performance under distant speech conditions while reducing the number of model parameters and improving robustness in case of change in the array geometry.The second proposes several approaches that combine channels using self-attention. Different models, inspired by an existing architecture, are developed. Combining channels also improves segmentation under distant speech conditions. Two of these approaches make feature extraction more interpretable. The proposed distant speech segmentation systems also improve speaker diarization.Channel combination shows poor robustness to changes in the array geometry during inference. To avoid this behavior, a learning procedure is proposed, which improves the robustness in case of array mismatch.Finally, we identified a gap in the public datasets available for distant multichannel automatic speech processing. An acquisition protocol is introduced to build a new dataset, integrating speaker position annotation in addition to speaker diarization.Thus, this work aims to improve the quality of multichannel distant speech segmentation. The proposed methods exploit the spatial information provided by microphone arrays while improving the robustness in case of array mismatch
Hsu, Wu-Hua, and 許吳華. "A Preliminary Study on Speaker Diarization for Automatic Transcription of Broadcast Radio Speech." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/a3z9vr.
Full text國立臺北科技大學
電子工程系
106
We use Time-delay Neural Network for Speaker Diarization. The average DER is 27.74%, which is better than 31.08% of GMM. We use trained automatic speaker diarization system to classify information of unmarked speakers in the NER-210 corpus, retrain the ASR by marking the output of the speaker information timeline. The experimental results show, through the speaker diarization system, the ASR system that classifies the speaker information can reduce the original CER from 20.01% to 19.13%. In addition, the average CER of the basic LSTM model on the automatic speech recognition system is 17.2%. The average CER can be reduced to 13.12% using the multi-layer serial neural network CNN-TDNN-LSTM model. Then, we using Confidence Measure data selection and adding more word sequences in the language model to increase the recognition rate, the average CER can be reduced to 9.2%.
Book chapters on the topic "Speech diarization"
Avdeeva, Anastasia, and Sergey Novoselov. "Deep Speaker Embeddings Based Online Diarization." In Speech and Computer, 24–32. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-20980-2_3.
Full textZajíc, Zbyněk, Josef V. Psutka, and Luděk Müller. "Diarization Based on Identification with X-Vectors." In Speech and Computer, 667–78. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-60276-5_64.
Full textEdwards, Erik, Michael Brenndoerfer, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. "A Free Synthetic Corpus for Speaker Diarization Research." In Speech and Computer, 113–22. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-99579-3_13.
Full textZajíc, Zbyněk, Josef V. Psutka, Lucie Zajícová, Luděk Müller, and Petr Salajka. "Diarization of the Language Consulting Center Telephone Calls." In Speech and Computer, 549–58. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-26061-3_56.
Full textEdwards, Erik, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. "Speaker Diarization: A Top-Down Approach Using Syllabic Phonology." In Speech and Computer, 123–33. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-99579-3_14.
Full textKudashev, Oleg, and Alexander Kozlov. "The Diarization System for an Unknown Number of Speakers." In Speech and Computer, 340–44. Cham: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-319-01931-4_45.
Full textNguyen, Trung Hieu, Eng Siong Chng, and Haizhou Li. "Speaker Diarization: An Emerging Research." In Speech and Audio Processing for Coding, Enhancement and Recognition, 229–77. New York, NY: Springer New York, 2014. http://dx.doi.org/10.1007/978-1-4939-1456-2_8.
Full textKynych, Frantisek, Jindrich Zdansky, Petr Cerva, and Lukas Mateju. "Online Speaker Diarization Using Optimized SE-ResNet Architecture." In Text, Speech, and Dialogue, 176–87. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-40498-6_16.
Full textZajíc, Zbyněk, Jan Zelinka, and Luděk Müller. "Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech." In Speech and Computer, 555–63. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-66429-3_55.
Full textKunešová, Marie, Marek Hrúz, Zbyněk Zajíc, and Vlasta Radová. "Detection of Overlapping Speech for the Purposes of Speaker Diarization." In Speech and Computer, 247–57. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-26061-3_26.
Full textConference papers on the topic "Speech diarization"
Von Neumann, Thilo, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, and Reinhold Haeb-Umbach. "Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization." In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 775–79. IEEE, 2024. http://dx.doi.org/10.1109/icasspw62465.2024.10625894.
Full textLamel, Lori, Jean-Luc Gauvain, and Leonardo Canseco-Rodriguez. "Speaker diarization from speech transcripts." In Interspeech 2004. ISCA: ISCA, 2004. http://dx.doi.org/10.21437/interspeech.2004-250.
Full textBounazou, Hadjer, Nassim Asbai, and Sihem Zitouni. "Speaker Diarization in Overlapped Speech." In 2022 19th International Multi-Conference on Systems, Signals & Devices (SSD). IEEE, 2022. http://dx.doi.org/10.1109/ssd54932.2022.9955684.
Full textJiang, Yidi, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, and Haizhou Li. "Prompt-Driven Target Speech Diarization." In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. http://dx.doi.org/10.1109/icassp48485.2024.10446072.
Full textXie, Jiamin, Leibny Paola García-Perera, Daniel Povey, and Sanjeev Khudanpur. "Multi-PLDA Diarization on Children’s Speech." In Interspeech 2019. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/interspeech.2019-2961.
Full textGebre, Binyam Gebrekidan, Peter Wittenburg, Sebastian Drude, Marijn Huijbregts, and Tom Heskes. "Speaker diarization using gesture and speech." In Interspeech 2014. ISCA: ISCA, 2014. http://dx.doi.org/10.21437/interspeech.2014-141.
Full textLupu, Eugen, Anca Apatean, and Radu Arsinte. "Speaker diarization experiments for Romanian parliamentary speech." In 2015 International Symposium on Signals, Circuits and Systems (ISSCS). IEEE, 2015. http://dx.doi.org/10.1109/isscs.2015.7204023.
Full textLyu, Dau-Cheng, Eng-Siong Chng, and Haizhou Li. "Language diarization for code-switch conversational speech." In ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013. http://dx.doi.org/10.1109/icassp.2013.6639083.
Full textImseng, David, and Gerald Friedland. "Robust Speaker Diarization for short speech recordings." In Understanding (ASRU). IEEE, 2009. http://dx.doi.org/10.1109/asru.2009.5373254.
Full textWang, Yingzhi, Mirco Ravanelli, and Alya Yacoubi. "Speech Emotion Diarization: Which Emotion Appears When?" In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023. http://dx.doi.org/10.1109/asru57964.2023.10389790.
Full textReports on the topic "Speech diarization"
Hansen, John H. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment. Fort Belvoir, VA: Defense Technical Information Center, October 2015. http://dx.doi.org/10.21236/ada623029.
Full text