Log in

Relevant bibliographies by topics / Speech diarization / Journal articles

To see the other types of publications on this topic, follow the link: Speech diarization.

Journal articles on the topic 'Speech diarization'

Author: Grafiati

Published: 30 November 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Speech diarization.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Mertens, Robert, Po-Sen Huang, Luke Gottlieb, Gerald Friedland, Ajay Divakaran, and Mark Hasegawa-Johnson. "On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks." International Journal of Multimedia Data Engineering and Management 3, no. 3 (2012): 1–19. http://dx.doi.org/10.4018/jmdem.2012070101.

Full text

Abstract:

A video’s soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as “ngine sounds,” “utdoor/indoor sounds.” These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question “ho spoke when?”by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.

APA, Harvard, Vancouver, ISO, and other styles

2

Astapov, Sergei, Aleksei Gusev, Marina Volkova, et al. "Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization." Mathematics 9, no. 23 (2021): 2998. http://dx.doi.org/10.3390/math9232998.

Full text

Abstract:

Recently developed methods in spontaneous speech analytics require the use of speaker separation based on audio data, referred to as diarization. It is applied to widespread use cases, such as meeting transcription based on recordings from distant microphones and the extraction of the target speaker’s voice profiles from noisy audio. However, speech recognition and analysis can be hindered by background and point-source noise, overlapping speech, and reverberation, which all affect diarization quality in conjunction with each other. To compensate for the impact of these factors, there are a variety of supportive speech analytics methods, such as quality assessments in terms of SNR and RT60 reverberation time metrics, overlapping speech detection, instant speaker number estimation, etc. The improvements in speaker verification methods have benefits in the area of speaker separation as well. This paper introduces several approaches aimed towards improving diarization system quality. The presented experimental results demonstrate the possibility of refining initial speaker labels from neural-based VAD data by means of fusion with labels from quality estimation models, overlapping speech detectors, and speaker number estimation models, which contain CNN and LSTM modules. Such fusing approaches allow us to significantly decrease DER values compared to standalone VAD methods. Cases of ideal VAD labeling are utilized to show the positive impact of ResNet-101 neural networks on diarization quality in comparison with basic x-vectors and ECAPA-TDNN architectures trained on 8 kHz data. Moreover, this paper highlights the advantage of spectral clustering over other clustering methods applied to diarization. The overall quality of diarization is improved at all stages of the pipeline, and the combination of various speech analytics methods makes a significant contribution to the improvement of diarization quality.

APA, Harvard, Vancouver, ISO, and other styles

3

Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.

Full text

Abstract:

This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system’s ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.

APA, Harvard, Vancouver, ISO, and other styles

4

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text

Abstract:

Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints

APA, Harvard, Vancouver, ISO, and other styles

5

V, Sethuram, Ande Prasad, and R. Rajeswara Rao. "Metaheuristic adapted convolutional neural network for Telugu speaker diarization." Intelligent Decision Technologies 15, no. 4 (2022): 561–77. http://dx.doi.org/10.3233/idt-211005.

Full text

Abstract:

In speech technology, a pivotal role is being played by the Speaker diarization mechanism. In general, speaker diarization is the mechanism of partitioning the input audio stream into homogeneous segments based on the identity of the speakers. The automatic transcription readability can be improved with the speaker diarization as it is good in recognizing the audio stream into the speaker turn and often provides the true speaker identity. In this research work, a novel speaker diarization approach is introduced under three major phases: Feature Extraction, Speech Activity Detection (SAD), and Speaker Segmentation and Clustering process. Initially, from the input audio stream (Telugu language) collected, the Mel Frequency Cepstral coefficient (MFCC) based features are extracted. Subsequently, in Speech Activity Detection (SAD), the music and silence signals are removed. Then, the acquired speech signals are segmented for each individual speaker. Finally, the segmented signals are subjected to the speaker clustering process, where the Optimized Convolutional Neural Network (CNN) is used. To make the clustering more appropriate, the weight and activation function of CNN are fine-tuned by a new Self Adaptive Sea Lion Algorithm (SA-SLnO). Finally, a comparative analysis is made to exhibit the superiority of the proposed speaker diarization work. Accordingly, the accuracy of the proposed method is 0.8073, which is 5.255, 2.45%, and 0.075, superior to the existing works.

APA, Harvard, Vancouver, ISO, and other styles

6

Murali, Abhejay, Satwik Dutta, Meena Chandra Shekar, Dwight Irvin, Jay Buzhardt, and John H. Hansen. "Towards developing speaker diarization for parent-child interactions." Journal of the Acoustical Society of America 152, no. 4 (2022): A61. http://dx.doi.org/10.1121/10.0015551.

Full text

Abstract:

Daily interactions of children with their parents are crucial for spoken language skills and overall development. Capturing such interactions can help to provide meaningful feedback to parents as well as practitioners. Naturalistic audio capture and developing further speech processing pipeline for parent-child interactions is a challenging problem. One of the first important steps in the speech processing pipeline is Speaker Diarization—to identify who spoke when. Speaker Diarization is the method of separating a captured audio stream into analogous segments that are differentiated by the speaker’s (child or parent’s) identity. Following ongoing COVID-19 restrictions and human subjects research IRB protocols, an unsupervised data collection approach was formulated to collect parent-child interactions (of consented families) using LENA device—a light weight audio recorder. Different interaction scenarios were explored: book reading activity at home and spontaneous interactions in a science museum. To identify child’s speech from a parent, we train the Diarization models on open-source adult speech data and children speech data acquired from LDC (Linguistic Data Consortium). Various speaker embeddings (e.g., x-vectors, i-vectors, resnets) will be explored. Results will be reported using Diarization Error Rate. [Work sponsored by NSF via Grant Nos. 1918032 and 1918012.]

APA, Harvard, Vancouver, ISO, and other styles

7

Taha, Thaer Mufeed, Zaineb Ben Messaoud, and Mondher Frikha. "Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization." International Journal of Interactive Mobile Technologies (iJIM) 18, no. 03 (2024): 88–103. http://dx.doi.org/10.3991/ijim.v18i03.43013.

Full text

Abstract:

This paper introduces three system architectures for speaker identification that aim to overcome the limitations of diarization and voice-based biometric systems. Diarization systems utilize unsupervised algorithms to segment audio data based on the time boundaries of utterances, but they do not distinguish individual speakers. On the other hand, voice-based biometric systems can only identify individuals in recordings with a single speaker. Identifying speakers in recordings of natural conversations can be challenging, especially when emotional shifts can alter voice characteristics, making gender identification difficult. To address this issue, the proposed architectures include techniques for gender, emotion, and diarization at either the segment or group level. The evaluation of these architectures utilized two speech databases, namely VoxCeleb and RAVDESS (Ryerson audio-visual database of emotional speech and song) datasets. The findings reveal that the proposed approach outperforms the strategy level in terms of recognition results, despite the real-time processing advantage of the latter. The challenge of identifying multiple speakers engaging in a conversation while considering emotional changes that impact speech is effectively addressed by the proposed architectures. The data indicates that the gender and emotion classification of diarization achieves an accuracy of over 98 percent. These results suggest that the proposed speech-based approach can achieve highly accurate speaker identification.

APA, Harvard, Vancouver, ISO, and other styles

8

Kothalkar, Prasanna V., John H. L. Hansen, Dwight Irvin, and Jay Buzhardt. "Child-adult speech diarization in naturalistic conditions of preschool classrooms using room-independent ResNet model and automatic speech recognition-based re-segmentation." Journal of the Acoustical Society of America 155, no. 2 (2024): 1198–215. http://dx.doi.org/10.1121/10.0024353.

Full text

Abstract:

Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates alternate deep learning-based lightweight, knowledge-distilled, diarization solutions for segmenting classroom interactions of 3–5 years old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our lightest CNN model achieves a best F1-score of ∼76.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with automatic speech recognition-based re-segmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations.

APA, Harvard, Vancouver, ISO, and other styles

9

Kshirod, Kshirod Sarmah. "Speaker Diarization with Deep Learning Techniques." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 11, no. 3 (2020): 2570–82. http://dx.doi.org/10.61841/turcomat.v11i3.14309.

Full text

Abstract:

Speaker diarization is a task to identify the speaker when different speakers spoke in an audio or video recording environment. Artificial intelligence (AI) fields have effectively used Deep Learning (DL) to solve a variety of real-world application challenges. With effective applications in a wide range of subdomains, such as natural language processing, image processing, computer vision, speech and speaker recognition, and emotion recognition, cyber security, and many others, DL, a very innovative field of Machine Learning (ML), that is quickly emerging as the most potent machine learning technique. DL techniques have outperformed conventional approaches recently in speaker diarization as well as speaker recognition. The technique of assigning classes to speech recordings that correspond to the speaker's identity is known as speaker diarization, and it allows one to determine who spoke when. A crucial step in speech processing is speaker diarization, which divides an audio recording into different speaker areas. In-depth analysis of speaker diarization utilizing a variety of deep learning algorithms that are presented in this research paper. NIST-2000 CALLHOME and our in-house database ALSD-DB are the two voice corpora we used for this study's tests. TDNN-based embeddings with x-vectors, LSTM-based embeddings with d-vectors, and lastly embeddings fusion of both x-vector and d-vector are used in the tests for the basic system. For the NIST-2000 CALLHOME database, LSTM based embeddings with d-vector and embeddings integrating both x-vector and d-vector exhibit improved performance with DER of 8.25% and 7.65%, respectively, and of 10.45% and 9.65% for the local ALSD-DB database

APA, Harvard, Vancouver, ISO, and other styles

10

Lleida, Eduardo, Alfonso Ortega, Antonio Miguel, et al. "Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media." Applied Sciences 9, no. 24 (2019): 5412. http://dx.doi.org/10.3390/app9245412.

Full text

Abstract:

The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained.

APA, Harvard, Vancouver, ISO, and other styles

11

Ahmad, Rehan, Syed Zubair, and Hani Alquhayz. "Speech Enhancement for Multimodal Speaker Diarization System." IEEE Access 8 (2020): 126671–80. http://dx.doi.org/10.1109/access.2020.3007312.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Kothalkar, Prasanna V., Dwight Irvin, Jay Buzhardt, and John H. Hansen. "End-to-end child-adult speech diarization in naturalistic conditions of preschool classrooms." Journal of the Acoustical Society of America 153, no. 3_supplement (2023): A174. http://dx.doi.org/10.1121/10.0018568.

Full text

Abstract:

Speech and language development are early indicators of overall analytical and learning ability in pre-school children. Early childhood researchers are interested in analyzing naturalistic versus controlled lab recordings to assess both quality and quantity of such communication interactions between children and adults/teachers. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to diversity of acoustic events/conditionsin daylong audio streams, automated speaker diarization technology is limited and must be advanced to address this challenging domain for audio segmentation and meta-data information extraction. We investigate a Deep Learning-based diarization solution for segmenting classroom interactions of 3–5 year-old children engaging with teachers. Here, the focus is on speaker-label diarization which classifies speech segments as belonging to either Adults or Children, partitioned across multiple classrooms. Our proposed ECAPA-TDNN model achieves a best F1-score of 65.5% on data from two classrooms, based on open dev and test sets for each classroom. Also, F1-scores for individual speaker labels provide a breakdown of performance across naturalistic child classroom engagement. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults.

APA, Harvard, Vancouver, ISO, and other styles

13

Kaur, Sukhvinder, and J. S. Sohal. "Speech Activity Detection and its Evaluation in Speaker Diarization System." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 16, no. 1 (2017): 7567–72. http://dx.doi.org/10.24297/ijct.v16i1.5893.

Full text

Abstract:

In speaker diarization, the speech/voice activity detection is performed to separate speech, non-speech and silent frames. Zero crossing rate and root mean square value of frames of audio clips has been used to select training data for silent, speech and nonspeech models. The trained models are used by two classifiers, Gaussian mixture model (GMM) and Artificial neural network (ANN), to classify the speech and non-speech frames of audio clip. The results of ANN and GMM classifier are compared by Receiver operating characteristics (ROC) curve and Detection ErrorTradeoff (DET) graph. It is concluded that neural network based SADcomparatively better than Gaussian mixture model based SAD.

APA, Harvard, Vancouver, ISO, and other styles

14

Hansen, John H., Aditya Joglekar, and Meena Chandra Shekar. "Fearless steps Apollo: Advancements in robust speech technologies and naturalistic corpus development from Earth to the Moon." Journal of the Acoustical Society of America 152, no. 4 (2022): A61. http://dx.doi.org/10.1121/10.0015549.

Full text

Abstract:

Recent developments in deep learning strategies have revolutionized Speech and Language Technologies(SLT). Deep learning models often rely on massive naturalistic datasets to produce the necessary complexity required for generating superior performance. However, most massive SLT datasets are not publicly available, limiting the potential for academic research. Through this work, we showcase the CRSS-UTDallas led efforts to recover, digitize, and openly distribute over 50,000 hrs of speech data recorded during the 12 NASA Apollo manned missions, and outline our continuing efforts to digitize and create meta-data through diarization of the remaining 100,000hrs. We present novel deep learning-based speech processing solutions developed to extract high-level information from this massive dataset. Fearless-Steps APOLLO resource is a 50,000 hrs audio collection from 30-track analog tapes originally used to document Apollo missions 1,7,8,10,11,&13. A customized tape read-head developed to digitize all 30 channels simultaneously has been deployed to expedite digitization of remaining mission tapes. Diarized transcripts for these unlabeled audio communications have also been generated to facilitate open research from speech sciences, historical archives, education, and speech technology communities. Robust technologies developed to generate human-readable transcripts include: (i) speaker diarization, (ii) speaker tracking, and (iii) text output from speech recognition systems.

APA, Harvard, Vancouver, ISO, and other styles

15

Sultan, Wael Ali, Mourad Samir Semary, and Sherif Mahdy Abdou. "An Efficient Speaker Diarization Pipeline for Conversational Speech." Benha Journal of Applied Sciences 9, no. 5 (2024): 141–46. http://dx.doi.org/10.21608/bjas.2024.284482.1414.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Kone, Tenon Charly, Sebastian Ghinet, Sayed Ahmed Dana, and Anant Grewal. "Speech detection models for effective communicable disease risk assessment in air travel environments." Journal of the Acoustical Society of America 155, no. 3_Supplement (2024): A277. http://dx.doi.org/10.1121/10.0027492.

Full text

Abstract:

In environments characterized by elevated noise levels, such as airports or aircraft cabins, travelers often find themselves involuntarily speaking loudly and drawing closer to one another in an effort to enhance communication and speech intelligibility. Unfortunately, this unintentional behaviour increases the risk of respiratory particles dispersion, potentially carrying infectious agents like bacteria which makes the contagion control more challenging. The accurate characterization of the risk associated to speaking, in such a challenging noise environment with multiple overlapping speech sources, is therefore of outmost importance. Among the most advanced signal processing strategies that can be used to accurately determine who spoke when and with whom and for how long but most importantly how loudly, at one location, artificial intelligence-based speaker diarization approaches were considered and adapted for this task. This article details the implementation of speaker diarization algorithms, customized to extract speaker and speech parameters discreetly. Validation and preliminary study results are also provided. The algorithms calculate speech duration and sound pressure level for each sentence and speaker, aiding in assessing viral contaminant spread. The paper focuses on applying these algorithms in noisy environments, particularly in confined spaces with multiple speakers. The findings contribute to proactive measures for containing and managing communicable diseases in air travel settings.

APA, Harvard, Vancouver, ISO, and other styles

17

Zelenak, Martin, Carlos Segura, Jordi Luque, and Javier Hernando. "Simultaneous Speech Detection With Spatial Features for Speaker Diarization." IEEE Transactions on Audio, Speech, and Language Processing 20, no. 2 (2012): 436–46. http://dx.doi.org/10.1109/tasl.2011.2160167.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Viñals, Ignacio, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. "The Domain Mismatch Problem in the Broadcast Speaker Attribution Task." Applied Sciences 11, no. 18 (2021): 8521. http://dx.doi.org/10.3390/app11188521.

Full text

Abstract:

The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers in the audio under analysis. In this work, we carry out a study of the speaker attribution problem in the broadcast domain. Through our experiments, we illustrate the positive impact of diarization on the final performance. Additionally, we show the influence of the variability present in broadcast data, depicting the broadcast domain as a collection of subdomains with particular characteristics. Taking these two factors into account, we also propose alternative approximations robust against domain mismatch. These approximations include a semisupervised alternative as well as a totally unsupervised new hybrid solution fusing diarization and speaker assignment. Thanks to these two approximations, our performance is boosted around a relative 50%. The analysis has been carried out using the corpus for the Albayzín 2020 challenge, a diarization and speaker attribution evaluation working with broadcast data. These data, provided by Radio Televisión Española (RTVE), the Spanish public Radio and TV Corporation, include multiple shows and genres to analyze the impact of new speech technologies in real-world scenarios.

APA, Harvard, Vancouver, ISO, and other styles

19

Indu D. "A Methodology for Speaker Diazaration System Based on LSTM and MFCC Coefficients." Journal of Electrical Systems 20, no. 6s (2024): 2938–45. http://dx.doi.org/10.52783/jes.3299.

Full text

Abstract:

Research on Speaker Identification is always difficult. A speaker may be automatically identified using by comparing their voice sample with their previously recorded voice, the machine learning strategy has grown in favor in recent years. Convolutional neural networks (CNN) , deep neural networks (DNN) are some of the machine learning techniques that has employed recently. The article will discuss a successful speaker verification system based on the d-vector to construct a new approach based on speaker diarization. In particular, in this article, we use the concept of LSTM to cluster the speech segments using MFCC coefficients and identify the speakers in the diarization system. The proposed system will be evaluated using benchmark performance metrics, and a comparative study will be made with other models. The need to consider the LSTM neural network using acoustic data and linguistic dialect is considered. LSTM networks could produce reliable speaker segmentation outputs.

APA, Harvard, Vancouver, ISO, and other styles

20

Sathyapriya, S., and A. Indhumathi. "An Efficient Speaker Diarization using Privacy Preserving Audio Features Based of Speech/Non Speech Detection." International Journal of Computer Trends and Technology 9, no. 4 (2014): 184–87. http://dx.doi.org/10.14445/22312803/ijctt-v9p136.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Huang, Zili, Marc Delcroix, Leibny Paola Garcia, Shinji Watanabe, Desh Raj, and Sanjeev Khudanpur. "Joint speaker diarization and speech recognition based on region proposal networks." Computer Speech & Language 72 (March 2022): 101316. http://dx.doi.org/10.1016/j.csl.2021.101316.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Khoma, Volodymyr, Yuriy Khoma, Vitalii Brydinskyi, and Alexander Konovalov. "Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library." Sensors 23, no. 4 (2023): 2082. http://dx.doi.org/10.3390/s23042082.

Full text

Abstract:

Diarization is an important task when work with audiodata is executed, as it provides a solution to the problem related to the need of dividing one analyzed call recording into several speech recordings, each of which belongs to one speaker. Diarization systems segment audio recordings by defining the time boundaries of utterances, and typically use unsupervised methods to group utterances belonging to individual speakers, but do not answer the question “who is speaking?” On the other hand, there are biometric systems that identify individuals on the basis of their voices, but such systems are designed with the prerequisite that only one speaker is present in the analyzed audio recording. However, some applications involve the need to identify multiple speakers that interact freely in an audio recording. This paper proposes two architectures of speaker identification systems based on a combination of diarization and identification methods, which operate on the basis of segment-level or group-level classification. The open-source PyAnnote framework was used to develop the system. The performance of the speaker identification system was verified through the application of the AMI Corpus open-source audio database, which contains 100 h of annotated and transcribed audio and video data. The research method consisted of four experiments to select the best-performing supervised diarization algorithms on the basis of PyAnnote. The first experiment was designed to investigate how the selection of the distance function between vector embedding affects the reliability of identification of a speaker’s utterance in a segment-level classification architecture. The second experiment examines the architecture of cluster-centroid (group-level) classification, i.e., the selection of the best clustering and classification methods. The third experiment investigates the impact of different segmentation algorithms on the accuracy of identifying speaker utterances, and the fourth examines embedding window sizes. Experimental results demonstrated that the group-level approach offered better identification results were compared to the segment-level approach, and the latter had the advantage of real-time processing.

APA, Harvard, Vancouver, ISO, and other styles

23

Jung, Dahae, Min-Kyoung Bae, Man Yong Choi, Eui Chul Lee, and Jinoo Joung. "Speaker diarization method of telemarketer and client for improving speech dictation performance." Journal of Supercomputing 72, no. 5 (2015): 1757–69. http://dx.doi.org/10.1007/s11227-015-1470-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Zhu, Qiushi, Jie Zhang, Yu Gu, Yuchen Hu, and Lirong Dai. "Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 19768–76. http://dx.doi.org/10.1609/aaai.v38i17.29951.

Full text

Abstract:

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.

APA, Harvard, Vancouver, ISO, and other styles

25

Papala, Gowtham, Aniket Ransing, and Pooja Jain. "Sentiment Analysis and Speaker Diarization in Hindi and Marathi Using using Finetuned Whisper." Scalable Computing: Practice and Experience 24, no. 4 (2023): 835–46. http://dx.doi.org/10.12694/scpe.v24i4.2248.

Full text

Abstract:

Automatic Speech Recognition (ASR) is a crucial technology that enables machines to automatically recognize human voices based on audio signals. In recent years, there has been a rigorous growth in the development of ASR models with the emergence of new techniques and algorithms. One such model is the Whisper ASR model developed by OpenAI, which is based on a Transformer encoder-decoder architecture and can handle multiple tasks such as language identification, transcription, and translation. However, there are still limitations to the Whisper ASR model, such as speaker diarization, summarization, emotion detection, and performance with Indian regional languages like Hindi, Marathi and others. This research paper aims to enhance the performance of the Whisper ASR model by adding additional components or features such as speaker diarization, text summarization, emotion detection, text generation and question answering. Additionally, we aim to improve its performance in Indian regional languages by training the model on common voice 11 dataset from huggingface. The research findings have the potential to contribute to the development of more accurate and reliable ASR models, which could improve human-machine communication in various applications.

APA, Harvard, Vancouver, ISO, and other styles

26

Senoussaoui, Mohammed, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel. "A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, no. 1 (2014): 217–27. http://dx.doi.org/10.1109/taslp.2013.2285474.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Vryzas, Nikolaos, Nikolaos Tsipas, and Charalampos Dimoulas. "Web Radio Automation for Audio Stream Management in the Era of Big Data." Information 11, no. 4 (2020): 205. http://dx.doi.org/10.3390/info11040205.

Full text

Abstract:

Radio is evolving in a changing digital media ecosystem. Audio-on-demand has shaped the landscape of big unstructured audio data available online. In this paper, a framework for knowledge extraction is introduced, to improve discoverability and enrichment of the provided content. A web application for live radio production and streaming is developed. The application offers typical live mixing and broadcasting functionality, while performing real-time annotation as a background process by logging user operation events. For the needs of a typical radio station, a supervised speaker classification model is trained for the recognition of 24 known speakers. The model is based on a convolutional neural network (CNN) architecture. Since not all speakers are known in radio shows, a CNN-based speaker diarization method is also proposed. The trained model is used for the extraction of fixed-size identity d-vectors. Several clustering algorithms are evaluated, having the d-vectors as input. The supervised speaker recognition model for 24 speakers scores an accuracy of 88.34%, while unsupervised speaker diarization scores a maximum accuracy of 87.22%, as tested on an audio file with speech segments from three unknown speakers. The results are considered encouraging regarding the applicability of the proposed methodology.

APA, Harvard, Vancouver, ISO, and other styles

28

Lleida, Eduardo, Luis Javier Rodriguez-Fuentes, Javier Tejedor, et al. "An Overview of the IberSpeech-RTVE 2022 Challenges on Speech Technologies." Applied Sciences 13, no. 15 (2023): 8577. http://dx.doi.org/10.3390/app13158577.

Full text

Abstract:

Evaluation campaigns provide a common framework with which the progress of speech technologies can be effectively measured. The aim of this paper is to present a detailed overview of the IberSpeech-RTVE 2022 Challenges, which were organized as part of the IberSpeech 2022 conference under the ongoing series of Albayzin evaluation campaigns. In the 2022 edition, four challenges were launched: (1) speech-to-text transcription; (2) speaker diarization and identity assignment; (3) text and speech alignment; and (4) search on speech. Different databases that cover different domains (e.g., broadcast news, conference talks, parliament sessions) were released for those challenges. The submitted systems also cover a wide range of speech processing methods, which include hidden Markov model-based approaches, end-to-end neural network-based methods, hybrid approaches, etc. This paper describes the databases, the tasks and the performance metrics used in the four challenges. It also provides the most relevant features of the submitted systems and briefly presents and discusses the obtained results. Despite employing state-of-the-art technology, the relatively poor performance attained in some of the challenges reveals that there is still room for improvement. This encourages us to carry on with the Albayzin evaluation campaigns in the coming years.

APA, Harvard, Vancouver, ISO, and other styles

29

Hansen, John H. L., Maryam Najafian, Rasa Lileikyte, Dwight Irvin, and Beth Rous. "Speech and language processing for assessing child–adult interaction based on diarization and location." International Journal of Speech Technology 22, no. 3 (2019): 697–709. http://dx.doi.org/10.1007/s10772-019-09590-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Cerva, Petr, Jan Silovsky, Jindrich Zdansky, Jan Nouza, and Ladislav Seps. "Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives." Speech Communication 55, no. 10 (2013): 1033–46. http://dx.doi.org/10.1016/j.specom.2013.06.017.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Joglekar, Aditya, Ivan Lopez-Espejo, and John H. Hansen. "Fearless Steps APOLLO: Challenges in keyword spotting and topic detection for naturalistic audio streams." Journal of the Acoustical Society of America 153, no. 3_supplement (2023): A173. http://dx.doi.org/10.1121/10.0018566.

Full text

Abstract:

Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. [Work Sponsored by NSF via Grant No. 2016725 and EU’s Horizon 2021 R&I Program under MSC Grant No. 101062614.]

APA, Harvard, Vancouver, ISO, and other styles

32

Xiao, Bo, Chewei Huang, Zac E. Imel, David C. Atkins, Panayiotis Georgiou, and Shrikanth S. Narayanan. "A technology prototype system for rating therapist empathy from audio recordings in addiction counseling." PeerJ Computer Science 2 (April 20, 2016): e59. http://dx.doi.org/10.7717/peerj-cs.59.

Full text

Abstract:

Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy—a key therapy quality index—from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist’s language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize highvs.low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying highvs.low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.

APA, Harvard, Vancouver, ISO, and other styles

33

Kalanadhabhatta, Manasa, Mohammad Mehdi Rastikerdar, Tauhidur Rahman, Adam S. Grabell, and Deepak Ganesan. "Playlogue: Dataset and Benchmarks for Analyzing Adult-Child Conversations During Play." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 4 (2024): 1–34. http://dx.doi.org/10.1145/3699775.

Full text

Abstract:

There has been growing interest in developing ubiquitous technologies to analyze adult-child speech in naturalistic settings such as free play in order to support children's social and academic development, language acquisition, and parent-child interactions. However, these technologies often rely on off-the-shelf speech processing tools that have not been evaluated on child speech or child-directed adult speech, whose unique characteristics might result in significant performance gaps when using models trained on adult speech. This work introduces the Playlogue dataset containing over 33 hours of long-form, naturalistic, play-based adult-child conversations from three different corpora of preschool-aged children. Playlogue enables researchers to train and evaluate speaker diarization and automatic speech recognition models on child-centered speech. We demonstrate the lack of generalizability of existing state-of-the-art models when evaluated on Playlogue, and show how fine-tuning models on adult-child speech mitigates the performance gap to some extent but still leaves considerable room for improvement. We further annotate over 5 hours of the Playlogue dataset with 8668 validated adult and child speech act labels, which can be used to train and evaluate models to provide clinically relevant feedback on parent-child interactions. We investigate the performance of state-of-the-art language models at automatically predicting these speech act labels, achieving significant accuracy with simple chain-of-thought prompting or minimal fine-tuning. We use inhome pilot data to validate the generalizability of models trained on Playlogue, demonstrating its utility in improving speech and language technologies for child-centered conversations. The Playlogue dataset is available for download at https://huggingface.co/datasets/playlogue/playlogue-v1.

APA, Harvard, Vancouver, ISO, and other styles

34

Di Cesare, Michele Giuseppe, David Perpetuini, Daniela Cardone, and Arcangelo Merla. "Machine Learning-Assisted Speech Analysis for Early Detection of Parkinson’s Disease: A Study on Speaker Diarization and Classification Techniques." Sensors 24, no. 5 (2024): 1499. http://dx.doi.org/10.3390/s24051499.

Full text

Abstract:

Parkinson’s disease (PD) is a neurodegenerative disorder characterized by a range of motor and non-motor symptoms. One of the notable non-motor symptoms of PD is the presence of vocal disorders, attributed to the underlying pathophysiological changes in the neural control of the laryngeal and vocal tract musculature. From this perspective, the integration of machine learning (ML) techniques in the analysis of speech signals has significantly contributed to the detection and diagnosis of PD. Particularly, MEL Frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GTCCs) are both feature extraction techniques commonly used in the field of speech and audio signal processing that could exhibit great potential for vocal disorder identification. This study presents a novel approach to the early detection of PD through ML applied to speech analysis, leveraging both MFCCs and GTCCs. The recordings contained in the Mobile Device Voice Recordings at King’s College London (MDVR-KCL) dataset were used. These recordings were collected from healthy individuals and PD patients while they read a passage and during a spontaneous conversation on the phone. Particularly, the speech data regarding the spontaneous dialogue task were processed through speaker diarization, a technique that partitions an audio stream into homogeneous segments according to speaker identity. The ML applied to MFCCS and GTCCs allowed us to classify PD patients with a test accuracy of 92.3%. This research further demonstrates the potential to employ mobile phones as a non-invasive, cost-effective tool for the early detection of PD, significantly improving patient prognosis and quality of life.

APA, Harvard, Vancouver, ISO, and other styles

35

Yella, Sree Harsha, and Herve Bourlard. "Overlapping Speech Detection Using Long-Term Conversational Features for Speaker Diarization in Meeting Room Conversations." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, no. 12 (2014): 1688–700. http://dx.doi.org/10.1109/taslp.2014.2346315.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Ghorbani, Shahram, and John H. L. Hansen. "Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition." Journal of the Acoustical Society of America 155, no. 6 (2024): 3848–60. http://dx.doi.org/10.1121/10.0026235.

Full text

Abstract:

The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end (E2E) accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in AID. Next, leveraging automatic speech recognition (ASR) and AID models is investigated to explore accentedness estimation. The ASR model is an E2E connectionist temporal classification model trained exclusively with American English (en-US) utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between scores estimated by the two models. Additionally, a robust correlation between objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of using AID-based and ASR-based systems for accentedness assessment in non-native speech. Such advanced systems would benefit accent assessment in language learning as well as speech and speaker assessment for intelligibility, quality, and speaker diarization and speech recognition advancements.

APA, Harvard, Vancouver, ISO, and other styles

37

Anmella, Gerard, Michele De Prisco, Jeremiah B. Joyce, et al. "Automated Speech Analysis in Bipolar Disorder: The CALIBER Study Protocol and Preliminary Results." Journal of Clinical Medicine 13, no. 17 (2024): 4997. http://dx.doi.org/10.3390/jcm13174997.

Full text

Abstract:

Background: Bipolar disorder (BD) involves significant mood and energy shifts reflected in speech patterns. Detecting these patterns is crucial for diagnosis and monitoring, currently assessed subjectively. Advances in natural language processing offer opportunities to objectively analyze them. Aims: To (i) correlate speech features with manic-depressive symptom severity in BD, (ii) develop predictive models for diagnostic and treatment outcomes, and (iii) determine the most relevant speech features and tasks for these analyses. Methods: This naturalistic, observational study involved longitudinal audio recordings of BD patients at euthymia, during acute manic/depressive phases, and after-response. Patients participated in clinical evaluations, cognitive tasks, standard text readings, and storytelling. After automatic diarization and transcription, speech features, including acoustics, content, formal aspects, and emotionality, will be extracted. Statistical analyses will (i) correlate speech features with clinical scales, (ii) use lasso logistic regression to develop predictive models, and (iii) identify relevant speech features. Results: Audio recordings from 76 patients (24 manic, 21 depressed, 31 euthymic) were collected. The mean age was 46.0 ± 14.4 years, with 63.2% female. The mean YMRS score for manic patients was 22.9 ± 7.1, reducing to 5.3 ± 5.3 post-response. Depressed patients had a mean HDRS-17 score of 17.1 ± 4.4, decreasing to 3.3 ± 2.8 post-response. Euthymic patients had mean YMRS and HDRS-17 scores of 0.97 ± 1.4 and 3.9 ± 2.9, respectively. Following data pre-processing, including noise reduction and feature extraction, comprehensive statistical analyses will be conducted to explore correlations and develop predictive models. Conclusions: Automated speech analysis in BD could provide objective markers for psychopathological alterations, improving diagnosis, monitoring, and response prediction. This technology could identify subtle alterations, signaling early signs of relapse. Establishing standardized protocols is crucial for creating a global speech cohort, fostering collaboration, and advancing BD understanding.

APA, Harvard, Vancouver, ISO, and other styles

38

Zeulner, Tobias, Gerhard Johann Hagerer, Moritz Müller, Ignacio Vazquez, and Peter A. Gloor. "Predicting Individual Well-Being in Teamwork Contexts Based on Speech Features." Information 15, no. 4 (2024): 217. http://dx.doi.org/10.3390/info15040217.

Full text

Abstract:

Current methods for assessing individual well-being in team collaboration at the workplace often rely on manually collected surveys. This limits continuous real-world data collection and proactive measures to improve team member workplace satisfaction. We propose a method to automatically derive social signals related to individual well-being in team collaboration from raw audio and video data collected in teamwork contexts. The goal was to develop computational methods and measurements to facilitate the mirroring of individuals’ well-being to themselves. We focus on how speech behavior is perceived by team members to improve their well-being. Our main contribution is the assembly of an integrated toolchain to perform multi-modal extraction of robust speech features in noisy field settings and to explore which features are predictors of self-reported satisfaction scores. We applied the toolchain to a case study, where we collected videos of 20 teams with 56 participants collaborating over a four-day period in a team project in an educational environment. Our audiovisual speaker diarization extracted individual speech features from a noisy environment. As the dependent variable, team members filled out a daily PERMA (positive emotion, engagement, relationships, meaning, and accomplishment) survey. These well-being scores were predicted using speech features extracted from the videos using machine learning. The results suggest that the proposed toolchain was able to automatically predict individual well-being in teams, leading to better teamwork and happier team members.

APA, Harvard, Vancouver, ISO, and other styles

39

Kaur, Sukhvinder, Chander Prabha, Ravinder Pal Singh, et al. "Optimized technique for speaker changes detection in multispeaker audio recording using pyknogram and efficient distance metric." PLOS ONE 19, no. 11 (2024): e0314073. http://dx.doi.org/10.1371/journal.pone.0314073.

Full text

Abstract:

Segmentation process is very popular in Speech recognition, word count, speaker indexing and speaker diarization process. This paper describes the speaker segmentation system which detects the speaker change point in an audio recording of multi speakers with the help of feature extraction and proposed distance metric algorithms. In this new approach, pre-processing of audio stream includes noise reduction, speech compression by using discrete wavelet transform (Daubechies wavelet ‘db40’ at level 2) and framing. It is followed by two feature extraction algorithms pyknogram and nonlinear energy operator (NEO). Finally, the extracted features of each frame are used to detect speaker change point which is accomplished by applying dissimilarity measures to find the distance between two frames. To realize it, a sliding window is moved across the whole data stream to find the highest peak which corresponds to the speaker change point. The distance metrics incorporated are standard “Bayesian Information Criteria (BIC)”, “Kullback Leibler Divergence (KLD)”, “T-test” and proposed algorithm to detect the speaker boundaries. At the end, threshold value is applied and their results are evaluated with Recall, Precision and F-measure. Best result of 99.34% is shown by proposed distance metric with pyknogram as compare to BIC, KLD and T-test algorithms.

APA, Harvard, Vancouver, ISO, and other styles

40

Delgado, Héctor, Anna Matamala, and Javier Serrano. "Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?" Cadernos de Tradução 35, no. 2 (2015): 308. http://dx.doi.org/10.5007/2175-7968.2015v35n2p308.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Diez, Mireia, Lukas Burget, Federico Landini, and Jan Cernocky. "Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 355–68. http://dx.doi.org/10.1109/taslp.2019.2955293.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Dawalatabad, Nauman, Srikanth Madikeri, C. Chandra Sekhar, and Hema A. Murthy. "Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 14–27. http://dx.doi.org/10.1109/taslp.2020.3036231.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

O’Malley, Ronan, Bahman Mirhedari, Kirsty Harkness, et al. "055 The digital doctor: a fully automated stratification and monitoring system for patients with memory complaints." Journal of Neurology, Neurosurgery & Psychiatry 90, no. 12 (2019): A23.2—A23. http://dx.doi.org/10.1136/jnnp-2019-abn-2.76.

Full text

Abstract:

IntroductionReferrals to specialist memory clinics have increased out of proportion to the incidence of dementia. Time and financial pressures are consequently exerted on a service striving to deliver high quality care. We have developed a fully automated ‘Digital Doctor’ with the aim of providing pre-clinic risk stratification and ongoing monitoring for patients with memory concerns.MethodsWe recruited 15 participants with Functional Memory Disorder (FMD), Mild Cognitive Impairment (MCI) and Alzheimer’s disease each as well as 15 healthy controls. Participants answered 12 questions posed by the ‘Digital Doctor’. Audio and visual data is analysed using diarization and automatic speech recognition tools and machine learning classifiers.ResultsThe ‘Digital Doctor’ can distinguish between neuro-degenerative dementia and FMD with an accuracy of 95%. We will have results of a 4-way classification accuracy (HC, FMD, MCI & AD) at time of conference.DiscussionWe demonstrate the potential value of the ‘Digital Doctor’ as a stratification and triage tool. Accuracy will be improved with greater number of users and inclusion of fluency and picture description data. Patients at low risk could avoid the burden of a clinic appointment, whilst patients with higher risk could benefit from a more streamlined service.

APA, Harvard, Vancouver, ISO, and other styles

44

Ding, Huitong, Adrian Lister, Cody Karjadi, et al. "EARLY DETECTION OF ALZHEIMER’S DISEASE AND RELATED DEMENTIAS FROM VOICE RECORDINGS: THE FRAMINGHAM HEART STUDY." Innovation in Aging 7, Supplement_1 (2023): 1024. http://dx.doi.org/10.1093/geroni/igad104.3291.

Full text

Abstract:

Abstract With the aging global population and the increasing prevalence of dementia, there is a growing focus on identifying mild cognitive impairment (MCI), a pre-dementia state, to enable timely interventions that could potentially slow down neurodegeneration. Producing speech is a cognitively complex task that engages various cognitive domains, while the ease of audio data collection underscores the potential cost-effectiveness and noninvasiveness that voice may offer. This study aims to construct a machine learning pipeline that incorporates speaker diarization, feature extraction, feature selection, and classification, to identify a set of acoustic features that exhibit strong MCI detection capability. The study included 100 MCI cases and 100 healthy controls (HC) matched for age, sex, and education from the Framingham Heart Study. Participants’ speech responses during cognitive tests were recorded, and the recorded audio was processed to identify segments of each participant’s voice from recordings that included voices of both testers and participants. A comprehensive set of 6385 acoustic features was then extracted from these voice segments using the OpenSMILE and Praat softwares. Subsequently, we constructed a random forest model using the features that exhibited significant differences between the MCI and HC groups. We identified an optimal subset of 29 features that resulted in an AUC of 0.87, with a 90% confidence interval ranging from 0.82 to 0.93. This study showcases the potential of human voice as a valuable resource for improving early detection of ADRD and motivates future opportunities to use passive voice collection tools, such as hearing aids, to measure brain health.

APA, Harvard, Vancouver, ISO, and other styles

45

Praharaj, Sambit, Maren Scheffel, Marcel Schmitz, Marcus Specht, and Hendrik Drachsler. "Towards Automatic Collaboration Analytics for Group Speech Data Using Learning Analytics." Sensors 21, no. 9 (2021): 3156. http://dx.doi.org/10.3390/s21093156.

Full text

Abstract:

Collaboration is an important 21st Century skill. Co-located (or face-to-face) collaboration (CC) analytics gained momentum with the advent of sensor technology. Most of these works have used the audio modality to detect the quality of CC. The CC quality can be detected from simple indicators of collaboration such as total speaking time or complex indicators like synchrony in the rise and fall of the average pitch. Most studies in the past focused on “how group members talk” (i.e., spectral, temporal features of audio like pitch) and not “what they talk”. The “what” of the conversations is more overt contrary to the “how” of the conversations. Very few studies studied “what” group members talk about, and these studies were lab based showing a representative overview of specific words as topic clusters instead of analysing the richness of the content of the conversations by understanding the linkage between these words. To overcome this, we made a starting step in this technical paper based on field trials to prototype a tool to move towards automatic collaboration analytics. We designed a technical setup to collect, process and visualize audio data automatically. The data collection took place while a board game was played among the university staff with pre-assigned roles to create awareness of the connection between learning analytics and learning design. We not only did a word-level analysis of the conversations, but also analysed the richness of these conversations by visualizing the strength of the linkage between these words and phrases interactively. In this visualization, we used a network graph to visualize turn taking exchange between different roles along with the word-level and phrase-level analysis. We also used centrality measures to understand the network graph further based on how much words have hold over the network of words and how influential are certain words. Finally, we found that this approach had certain limitations in terms of automation in speaker diarization (i.e., who spoke when) and text data pre-processing. Therefore, we concluded that even though the technical setup was partially automated, it is a way forward to understand the richness of the conversations between different roles and makes a significant step towards automatic collaboration analytics.

APA, Harvard, Vancouver, ISO, and other styles

46

Hershkovich, Leeor, Sabyasachi Bandyopadhyay, Jack Wittmayer, et al. "96 Proof of Principle: Can Paragraph Recall Pauses and Speech Frequencies Correctly Classify Cognitively Compromised Older Adults?" Journal of the International Neuropsychological Society 29, s1 (2023): 767–68. http://dx.doi.org/10.1017/s1355617723009530.

Full text

Abstract:

Objective:Recent research has found that machine learning based analysis of patient speech can be used to classify Alzheimer’s Disease. We know of no studies, however, which systematically explore the value of pausing events in speech for detecting cognitive limitations. Using retrospectively acquired voice data from paragraph memory tests, we created two types of pause features: a) the number and duration of pauses, and b) frequency components in speech immediately following pausing. Multiple machine learning models were used to assess how these features could effectively discriminate individuals classified into two groups: Cognitively Compromised versus Cognitively Well.Participants and Methods:Participants (age> 65 years, n= 67) completed the Newcomer paragraph memory test and a neuropsychological protocol as part of a federally funded prospective IRB approved investigation at the University of Florida. Participant vocal recordings were acquired for the immediate and delay conditions of the test. Speaker diarization was performed on the immediate free recall test condition to separate voices of patients from examiners. Features extracted from both test conditions included a) 3 pause characteristics (total number of pauses, total pause duration, and length of the longest pause), and b) 20 Mel Frequency Cepstral Coefficients (MFCC) pertaining to speech immediately (2.7 seconds) following pauses. These were combined with demographics (age, sex, race, education, and handedness) to create a total of 105 features that were used as inputs for multiple machine learning analytic models (random forest, logistic regression, naive Bayes, AdaBoost, Gradient Boost, and multi-layered perceptron). External neuropsychological metrics were used to initially classify Cognitively Compromised (i.e., < -1.0 standard deviation on > two of five test metrics: total immediate, delay, discrimination Hopkins Verbal Learning Test-Revised (HVLT-R),Controlled Oral Word Association (COWA) test, category fluency ('animals')). Pearson Product Moment Correlations were used to assess the linear relationships between pauses and speech frequency categories and neuropsychological metrics.Results:Neuropsychology metric classification using -1SD cut-off identified 27% (18/67 participants) as Cognitively Compromised. The Cognitively Compromised group and the Cognitively Well group did not show any difference in distributions of individual pause/frequency features (Mann Whitney U-test, p> 0.11). A negative correlation was found between total duration of short pauses and HVLT total immediate free recall, while a positive correlation was found between MFCC-10 and HVLT total immediate free recall. The best classification model was AdaBoost Classifier which predicted the Cognitively Compromised label with 0.91 area under receiver operating curve, 0.81 accuracy, 0.43 sensitivity, 1.0 specificity, 1.0 precision, 0.6 f1 score.Conclusions:Pause characteristics and frequency profiles of speech immediately following pauses from a paragraph memory test accurately identified older adults with compromised cognition, as measured by verbal learning and verbal fluency metrics. Furthermore, individuals with reduced HVLT immediate free recall generated more pauses, while individuals who recalled more words had higher power in mid-frequency bands (10th MFCC). Future research needs to replicate how paragraph recall pause characteristics and frequency the profile of speech immediately following pauses potentially provides a low resource alternative to automatic speech recognition models for detecting cognitive impairments.

APA, Harvard, Vancouver, ISO, and other styles

47

McDonald, Margarethe, Taeahn Kwon, Hyunji Kim, Youngki Lee, and Eon-Suk Ko. "Evaluating the Language ENvironment Analysis System for Korean." Journal of Speech, Language, and Hearing Research 64, no. 3 (2021): 792–808. http://dx.doi.org/10.1044/2020_jslhr-20-00489.

Full text

Abstract:

Purpose The algorithm of the Language ENvironment Analysis (LENA) system for calculating language environment measures was trained on American English; thus, its validity with other languages cannot be assumed. This article evaluates the accuracy of the LENA system applied to Korean. Method We sampled sixty 5-min recording clips involving 38 key children aged 7–18 months from a larger data set. We establish the identification error rate, precision, and recall of LENA classification compared to human coders. We then examine the correlation between standard LENA measures of adult word count, child vocalization count, and conversational turn count and human counts of the same measures. Results Our identification error rate (64% or 67%), including false alarm, confusion, and misses, was similar to the rate found in Cristia, Lavechin, et al. (2020) . The correlation between LENA and human counts for adult word count ( r = .78 or .79) was similar to that found in the other studies, but the same measure for child vocalization count ( r = .34–.47) was lower than the value in Cristia, Lavechin, et al., though it fell within ranges found in other non-European languages. The correlation between LENA and human conversational turn count was not high ( r = .36–.47), similar to the findings in other studies. Conclusions LENA technology is similarly reliable for Korean language environments as it is for other non-English language environments. Factors affecting the accuracy of diarization include speakers' pitch, duration of utterances, age, and the presence of noise and electronic sounds.

APA, Harvard, Vancouver, ISO, and other styles

48

Kumar, Krishna. "Speaker Diarization: A Review." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, no. 06 (2023). http://dx.doi.org/10.55041/ijsrem24075.

Full text

Abstract:

Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and an unknown number of speakers. It is a challenging task due to the variability of human speech, the presence of overlapping speech, and the lack of prior information about the speakers. It is the process of labeling a speech signal with labels corresponding to the identity of speakers. It is a crucial task in audio signal processing and speech analysis. A recent review of speaker diarization research since 2018 can be found in this paper which discusses the historical development of speaker diarization technology and recent advancements in neural speaker diarization approaches. Key Words: speaker diarization, speaker clustering, speaker embeddings

APA, Harvard, Vancouver, ISO, and other styles

49

Xu, Sean Shensheng, Xiaoquan Ke, Man-Wai Mak, et al. "Speaker-turn aware diarization for speech-based cognitive assessments." Frontiers in Neuroscience 17 (January 16, 2024). http://dx.doi.org/10.3389/fnins.2023.1351848.

Full text

Abstract:

IntroductionSpeaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA).MethodsThis paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments.ResultsEvaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios.DiscussionThe results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

APA, Harvard, Vancouver, ISO, and other styles

50

Roberto Sánchez Cárdenas and Marvin Coto-Jiménez. "Application of Fischer semi discriminant analysis for speaker diarization in costa rican radio broadcasts." Revista Tecnología en Marcha, November 16, 2022. http://dx.doi.org/10.18845/tm.v35i8.6464.

Full text

Abstract:

Automatic segmentation and classification of audio streams is a challenging problem, with many applications, such as indexing multi – media digital libraries, information retrieving, and the building of speech corpus or spoken corpus) for particular languages and accents. Those corpus is a database of speech audio files and the corresponding text transcriptions. Among the several steps and tasks required for any of those applications, the speaker diarization is one of the most relevant, because it pretends to find boundaries in the audio recordings according to who speaks in each fragment. Speaker diarization can be performed in a supervised or unsupervised way and is commonly applied in audios consisting of pure speech. In this work, a first annotated dataset and analysis of speaker diarization for Costa Rican radio broadcasting is performed, using two approaches: a classic one based on k-means clustering, and the more recent Fischer Semi Discriminant. We chose publicly available radio broadcast and decided to compare those systems’ applicability in the complete audio files, which also contains some segments of music and challenging acoustic conditions. Results show a dependency on the results according to the number of speakers in each broadcast, especially in the average cluster purity. The results also show the necessity of further exploration and combining with other classification and segmentation algorithms to better extract useful information from the dataset and allow further development of speech corpus.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!