To see the other types of publications on this topic, follow the link: Automatic speaker recognition.

Journal articles on the topic 'Automatic speaker recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Automatic speaker recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Aung, Dr Zaw Win. "Automatic Attendance System Using Speaker Recognition." International Journal of Trend in Scientific Research and Development Volume-2, Issue-6 (October 31, 2018): 802–6. http://dx.doi.org/10.31142/ijtsrd18763.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Singh, Satyanand. "Forensic and Automatic Speaker Recognition System." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 5 (October 1, 2018): 2804. http://dx.doi.org/10.11591/ijece.v8i5.pp2804-2811.

Full text
Abstract:
<span lang="EN-US">Current Automatic Speaker Recognition (ASR) System has emerged as an important medium of confirmation of identity in many businesses, ecommerce applications, forensics and law enforcement as well. Specialists trained in criminological recognition can play out this undertaking far superior by looking at an arrangement of acoustic, prosodic, and semantic attributes which has been referred to as structured listening. An algorithmbased system has been developed in the recognition of forensic speakers by physics scientists and forensic linguists to reduce the probability of a contextual bias or pre-centric understanding of a reference model with the validity of an unknown audio sample and any suspicious individual. Many researchers are continuing to develop automatic algorithms in signal processing and machine learning so that improving performance can effectively introduce the speaker’s identity, where the automatic system performs equally with the human audience. In this paper, I examine the literature about the identification of speakers by machines and humans, emphasizing the key technical speaker pattern emerging for the automatic technology in the last decade. I focus on many aspects of automatic speaker recognition (ASR) systems, including speaker-specific features, speaker models, standard assessment data sets, and performance metrics</span>
APA, Harvard, Vancouver, ISO, and other styles
3

Gonzalez-Rodriguez, Joaquin. "Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014)." Loquens 1, no. 1 (June 30, 2014): e007. http://dx.doi.org/10.3989/loquens.2014.007.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Algabri, Mohammed, Hassan Mathkour, Mohamed A. Bencherif, Mansour Alsulaiman, and Mohamed A. Mekhtiche. "Automatic Speaker Recognition for Mobile Forensic Applications." Mobile Information Systems 2017 (2017): 1–6. http://dx.doi.org/10.1155/2017/6986391.

Full text
Abstract:
Presently, lawyers, law enforcement agencies, and judges in courts use speech and other biometric features to recognize suspects. In general, speaker recognition is used for discriminating people based on their voices. The process of determining, if a suspected speaker is the source of trace, is called forensic speaker recognition. In such applications, the voice samples are most probably noisy, the recording sessions might mismatch each other, the sessions might not contain sufficient recording for recognition purposes, and the suspect voices are recorded through mobile channel. The identification of a person through his voice within a forensic quality context is challenging. In this paper, we propose a method for forensic speaker recognition for the Arabic language; the King Saud University Arabic Speech Database is used for obtaining experimental results. The advantage of this database is that each speaker’s voice is recorded in both clean and noisy environments, through a microphone and a mobile channel. This diversity facilitates its usage in forensic experimentations. Mel-Frequency Cepstral Coefficients are used for feature extraction and the Gaussian mixture model-universal background model is used for speaker modeling. Our approach has shown low equal error rates (EER), within noisy environments and with very short test samples.
APA, Harvard, Vancouver, ISO, and other styles
5

Besacier, Laurent, and Jean-François Bonastre. "Subband architecture for automatic speaker recognition." Signal Processing 80, no. 7 (July 2000): 1245–59. http://dx.doi.org/10.1016/s0165-1684(00)00033-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

FarrÚs, Mireia. "Voice Disguise in Automatic Speaker Recognition." ACM Computing Surveys 51, no. 4 (September 6, 2018): 1–22. http://dx.doi.org/10.1145/3195832.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Drygajlo, A. "Forensic Automatic Speaker Recognition [Exploratory DSP]." IEEE Signal Processing Magazine 24, no. 2 (March 2007): 132–35. http://dx.doi.org/10.1109/msp.2007.323278.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Zhang, Cuiling, and Tiejun Tan. "Voice disguise and automatic speaker recognition." Forensic Science International 175, no. 2-3 (March 2008): 118–22. http://dx.doi.org/10.1016/j.forsciint.2007.05.019.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Singh, Satyanand. "High level speaker specific features modeling in automatic speaker recognition system." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 2 (April 1, 2020): 1859. http://dx.doi.org/10.11591/ijece.v10i2.pp1859-1867.

Full text
Abstract:
Spoken words convey several levels of information. At the primary level, the speech conveys words or spoken messages, but at the secondary level, the speech also reveals information about the speakers. This work is based on the high-level speaker-specific features on statistical speaker modeling techniques that express the characteristic sound of the human voice. Using Hidden Markov model (HMM), Gaussian mixture model (GMM), and Linear Discriminant Analysis (LDA) models build Automatic Speaker Recognition (ASR) system that are computational inexpensive can recognize speakers regardless of what is said. The performance of the ASR system is evaluated for clear speech to a wide range of speech quality using a standard TIMIT speech corpus. The ASR efficiency of HMM, GMM, and LDA based modeling technique are 98.8%, 99.1%, and 98.6% and Equal Error Rate (EER) is 4.5%, 4.4% and 4.55% respectively. The EER improvement of GMM modeling technique based ASR systemcompared with HMM and LDA is 4.25% and 8.51% respectively.
APA, Harvard, Vancouver, ISO, and other styles
10

Khalil, Driss, Amrutha Prasad, Petr Motlicek, Juan Zuluaga-Gomez, Iuliia Nigmatulina, Srikanth Madikeri, and Christof Schuepbach. "An Automatic Speaker Clustering Pipeline for the Air Traffic Communication Domain." Aerospace 10, no. 10 (October 10, 2023): 876. http://dx.doi.org/10.3390/aerospace10100876.

Full text
Abstract:
In air traffic management (ATM), voice communications are critical for ensuring the safe and efficient operation of aircraft. The pertinent voice communications—air traffic controller (ATCo) and pilot—are usually transmitted in a single channel, which poses a challenge when developing automatic systems for air traffic management. Speaker clustering is one of the challenges when applying speech processing algorithms to identify and group the same speaker among different speakers. We propose a pipeline that deploys (i) speech activity detection (SAD) to identify speech segments, (ii) an automatic speech recognition system to generate the text for audio segments, (iii) text-based speaker role classification to detect the role of the speaker—ATCo or pilot in our case—and (iv) unsupervised speaker clustering to create a cluster of each individual pilot speaker from the obtained speech utterances. The speech segments obtained by SAD are input into an automatic speech recognition (ASR) engine to generate the automatic English transcripts. The speaker role classification system takes the transcript as input and uses it to determine whether the speech was from the ATCo or the pilot. As the main goal of this project is to group the speakers in pilot communication, only pilot data acquired from the classification system is employed. We present a method for separating the speech parts of pilots into different clusters based on the speaker’s voice using agglomerative hierarchical clustering (AHC). The performance of the speaker role classification and speaker clustering is evaluated on two publicly available datasets: the ATCO2 corpus and the Linguistic Data Consortium Air Traffic Control Corpus (LDC-ATCC). Since the pilots’ real identities are unknown, the ground truth is generated based on logical hypotheses regarding the creation of each dataset, timing information, and the information extracted from associated callsigns. In the case of speaker clustering, the proposed algorithm achieves an accuracy of 70% on the LDC-ATCC dataset and 50% on the more noisy ATCO2 dataset.
APA, Harvard, Vancouver, ISO, and other styles
11

Kumar, Praveen, and H. S. Jayanna. "Development of Speaker-Independent Automatic Speech Recognition System for Kannada Language." Indian Journal of Science and Technology 15, no. 8 (February 27, 2022): 333–42. http://dx.doi.org/10.17485/ijst/v15i8.2322.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Devi, Kharibam Jilenkumari, and Khelchandra Thongam. "A Survey of Automatic Speaker Recognition System Using Artificial Neural Networks." Journal of Advanced Research in Dynamical and Control Systems 11, no. 10-SPECIAL ISSUE (October 31, 2019): 453–56. http://dx.doi.org/10.5373/jardcs/v11sp10/20192832.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Marini, Marco, Nicola Vanello, and Luca Fanucci. "Optimising Speaker-Dependent Feature Extraction Parameters to Improve Automatic Speech Recognition Performance for People with Dysarthria." Sensors 21, no. 19 (September 27, 2021): 6460. http://dx.doi.org/10.3390/s21196460.

Full text
Abstract:
Within the field of Automatic Speech Recognition (ASR) systems, facing impaired speech is a big challenge because standard approaches are ineffective in the presence of dysarthria. The first aim of our work is to confirm the effectiveness of a new speech analysis technique for speakers with dysarthria. This new approach exploits the fine-tuning of the size and shift parameters of the spectral analysis window used to compute the initial short-time Fourier transform, to improve the performance of a speaker-dependent ASR system. The second aim is to define if there exists a correlation among the speaker’s voice features and the optimal window and shift parameters that minimises the error of an ASR system, for that specific speaker. For our experiments, we used both impaired and unimpaired Italian speech. Specifically, we used 30 speakers with dysarthria from the IDEA database and 10 professional speakers from the CLIPS database. Both databases are freely available. The results confirm that, if a standard ASR system performs poorly with a speaker with dysarthria, it can be improved by using the new speech analysis. Otherwise, the new approach is ineffective in cases of unimpaired and low impaired speech. Furthermore, there exists a correlation between some speaker’s voice features and their optimal parameters.
APA, Harvard, Vancouver, ISO, and other styles
14

Garcia‐Romero, Daniel, and Carol Espy‐Wilson. "Automatic speaker recognition: Advances toward informative systems." Journal of the Acoustical Society of America 128, no. 4 (October 2010): 2394. http://dx.doi.org/10.1121/1.3508584.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Singh, Nilu. "A Critical Review on Automatic Speaker Recognition." Science Journal of Circuits, Systems and Signal Processing 4, no. 2 (2015): 14. http://dx.doi.org/10.11648/j.cssp.20150402.12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Giuliani, Diego, Matteo Gerosa, and Fabio Brugnara. "Improved automatic speech recognition through speaker normalization." Computer Speech & Language 20, no. 1 (January 2006): 107–23. http://dx.doi.org/10.1016/j.csl.2005.05.002.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Kumar, P. "Automatic Speaker Recognition using LPCC and MFCC." International Journal on Recent and Innovation Trends in Computing and Communication 3, no. 4 (2015): 2106–9. http://dx.doi.org/10.17762/ijritcc2321-8169.150474.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Helling, Detlef. "Automatic speaker recognition based on entire words." Journal of the Acoustical Society of America 77, no. 6 (June 1985): 2194. http://dx.doi.org/10.1121/1.391752.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Nechanský, Tomáš, Tomáš Bořil, Alžběta Houzar, and Radek Skarnitzl. "The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners." AUC PHILOLOGICA 2022, no. 1 (January 17, 2023): 11–22. http://dx.doi.org/10.14712/24646830.2022.25.

Full text
Abstract:
The so-called ‘mismatch’ is a factor which experts in the forensic voice comparison field encounter regularly. Therefore, we decided to explore to what extent the automatic-speaker-recognition system’s and the earwitness’ ability to identify speakers is influenced when recordings are acquired in different languages and at different times. 100 voices in a database of 300 recordings (100 speakers recorded in three mutually mismatched sessions) were compared with an automatic-speaker-recognition software VOCALISE based on i-vectors and x-vectors, and by 39 respondents in simulated voice parades. Both the automatic and perceptual approach seem to have yielded similar results in that the less complex the mismatch type, the more successful the identification. The results point to the superiority of the x-vector approach, and also to varying identification abilities of listeners.
APA, Harvard, Vancouver, ISO, and other styles
20

Auti, Dr Nisha, Atharva Pujari, Anagha Desai, Shreya Patil, Sanika Kshirsagar, and Rutika Rindhe. "Advanced Audio Signal Processing for Speaker Recognition and Sentiment Analysis." International Journal for Research in Applied Science and Engineering Technology 11, no. 5 (May 31, 2023): 1717–24. http://dx.doi.org/10.22214/ijraset.2023.51825.

Full text
Abstract:
Abstract: Automatic Speech Recognition (ASR) technology has revolutionized human-computer interaction by allowing users to communicate with computer interfaces using their voice in a natural way. Speaker recognition is a biometric recognition method that identifies individuals based on their unique speech signal, with potential applications in security, communication, and personalization. Sentiment analysis is a statistical method that analyzes unique acoustic properties of the speaker's voice to identify emotions or sentiments in speech. This allows for automated speech recognition systems to accurately categorize speech as Positive, Neutral, or Negative. While sentiment analysis has been developed for various languages, further research is required for regional languages. This project aims to improve the accuracy of automatic speech recognition systems by implementing advanced audio signal processing and sentiment analysis detection. The proposed system will identify the speaker's voice and analyze the audio signal to detect the context of speech, including the identification of foul language and aggressive speech. The system will be developed for the Marathi Language dataset, with potential for further development in other languages.
APA, Harvard, Vancouver, ISO, and other styles
21

Singh, Satyanand. "Bayesian distance metric learning and its application in automatic speaker recognition systems." International Journal of Electrical and Computer Engineering (IJECE) 9, no. 4 (August 1, 2019): 2960. http://dx.doi.org/10.11591/ijece.v9i4.pp2960-2967.

Full text
Abstract:
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
APA, Harvard, Vancouver, ISO, and other styles
22

Weychan, Radoslaw, Tomasz Marciniak, Agnieszka Stankiewicz, and Adam Dabrowski. "Real Time Recognition Of Speakers From Internet Audio Stream." Foundations of Computing and Decision Sciences 40, no. 3 (September 1, 2015): 223–33. http://dx.doi.org/10.1515/fcds-2015-0014.

Full text
Abstract:
Abstract In this paper we present an automatic speaker recognition technique with the use of the Internet radio lossy (encoded) speech signal streams. We show an influence of the audio encoder (e.g., bitrate) on the speaker model quality. The model of each speaker was calculated with the use of the Gaussian mixture model (GMM) approach. Both the speaker recognition and the further analysis were realized with the use of short utterances to facilitate real time processing. The neighborhoods of the speaker models were analyzed with the use of the ISOMAP algorithm. The experiments were based on four 1-hour public debates with 7–8 speakers (including the moderator), acquired from the Polish radio Internet services. The presented software was developed with the MATLAB environment.
APA, Harvard, Vancouver, ISO, and other styles
23

Kamiński, Kamil A., and Andrzej P. Dobrowolski. "Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features." Sensors 22, no. 23 (December 1, 2022): 9370. http://dx.doi.org/10.3390/s22239370.

Full text
Abstract:
This article presents the Automatic Speaker Recognition System (ASR System), which successfully resolves problems such as identification within an open set of speakers and the verification of speakers in difficult recording conditions similar to telephone transmission conditions. The article provides complete information on the architecture of the various internal processing modules of the ASR System. The speaker recognition system proposed in the article, has been compared very closely to other competing systems, achieving improved speaker identification and verification results, on known certified voice dataset. The ASR System owes this to the dual use of genetic algorithms both in the feature selection process and in the optimization of the system’s internal parameters. This was also influenced by the proprietary feature generation and corresponding classification process using Gaussian mixture models. This allowed the development of a system that makes an important contribution to the current state of the art in speaker recognition systems for telephone transmission applications with known speech coding standards.
APA, Harvard, Vancouver, ISO, and other styles
24

Singh, Nilu, Alka Agrawal, and R. A. Khan. "Automatic Speaker Recognition: Current Approaches and Progress in Last Six Decades." Global Journal of Enterprise Information System 9, no. 3 (September 27, 2017): 45. http://dx.doi.org/10.18311/gjeis/2017/15973.

Full text
Abstract:
<p>Automatic speaker recognition is the process to recognizing speaker automatically by their speech/voice on the basis of specific characteristics of his/her speech signal. These voice specific characteristics are called speech features. Over the past six decades many recent advances in the area of speaker recognition have been achieved, but still many problems remains to be solved or require better solutions. The main problems in speaker recognition are session variability, channel mismatch and recording conditions of voice. To develop an efficient speaker recognition system it needs to examine stable parameters of voice features parameters over time, unaffected from variation in speaking, background noise, channel distortion and robust against variation of physical problems. This paper overviews recent advances and general ideas of speaker recognition technology.</p>
APA, Harvard, Vancouver, ISO, and other styles
25

Radan, N. H., and K. Sidorov. "Development and Research of a System for Automatic Recognition of the Digits Yemeni Dialect of Arabic Speech Using Neural Networks." Proceedings of Telecommunication Universities 9, no. 5 (November 14, 2023): 35–42. http://dx.doi.org/10.31854/1813-324x-2023-9-5-35-42.

Full text
Abstract:
The article describes the results of research on the development and testing of an automatic speech recognition system (SAR) in Arabic numerals using artificial neural networks. Sound recordings (speech signals) of the Arabic Yemeni dialect recorded in the Republic of Yemen were used for the research. SAR is an isolated system of recognition of whole words, it is implemented in two modes: "speaker-dependent system" (the same speakers are used for training and testing the system) and "speaker-independent system" (the speakers used for training the system differ from those used for testing it). In the process of speech recognition, the speech signal is cleared of noise using filters, then the signal is pre-localized, processed and analyzed by the Hamming window (a time alignment algorithm is used to compensate for differences in pronunciation). Informative features are extracted from the speech signal using mel-frequency cepstral coefficients. The developed SAR provides high accuracy of the recognition of Arabic numerals of the Yemeni dialect – 96.2 % (for a speaker-dependent system) and 98.8 % (for a speaker-independent system).
APA, Harvard, Vancouver, ISO, and other styles
26

Singh, Mahesh K., S. Manusha, K. V. Balaramakrishna, and Sridevi Gamini. "Speaker Identification Analysis Based on Long-Term Acoustic Characteristics with Minimal Performance." International Journal of Electrical and Electronics Research 10, no. 4 (December 30, 2022): 848–52. http://dx.doi.org/10.37391/ijeer.100415.

Full text
Abstract:
The identity of the speakers depends on the phonological properties acquired from the speech. The Mel-Frequency Cepstral Coefficients (MFCC) are better researched for derived the acoustic characteristic. This speaker model is based on a small representation and the characteristics of the acoustic features. These are derived from the speaker model and the cartographic representation by the MFCCs. The MFCC is used for independent monitoring of speaker text. There is a problem with the recognition of speakers by small representation, so proposed the Gaussian Mixture Model (GMM), mean super vector core for training. Unknown vector modules are cleared using rarity and experiments based on the TMIT database. The I-vector algorithm is proposed for the effective improvement of ASR (Automatic Speaker Recognition). The Atom Aligned Sparse Representation (AASR) is used to describe the speaker-based model. The Short Representation Classification (SRC) is used to describe the speaker recognition report. A robust short coding is based on the Maximum Likelihood Estimation (MIE) to clarify the problem in small representation. Strong speaker verification based on a small representation of GMM super vectors. Strong speaker verification based on a small representation of GMM super vectors.
APA, Harvard, Vancouver, ISO, and other styles
27

Talbot, Mike. "Adapting to the speaker in automatic speech recognition." International Journal of Man-Machine Studies 27, no. 4 (October 1987): 449–57. http://dx.doi.org/10.1016/s0020-7373(87)80008-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Basztura, Czesław. "Experiments of automatic speaker recognition in open sets." Speech Communication 10, no. 2 (June 1991): 117–27. http://dx.doi.org/10.1016/0167-6393(91)90035-r.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Boyer, A., J. Di Martino, P. Divoux, J. P. Haton, J. F. Mari, and K. Smaili. "Statistical methods in multi-speaker automatic speech recognition." Applied Stochastic Models and Data Analysis 6, no. 3 (September 1990): 143–55. http://dx.doi.org/10.1002/asm.3150060302.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Castellano, Pierre, Stefan Slomka, and Peter Barger. "Gender Gates for Telephone-Based Automatic Speaker Recognition." Digital Signal Processing 7, no. 2 (April 1997): 65–79. http://dx.doi.org/10.1006/dspr.1997.0276.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Rajiv Ranjan Tewari, Shivangi Srivastav,. "Efficient Approach of Automatic Speech Emotion Recognition (ASR) Using Mutual Information." INFORMATION TECHNOLOGY IN INDUSTRY 9, no. 1 (March 10, 2021): 595–603. http://dx.doi.org/10.17762/itii.v9i1.177.

Full text
Abstract:
Speech is a significant quality for distinguishing a person in daily human to human interaction/ communication. Like other biometric measures, such as face, iris and fingerprints, voice can therefore be used as a biometric measure for perceiving or identifying the person. Speaker recognition is almost the same as a kind of voice recognition in which the speaker is identified from the expression instead of the message. Automatic Speaker Recognition (ASR) is the way to identify people who rely on highlights that are omitted from speech expressions. Speech signals are awesome correspondence media that constantly pass on rich and useful knowledge, such as a speaker's feeling, sexual orientation, complement, and other interesting attributes. In any speaker identification, the essential task is to delete helpful highlights and allow for significant examples of speaker models. Hypothetical description, organization of the full state of feeling and the modalities of articulation of feeling are added. A SER framework is developed to conduct this investigation, in view of different classifiers and different techniques for extracting highlights. In this work various machine learning algorithms are investigated to identify decision boundary in feature space of audio signals. Moreover novelty of this art lies in improving the performance of classical machine learning algorithms using information theory based feature selection methods. The higher accuracy retrieved is 96 percent using Random forest algorithm incorporated with Joint Mutual information feature selection method.
APA, Harvard, Vancouver, ISO, and other styles
32

Kwasny, Damian, and Daria Hemmerling. "Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks." Sensors 21, no. 14 (July 13, 2021): 4785. http://dx.doi.org/10.3390/s21144785.

Full text
Abstract:
The speech signal contains a vast spectrum of information about the speaker such as speakers’ gender, age, accent, or health state. In this paper, we explored different approaches to automatic speaker’s gender classification and age estimation system using speech signals. We applied various Deep Neural Network-based embedder architectures such as x-vector and d-vector to age estimation and gender classification tasks. Furthermore, we have applied a transfer learning-based training scheme with pre-training the embedder network for a speaker recognition task using the Vox-Celeb1 dataset and then fine-tuning it for the joint age estimation and gender classification task. The best performing system achieves new state-of-the-art results on the age estimation task using popular TIMIT dataset with a mean absolute error (MAE) of 5.12 years for male and 5.29 years for female speakers and a root-mean square error (RMSE) of 7.24 and 8.12 years for male and female speakers, respectively, and an overall gender recognition accuracy of 99.60%.
APA, Harvard, Vancouver, ISO, and other styles
33

KOO, J. M., H. S. KIM, and C. K. UN. "A KOREAN LARGE VOCABULARY SPEECH RECOGNITION SYSTEM FOR AUTOMATIC TELEPHONE NUMBER QUERY SERVICE." International Journal of Pattern Recognition and Artificial Intelligence 08, no. 01 (February 1994): 215–32. http://dx.doi.org/10.1142/s0218001494000103.

Full text
Abstract:
In this paper, we introduce a Korean large vocabulary speech recognition system. This system recognizes sentence utterances with a vocabulary size of 1160 words, and is designed for an automatic telephone number query service. The system consists of four subsystems. The first is an acoustic processor recognizing words in an input sentence by a Hidden Markov Model (HMM) based speech recognition algorithm. The second subsystem is a linguistic processor which estimates input sentences from the results of the acoustic processor and determines the following words using syntactic information. The third is a time reduction processor reducing recognition time by limiting the number of candidate words to be computed by the acoustic processor. The time reduction processor uses linguistic information and acoustic information contained in the input sentence. The last subsystem is a speaker adaptation processor which quickly adapts parameters of the speech recognition system to new speakers. This subsystem uses VQ adaptation and HMM parameter adaptation based on spectral mapping. We also present our recent work on improving the performance of the large vocabulary speech recognition system. These works focused on the enhancement of the acoustic processor and the time reduction processor for speaker-independent speech recognition. A new approach for speaker adaptation is also described.
APA, Harvard, Vancouver, ISO, and other styles
34

Lee, Yun Kyung, and Jeon Gue Park. "Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech." Applied Sciences 11, no. 6 (March 16, 2021): 2642. http://dx.doi.org/10.3390/app11062642.

Full text
Abstract:
This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker’s English sentence is different from the native speaker’s one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.
APA, Harvard, Vancouver, ISO, and other styles
35

ZERGAT, KAWTHAR YASMINE, and ABDERRAHMANE AMROUCHE. "SVM AGAINST GMM/SVM FOR DIALECT INFLUENCE ON AUTOMATIC SPEAKER RECOGNITION TASK." International Journal of Computational Intelligence and Applications 13, no. 02 (June 2014): 1450012. http://dx.doi.org/10.1142/s1469026814500126.

Full text
Abstract:
A big deal for current research on automatic speaker recognition is the effectiveness of the speaker modeling techniques for the talkers, because they have their own speaking style, depending on their specific accents and dialects. This paper investigates on the influence of the dialect and the size of database on the text independent speaker verification task using the SVM and the hybrid GMM/SVM speaker modeling. The Principal Component Analysis (PCA) technique is used in the front-end part of the speaker recognition system, in order to extract the most representative features. Experimental results show that the size of database has an important impact on the SVM and GMM/SVM based speaker verification performances, while the dialect has no significant effect. Applying PCA dimensionality reduction improves the recognition accuracy for both SVM and GMM/SVM based recognition systems. However, it did not give an obvious observation about the dialect effect.
APA, Harvard, Vancouver, ISO, and other styles
36

Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.

Full text
Abstract:
This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system’s ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.
APA, Harvard, Vancouver, ISO, and other styles
37

Abakarim, Fadwa, and Abdenbi Abenaou. "Comparative study to realize an automatic speaker recognition system." International Journal of Electrical and Computer Engineering (IJECE) 12, no. 1 (February 1, 2022): 376. http://dx.doi.org/10.11591/ijece.v12i1.pp376-382.

Full text
Abstract:
In this research, we present an automatic speaker recognition system based on adaptive orthogonal transformations. To obtain the informative features with a minimum dimension from the input signals, we created an adaptive operator, which helped to identify the speaker’s voice in a fast and efficient manner. We test the efficiency and the performance of our method by comparing it with another approach, mel-frequency cepstral coefficients (MFCCs), which is widely used by researchers as their feature extraction method. The experimental results show the importance of creating the adaptive operator, which gives added value to the proposed approach. The performance of the system achieved 96.8% accuracy using Fourier transform as a compression method and 98.1% using Correlation as a compression method.
APA, Harvard, Vancouver, ISO, and other styles
38

Jokić, Ivan, Stevan Jokić, Vlado Delić, and Zoran Perić. "One Solution of Extension of Mel-Frequency Cepstral Coefficients Feature Vector for Automatic Speaker Recognition." Information Technology And Control 49, no. 2 (June 16, 2020): 224–36. http://dx.doi.org/10.5755/j01.itc.49.2.22258.

Full text
Abstract:
One extension of feature vector for automatic speaker recognition is considered in this paper. The starting feature vector consisted of 18 mel-frequency cepstral coefficients (MFCCs). Extension was done with two additional features derived from the spectrum of the speech signal. The main idea that generated this research is that it is possible to increase the efficiency of automatic speaker recognition by constructing a feature vector which tracks a real perceived spectrum in the observed speech. Additional features are based on the energy maximums in the appropriate frequency ranges of observed speech frames. In experiments, accuracy and equal error rate (EER) are compared in the case when feature vectors contain only 18 MFCCs and in cases when additional features are used. Recognition accuracy increased by around 3%. Values of EER show smaller differentiation but the results show that adding proposed additional features produced a lower decision threshold. These results indicate that tracking of real occurrences in the spectrum of the speech signal leads to more efficient automatic speaker recognizer. Determining features which track real occurrences in the speech spectrum will improve the procedure of automatic speaker recognition and enable avoiding complex models.
APA, Harvard, Vancouver, ISO, and other styles
39

Li, Wenjie, Pengyuan Zhang, and Yonghong Yan. "TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition." Electronics Letters 55, no. 14 (July 2019): 816–19. http://dx.doi.org/10.1049/el.2019.1228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Künzel, Hermann, and Paul Alexander. "Forensic Automatic Speaker Recognition with Degraded and Enhanced Speech." Journal of the Audio Engineering Society 62, no. 4 (April 16, 2014): 244–53. http://dx.doi.org/10.17743/jaes.2014.0014.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

J.Chaudhari, Shivaji, and Ramesh M. Kagalkar. "Automatic Speaker Age Estimation and Gender Dependent Emotion Recognition." International Journal of Computer Applications 117, no. 17 (May 20, 2015): 5–10. http://dx.doi.org/10.5120/20644-3383.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Alexander, Anil, Damien Dessimoz, Filippo Botti, and Andrzej Drygajlo. "Aural and automatic forensic speaker recognition in mismatched conditions." International Journal of Speech, Language and the Law 12, no. 2 (December 2005): 214–34. http://dx.doi.org/10.1558/sll.2005.12.2.214.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Roberts, Linda A., Jay G. Wilpon, Dennis E. Egan, and Jean Bakk. "Improving speaker consistency in an automatic speech recognition framework." Computer Speech & Language 1, no. 1 (March 1986): 61–93. http://dx.doi.org/10.1016/s0885-2308(86)80011-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Mahmood, Awais, Mansour Alsulaiman, and Ghulam Muhammad. "Automatic Speaker Recognition Using Multi-Directional Local Features (MDLF)." Arabian Journal for Science and Engineering 39, no. 5 (April 9, 2014): 3799–811. http://dx.doi.org/10.1007/s13369-014-1048-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

van Leeuwen, David A., Alvin F. Martin, Mark A. Przybocki, and Jos S. Bouten. "NIST and NFI-TNO evaluations of automatic speaker recognition." Computer Speech & Language 20, no. 2-3 (April 2006): 128–58. http://dx.doi.org/10.1016/j.csl.2005.07.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Qin, Yuqiang, and Yudong Qi. "EEMD-Based Speaker Automatic Emotional Recognition in Chinese Mandarin." Applied Mathematics & Information Sciences 8, no. 2 (March 1, 2014): 617–24. http://dx.doi.org/10.12785/amis/080219.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

San Segundo, Eugenia, and Hermann Künzel. "Automatic speaker recognition of spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers." Loquens 2, no. 2 (December 30, 2015): e021. http://dx.doi.org/10.3989/loquens.2015.021.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Singh, Satyanand. "High Level Speaker Specific Features as an Efficiency Enhancing Parameters in Speaker Recognition System." International Journal of Electrical and Computer Engineering (IJECE) 9, no. 4 (August 1, 2019): 2443. http://dx.doi.org/10.11591/ijece.v9i4.pp2443-2450.

Full text
Abstract:
<p>In this paper, I present high-level speaker specific feature extraction considering intonation, linguistics rhythm, linguistics stress, prosodic features directly from speech signals. I assume that the rhythm is related to language units such as syllables and appears as changes in measurable parameters such as fundamental frequency ( ), duration, and energy. In this work, the syllable type features are selected as the basic unit for expressing the prosodic features. The approximate segmentation of continuous speech to syllable units is achieved by automatically locating the vowel starting point. The knowledge of high-level speaker’s specific speakers is used as a reference for extracting the prosodic features of the speech signal. High-level speaker-specific features extracted using this method may be useful in applications such as speaker recognition where explicit phoneme/syllable boundaries are not readily available. The efficiency of the particular characteristics of the specific features used for automatic speaker recognition was evaluated on TIMIT and HTIMIT corpora initially sampled in the TIMIT at 16 kHz to 8 kHz. In summary, the experiment, the basic discriminating system, and the HMM system are formed on TIMIT corpus with a set of 48 phonemes. Proposed ASR system shows 1.99%, 2.10%, 2.16% and 2.19 % of efficiency improvements compared to traditional ASR system for and of 16KHz TIMIT utterances.</p>
APA, Harvard, Vancouver, ISO, and other styles
49

Geng, Puyang, Qimeng Lu, Hong Guo, and Jinhua Zeng. "The effects of face mask on speech production and its implication for forensic speaker identification-A cross-linguistic study." PLOS ONE 18, no. 3 (March 30, 2023): e0283724. http://dx.doi.org/10.1371/journal.pone.0283724.

Full text
Abstract:
This study aims to understand the effects of face mask on speech production between Mandarin Chinese and English, and on the automatic classification of mask/no mask speech and individual speakers. A cross-linguistic study on mask speech between Mandarin Chinese and English was then conducted. Continuous speech of the phonetically balanced texts in both Chinese and English versions were recorded from thirty native speakers of Mandarin Chinese (i.e., 15 males and 15 females) with and without wearing a surgical mask. The results of acoustic analyses showed that mask speech exhibited higher F0, intensity, HNR, and lower jitter and shimmer than no mask speech for Mandarin Chinese, whereas higher HNR and lower jitter and shimmer were observed for English mask speech. The results of classification analyses showed that, based on the four supervised learning algorithms (i.e., Linear Discriminant Analysis, Naïve Bayes Classifier, Random Forest, and Support Vector Machine), undesirable performances (i.e., lower than 50%) in classifying the speech with and without a face mask, and highly-variable accuracies (i.e., ranging from 40% to 89.2%) in identifying individual speakers were achieved. These findings imply that the speakers tend to conduct acoustic adjustments to improve their speech intelligibility when wearing surgical mask. However, a cross-linguistic difference in speech strategies to compensate for intelligibility was observed that Mandarin speech was produced with higher F0, intensity, and HNR, while English was produced with higher HNR. Besides, the highly-variable accuracies of speaker identification might suggest that surgical mask would impact the general performance of the accuracy of automatic speaker recognition. In general, therefore, it seems wearing a surgical mask would impact both acoustic-phonetic and automatic speaker recognition approaches to some extent, thus suggesting particular cautions in the real-case practice of forensic speaker identification.
APA, Harvard, Vancouver, ISO, and other styles
50

Bezoui, Mouaz. "Speech Recognition of Moroccan Dialect Using Hidden Markov Models." IAES International Journal of Artificial Intelligence (IJ-AI) 8, no. 1 (March 1, 2019): 7. http://dx.doi.org/10.11591/ijai.v8.i1.pp7-13.

Full text
Abstract:
<p>This paper addresses the development of an Automatic Speech Recognition (ASR) system for the Moroccan Dialect. Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. In fact, Moroccan Dialect is very different from the Modern Standard Arabic (MSA) because it is highly influenced by the French Language. It is observed throughout all Arab countries that standard Arabic widely written and used for official speech, news papers, public administration and school but not used in everyday conversation and dialect is widely spoken in everyday life but almost never written. we propose to use the Mel Frequency Cepstral Coefficient (MFCC) features to specify the best speaker identification system. The extracted speech features are quantized to a number of centroids using vector quantization algorithm. These centroids constitute the codebook of that speaker. MFCC’s are calculated in training phase and again in testing phase. Speakers uttered same words once in a training session and once in a testing session later. The Euclidean distance between the MFCC’s of each speaker in training phase to the centroids of individual speaker in testing phase is measured and the speaker is identified according to the minimum Euclidean distance. The code is developed in the MATLAB environment and performs the identification satisfactorily.</p>
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography