Academic literature on the topic 'VoxCeleb2'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'VoxCeleb2.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "VoxCeleb2"

1

Seo, Soonshin, and Ji-Hwan Kim. "Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System." Electronics 9, no. 10 (October 17, 2020): 1706. http://dx.doi.org/10.3390/electronics9101706.

Full text
Abstract:
One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).
APA, Harvard, Vancouver, ISO, and other styles
2

Nagrani, Arsha, Joon Son Chung, Weidi Xie, and Andrew Zisserman. "Voxceleb: Large-scale speaker verification in the wild." Computer Speech & Language 60 (March 2020): 101027. http://dx.doi.org/10.1016/j.csl.2019.101027.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Badr, Ameer A., and Alia K. Abdul Hassan. "Age Estimation in Short Speech Utterances Based on Bidirectional ‎Gated-Recurrent Neural Networks." Engineering and Technology Journal 39, no. 1B (March 25, 2021): 129–40. http://dx.doi.org/10.30684/etj.v39i1b.1905.

Full text
Abstract:
Recently, age estimates from speech have received growing interest as they are important for ‎many applications like custom call routing, targeted marketing, or user-profiling. In this work, an ‎automatic system to estimate age in short speech utterances without ‎depending on the text is proposed. From each utterance frame, four ‎groups of features are extracted and then 10 statistical functionals are measured for each ‎extracted dimension of the features, to be followed by dimensionality reduction using Linear ‎Discriminant Analysis (LDA). Finally, bidirectional Gated-Recurrent Neural Networks (G-‎RNNs) are used to predict speaker age. Experiments are conducted on the VoxCeleb1 ‎dataset to show the performance of the proposed system, which is the first attempt to do so for ‎such a system. In gender-dependent system, the Mean Absolute Error (MAE) of the proposed system ‎is 9.25 years, and 10.33 ‎years, the Root Mean ‎Square Error (RMSE)‎ is 13.17 and 13.26, respectively, ‎for ‎female and male speakers. In gender_ independent system, the MAE of the proposed system is 10.96 years, and the RMSE is 15.47. The results show that the proposed system has a good performance on short-duration utterances, taking into consideration the high noise ratio in the VoxCeleb1 dataset. ‎
APA, Harvard, Vancouver, ISO, and other styles
4

Lei, Lei, and Kun She. "Identity Vector Extraction by Perceptual Wavelet Packet Entropy and Convolutional Neural Network for Voice Authentication." Entropy 20, no. 8 (August 13, 2018): 600. http://dx.doi.org/10.3390/e20080600.

Full text
Abstract:
Recently, the accuracy of voice authentication system has increased significantly due to the successful application of the identity vector (i-vector) model. This paper proposes a new method for i-vector extraction. In the method, a perceptual wavelet packet transform (PWPT) is designed to convert speech utterances into wavelet entropy feature vectors, and a Convolutional Neural Network (CNN) is designed to estimate the frame posteriors of the wavelet entropy feature vectors. In the end, i-vector is extracted based on those frame posteriors. TIMIT and VoxCeleb speech corpus are used for experiments and the experimental results show that the proposed method can extract appropriate i-vector which reduces the equal error rate (EER) and improve the accuracy of voice authentication system in clean and noisy environment.
APA, Harvard, Vancouver, ISO, and other styles
5

Mo, Jianye, and Li Xu. "Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition." Applied Sciences 10, no. 24 (December 16, 2020): 9004. http://dx.doi.org/10.3390/app10249004.

Full text
Abstract:
While traditional i-vector based methods are popular in the field of speaker recognition, deep learning has recently found more and more applications to the end-to-end models due to its attractive performance. One effective practice is the integration of attention mechanism into the Convolution Neural Networks (CNNs). In this work, a light-weight dual-path attention block is proposed by combining the self-attention and Convolutional Block Attention Module (CBAM), which helps to capture more multi-source features with neglectable extra time expense. Additionally, a Weighted Cluster-Range Loss (WCRL) is proposed to enhance the identification performance of Cluster-Range Loss (CRL) on indecisive samples. Besides, to address the low efficiency in the initial training stage of CRL, a novel Criticality-Enhancement Loss (CEL) is also presented. Both of the proposed loss functions could significantly promote the training efficiency and globally improve the recognition performance. Experimental results are presented to show the effectiveness of the proposed scheme, which achieves a competitive top-1 accuracy of 92.0%, top-5 accuracy of 97.6%, and Equal Error Rate (EER) of 3.5% on the VoxCeleb1 dataset.
APA, Harvard, Vancouver, ISO, and other styles
6

Zeng, Xianfang, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. "Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 12757–64. http://dx.doi.org/10.1609/aaai.v34i07.6970.

Full text
Abstract:
Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
APA, Harvard, Vancouver, ISO, and other styles
7

Zhong, Qinghua, Ruining Dai, Han Zhang, Yongsheng Zhu, and Guofu Zhou. "Text-independent speaker recognition based on adaptive course learning loss and deep residual network." EURASIP Journal on Advances in Signal Processing 2021, no. 1 (July 23, 2021). http://dx.doi.org/10.1186/s13634-021-00762-2.

Full text
Abstract:
AbstractText-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.
APA, Harvard, Vancouver, ISO, and other styles
8

Liu, Yi, Liang He, Jia Liu, and Michael T. Johnson. "Introducing phonetic information to speaker embedding for speaker verification." EURASIP Journal on Audio, Speech, and Music Processing 2019, no. 1 (December 2019). http://dx.doi.org/10.1186/s13636-019-0166-8.

Full text
Abstract:
AbstractPhonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "VoxCeleb2"

1

Lukáč, Peter. "Verifikace osob podle hlasu bez extrakce příznaků." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2021. http://www.nusl.cz/ntk/nusl-445531.

Full text
Abstract:
Verifikácia osôb je oblasť, ktorá sa stále modernizuje, zlepšuje a snaží sa vyhovieť požiadavkám, ktoré sa na ňu kladú vo oblastiach využitia ako sú autorizačné systmémy, forenzné analýzy, atď. Vylepšenia sa uskutočňujú vďaka pokrom v hlbokom učení, tvorením nových trénovacích a testovacích dátovych sad a rôznych súťaží vo verifikácií osôb a workshopov. V tejto práci preskúmame modely pre verifikáciu osôb bez extrakcie príznakov. Používanie nespracovaných zvukových stôp ako vstupy modelov zjednodušuje spracovávanie vstpu a teda znižujú sa výpočetné a pamäťové požiadavky a redukuje sa počet hyperparametrov potrebných pre tvorbu príznakov z nahrávok, ktoré ovplivňujú výsledky. Momentálne modely bez extrakcie príznakov nedosahujú výsledky modelov s extrakciou príznakov. Na základných modeloch budeme experimentovať s modernými technikamy a budeme sa snažiť zlepšiť presnosť modelov. Experimenty s modernými technikamy značne zlepšili výsledky základných modelov ale stále sme nedosiahli výsledky vylepšeného modelu s extrakciou príznakov. Zlepšenie je ale dostatočné nato aby sme vytovrili fúziu so s týmto modelom. Záverom diskutujeme dosiahnuté výsledky a navrhujeme zlepšenia na základe týchto výsledkov.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "VoxCeleb2"

1

Chung, Joon Son, Arsha Nagrani, and Andrew Zisserman. "VoxCeleb2: Deep Speaker Recognition." In Interspeech 2018. ISCA: ISCA, 2018. http://dx.doi.org/10.21437/interspeech.2018-1929.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Chung, Joon Son, Jaesung Huh, and Seongkyu Mun. "Delving into VoxCeleb: Environment Invariant Speaker Recognition." In Odyssey 2020 The Speaker and Language Recognition Workshop. ISCA: ISCA, 2020. http://dx.doi.org/10.21437/odyssey.2020-49.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman. "VoxCeleb: A Large-Scale Speaker Identification Dataset." In Interspeech 2017. ISCA: ISCA, 2017. http://dx.doi.org/10.21437/interspeech.2017-950.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Chen, Zhengyang, Shuai Wang, and Yanmin Qian. "Multi-Modality Matters: A Performance Leap on VoxCeleb." In Interspeech 2020. ISCA: ISCA, 2020. http://dx.doi.org/10.21437/interspeech.2020-2229.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Xiao, Xiong, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao, et al. "Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413832.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Hautamäki, Rosa González, and Tomi Kinnunen. "Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data." In Interspeech 2020. ISCA: ISCA, 2020. http://dx.doi.org/10.21437/interspeech.2020-2715.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography