Tesis: "Détection de faux en profondeur"

1

Tak, Hemlata. "End-to-End Modeling for Speech Spoofing and Deepfake Detection". Electronic Thesis or Diss., Sorbonne université, 2023. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2023SORUS104.pdf.

Texto completo

Resumen

Les systèmes biométriques vocaux sont utilisés dans diverses applications pour une authentification sécurisée. Toutefois, ces systèmes sont vulnérables aux attaques par usurpation d'identité. Il est donc nécessaire de disposer de techniques de détection plus robustes. Cette thèse propose de nouvelles techniques de détection fiables et efficaces contre les attaques invisibles. La première contribution est un ensemble non linéaire de classificateurs de sous-bandes utilisant chacun un modèle de mélange gaussien. Des résultats compétitifs montrent que les modèles qui apprennent des indices discriminants spécifiques à la sous-bande peuvent être nettement plus performants que les modèles entraînés sur des signaux à bande complète. Étant donné que les DNN sont plus puissants et peuvent effectuer à la fois l'extraction de caractéristiques et la classification, la deuxième contribution est un modèle RawNet2. Il s'agit d'un modèle de bout en bout qui apprend les caractéristiques directement à partir de la forme d'onde brute. La troisième contribution comprend la première utilisation de réseaux neuronaux graphiques (GNN) avec un mécanisme d'attention pour modéliser la relation complexe entre les indices d'usurpation présents dans les domaines spectral et temporel. Nous proposons un réseau d'attention spectro-temporel E2E appelé RawGAT-ST. Il est ensuite étendu à un réseau d'attention spectro-temporel intégré, appelé AASIST, qui exploite la relation entre les graphes spectraux et temporels hétérogènes. Enfin, cette thèse propose une nouvelle technique d'augmentation des données appelée RawBoost et utilise un modèle vocal auto-supervisé et pré-entraîné pour améliorer la généralisation
Voice biometric systems are being used in various applications for secure user authentication using automatic speaker verification technology. However, these systems are vulnerable to spoofing attacks, which have become even more challenging with recent advances in artificial intelligence algorithms. There is hence a need for more robust, and efficient detection techniques. This thesis proposes novel detection algorithms which are designed to perform reliably in the face of the highest-quality attacks. The first contribution is a non-linear ensemble of sub-band classifiers each of which uses a Gaussian mixture model. Competitive results show that models which learn sub-band specific discriminative information can substantially outperform models trained on full-band signals. Given that deep neural networks are more powerful and can perform both feature extraction and classification, the second contribution is a RawNet2 model. It is an end-to-end (E2E) model which learns features directly from raw waveform. The third contribution includes the first use of graph neural networks (GNNs) with an attention mechanism to model the complex relationship between spoofing cues present in spectral and temporal domains. We propose an E2E spectro-temporal graph attention network called RawGAT-ST. RawGAT-ST model is further extended to an integrated spectro-temporal graph attention network, named AASIST which exploits the relationship between heterogeneous spectral and temporal graphs. Finally, this thesis proposes a novel data augmentation technique called RawBoost and uses a self-supervised, pre-trained speech model as a front-end to improve generalisation in the wild conditions