In an article published in the journal Scientific Reports, researchers explored the complex interaction between speech pathology and the performance of deep learning-based automatic speaker verification (ASV) systems. This research mainly focuses on understanding how different speech disorders affect the accuracy of automatic speaker verification systems, specifically delving deeper into the ASV landscape.
By examining these effects closely, the research identifies potential weaknesses in the systems and enhances our understanding of speaker identification across various conditions. The researchers used a real-world dataset with approximately 200 hours of healthy as well as pathological recordings (with voices of adults and children).
Background
ASV technology plays a crucial role in confirming the identity of speakers through voice analysis. This system works in two phases: one is enrollment, and the other is the verification phase. ASV technology is commonly used in security and voice-controlled devices.
In ASV, deep learning plays a crucial role, enabling high-level feature learning from speech signals and enhancing the accuracy and robustness of ASV. However, concerns arise about the efficacy of ASV in scenarios involving individuals with speech pathology because these systems are trained primarily on datasets containing healthy speakers. Addressing this gap, ongoing research highlights how speech disorders influence the outcomes of deep learning-based ASV systems, aiming to enhance understanding by thoroughly exploring the relationship between speech distortion and ASV accuracy.
About the Research
Researchers tried to answer the question: does pathological speech, when examined as a biomarker, increase susceptibility to re-identification attacks compared to healthy speech? To answer this question, they used a comprehensive real-world pathological speech dataset with 3800 test subjects from different age groups having various speech disorders. The dataset contains recordings of German speakers reading phonetically rich text or naming pictograms.
Researchers analyzed normal recordings as well as recordings with a diverse range of speech pathology (or disorders) such as dysglosia, dysarthria, dysphonia, and cleft lip and palate (CLP) of adults and children in both types of test subjects. Twenty repeated experiments were conducted to address potential biases, with a specific focus on age distribution and speaker numbers, ensuring fair comparisons among groups.
Researchers employed a deep learning neural network model, specifically the GE2E (Generalized End-to-End) TISV model, to investigate the complex interplay between speech disorders and ASV accuracy. Researchers used the dataset containing pathological and healthy speech to train and evaluate speaker verification models based on recurrent neural networks.
The study indicates that several factors, including the age of subjects, recording quality, microphone type, background noise, and speech intelligibility, can impact speaker verification accuracy in the ASV system.
Research Findings
The findings show that pathological speech has a significant impact on speaker verification performance of ASV and that different speech pathologies have different effects. The study reports a low mean equal error rate (EER) of 0.89% for the entire pathological dataset, which is lower than common values found in non-pathological datasets. Research shows that identifying those people with pathological speech (or unusual speech), such as adults with voice problems (dysphonia) and children with CLP, as compared to the adults and children with regular speech is very much easy.
The results also reveal that speech intelligibility does not influence the speaker verification performance of ASV, suggesting that automatic speaker verification systems can operate effectively even if the speech is not clearly understood. Increasing the size of the used training dataset can also improve the speaker verification performance of the ASV system. This is because having more data helps the neural network model learn better (due to which the error rate is reduced), making it more accurate in recognizing speakers.
Applications
This research can be applied in various fields like healthcare, voice-controlled devices, access control, forensic investigation, telecommunications, etc. Specifically, it can be used in healthcare applications in which speech is used as a biomarker, including diagnosis, therapy, screening, and monitoring of speech and voice disorders (or voice pathology). Apart from healthcare, it can be applied to enhance security and reliability in biometric authentication applications such as access control, banking, e-commerce, voice-controlled devices, forensic investigation, and telecommunications, which use voice as a means of authentication.
Conclusion
In conclusion, this paper presents a comprehensive study of the effect of speech pathology on speaker verification using deep learning-based ASV systems, using a large-scale dataset of pathological and healthy speech. As per the researchers, pathological speech influences speaker verification performance in different ways, depending on the type of pathology, the recording environment, the diversity of the recorded speech data, and the size of the used dataset.
The study findings show that speech intelligibility does not affect the speaker verification performance of the ASV. The paper concludes by highlighting the importance and challenges of speech pathology in speaker verification and suggesting directions for future work, such as extending the dataset, developing anonymization techniques, and examining individual-level differences.