Mobile communication delivers relatively clear voice output, but noise makes it hard to understand in loud environments. Speech Intelligibility Enhancement Technology (IENH) was developed to solve this problem. Traditional IENH methods cause speech distortion and quality issues. In a recent publication in the journal Electronics, researchers proposed an enhanced IENH framework using generator-based deep neural networks called StarGAN and dual discriminators.
Background
The rapid progress in artificial intelligence (AI) technology and mobile communication has normalized telephone conversations, even in noisy surroundings. During communication, noisy environments impact both the speaking and listening phases. In the speaking stage, noise suppression is addressed by speech enhancement (SE); in the listening stage, it struggles with speech intelligibility enhancement (IENH).
Present research concentrates on enhancing near-end listening to improve speech clarity and quality. These enhancements are critical for effective communication. Traditional SE methods have relied heavily on signal processing techniques. Recent deep learning models show promising improvements in SE methods. The transition from SE to IENH involves altering the acoustic features of the source speech signal to enhance intelligibility in the presence of noise. Early IENH research used acoustic-making principles but failed to preserve the naturalness of speech.
Enhancing Speech Intelligibility
Speech feature tuning in IENH algorithms can be broadly categorized into two groups: rule-based and data-driven approaches. Rule-based methods, stemming from years of speech processing research, offer speed and adaptability to varying speech features. However, they struggle to comprehensively model complex speech feature interactions, hampering speech intelligibility enhancement. Such methods often compromise naturalness and quality due to predefined rules, resulting in unnatural sounds and distorted speech.
To address these concerns, contemporary speech enhancement leans towards data-driven methods, employing deep learning for improved feature modeling. Utilizing abundant speech data, data-driven approaches build models, transforming normal speech into Lombard speech and achieving speech style conversion (SSC). While conventional SSC methods necessitate parallel corpora, non-parallel SSC methods using techniques such as cycle-consistent generative adversarial networks (CycleGAN) and StarGAN have surfaced, enhancing intelligibility and naturalness and learning many-to-many mappings. StarGAN, for instance, even considers gender differences' impact on Lombard features, broadening its scope beyond parallel mappings.
Advancing speech conversion through SSC techniques
The latest non-parallel SSC technique employs a framework to convert standard input speech into Lombard-style output speech. The procedure involves a normal-to-Lombard speech conversion module comprising vocoder analysis, feature mapping, and vocoder synthesis. Initially, the input speech signal goes through vocoder analysis to extract features. These features, closely linked to Lombard-style attributes, are then transformed using a mapping model. Ultimately, the altered and unaltered features are combined within the vocoder for speech synthesis.
The current study introduces an enhanced iteration of the framework known as AdaSAStarGAN, designed to enhance speech quality, especially in noisy contexts. Conforming to the AdaSAStarGAN architecture, this augmentation generates a unified generator that facilitates seamless domain mapping. The augmentation process is underpinned by the interplay of adversarial, cycle consistency, and identity mapping losses, the culmination of which yields the refinement of transformed features. Adversarial loss orchestrates a performance boost in the generator by ensuring the indistinguishability of transformed features from authentic target attributes. Cyclic consistency and domain classification losses synergistically fine-tune the mapping function, while identity mapping loss stands as a bulwark against the erosion of essential information while maintaining stylistic integrity.
D2StarGAN for enhanced intelligibility and adaptability
The foundational framework AdaSAStarGAN seeks to improve spoken word quality, particularly in high-noise environments. In this case, the noisy speech changed to Lombard speech. Converting regular speech to Lombard speech encounters challenges due to noise assumptions. In diverse, noisy settings, converting to Lombard speech can degrade quality. To solve these issues, the researchers developed a technique called D2StarGAN, which makes speech easier to understand.
They added a special part that handles nearby noise to the existing framework, taking the features of the nearby noise and using them to improve the enhanced speech. This means that the enhanced speech can adjust itself to different kinds of noise. In situations where there is noise both far away and nearby, the framework uses the noise levels to change between Lombard speech and normal speech. In subjective listening assessments, the enhanced intelligibility and naturalness offered by D2StarGAN emerge as consistent themes, thereby reinforcing its efficacy.
Underlying the architecture is an intricate sequence of stages. The division of distant speech into distinct steps— namely, noisy and clear—commences the process. Notably, quantifying near- and far-field noisy sound intensity and their corresponding labeling follow suit. Simultaneously, crucial features such as temporal modulations and pitch inflections are distilled from the speech corpus, fortifying it for the subsequent phases. The meticulous orchestration of these features and their harmonious integration are quintessential for the efficacious transformation of conventional speech into the embellished variant.
Regarding system robustness, D2StarGAN excels in cross-lingual and cross-speaker scenarios, highlighting its adaptability across varied linguistic and speaker settings. This adaptability underscores its potential to cater to a spectrum of exigent communication paradigms.
Conclusion
In summation, the D2StarGAN model heralds a transformative shift in speech intelligibility enhancement. Its fusion of advanced metrics, such as the dual non-parallel SSC discriminator, within the purview of a robust data-driven framework holds the promise of addressing real-world complexities. While considerable strides have been made, this study casts a forward gaze toward refining the comfort and stability of this technology in alignment with practical applications and user expectations.