In an article recently submitted to the ArXiv* server, researchers addressed the challenge of speech accents affecting automatic speech recognition (ASR) systems. They proposed an innovative approach for accent adaptation within end-to-end ASR systems, using trainable codebooks and cross-attention mechanisms to capture accent-specific information.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Testing on the Mozilla Common Voice dataset showed significant performance gains for known English accents and unseen accents. Their method also demonstrated advantages in zero-shot transfer setups. The study highlights the potential for improving ASR's inclusivity and performance across various accents.
Background
Accents in speech arise from multiple influences but pose challenges for ASR systems training. Existing solutions have progressed, but robustly handling diverse accents in ASR training and testing remains a complex problem. Prior work in accent managing for ASR systems includes accent-agnostic and accent-aware approaches. Accent-agnostic methods focus on minimizing the influence of accents using adversarial training or similarity losses. Accent-aware approaches provide additional accent-related information to the model during training, such as accent-specific auxiliary tasks, embeddings, or fusion methods.
Enhanced ASR Architecture and Accent Handling
The methodology revolves around enhancing the base ASR architecture, comprising the encoder, decoder with attention (DEC-ATT), and a Connectionist Temporal Classification (CTC) module (DEC-CTC). This core ASR model efficiently processes spoken language input. This work introduces three fundamental modifications to address the challenge of handling diverse speech accents.
Codebook Construction: Multiple accents within the training data, denoted as M-seen accents, are recognized. Researchers create M codebooks to accommodate these accents, tailoring each to a specific accent. These codebooks contain vectors (codebook entries) designed to capture accent-specific information. During training, researchers implement a deterministic gating mechanism to select the relevant codebook for the accent associated with the training example, enabling the ASR model to learn accent-specific representations. They have devised a strategy for codebook selection during inference, especially when accent labels are absent. This selection process ensures the model can adapt to varying accents, even without accent labels.
Encoder with Accent Codebooks: Substantial changes are made to the encoder to fully integrate accent-specific information into the ASR model. A new accent-aware encoder module, Encoder with Accent Codebooks (ENCa), is created, comprising a stack of L identical Conformer layers. Each Conformer layer incorporates cross-attention mechanisms, enabling the encoder to access accent codebooks for additional information. These codebooks are shared across all encoder layers, ensuring a consistent understanding of accent-specific nuances. The cross-attention sub-layer empowers the encoder to extract relevant information from the codebooks, enhancing its ability to capture the characteristics of different accents effectively.
Modified Beam-Search Algorithm: The approach involves implementing a joint beam search mechanism over all seen accents. This adaptation ensures that considering each seen accent during the expansion of hypotheses by modifying the standard beam-search algorithm for inference. Calculating scores for each hypothesis involves using the corresponding codebook for the associated accent. This approach ensures that the ASR model can accurately predict different accents in scenarios with no accent labels. Researchers notably discuss the limitations of using a classifier to predict accent labels during inference because of imbalances in the accent distribution observed during training.
Experimental Setup and Results
In the experimental setup, the study utilized the Mozilla Common Voice Accent (MCV_ACCENT) dataset derived from the Mozilla Common Voice English corpus, encompassing 14 English accents categorized as seen and unseen. The data splits ensured speaker-disjoint sets for training, development, and testing, with two train sets of differing sizes. Researchers conducted all experiments using the End-to-End Speech Processing Toolkit (ESPnet), employing a Conformer model with specific configurations. The results exhibited that the proposed codebook attend (CA) system consistently outperformed other approaches, demonstrating the lowest Word Error Rates (WERs) for both seen and unseen accents.
Additionally, zero-shot transfer evaluations on the L2Arctic dataset illustrated the CA system's remarkable adaptability to new datasets. The CA approach performed superior on seen and unseen accents, even with a 600-hour MCV_ACCENT dataset. Ablation studies revealed that the number of accent-specific codebook entries and cross-attention applications at specific encoder layers significantly impacted the model's effectiveness.
Finally, beam-search decoding variants showcased the CA system's joint beam search as an optimal compromise between performance and inference overhead, outperforming alternative decoding strategies. The study's experimental framework demonstrated the codebook-based approach's effectiveness in handling diverse accents, achieving remarkable results on both seen and unseen accents while maintaining reasonable computational efficiency.
Conclusion and Future Directions
This study introduces a novel end-to-end approach for accented ASR, leveraging accent-specific codebooks and cross-attention mechanisms. The extensive experiments on the Mozilla Common Voice corpus demonstrate significant performance gains for both seen and unseen accents. Furthermore, this approach paves the way for encoding various non-semantic cues in speech, such as noise types, dialects, and emotional speech styles, which can impact ASR performance.
While this method shows promise, several limitations exist. The codebook size requires careful tuning, and exploring a single, large codebook with learnable gates for accent selection may offer more scalability and codebook entry sharing. Moreover, researchers should optimize the increased computational time incurred by the joint beam-search inference in future work. The authors acknowledge the potential for improvements by accommodating mix-and-match effects across different seen accents within a single utterance.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.