In an article recently submitted to the ArXiV* server, researchers highlighted the impressive advancements in speech generation. These strides have led to near-human speech synthesis and the potential to revolutionize various applications. However, current models like VALL-E and SoundStorm demand extensive data and resources.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In contrast, the production-friendly, highly efficient, and modularized encoding (PHEME) model series offered compact, high-quality speech generation efficient with smaller data sets, bridging the gap between quality and scalability. Moreover, simple techniques like teacher-student distillation enhanced its performance. Access to audio samples and pre-trained models was available online.
Related Work
Recent advancements in neural text-to-speech (TTS) synthesis have been pivotal for natural speech generation, crucial in applications like conversational artificial intelligence (AI). Leveraging deep learning architectures and neural audio codecs, such as transformer-based models, has accelerated this progress. However, current models often need to pay more attention to the conversational nature of human speech, posing challenges in adaptability across diverse scenarios. Conversational TTS is fundamental for creating a lifelike user experience and enhances user satisfaction in interactions with conversational systems. However, most Transformer-based TTS work has focused on controlled environments, disregarding the conversational nature of human speech across diverse scenarios.
PHEME's Efficient TTS Methodology
The methodology for developing PHEME encompasses three core components: speech tokenization, the T2S (text-to-semantic) component, and the A2S (acoustic-to-semantic) component, emphasizing simplicity and efficiency. Speech tokenization relies on the SpeechTokenizer model, operating on audio signals to produce semantic and acoustic tokens. These tokens form hierarchical residual vector quantization (RVQ) layers, facilitating the disentanglement of speech information. The T2S component learns the mapping from raw text to semantic tokens using a T5-style encoder-decoder architecture, preprocessing text into international phonetic alphabet (IPA) phones and generating semantic tokens for training and inference.
During inference, the PHEME model prompts with speech input and its text transcription, converting them into semantic tokens. The A2S component adopts non-autoregressive decoding from SoundStorm, incorporating SpeechTokenizer-derived acoustic tokens and speaker embeddings. Based on a cosine schedule, the masking strategy selects tokens for conditioning the model, generating high-fidelity speech. The A2S model, based on a conformer network, undergoes training via cross-entropy loss on masked tokens of specific RVQ levels.
Decoding occurs through an iterative, level-wise parallel decoding procedure similar to SoundStorm, significantly reducing the number of forward passes. This design, including speaker embeddings, enables one-shot and zero-shot speech generation, allowing the model to produce speech with or without a prompt but with a supplied speaker embedding. This methodology aims to streamline the TTS process, enhancing efficiency and adaptability in generating natural-sounding speech across various scenarios.
Comparison and Evaluation Summary
PHEME's comparative evaluation against multi-speaker quality text-to-speech (MQTTS) initially scrutinizes the impact of fast parallel decoding on the overall quality of conversational TTS synthesis. Despite its significantly quicker non-autoregressive inference, the slight PHEME variant, with 100M parameters, outperforms the similarly sized MQTTS model trained on a matching dataset in terms of word error rate (WER) and Mel-Cepstral distortion (MCD). Although both models exhibit high WER due to modest data sizes, PHEME demonstrates better speech diversity, higher intelligibility, and improved Fréchet inception distance (FID) scores. However, some WER errors emerge from misspelled proper nouns and homonyms, suggesting potential improvements with language model-assisted decoding.
Furthermore, leveraging additional training data, the 300M PHEME variant significantly improves WER, MCD, and FID scores compared to the 100M variant. Despite the increased model size, the 300M variant's performance remains higher than MQTTS and the smaller PHEME model, yet it employs more extensive training data beyond preprocessed GigaSpeech.
Exploring efficiency metrics showcases the significant advantages of non-autoregressive decoding in PHEME. The inference speed comparison between MQTTS and PHEME demonstrates substantial efficiency gains without compromising synthesis quality. For instance, PHEME achieves a 14.5x speed-up over MQTTS for a 10-second speech utterance. The larger 300M model maintains competitive inference speed while significantly enhancing synthesis quality.
The evaluation also extends to single-speaker specialization using the 300M multi-speaker PHEME model, aiming to create a distinct 'brand voice.' Fine-tuning with synthetic data shows slight WER reductions alongside improvements in speaker similarity and FID scores. Researchers must actively investigate using precise human training data in the target voice and scaling up the dataset for better single-speaker specialization.
An analysis of the A2S component without speaker embeddings underscores their significant impact on performance across GigaSpeech evaluations and single-speaker specialization experiments. The inclusion of speaker embeddings notably affects fidelity metrics, enhancing overall performance.
Examining PHEME's pivotal components reveals that T2S represents a crucial bottleneck within the system. While A2S performs robustly, scaling down, T2S adversely affects generation quality and introduces learning instability. Notably, T2S processing consumes most TTS processing time, signaling challenges in employing highly parameterized models for T2S in real-time systems. Future work should prioritize refining the T2S component to balance performance and efficiency better.
Conclusion
To sum up, PHEME TTS models revolutionized efficient and conversational text-to-speech systems. Achieving remarkable speed improvements of nearly 15 times faster inference without compromising speech quality, these models set a strong foundation. They invite further exploration in architecture alternatives, enhancing text-to-semantics components and innovating decoding strategies. Additionally, the study underscored the scarcity of high-quality conversational TTS data, signaling the need for focused data collection in future research.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.