In a recent publication in the journal Applied Sciences, researchers augmented the existing linguistic e-learning system for phonetic transcription with three artificial intelligence (AI)-driven enhancements.
Background
Phonetic transcription, a crucial component of linguistic education, employs International Phonetic Alphabet (IPA) characters to represent speech sounds. In their prior work, the authors introduced Automated Phonetic Transcription—The Grading Tool (APTgt), an interactive e-learning system for phonetic transcription and pronunciation. This system, tailored for IPA language students, offers phonetic transcription exams and automated grading for teachers. To enhance its intelligence and versatility, the authors propose three machine learning and deep learning-based improvements.
Empowering Linguistics Through E-Learning and AI Advancements
Amid the coronavirus disease 2019 (COVID-19) pandemic, e-learning saw a surge, affecting over one billion children globally, as per United Nations Educational, Scientific, and Cultural Organization (UNESCO) data. E-learning employs electronic devices such as smartphones and laptops to provide interactive, remote learning, offering flexibility and overcoming geographical constraints.
Linguistics, the study of human language, benefits from e-learning. APTgt, an interactive web-based system, focuses on phonetic transcription, representing speech sounds using unique symbols. The speech disorder classification process involves feature extraction and classification. Researchers extracted the Mel frequency cepstrum coefficient (MFCC) and linear prediction coding (LPC) features. These extracted features were used for the classification task.
A phoneme is the smallest spoken sound unit, while a grapheme denotes the smallest written language unit in linguistics. Words are converted from graphemes to phonemes (G2P), essential for natural language processing applications such as text-to-speech (TTS) and speech recognition. Hidden Markov and deep learning models have been investigated, with noticeable advancements in performance over time. Neural network-based end-to-end models, including Tacotron 1 and 2, Fast Speech, and Deep Voice, have become prominent in TTS research, comprising text analysis, acoustic modeling, and vocoding components.
Enhancements to Improve Linguistic E-Learning
MFCC and CNN-based Disorder Speech Classification: The initial enhancement introduced is a speech classification module that can differentiate between disordered and non-disordered speech. This classification entails two fundamental subproblems: feature extraction and classification. Researchers considered MFCC in image format speech features, and a CNN model handled classification. Feature extraction involved converting audio signals into MFCCs, utilizing MFC as a representation. The process included pre-emphasizing the audio, segmenting it into overlapping windows, Fourier transforming to the frequency domain, computing the Mel spectrum, and applying discrete cosine transformation to yield MFCCs.
Data for model training were sourced from the Speech Exemplar and Evaluation Database (SEED), comprising over 16,000 speech samples categorized by age and speech health status, both with and without speech disorders.
Implementation involved selecting about 1000 SEED samples, splitting them into training and validation sets, and processing MFCC values using Python's Librosa library. The disordered speech classification problem was translated into an image classification challenge addressed by a CNN model. The model achieved an average classification accuracy of approximately 83 percent, indicating efficiency and versatility in distinguishing disordered speech, irrespective of content or recorder.
Transformer-based Multilingual G2P Converter: The primary function of their e-learning system revolves around interactive phonetic transcription exams, posing challenges for teachers when generating IPA characters from written language. Hence, they devised a G2P converter. G2P conversion involves translating words into IPA format, which is essential for phonetic transcription. They opted for IPA symbols to represent pronunciation in their system and explored French and Spanish variants for multilingual support. Utilizing various datasets, they employed a Transformer model, which processes input comprehensively, avoiding dependency issues.
The six-layer transformer model achieved remarkable results. The English G2P converter yielded a 2.6 percent phoneme error rate (PER) and a 10.7 percent word error rate (WER). The French and Spanish converters attained PERs of 2.1 percent and 1.7 percent and WERs of 12.3 percent and 12.7 percent, respectively. Their models outperformed other G2P converters, showcasing their superiority in accuracy. This transformer-based G2P converter enhances phonetic transcription and presents potential for multilingual e-learning, streamlining the generation of IPA characters for teachers.
Tacotron2-Based IPA-to-Speech System: The existing e-learning system faces the challenge of acquiring high-quality speech audio and converting text into IPA format for phonetic transcription exams. To address this, researchers designed a TTS system capable of generating speech directly from IPA-formatted text.
Their IPA-to-speech system workflow involves converting English sentences from the LJSpeech dataset into IPA format using their G2P converter. They then predict and calculate Mel spectrograms with the Tacotron 2 model and employ WaveGlow as the vocoder, ensuring high-quality speech sound generation. The APTgt system employs Tacotron2, which uses simpler building blocks and predicts Mel spectrogram frames from character input sequences. A modified WaveNet generates waveform samples based on the predicted Mel spectrograms. This Tacotron2-based IPA-to-speech system offers teachers a solution to the challenges of recording speech and enhances students' understanding of IPA characters and word pronunciation.
Conclusion
In summary, researchers introduced three AI enhancements for the existing linguistic e-learning system. These enhancements elevate the system's functionality, making it more comprehensive. They improve speech classification, G2P conversion, and speech synthesis. Future work includes objective system evaluation and addressing potential privacy concerns related to recorded speech sounds.