In a recent publication in the journal Scientific Reports, researchers evaluated three machine learning algorithms, including linear discriminant analysis (LDA), decision trees (C5.0), and neural networks (NNET), in alignment with human speech perception. These algorithms were trained on first-language (L1) vowel formants and duration and tested on second-language (L2) vowels. Adult L2 speakers participated in perceptual classification.
Background
In recent years, machine learning has found application in predicting nonnative speech perception patterns. This involves the assumption, direct or indirect, from various speech models that acoustic and phonetic similarity between first-language (L1) and second-language (L2) sounds can predict L2 sound perception. For instance, the author’s previous study successfully used machine learning to predict the classification of English /ɪ/ and /iː/ in terms of Cypriot Greek /i/. Similarly, LDA has been employed to predict nonnative sound mappings. Gilichinskaya and Strange's study estimated the assimilation of American English vowels into Russian listeners' L1 vowel categories. LDA predicted this assimilation effectively.
Acoustic analysis and methodologies
Experimental protocols received approval from the Ethics Committee of the University of Nicosia, Department of Languages and Literature. All methods adhered to ethical standards outlined in the Declaration of Helsinki and its subsequent amendments. Participants had the option to withdraw at any time because participation was entirely voluntary. Data were kept confidential, and participant identities remained anonymous using codes. Every subject gave their informed consent.
A formant in a speech wave is an accumulation of acoustic energy at a specific frequency. For speech feature extraction, training data included F1, F2, and F3 formats and duration measurements of Cypriot Greek vowels (/i e a o u/) from 22 adults. Equivalent test data for Standard Southern British English vowels (/ɪ iː e ɜː æ ɑː ʌ ɒ ɔː uː ʊ/) were included, from 20 English-speaking adults (10 females). Natural speaking was encouraged, and recordings were made at a sampling rate of 44.1 kHz.
The study aimed to assess classifiers' ability to generalize across phonetic contexts. Acoustic analysis was conducted using Praat, with adjustments including pre-emphasis (50 Hz), window length (0.025 ms), and spectrogram view range (5500 Hz). Formant frequencies were extracted based on vowel analysis points, and vocalic duration was determined manually.
Machine learning employed three algorithms: LDA, C5.0, and NNET, to predict L2 sound classification relative to L1 phonetic categories. These models were trained using R software with cross-validation. LDA achieved 0.94 percent prediction accuracy, C5.0 reached 0.95 percent, and NNET attained the same accuracy. Model optimizations were cross-validated.
In the perception study, 20 Cypriot Greek speakers (10 females) participated. Their daily English use was reported, and their mean age at which English learning began was 8.35 years. All knew English at B2/C1 levels and had healthy sensory and cognitive functions. Test stimuli consisted of 11 English monophthongs embedded in /hVd/ words within the phrase "They say < word > now." The stimuli were recorded by two adult female English speakers. Participants individually completed the classification test, clicking on labels that matched the heard vowel. Participants gave written consent to the Declaration of Helsinki, and the University of Nicosia Ethics Committee approved the study as ethical.
Results and analysis
Machine learning algorithms classified L2 English vowels according to L1 vowel categories. For the responses with the highest proportion, C5.0 and LDA showed 100 percent agreement, whereas LDA and NNET showed 90 percent agreement, and NNET and C5.0 showed 90 percent agreement. For a broader range of above-chance responses, C5.0 and LDA exhibited 63.6 percent agreement, NNET and LDA had 72.7 percent agreement, and NNET and C5.0 showed 63.6 percent agreement.
In the perceptual test, L2 speakers classified English vowels similarly. Comparing machine learning to human participants, for responses with the highest proportions, LDA and C5.0 achieved 90.9 percent prediction accuracy. NNET reached 100 percent prediction accuracy, covering all English vowels. For the broader range of above-chance responses, LDA achieved 72.7 percent prediction accuracy. C5.0 had 45.5 percent prediction accuracy. NNET achieved 81.8 percent prediction accuracy.
Results indicated strong performance by LDA and NNET but poor performance by C5.0. This aligns with previous findings suggesting LDA's efficacy in mapping L2 sounds to L1 categories, albeit with reduced accuracy in predicting a wider range of responses. LDA, while slightly less accurate, still performed well, possibly due to the less nonlinear nature of the L1-L2 relationship and dataset size limitations. C5.0 struggled, potentially due to overfitting and difficulty handling continuous variables.
Conclusion
In summary, the study aimed to assess if machine learning algorithms, specifically NNET, C5.0, and LDA, trained on crosslinguistic acoustic data, could match the accuracy of L2 human listeners in classifying sounds. The models NNET and LDA showed accurate classification of L2 sounds based on L1 categories, with potential implications for cross-linguistic speech studies. These findings can enhance language learning and speech technology. Future research can explore larger samples and diverse classifier sets.