In a paper published in the journal Scientific Reports, researchers conducted two experiments using computational language models (LMs). These LMs, ranging from simple n-gram models to advanced deep neural networks, were trained on written cross-linguistic corpus data encompassing 1293 languages. The researchers aimed to empirically test the hypothesis that languages become more accessible to learn as they have more speakers.
Using quantitative methods and machine learning techniques while considering factors like phylogenetic relatedness and geographical proximity of languages, the study provides robust evidence supporting a relationship between learning difficulty and speaker population size. Surprisingly, the results contradict prior expectations, indicating that languages with more speakers tend to be harder to learn.
Background
Historically, linguistics has prevailed with the assumption that no connection exists between a language's structure and the environment in which people speak it. This belief has led to the long-standing idea that all languages are equally complex and challenging to learn. However, there are between 6000 and 8000 languages worldwide, exhibiting significant variation in their structural properties. Recent cross-linguistic research has emphasized the impact of natural and social environments on language diversity, suggesting that variables like the number of speakers can influence language structure. It challenges the idea of a "universal language complexity.
Methods
Language Models: In this study, researchers employ general-purpose data compression algorithms. These algorithms consist of a model and a coder. The main focus is on a specific type of compressor known as a lossless compressor. This type of compressor estimates a model, essentially a conditional probability distribution derived from the training data. This model can then make predictions and compress data using arithmetic encoding.
For instance, Prediction by Partial Matching (PPM) is a dynamic and adaptive variable-order n-gram LM. It operates under the assumption of the Markov property, using the last 'o' symbols immediately preceding the symbol of interest to make predictions. Lempel-Ziv-Markov chain algorithm (LZMA) employs a compression strategy that identifies repetitive segments within the data and replaces them with references pointing to a single earlier instance of that segment in the uncompressed data stream. Study Two uses LZMA.
Perceptual Artistic Quality (PAQ) is a model that combines predictions from numerous models using a gated linear network. The network has a single layer with many input nodes and input weights. Trainers utilize backpropagation through time and Adam optimization for training.
Similarly, Long Short-Term Memory (LSTM)-compression combines predictions from independent models and uses a long short-term memory deep neural network. No Name Calling Please (NNCP) is a lossless data compressor based on the Transformer XL model. There are two versions, NNCPsmall and NNCPlarge, with different options.
Data: In Study One, researchers utilized a large-scale database of written multilingual texts consisting of 3853 documents across 40 different multilingual corpora. These documents encompass various text types and vary in length. For Study Two, they relied on data from the Parallel Bible Corpus, containing 1568 unique translations of the Bible in 1166 different languages, structured in terms of book, chapter, and verse.
Information Encoding Units: In Study One, researchers computed relevant quantities for words and characters as information encoding units/symbols. In Study Two, they estimated the level of words and applied byte pair encoding (BPE) to split words into one or several units for LM training.
Sociodemographic and Linguistic Variables: Researchers considered various sociodemographic and linguistic variables such as speaker population size, language family, language, macro area, country, writing script, longitude, latitude, EGIDS level, and more. These variables help control potential translation effects and investigate their influence on language complexity.
Estimating LM Learning Difficulty: Investigators used compression rates for different sub-sequences of increasing length to measure language learning difficulty. By evaluating the shape of the curve of compression lengths, they gained insights into how well language learning succeeds.
Statistical Analyses: Researchers employed multilevel mixed-effects linear regression, frequentist model averaging, double-selection lasso linear regression, permutation testing, phylogenetic generalized least squares regression, and spatial autoregressive error regression to analyze the data and examine the relationship between speaker population size and language complexity. In these analyses, researchers used various control variables, and in some cases, they employed permutation tests to assess the significance of the findings. These methods helped explore the effects of different factors on language complexity and learning difficulty.
Experimental Results
In the first study, the researchers conducted a detailed analysis of language learning difficulty, mainly focusing on the relationship between learning difficulty and speaker population size. They used various language models and multilevel mixed-effects linear regression (LMER) to investigate the impact of population size on learning difficulty. The results indicated that, across different language models, larger speaker populations were associated with lower learning difficulty, suggesting that languages with more speakers tend to be easier to learn. These findings were consistent across various models, including PPM, PAQ, and LSTMcomp, and the results remained robust even when controlling for potentially confounding factors like translation effects and pluricentism.
In the second study, the researchers aimed to confirm the results obtained in the first study using a different dataset, the Parallel Bible Corpus, which provided more balanced and parallel multilingual training data. They trained several language models and evaluated learning difficulty using cross-entropy measures. The results in the second study aligned with those in the first study, showing that languages with larger speaker populations tended to be easier to learn, as indicated by the negative relationship between learning difficulty and population size. These results were consistent across language models and symbolic levels.
The researchers further confirmed their findings by conducting Phylogenetic Generalized Least Squares (PGLS) regression and spatial autoregressive errors regression (SAR) analyses, which considered genealogical relatedness and spatial proximity as potential sources of influence on language learning difficulty. In both cases, the results supported the negative relationship between population size and learning difficulty. These findings prove that languages with larger speaker populations are generally easier to learn.
Conclusion
In conclusion, the research leveraged computational language models to empirically investigate the relationship between speaker population size and language learning difficulty. Contrary to prior expectations, the results from two comprehensive experiments using a diverse range of language models showed that languages with larger speaker populations tend to be more challenging to learn. This novel insight challenges conventional assumptions and has significant implications for the understanding of language acquisition and the factors that influence linguistic diversity. It underscores the valuable role of computational models in advancing the understanding of human language and cognition.
Journal reference:
Koplenig, A., & Wolfer, S. (2023). Languages with more speakers tend to be harder to (machine-)learn. Scientific Reports, 13:1, 18521. https://doi.org/10.1038/s41598-023-45373-z. https://www.nature.com/articles/s41598-023-45373-z