In an article published in the journal Scientific Reports, researchers from Mexico utilized several machine learning (ML) algorithms to design predictive models that can identify students who are at risk of dropping out and provide them with appropriate support. They employed their techniques for predicting school dropout at secondary and higher education levels.
Background
ML is a branch of artificial intelligence that enables computers to learn from data and perform tasks that normally require human intelligence. It can be classified into two categories: supervised and unsupervised learning. Supervised learning is when the computer is given a set of input-output pairs and learns to map new inputs to the desired outputs. Unsupervised learning occurs when the computer is given input data without explicit labels and learns to uncover patterns or structures within the data.
ML has been widely used in various fields, such as medicine, engineering, finance, and education. It can help improve the quality and effectiveness of teaching and learning processes and address challenges such as student retention, performance, and satisfaction. School dropout is a complex phenomenon that has multiple causes and consequences, and it is influenced by individual, family, school, and social factors. Therefore, ML can model and predict school dropout by handling large and heterogeneous datasets, capturing nonlinear relationships, and providing accurate results.
About the Research
In the present paper, the authors aimed to develop a model for predicting school dropout with 90% reliability. They used data from the 2010 and 2020 housing and population censuses and the 2015 intercensal survey conducted by the National Institute of Statistics and Geography (INEGI). These data sets included information about the residents and households in Mexico's 32 states and 2,457 municipalities, including factors such as ethnicity, birth, education, health services, economic issues, and other relevant characteristics.
The study selected 20 variables from the data sources based on their correlation with the target variable, which was the academic level of the individuals. The target variable indicated whether the individual had completed or dropped out of secondary or higher education. The selected variables included demographic, socioeconomic, and educational factors, such as age, gender, marital status, occupation, income, school attendance, school type, and school location. The researchers cleaned and homogenized the data, discarding incomplete, duplicate, and unspecified records and retaining only the records of people over 14 years old who entered secondary or higher education. The final dataset consisted of 1,080,782 records.
Furthermore, artificial neural networks (ANN), support vector machines (SVM), Bayesian optimization, random forest (RF), and linear ridge and Lasso regression were applied to create predictive models. These techniques were chosen because they have proven effective and competitive in solving regression problems. Moreover, the performance of each technique was compared in terms of reliability and processing time using different evaluation metrics, such as the coefficient of determination, the mean squared error, and the root mean squared error. The study utilized 80% of the data for training and 20% for testing.
Research Findings
The outcomes showed that all the ML techniques achieved high-reliability results, above 91%. However, the best technique in terms of reliability and processing time was the ANN, which obtained a reliability of 99%, followed by SVM and Bayesian optimization, which obtained a reliability of 99.5% and 99.4%, respectively. RF, linear ridge, and Lasso regression obtained a reliability of 91.3% and 91.1%, respectively. The error rates of the techniques were below 10%, which was the convergence criterion established by the authors. The ANN also had the shortest processing time, while random forest required the most computing power.
Several tests were also performed to optimize the parameters and structure of the ANN, such as the number of layers, neurons, activation function, and optimization algorithm. The authors found that ANN was the best configuration multilayer perceptron with four hidden layers and two neurons each, using the adaptive moment estimation (ADAM) optimization algorithm and the rectified linear unit (ReLU) activation function. Moreover, it was able to learn from the data and to predict the probability of school dropout for everyone based on the input variables.
The study also identified the most influential variables in predicting school dropout using the feature importance method. The most influential variables were school attendance, the school type, the school location, the occupation, the income, and the marital status. These variables reflect the economic, social, and educational factors that affect the decision of students to continue or abandon their studies.
Conclusion
In summary, the paper comprehensively demonstrated the feasibility and usefulness of applying ML to predict school dropout. The authors indicated that the best ML approach was the ANN. They also highlighted the most influential variables in predicting school dropout, which can aid in understanding the causes and consequences of this issue.
The research has several applications and implications for the educational sector, including providing timely support to at-risk students and evaluating the impact of various policies and programs. Additionally, the researchers proposed developing an open platform for institutions to access and utilize the data and predictions, facilitating ongoing model improvement with new data.