In an article published in the journal Nature, researchers explored a novel approach to expedite poverty assessment in Indonesia using e-commerce data and machine learning (ML) algorithms. The authors employed statistical-based feature selection and compared three ML algorithms to predict poverty rates.
Background
In recent decades, poverty has persisted as a significant challenge in developing countries, exemplified by Indonesia's struggle where 9.82 percent of the population, or 25.95 million people, were identified as poor in March 2018. Traditional poverty assessment methods, such as the National Socio-economic Survey (SUSENAS), are time-consuming, costly, and conducted at infrequent intervals, hindering timely and cost-effective policymaking. Acknowledging the transformative possibilities of the digital revolution, this investigation delved into leveraging e-commerce data as an immediate and nuanced gauge of socio-economic conditions. This focus was particularly relevant in Indonesia, boasting one of Southeast Asia's largest e-commerce markets.
While previous studies have utilized various data sources, such as satellite imagery and call detail records, for poverty estimation, assumptions and limitations persisted. E-commerce data, however, presented a promising alternative, offering direct insights into household expenditure without inherent assumptions. This paper addressed the scarcity of research in utilizing e-commerce data for poverty prediction, highlighting its novelty and potential significance.
Previous efforts have primarily employed limited feature selection algorithms, whereas this research employed three statistical-based feature selection methods along with three ML algorithms to enhance the accuracy of poverty estimation models. By doing so, the paper sought to bridge gaps in existing research, providing a comprehensive and original approach to poverty prediction using e-commerce data, which could have broader implications beyond Indonesia.
Method
The research utilized sample advertising data from a prominent Indonesian e-commerce company to address the challenge of timely and cost-effective poverty assessment. Focused on Java Island, the dataset comprised eight items such as motorbikes, cars, apartments, houses, and land for sale or rent in 2016. Poverty levels were measured against a predefined poverty line, representing the minimum expenditure needed for basic life needs. With 96 features initially, including aspects like the number of items sold and their prices, the dataset covered 118 cities.
For improved computational efficiency, the study utilized statistical-based feature selection algorithms, including f-score, chi-square, and correlation-based feature selection, to pinpoint pertinent features. The researchers utilized a thorough, multi-stage methodology that covered data preprocessing, normalization, feature selection, model training, and evaluation.
The authors employed ML algorithms, specifically support vector regression (SVR), k-nearest neighbor regression (k-NN), and linear regression (LR). SVR, chosen for its successful application in various domains, underwent a grid search for optimal parameters. The researchers aimed to predict poverty rates, employing leave-one-out cross-validation for evaluation.
Performance metrics included root mean squared error (RMSE) to measure the difference between actual and predicted values and R-squared (R2) to assess the model's ability to predict actual data trends. The comprehensive methodology addressed the challenge of high-dimensional data in the e-commerce dataset, contributing to the novel application of e-commerce data and ML for poverty estimation in Indonesia.
Results and Discussion
The research employed f-score and chi-square feature selection algorithms to identify relevant features from e-commerce data, aiming to enhance the performance of ML models in predicting poverty levels. Correlation-based feature selection exhibited inconsistent results and was excluded from further analysis. The study conducted prediction experiments with SVR, k-NN, and LR, comparing results with and without feature selection. The best-performing model was SVR, achieving an R2 score of 0.42765 with f-score feature selection and 90 features.
Visualizations of SVR, k-NN, and LR models showcased SVR's superior performance, particularly in minimizing prediction errors. The choropleth maps displayed actual and predicted poverty rates in Java Island, revealing an overall underestimation in predicted rates compared to actual data. Detailed city-level comparisons provided a comprehensive analysis of actual versus predicted poverty percentages. The findings highlighted the effectiveness of feature selection in handling high-dimensional e-commerce data, with SVR emerging as the most reliable model for poverty prediction. However, the underestimation observed in predictions suggested potential areas for model refinement and improvement.
Conclusion
In conclusion, the researchers demonstrated the potential of utilizing e-commerce data for poverty prediction through ML. The f-score feature selection algorithm outperformed others, enhancing the performance of SVR in predicting poverty rates. However, challenges existed in predicting regions with higher poverty rates. Despite limitations, such as the use of only one year of data, the study suggested the viability of e-commerce datasets as proxies for socio-economic conditions.
Future research could explore larger datasets for improved model accuracy. The main drawback was in data accessibility and confidentiality constraints associated with e-commerce data. Overall, the findings underscored the promise of integrating e-commerce data, feature selection, and ML for effective poverty estimation.
Journal reference:
- Wijaya, D. R., Ibadurrohman, R. I. F., Hernawati, E., & Wikusna, W. (2024). Poverty prediction using E-commerce dataset and filter-based feature selection approach. Scientific Reports, 14(1), 3088. https://doi.org/10.1038/s41598-024-52752-7, https://www.nature.com/articles/s41598-024-52752-7