In a paper published in the journal PLOS One, researchers tackled the growing challenge of customer churn within the banking sector. Using data from a bank available on Kaggle, the team focused on constructing a predictive model utilizing Genetic Algorithm eXtreme Gradient Boosting (GA-XGBoost) to identify churn. Their findings highlighted the effectiveness of techniques like the Synthetic Minority Over-sampling Technique using Edited Nearest Neighbors (SMOTEENN) in handling data imbalances.
The optimized model displayed exceptional performance, outperforming other machine-learning approaches. The GA-XGBoost classifier emerged as the optimal solution for predicting churn, highlighting influential factors like transaction volumes, product diversity, and account balances. This research provides invaluable insights for banks to elevate service quality and retain customers, setting a precedent for churn models in diverse industries. These strategies aim to reinforce market stability and minimize potential losses.
Background
The rapid evolution of the Internet and its integration with the financial sector has given rise to Internet finance, a burgeoning model that significantly impacts market dynamics. This shift has influenced traditional banking, leading to the migration of financial products online and a subsequent decline in profitability within conventional banking. Amid this transformation, customer retention and a customer-centric approach have emerged as pivotal elements for maintaining a competitive edge in the banking landscape. However, the advent of Internet finance has also diversified and personalized customer needs, resulting in a substantial loss of existing customers, a phenomenon termed customer churn.
Constructing Bank Customer Churn Model
This study outlines the technical methodology and theoretical frameworks for constructing a bank card customer churn prediction model. The approach integrates the Shapley Additive Explanations (SHAP) interpretation framework with the GA-XGBoost model in a multi-step process to predict and interpret customer churn effectively. The steps involve dataset acquisition from the Kaggle platform, comprehensive data preprocessing, addressing dataset imbalances, evaluating multiple machine learning models, selecting the most efficient model, optimizing it using genetic algorithms, and interpreting the prediction results using the SHAP framework.
The GA-XGBoost model construction involves leveraging the XGBoost algorithm—recognized for its boosting capabilities—to create a customer churn model by optimizing a specific objective function. Introducing a genetic algorithm-based parameter tuning method aims to enhance model performance and tackle the challenges of large datasets and numerous XGBoost parameters. This approach facilitates efficient parameter optimization, aiming to achieve an optimal parameter combination for the XGBoost model, ultimately enhancing the predictive accuracy by utilizing the AUC index as a fitness function.
Moreover, the study introduces the SHAP interpretation framework to address the interpretability challenges of the GA-XGBoost model, which, despite its accuracy, needs better interpretability due to its complex nature. SHAP (Shapley Additive Explanations) provides a robust means to interpret black-box models like GA-XGBoost by calculating Shapley values for each feature, revealing their contribution to prediction outcomes. This framework ranks feature importance and elucidates how individual features impact prediction results, providing valuable insights aligned with human intuition.
Regarding experimental data and preprocessing, the study utilizes a Credit Card Customer dataset from Kaggle, comprising personal, behavioral, and transactional information. Data preprocessing involves handling categorical variables using One-Hot encoding, feature normalization via Z-score normalization, and a chi-square test to discern the significant impact of features on customer churn. Conducting feature correlation analysis drives removing specific features based on correlation heat maps.
The experimental environment comprises Python programming language, utilizing XGBoost, Imbalance, and Scikit-learn libraries for algorithm implementation and evaluation. For assessing bank customer churn, evaluation metrics focus on positive cases, emphasizing accuracy, recall, precision, and F1-score derived from a confusion matrix. The Area Under the Curve (AUC) value, derived from the Receiver Operating Characteristic (ROC) curve, is a comprehensive metric for evaluating the overall model performance in predicting bank customer churn.
Enhancing Customer Churn Prediction Method
The study delves into predicting bank customer churn, scrutinizing unbalanced datasets, and employing various methodologies to enhance model accuracy. Researchers initially tackled the imbalance issue, noting XGBoost's impressive 90% accuracy but flagging its low recall rate owing to this imbalance. They delved into resampling algorithms such as SMOTE, Adaptive Synthetic Sampling (ADASYN), and SMOTEENN to counter this. Among these, SMOTEENN emerged as the star performer, showcasing a remarkable 96% accuracy and an impressive 92% recall rate, thus substantially boosting the model's overall efficacy.
Moving forward, hyperparameter optimization via genetic algorithms amplifies XGBoost's efficiency, culminating in optimal parameters—n_estimators = 265, learning_rate = 0.097, max_depth = 5. This optimization boosts the model's AUC to 99.02%, outshining other machine-learning models across various evaluation metrics. Ultimately, the GA-XGBoost approach demonstrates a marked advancement in predicting customer churn, particularly in the banking sector. This comprehensive methodology amalgamates data balancing, algorithmic fine-tuning, and interpretability frameworks to significantly heighten prediction accuracy and robustness, offering crucial insights for proactive customer retention strategies.
Conclusion
To sum up, this study pioneers an XGBoost model for customer churn prediction, overcoming the limitations of traditional methods. The model achieves exceptional accuracy and outperforms conventional approaches by optimizing parameters using resampling techniques and genetic algorithms.
Leveraging the SHAP framework for interpretability sheds light on critical churn indicators like transaction volume and product engagement. Future research will prioritize enhancing feature selection methods and validating the model across diverse datasets, enabling domestic banks to implement custom-tailored retention strategies.