In a paper published in the journal Nature, researchers proposed a machine learning strategy to identify and classify the rise of organized retail crime (ORC) listings on a well-known online marketplace. This has become a significant challenge for retailers and consumers, particularly with the surge of online commerce and digital platforms. Swiftly detecting and responding to ORC is crucial to mitigate its impact. Leveraging supervised learning and advanced techniques in the proposed strategy, the method achieves a remarkable recall score of 0.97 on the holdout set and 0.94 on the testing dataset, employing a refined set of 45 features from the original 58.
Background
In the context of a rapidly expanding internet commerce landscape and the surge of online activities prompted by the COVID-19 pandemic, the prevalence of cybercrime and fraud has risen, posing severe economic and security challenges. Detecting and responding to such threats are imperative, but traditional prevention methods are not foolproof, and detection approaches have shown limitations.
The rapid growth of e-commerce platforms like Yahoo and eBay has been accompanied by a surge in online fraud cases, presenting a substantial challenge. Categorized by the Internet Fraud Complaint Center (IFCC) into various types, including non-delivery of goods, product misrepresentation, and multiple bidding, online fraud has spurred research into diverse detection strategies. Feedback anomaly detection methods, data mining schemes, and trust management solutions have been explored.
Addressing the issue of skewed data distribution, an imbalance between fraudulent and legitimate instances, researchers have employed data-level and algorithmic approaches. Data-level rebalancing involves techniques like undersampling and oversampling, with Synthetic Minority Oversampling Technique (SMOTE) emerging as a superior oversampling method. Algorithmic solutions, such as cost-sensitive learning, aim to manage class imbalance, with data-level methods generally outperforming algorithm-level strategies.
The proposed method tackles the issue by presenting a machine-learning solution to identify and combat organized retail crime (ORC) in online marketplaces. Through supervised learning and advanced methods, the approach achieves high recall scores on holdout and testing datasets.
Proposed method
The framework includes four experiments to identify the optimal organized retail fraud detection model. Numeric features are extracted and preprocessed in the design named individual classifiers, where seven classifiers are trained without asymmetry resolution techniques. Grid search with stratified k-fold cross-validation is employed for hyperparameter tuning.
Using the same data, an ensemble is constructed by stacking seven classifiers. This approach, called stacked generalization, combines predictions from these models via a meta-model trained on out-of-fold predictions from k-fold cross-validation of the base models. This framework addresses class asymmetry due to the imbalance between fraudulent and non-fraudulent cases in fraud data. This phase results in the optimal class rebalancing technique - classifier combination for the context, elaborated further in the detailed class resolution approach.
The employed section covers the utilized classifiers, experimental configurations, and data preprocessing steps. Historical data from a prominent worldwide online marketplace is utilized to detect ORC instances, focusing on 3606 high-volume sellers based in the US. The dataset encompasses numeric, category, and text data types, with text features having a limited impact. The preprocessing stage addresses duplicates, missing data, and outliers. Feature engineering entails generating predictive attributes through encoding, dummy columns, and new features derived from titles and descriptions. Established and new classifiers are incorporated, guided by expert insights from ORC professionals with experience in fraud detection and mitigation.
Handling an "unbalanced data problem," denoting a skewed distribution of data between classes38, is crucial due to the hindered performance of many machine learning algorithms in such scenarios. As a solution, adaptations of the Synthetic Minority Oversampling Technique (SMOTE) are applied in this research. SMOTE involves generating synthetic instances for the minority class, different from conventional oversampling. This synthesis relies on Euclidean distances between nearest neighbors and follows these steps: (1) compute the distance between the feature vector and its nearest neighbors; (2) multiply this difference by a random value between 0 and 1 and add it to the feature vector.
Experimental Analysis
In the context of imbalanced data, the evaluation employs repeated stratified k-cross validation, highlighting Gaussian Naive Bayes' high recall but lower accuracy and true positive predictions. Tree-based models, particularly the tuned random forest, emerge with the best F1 score after hyperparameter tuning. Transitioning to out-of-sample data, classifiers experience performance degradation due to evolving fraud behavior, with tree-based models maintaining their superiority.
Shifting to data balancing, data-level techniques like Random Oversampling (ROS) outperform algorithms, and the balanced random forest algorithm excels in optimizing recall. The framework highlights the importance of feature selection, preprocessing, and class imbalance resolution, underscoring the necessity for regular retraining in a dynamic fraud detection landscape. It attains a leading recall score of 97.5% on in-sample data and 94.9% on out-of-sample data, compared to 92.8% and 81.9%, respectively.
Conclusion and future work
E-commerce platforms like the digital marketplace operated by Meta and eBay face ongoing cybersecurity challenges due to organized retail crime (ORC). Detecting fraudulent activities in this context is increasingly complex, with abundant user data and transactions. The research presents an advanced fraud detection approach that utilizes supervised machine learning, surpassing traditional rule-based and unsupervised methods in terms of accuracy and effectiveness.
The comprehensive framework integrates expert-derived feature discovery, customized data processing, imbalanced learning, careful feature and model selection, precise hyperparameter tuning, and business-relevant performance metrics to achieve superior results. The limitations of single-stage trials are addressed, setting the approach apart. While primarily utilizing numeric and categorical features, future research could investigate the efficacy of multimodal features to enhance ORC detection performance.
Journal reference:
- Mutemi, A., & Bacao, F. (2023). A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces. Scientific Reports, 13:1, 12499. DOI: 10.1038/s41598-023-38304-5, https://www.nature.com/articles/s41598-023-38304-5.