A study published in Scientific Data presents SynthAML, the first publicly available synthetic dataset to enable research on critical anti-money laundering (AML) challenges like efficiency, effectiveness, class imbalance, concept drift, and interpretability.
With accurate bank transaction data being highly confidential, current AML systems are developed by financial institutions in isolation, without public benchmarks. The need for shared data severely limits scientific progress on AML. To overcome this, the researchers synthesized a dataset based on genuine transactions from Denmark's Spar Nord Bank.
Banks worldwide must monitor transactions and report suspicious activity to authorities under the global AML framework. However, most regulators need more guidance on building compliant AML systems. With no public data due to privacy concerns, banks rely on in-house heuristics and rules, obstructing standardized assessment.
The absence of public benchmarks also restricts academic research on pressing AML problems, like handling extremely imbalanced data where most clients are not laundering money. Machine learning models trained on such data can have exorbitant false favorable rates of up to 98%. Concept drift, wherein laundering typologies change over time, and model interpretability for trust and transparency are other key challenges.
However, publishing accurate anonymized data remains implausible due to demonstrations of successful de-anonymization attacks. Simulated datasets like PaySim and AMLSim have been proposed as alternatives but need to be grounded in actual data. An innovative synthesis approach is needed to produce realistic AML data for public research.
Generating Synthetic Data
The study employs the Synthetic Data Vault (SDV) library to learn a probabilistic model from accurate data supplied by Denmark's Spar Nord Bank. The data has over 20,000 AML alerts on private clients from 2020-2021 and 16+ million transactions.
SDV uses conditional parameter aggregation and Gaussian copulas to capture dependencies between the alerts and associated transaction sequences. It estimates multivariate distributions and covariances to simulate new samples mimicking the data.
The synthetic SynthAML dataset contains alert and transaction tables with the same features as the original data. Dates and categorical variables are synthesized numerically and decoded post-simulation for added entropy and confidentiality. Small perturbations are also applied to transaction sizes and alert outcomes. The resulting SynthAML dataset promises to enable investigations into multiple open AML research problems.
Experimental Protocol and Data Records
The SynthAML dataset contains extensive details on the rigorous experimental protocol followed by the researchers and the structure of the generated data records, as described in the source paper.
The researchers used conditional parameter aggregation and Gaussian copulas in the SDV library to capture dependencies between the alert and transaction tables. The Kolmogorov-Smirnov test was applied to fit optimal univariate distributions for each feature. A covariance matrix was estimated using Gaussian copulas to model correlations. Dates and categorical variables were encoded numerically for synthesis and decoded post-simulation.
The SynthAML data records consist of two CSV files - one for synthetic alerts and one for transactions. The alerts file contains each alert's ID, date, and outcome. The date is accurate to the quarter. The transaction file provides a timestamp, entry type, transaction type, and size for each transaction associated with an alert ID. Transaction sizes are log-transformed and standardized.
For technical validation, the distributions of synthetic transactions were compared to the original, accurate data in terms of sizes, types, entries, and temporal alert patterns. The researchers also conducted extensive machine learning experiments for predictive modeling using the synthetic dataset versus the actual data. Multiple classification models were trained and tested on splits respecting the quarterly synthesized dates.
The comprehensive protocol and dataset details provide transparency on the rigorous generative process and data schema supporting SynthAML's utility for enabling robust and reproducible benchmarking of anti-money laundering systems on realistic synthetic data.Technical Validation
The researchers validate that SynthAML approximately preserves the distributional characteristics of the original data based on transaction sizes, types, entries, and temporal alert patterns. Machine learning experiments demonstrate that model performance on SynthAML transfers to the actual data. Classifiers trained on synthetic samples exhibit similar generalizations and ranking of accuracies when tested on actual samples compared to training directly on accurate data.
The validation results confirm SynthAML's utility for elucidating long-standing AML research problems:
- Class imbalance can be analyzed by subsetting the data to vary proportions of alerts indicating money laundering from a low realistic rate to higher balances.
- Concept drift may manifest in decreased test accuracy when training classifiers on past synthetic alerts to predict outcomes in subsequent quarters.
- Interpretability techniques like SHapley Additive exPlanations (SHAP) and layer-wise relevance propagation can be evaluated for explaining model predictions and enhancing trust.
The researchers suggest splitting the temporal data, respecting the quarterly synthesized dates for valid experiments. SynthAML advances AML research by enabling reproducible public benchmarking on a privacy-preserving synthetic clone of real-world data.
Future Outlook
While promising, SynthAML has limitations typical to simulated data. It lacks representations of clients never under investigation. Data from a single Danish bank may not generalize across geographic and jurisdictional variances.
Enhancing SythnAML across diverse real-world sources and integrating additional typology factors could improve coverage. Future work must also establish frameworks for securely sharing coded transformations to generate banking data syntheses without revealing sensitive source information. With growing adoption, standardized synthetic data generation could profoundly accelerate public-private collaboration on urgent AML research challenges.
This research marks an important milestone in AML research by introducing SynthAML - the first open synthetic AML dataset based on genuine bank transactions. Extensive experiments demonstrate that models trained on SynthAML transfer reasonably to the original data, validating its utility for public benchmarking of AML systems. Despite limitations, SynthAML opens up crucial new avenues for collaborative innovation on long-standing research problems in anti-money laundering. With rigorous privacy safeguards, such public-private partnerships can help achieve the full potential of data science for combating financial crime and protecting the integrity of the global financial system.
Journal reference:
- Jensen, R. I. T., Ferwerda, J., Jørgensen, K. S., Jensen, E. R., Borg, M., Krogh, M. P., Jensen, J. B., & Iosifidis, A. (2023). A synthetic data set to benchmark anti-money laundering methods. Scientific Data, 10(1), 661. https://doi.org/10.1038/s41597-023-02569-2, https://www.nature.com/articles/s41597-023-02569-2