A breakthrough AI technique from Florida Atlantic University is revolutionizing fraud detection by swiftly and accurately labeling fraudulent activity in massive, unlabeled datasets—cutting down on false alarms and manual work for industries hit hardest by scams.
Research: Unsupervised label generation for severely imbalanced fraud data. Image Credit: Andrii Yalanskyi / Shutterstock
Fraud is widespread in the United States and increasingly driven by technology. For example, 93% of credit card fraud now involves remote account access, not physical theft. In 2023, fraud losses surpassed $10 billion for the first time. The financial toll is staggering: credit card fraud costs $5 billion annually, affecting 60% of U.S. cardholders, while identity theft resulted in $16.4 billion in losses in 2021. Medicare fraud costs $60 billion annually, and government losses range from $233 billion to $521 billion annually, with improper payments totaling $2.7 trillion since 2003.
Machine learning plays a critical role in fraud detection by identifying patterns and anomalies in real-time. It analyzes large datasets to spot normal behavior and flag significant deviations, such as unusual transactions or account access. However, fraud detection is challenging because fraud cases are much rarer than normal ones, and the data is often messy or unlabeled.
New AI Method for Fraud Detection
To address these challenges, researchers from the College of Engineering and Computer Science at Florida Atlantic University have developed a novel method for generating binary class labels in highly imbalanced datasets. This offers a promising solution for fraud detection in industries like health care and finance. The technique works without relying on labeled data, a key advantage in sectors where privacy concerns and labeling costs are significant obstacles.
The team tested their method on two real-world, large-scale datasets with severe class imbalance (less than 0.2%): European credit card transactions (over 280,000 records from September 2013) and Medicare Part D claims (more than 5 million from 2013 to 2019). Both datasets include fraud-labeled and genuine transactions, which are ideal for evaluating fraud detection methods.
Study Results
Results, published in the Journal of Big Data, show that this new labeling method effectively addresses the challenge of labeling severely imbalanced data in an unsupervised framework. Unlike traditional methods, this approach directly evaluates newly generated fraud and non-fraud labels without requiring a supervised classifier.
“The use of machine learning in fraud detection brings many advantages,” said Taghi Khoshgoftaar, Ph.D., senior author and Motorola Professor in the FAU Department of Electrical Engineering and Computer Science. “Machine learning algorithms can label data much faster than human annotation, significantly improving efficiency. Our method represents a major advancement in fraud detection, especially in highly imbalanced datasets. It reduces the workload by minimizing cases that require further inspection, which is crucial in sectors like Medicare and credit card fraud.”
The study shows that the new method outperformed the widely used Isolation Forest algorithm, offering a more efficient way to identify fraud while minimizing the need for further investigation. It provides a scalable solution for fraud detection without relying on costly labeled data, which requires significant manual input.
Reducing False Positives
“Our method generates labels for both fraud (positive) and non-fraud (negative) instances, which are then refined to minimize the number of fraud labels,” said Mary Anne Walauskis, first author and Ph.D. candidate at FAU. “By applying our method, we minimize false positives — genuine instances marked as fraud — which is key to improving fraud detection.”
The method ensures that only the most confidently identified fraud cases are retained, enhancing accuracy and reducing unnecessary alarms. It combines two strategies: an ensemble of three unsupervised learning techniques (via the SciKit-learn library) and a percentile-gradient approach. This dual strategy minimizes false positives by focusing on the most confident fraud predictions.
The refined labels form a subset that is highly likely to be accurate. These are used to create confidence intervals and finalize labeling, with minimal domain knowledge needed to estimate the number of positive cases.
Broader Implications
“This innovative approach holds great promise for industries plagued by fraud,” said Stella Batalama, Ph.D., dean of the College of Engineering and Computer Science. “Fraud’s impact goes beyond financial losses, including emotional distress, reputational damage, and reduced trust. Health care fraud undermines care quality and cost, while identity theft causes severe stress. Addressing fraud is key to mitigating these broader harms.”
The research team plans to enhance the method by automating the optimal number of positive instances, further improving efficiency and scalability for large-scale applications.
Publication and Recognition
The journal article, Unsupervised Label Generation for Severely Imbalanced Fraud Data, is an updated version of the team’s earlier paper, Confident Labels: A Novel Approach to New Class Labeling and Evaluation on Highly Imbalanced Data. The earlier version was presented at the IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI) in November 2024, where it won the Best Student Paper Award.
Source:
Journal reference: