Research

Threat Domain Partitioning and Sorted Rejection Labeling

Benchmarking for Adversarial Environments

Charles YehDaniel LeeHongkai Pan

Explore the simplified and interactive version of the paper.Or, read the full paper here.

Abstract

Several distinctions make fraud detection different from other domains such that conventional machine learning classification metrics become impractical: adversarial adaptation, expensive labels, class imbalance, and unseen classes. These distinctions make many conventional classification metrics not only difficult to compute but even misleading. We present a practical framework that offers cheaper and more consistent benchmarks for such models. Application of the framework on real-world fraud detection systems demonstrates significant reductions in labeling costs and much more consistent benchmarks while maintaining rigorous product standards, enabling rapid deployment cycles that match the pace of adversarial adaptation.

In this paper

Introduction

We work on a wide range of fraud detection models, including selfie liveness and ID verification. As AI-driven fraud has grown more dynamic and sophisticated, manual labeling in these use cases has become more uncertain and costly to implement. At the same time, conventional machine learning classification metrics fail in adversarial environments where intelligent attackers rapidly adapt their strategies.

This paper introduces a framework that fundamentally changes how we evaluate and deploy models in these environments. Our framework centers on two key innovations: threat domain partitioning, which comprehensively organizes the space of possible attacks into manageable categories, and sorted rejection labeling, which efficiently measures system performance by focusing evaluation effort on the highest-risk cases.

We'll dive into the two innovations after discussing three general challenges in building fraud detection models.

1. Identifying metrics stable across performative drift

Fraud detection benchmarking in machine learning literature typically relies on conventional classification metrics for evaluation, such as precision, recall, F1-score, and area under the curve (AUC) score. While some studies acknowledge the limitations of these metrics and supplement their analysis with additional measures, nearly all evaluations are conducted on static test sets that fail to account for performative driftWhen a model's predictions influence the environment, causing adversaries to adapt and metrics to shift..

In the performative prediction framework, a model's predictions directly influence the environment: when deployed, adversaries probe the model, observe outcomes, and adapt their tactics accordingly. This adaptation causes conventional classification metrics to shift dramatically between pre-deployment and post-deployment measurements, rendering deployment decisions based on these metrics potentially misinformed.

The flickering attack vector

Move the slider to simulate time passing after deployment. As the model successfully deters attackers, the number of attacks drops.

Notice how precision artificially plummets because the denominator, true positives (TP), shrinks while false positives (FPs) remain constant.

Time since deploymentDay 0

Attacks caught (TP)

950

Attacks decrease because fraudsters give up. This is a success.

Measured precision

95.0%

50%

Precision drops below 50%. A conventional dashboard would suggest rolling back the model.

Precision and recall measured around both deployment and rollback tell completely contradictory narratives and fail to capture the model's true operational value. Conventional classification metrics measured on static test sets fail to capture performative drift. This paper addresses these challenges by providing metrics that remain stable in dynamic environments and can therefore serve as clear and efficient decision criteria for deployment.

2. Improving labeling efficiency under class imbalance

In fraud detection scenarios, ground-truth labels may be scarce and expensive to obtain, even though conventional classification metrics necessitate ground-truth labels for the entire test set. Furthermore, the rate of bad actors can range from 0.01% to 1%, depending on the use case, implying that one might need to label 1,000 instances to obtain a single bad actor label.

Ensuring sufficiently large samples for all classes is impractical in these scenarios, which motivates the need for procedures that operate more efficiently in the presence of such class imbalance.

The rarity of fraud

At a 0.1% fraud rate, randomly sampling to calculate precision requires labeling thousands of legitimate instances just to find a single true positive.

3. Monitoring coverage rate across all attack vectors

Live fraud data is inherently sparse because fraudsters identify the most vulnerable attack vectors and exploit them at scale, meaning observed fraud data never covers the full range of possibilities. This necessitates a comprehensive taxonomy of potential attack vectors beyond what has been observed in the wild.

Identifying which attack vectors are most vulnerable at any given time is crucial for proactive defense and system prioritization. However, conventional classification metrics cannot be partitioned or aggregated across models, impeding our ability to track attack vector coverage over time. We require metrics that operate at the model level but also enable holistic analysis across all possible attack vectors.

Identity fraud vectors rise and fall in popularity as defenses adapt and new vulnerabilities emerge.

Threat domain partitioning for bad actor coverage

The first key innovation in our framework is threat domain partitioning prior to model development, which comprehensively organizes the space of possible attacks into manageable categories.

Example: selfie liveness verification

The attack surface for selfie liveness verification is expansive and rapidly evolving, particularly with recent advances in deepfake and generative AI technologies. Our threat domain taxonomy was initially constructed from broad attack categories and iteratively refined through continuous observation of emerging fraud patterns. The top-level partitions comprise:

Generative AI: Synthetically generated images from generative models designed to mimic authentic selfies.
Digital tampering: Manipulation of genuine images through deepfakes, inpainting, or outpainting techniques to alter identity or appearance.
Digital render: Computer-generated avatars, 3D models, or video game screenshots submitted as identity proof.
Replay: Pre-recorded images or videos presented to circumvent liveness detection mechanisms.
Physical replica: Physical artifacts such as silicone masks or printed photographs used for impersonation.
Evasion: Deliberate modification of physical attributes through makeup, face paint, or occlusion to evade detection.

Generative AI represents the most dynamic partition within our threat domain taxonomy, reflecting both the rapid evolution of generative models and the increasing accessibility of AI-based attack tools. This partition is hierarchically subdivided by underlying model architecture, with unknown or unclassified instances temporarily assigned to a residual category pending further analysis. Representative subpartitions within the Generative AI domain include:

GAN-based models: Generative Adversarial Network architectures
- StyleGAN
- ProGAN
Diffusion-based models: Denoising diffusion probabilistic models
- Stable Diffusion
- Midjourney
- DALL-E
- Gemini
- Sora

Note that individual fraud instances may exhibit characteristics spanning multiple partitions simultaneously, as adversaries deliberately combine techniques to obfuscate their methods and evade detection across multiple defensive layers.

By establishing a comprehensive taxonomy, we can evaluate coverage rate across all attack vectors, including the ones that attackers are not exploiting actively.

Traffic router

Physical presentation

Printed masks
3D mannequins
Screen replays

Capture rate

98.2%

Digital manipulation

Deepfakes
Camera injections
Image warpings

Capture rate

99.1%

Identity theft

Stolen IDs
Synthetic IDs
Data breaches

Capture rate

96.5%

All analysis and benchmarking centers on fraud capture rate (FCR)Fraud capture rate tells you the percentage of correctly identified bad actors., defined as the percentage of correctly identified bad actors. This metric is tracked both holistically across the threat domain and individually per threat domain partition. We use this to help design new models by prioritizing partitions based on vulnerability, training specialized models on vulnerable partitions, and updating the model's metrics.

Due to the dynamic and shifting nature of bad actors, we construct and use a representative set for evaluation rather than relying on sampling from the live population. This test set serves as a comprehensive taxonomy of observed attack vectors and is continuously curated and partitioned into threat domain partitions at multiple levels of granularity. We regularly update the taxonomy by classifying new fraud categories, refining partitions into more specific ones as needed, and publishing new partitions.

Threat domain partitioning and model development operate in two perpetual cycles that continuously inform and advance one another. This reframes the unseen classes problem into a problem space that can be solved by iteratively shipping models to systematically expand FCR over time, rather than every deployment being a careful balance between precision and recall.

Sorted rejection labeling for good actor protection

The second key innovation in our framework is sorted rejection labeling during deployment assessment. Unlike bad actors, good actors do not actively adapt to deployed defenses and are therefore less affected by performative drift, allowing us to center evaluation to a predetermined false rejection rate (FRR)False rejection rate tells you the percentage of legitimate users incorrectly flagged (false positives/total)., defined as (false positives/total).

Choosing the right FRR threshold involves balancing security against user experience. A stricter threshold (lower FRR) provides better fraud protection but may frustrate legitimate users. A more lenient threshold (higher FRR) improves user experience but may let more fraud through.

Organizations typically set FRR targets based on business context: high-value transactions may warrant stricter thresholds, while onboarding flows might prioritize conversion. Our framework makes these trade-offs explicit and measurable, allowing teams to make informed decisions about where to set boundaries.

Setting a threshold for sorted rejection labeling

Step 1. Using good actor samples from the live population, identify the score threshold that comes as close as possible to the target FRR while remaining within it.

Step 2. Using bad actor samples from the threat domain partitions, calculate the FCRFraud capture rate tells you the percentage of correctly identified bad actors. at that same score threshold.

Threshold at 0.1% FRR

Low riskHigh risk

Results in an FCR over 60%

With this approach, we don't need to manually label every transaction. Instead, we can focus on labeling the highest-risk cases until we count enough false rejections to hit the FRR target. In practice, this means labeling a small fraction of the data rather than the whole dataset.

Conclusion

Application of the framework on real-world fraud detection systems demonstrates significant reductions in labeling costs and much more consistent benchmarks while maintaining rigorous product standards, enabling rapid deployment cycles that match the pace of adversarial adaptation.

Threat domain partitioning's core strength lies in its ability to identify and highlight vulnerable attack vectors across the entire possible attack space. This capability is essential for ensuring impactful model development and enables parallel model design, where separate teams may train and evaluate models in isolated environments. Crucially, the results and progress from each independent team fit together within a cohesive, overarching framework: the collective outputs from all teams combine to provide comprehensive coverage of the entire threat landscape.

Sorted rejection labeling's key advantage is its ability to target and control the false rejection rate (FRR), grounding the overall framework in a common business objective. It enables efficient model evaluation with metrics that remain stable in the face of performative drift. By explicitly setting thresholds that correspond to concrete FRR values such as 0.01% or 0.1%, organizations can predetermine the exact percentage of users who may experience friction or intervention due to the deployed system.

Together, they form our framework for consistent, repeatable, and scalable model development and deployment in the face of rapid adversarial adaptation.

Read the full paper