Causal Uplift Modeling · Criteo Dataset · 500K Records

Not who converts.
Who converts
because of you.

Traditional ad targeting wastes spend on people who'd buy anyway. This causal ML system estimates individual treatment effects — isolating the users whose behavior actually changes when shown an ad — lifting ROAS from 2.44x → 9.60x while cutting spend by 75%.

CausalML · X-Learner LightGBM · XGBoost 64% AUUC Improvement SHAP Explainability IPW Bias Correction Fairness Validation
user segments · 75K val set X-LEARNER
Sure Thing50.6%
Persuadable25.0%
Lost Cause14.3%
Do-Not-Disturb10.1%
9.60x
Model ROAS
75%
Spend Saved
The Problem

Ads that go to
the wrong people.

Standard conversion-rate targeting is blind to causality. It identifies users likely to convert — but that includes people who would have bought anyway, and misses people who only buy when prompted. The result is wasted budget on Sure Things and Do-Not-Disturbs, while Persuadables — the users where ads actually change behavior — go untargeted.

"At 0.294% conversion and a 339:1 class imbalance, finding the signal required IPW bias correction, propensity clipping, and scale_pos_weight=339 just to train a model that sees the minority class."

Training details — model_card.json
500K
Records from Criteo 14M RCT dataset
85/15 TREATMENT SPLIT
339:1
Class imbalance — 0.294% conversion rate
scale_pos_weight=339
64%
AUUC improvement over S/T-Learner baseline
X-LEARNER + IPW
$42K
Net profit per 1.5M-user campaign cycle
STANDARD RULE
Modeling

From correlation
to causation.

The X-Learner with IPW bias correction is the core innovation. Unlike S/T-Learners that treat uplift as a simple difference, the X-Learner cross-estimates treatment effects — using the control model to impute counterfactual outcomes for treated users and vice versa. This dramatically improves ITE estimation on imbalanced RCT data.

Qini Uplift Curve — X-Learner vs Random Targeting
Cumulative incremental conversions as % of population targeted · 75K validation set
Top 25% captures 98% of conversions
Targeting the top 25% by predicted uplift (τ) captures nearly all incremental conversions while reaching only a quarter of users. The curve peaks around the 50th percentile — beyond that, adding more users starts including Do-Not-Disturbs who respond negatively.
Individual Treatment Effect (τ) Distribution
Predicted uplift scores across 75K validation users
mean τ = 0.052
Most users cluster near τ ≈ 0 — the ad doesn't change their behavior. The long right tail (τ > 0.3) are high-confidence Persuadables. The left tail (τ < 0) are Do-Not-Disturbs — users who convert less when shown ads.
Targeting Strategy

Four segments.
One right answer each.

The model segments every user into one of four behavioral archetypes based on their estimated treatment effect τ. Each archetype demands a different action — and conflating them is exactly the mistake traditional targeting makes.

🎯
Persuadable
25%
Convert because of the ad. High τ, low baseline. The only group where ad spend is causally justified.
→ TARGET
Sure Thing
50.6%
Would buy anyway. Showing them an ad wastes budget — they don't need convincing.
→ SKIP / EFFICIENT
Lost Cause
14.3%
Won't convert regardless of treatment. Low τ, low baseline. No ROI upside.
→ SKIP
🚫
Do-Not-Disturb
10.1%
Negative τ — ads actively reduce conversion probability. Targeting these users costs money and loses sales.
→ PROTECT
Before — Target All
2.44x
Targeting 1.5M users indiscriminately. Budget spent on Sure Things and Do-Not-Disturbs who don't need ads.
$19,500 spend · 1,588 incr. conversions
After — Top 25% τ
9.60x
Targeting only 374K users with the highest predicted uplift. 98% of incremental conversions retained.
$4,874 spend · 1,559 incr. conversions
Budget Strategy Comparison
ROAS vs spend across four targeting strategies · 1.5M user campaign
Strict Rule hits 12.9x ROAS
The Strict Rule (top 15%) achieves 12.9x ROAS but captures fewer conversions. The Standard Rule (top 25%) is the optimal balance — strong ROAS with near-complete conversion capture. Budget Rule extends reach with diminishing returns.
Governance

~80% automated.
The rest, reviewed.

Not every decision should be automated. The system flags edge cases for human review — borderline scores, anomalously high uplift predictions, and users at risk of being misclassified as Do-Not-Disturb. Only clearly confident decisions are auto-approved.

BORDERLINE
81.4%
Near-Threshold Users
τ within 0.5 std of the targeting threshold. Small model uncertainty could flip the decision — sent for human review.
HIGH_UPLIFT_CHECK
3.2%
Anomalously High τ
Predicted uplift exceeds mean + 3σ. Could be genuine high-value users or model extrapolation errors — flagged for verification.
DND_RISK
6.7%
Do-Not-Disturb Risk
Negative τ detected. Protecting these users from ads prevents conversion loss — reviewed to confirm before suppression.
Fairness Validation — Disparate Impact Audit
80% DI rule applied across 4 behavioral segments · Lo et al. (2024) framework
1/4 segments pass · Audit recommended
Segment Disparate Impact Statistical Parity 80% Rule Status
Behavioral Signal 1 0.237 0.308 FAIL ⚠️ Bias detected
Behavioral Signal 2 1.000 0.000 PASS ✅ Fair
Behavioral Signal 3 0.392 0.192 FAIL ⚠️ Bias detected
Engagement Level 0.682 0.095 FAIL ⚠️ Bias detected
Fairness validation is a first-class output, not an afterthought. The model card explicitly flags that 3 of 4 segments require a demographic audit before production deployment — a standard no real system should skip.
Key Takeaways

What the causal
lens changes.

🧠
Correlation ≠ Causation in Targeting
A user likely to convert is not the same as a user who converts because of an ad. Conflating these two signals is the core mistake in standard ML-based ad targeting — and it's expensive.
⚖️
IPW Matters at Scale
The Criteo dataset has an 85/15 treatment split that biases naive ITE estimates. IPW with propensity clipping [0.05, 0.95] corrects this — without it, the X-Learner's 64% AUUC gain shrinks to near-zero.
🛡️
Do-Not-Disturb Is a Revenue Signal
The 10.1% DND population isn't just a waste — it's a loss. Targeting them actively reduces conversions. Protecting them is both ethical and financially optimal.
📋
Fairness Before Deployment
The model card explicitly scopes out credit scoring and employment decisions, and flags 3 of 4 segments for demographic audit. Responsible AI isn't a checkbox — it's an output you ship alongside the model.