Causal Uplift Modeling · Criteo Dataset · 500K Records

Not who converts.
Who converts
because of you.

Traditional ad targeting wastes spend on people who'd buy anyway. This causal ML system estimates individual treatment effects — isolating the users whose behavior actually changes when shown an ad — lifting ROAS from 2.44x → 9.60x while cutting spend by 75%.

CausalML · X-Learner LightGBM · XGBoost 64% AUUC Improvement SHAP Explainability IPW Bias Correction Fairness Validation

user segments · 75K val set X-LEARNER

Sure Thing50.6%

Persuadable25.0%

Lost Cause14.3%

Do-Not-Disturb10.1%

9.60x

Model ROAS

75%

Spend Saved

The Problem

Ads that go to
the wrong people.

Standard conversion-rate targeting is blind to causality. It identifies users likely to convert — but that includes people who would have bought anyway, and misses people who only buy when prompted. The result is wasted budget on Sure Things and Do-Not-Disturbs, while Persuadables — the users where ads actually change behavior — go untargeted.

"At 0.294% conversion and a 339:1 class imbalance, finding the signal required IPW bias correction, propensity clipping, and scale_pos_weight=339 just to train a model that sees the minority class."

Training details — model_card.json

500K

Records from Criteo 14M RCT dataset

85/15 TREATMENT SPLIT

339:1

Class imbalance — 0.294% conversion rate

scale_pos_weight=339

64%

AUUC improvement over S/T-Learner baseline

X-LEARNER + IPW

$42K

Net profit per 1.5M-user campaign cycle

STANDARD RULE

Modeling

From correlation
to causation.

The X-Learner with IPW bias correction is the core innovation. Unlike S/T-Learners that treat uplift as a simple difference, the X-Learner cross-estimates treatment effects — using the control model to impute counterfactual outcomes for treated users and vice versa. This dramatically improves ITE estimation on imbalanced RCT data.

Qini Uplift Curve — X-Learner vs Random Targeting

Cumulative incremental conversions as % of population targeted · 75K validation set

Top 25% captures 98% of conversions

Targeting the top 25% by predicted uplift (τ) captures nearly all incremental conversions while reaching only a quarter of users. The curve peaks around the 50th percentile — beyond that, adding more users starts including Do-Not-Disturbs who respond negatively.

Individual Treatment Effect (τ) Distribution

Predicted uplift scores across 75K validation users

mean τ = 0.052

Most users cluster near τ ≈ 0 — the ad doesn't change their behavior. The long right tail (τ > 0.3) are high-confidence Persuadables. The left tail (τ < 0) are Do-Not-Disturbs — users who convert less when shown ads.

Targeting Strategy

Four segments.
One right answer each.

The model segments every user into one of four behavioral archetypes based on their estimated treatment effect τ. Each archetype demands a different action — and conflating them is exactly the mistake traditional targeting makes.

🎯

Persuadable

25%

Convert because of the ad. High τ, low baseline. The only group where ad spend is causally justified.

→ TARGET

✅

Sure Thing

50.6%

Would buy anyway. Showing them an ad wastes budget — they don't need convincing.

→ SKIP / EFFICIENT

❌

Lost Cause

14.3%

Won't convert regardless of treatment. Low τ, low baseline. No ROI upside.

→ SKIP

🚫

Do-Not-Disturb

10.1%

Negative τ — ads actively reduce conversion probability. Targeting these users costs money and loses sales.

→ PROTECT

Before — Target All

2.44x

Targeting 1.5M users indiscriminately. Budget spent on Sure Things and Do-Not-Disturbs who don't need ads.

$19,500 spend · 1,588 incr. conversions

After — Top 25% τ

9.60x

Targeting only 374K users with the highest predicted uplift. 98% of incremental conversions retained.

$4,874 spend · 1,559 incr. conversions

Budget Strategy Comparison

ROAS vs spend across four targeting strategies · 1.5M user campaign

Strict Rule hits 12.9x ROAS

The Strict Rule (top 15%) achieves 12.9x ROAS but captures fewer conversions. The Standard Rule (top 25%) is the optimal balance — strong ROAS with near-complete conversion capture. Budget Rule extends reach with diminishing returns.

Governance

~80% automated.
The rest, reviewed.

Not every decision should be automated. The system flags edge cases for human review — borderline scores, anomalously high uplift predictions, and users at risk of being misclassified as Do-Not-Disturb. Only clearly confident decisions are auto-approved.

BORDERLINE

81.4%

Near-Threshold Users

τ within 0.5 std of the targeting threshold. Small model uncertainty could flip the decision — sent for human review.

HIGH_UPLIFT_CHECK

3.2%

Anomalously High τ

Predicted uplift exceeds mean + 3σ. Could be genuine high-value users or model extrapolation errors — flagged for verification.

DND_RISK

6.7%

Do-Not-Disturb Risk

Negative τ detected. Protecting these users from ads prevents conversion loss — reviewed to confirm before suppression.

Fairness Validation — Disparate Impact Audit

80% DI rule applied across 4 behavioral segments · Lo et al. (2024) framework

1/4 segments pass · Audit recommended

Segment	Disparate Impact	Statistical Parity	80% Rule	Status
Behavioral Signal 1	0.237	0.308	FAIL	⚠️ Bias detected
Behavioral Signal 2	1.000	0.000	PASS	✅ Fair
Behavioral Signal 3	0.392	0.192	FAIL	⚠️ Bias detected
Engagement Level	0.682	0.095	FAIL	⚠️ Bias detected

Fairness validation is a first-class output, not an afterthought. The model card explicitly flags that 3 of 4 segments require a demographic audit before production deployment — a standard no real system should skip.

Key Takeaways

What the causal
lens changes.

🧠

Correlation ≠ Causation in Targeting

A user likely to convert is not the same as a user who converts because of an ad. Conflating these two signals is the core mistake in standard ML-based ad targeting — and it's expensive.

⚖️

IPW Matters at Scale

The Criteo dataset has an 85/15 treatment split that biases naive ITE estimates. IPW with propensity clipping [0.05, 0.95] corrects this — without it, the X-Learner's 64% AUUC gain shrinks to near-zero.

🛡️

Do-Not-Disturb Is a Revenue Signal

The 10.1% DND population isn't just a waste — it's a loss. Targeting them actively reduces conversions. Protecting them is both ethical and financially optimal.

📋

Fairness Before Deployment

The model card explicitly scopes out credit scoring and employment decisions, and flags 3 of 4 segments for demographic audit. Responsible AI isn't a checkbox — it's an output you ship alongside the model.

Not who converts.Who convertsbecause of you.

Ads that go tothe wrong people.

From correlationto causation.

Four segments.One right answer each.

~80% automated.The rest, reviewed.

What the causallens changes.