A production-grade streaming pipeline that catches promo abuse — same device, multiple "new" accounts — before it drains marketing budget. Events flow from Kafka → Spark → Postgres with sub-second alert latency, fully dockerized and fault-tolerant.
Promo abuse is low-effort, high-reward fraud. A single actor creates dozens of "new" accounts from the same device — or swaps payment tokens — to repeatedly claim free trials, referral bonuses, and first-time discounts. Traditional batch detection catches it days later, after the damage is done. By then, the promo budget is gone and the data is noisy.
Each component solves exactly one problem. Kafka absorbs burst traffic and guarantees per-device ordering. Spark handles stateful windowed aggregation with late-event tolerance. Postgres surfaces actionable, explainable alerts ready for dashboards or downstream automation.
device_id for per-device ordering. Exposes localhost:9094 to host, kafka:9092 inside Docker network. acks=all, 10 retries — no event dropped.withWatermark("ts","24h"), groups by device_id + window, counts. Emits alert when count ≥ THRESHOLD. Checkpoint on both sinks.JSONB for full audit trail. Alerts deduplicated via UNIQUE(device_id, window_start, window_end). ON CONFLICT DO UPDATE handles micro-batch re-delivery.Every signup event carries a device_id — a stable fingerprint that survives account deletion. Spark maintains a 24-hour tumbling window per device. The moment that window's count crosses the threshold, an alert record is written with full forensic context: window bounds, first/last seen timestamps, signup count.
# 24-hour tumbling window aggregation per device w = window(col("ts"), "24 hours") agg = (df .withWatermark("ts", "24 hours") # tolerate late events .groupBy(col("device_id"), w) .agg( count(lit(1)).alias("signup_count"), smin("ts").alias("first_seen_ts"), smax("ts").alias("last_seen_ts"), ) .where(col("signup_count") >= lit(THRESHOLD)) # THRESHOLD = 10 )
UNIQUE(device_id, window_start, window_end) constraint with ON CONFLICT DO UPDATE makes every write idempotent — no duplicate alerts.