Real-Time Fraud Detection · Kafka + Spark + Postgres

Free trials
don't stay free when abuse is caught
at the source.

A production-grade streaming pipeline that catches promo abuse — same device, multiple "new" accounts — before it drains marketing budget. Events flow from Kafka → Spark → Postgres with sub-second alert latency, fully dockerized and fault-tolerant.

Apache Kafka Spark Streaming PostgreSQL Docker · Dockerized 24h Tumbling Window Exactly-Once Semantics
signup-events · live feed
STREAMING
0
Events Seen
0
Alerts Fired
0
Unique Devices
The Problem

One person.
Infinite free trials.

Promo abuse is low-effort, high-reward fraud. A single actor creates dozens of "new" accounts from the same device — or swaps payment tokens — to repeatedly claim free trials, referral bonuses, and first-time discounts. Traditional batch detection catches it days later, after the damage is done. By then, the promo budget is gone and the data is noisy.

10
Signups per device threshold
in a 24-hour window
DEFAULT THRESHOLD
<1s
Alert latency from first
event to Postgres write
MICRO-BATCH
24h
Rolling tumbling window
with 24h watermark
WINDOW SIZE
Exactly-once semantics
via Spark checkpointing
DELIVERY GUARANTEE
Architecture

Three layers.
One clean story.

Each component solves exactly one problem. Kafka absorbs burst traffic and guarantees per-device ordering. Spark handles stateful windowed aggregation with late-event tolerance. Postgres surfaces actionable, explainable alerts ready for dashboards or downstream automation.

Kafka
signup-events
keyed by device_id
port 9092/9094
🔥
Spark
Structured Streaming
24h window + watermark
groupBy device_id
🗄️
Postgres
signups (raw + JSONB)
alerts (deduped)
ON CONFLICT upsert
Ingest Layer
Kafka Producer
Events keyed by device_id for per-device ordering. Exposes localhost:9094 to host, kafka:9092 inside Docker network. acks=all, 10 retries — no event dropped.
Processing Layer
Spark Structured Streaming
Parses JSON, applies withWatermark("ts","24h"), groups by device_id + window, counts. Emits alert when count ≥ THRESHOLD. Checkpoint on both sinks.
Storage Layer
PostgreSQL
Raw signups with JSONB for full audit trail. Alerts deduplicated via UNIQUE(device_id, window_start, window_end). ON CONFLICT DO UPDATE handles micro-batch re-delivery.
Detection Logic

The window
remembers everything.

Every signup event carries a device_id — a stable fingerprint that survives account deletion. Spark maintains a 24-hour tumbling window per device. The moment that window's count crosses the threshold, an alert record is written with full forensic context: window bounds, first/last seen timestamps, signup count.

Alert Lifecycle from raw event → actionable alert
📨 Signup Event
user_id + device_id
+ timestamp
Producer
🔑 Kafka Topic
keyed by device_id
ordered delivery
Kafka
⏱ 24h Window
groupBy device
count events
Spark
🔢 Count ≥ 10?
threshold check
Filter
🚨 Alert Written
device + window
+ evidence
Postgres
24-Hour Window — Device dev_003 (abusive) vs dev_021 (normal)
24-hour tumbling window
T+0hT+6hT+12hT+18hT+24h
dev_003 (abusive — 14 events)
dev_021 (normal — 2 events)
stream/main_stream.py — core detection logic PySpark
# 24-hour tumbling window aggregation per device
w = window(col("ts"), "24 hours")

agg = (df
    .withWatermark("ts", "24 hours")   # tolerate late events
    .groupBy(col("device_id"), w)
    .agg(
        count(lit(1)).alias("signup_count"),
        smin("ts").alias("first_seen_ts"),
        smax("ts").alias("last_seen_ts"),
    )
    .where(col("signup_count") >= lit(THRESHOLD))   # THRESHOLD = 10
)
Key Design Decisions

Why this stack
earns its complexity.

🔑
Keyed by device_id, not user_id
Abusers create new accounts — that's the whole point. Keying by device fingerprint means the fraud signal survives account deletion. Per-key ordering in Kafka ensures correct windowed counts.
Watermark = Window Length (24h)
Setting watermark equal to window length maximizes late-event tolerance without holding state indefinitely. Events up to 24h late are still correctly counted — important for mobile clients with unreliable connectivity.
🔁
ON CONFLICT for idempotent writes
Spark micro-batches can replay on restart. The UNIQUE(device_id, window_start, window_end) constraint with ON CONFLICT DO UPDATE makes every write idempotent — no duplicate alerts.
📋
JSONB raw capture for ML
Every signup event is stored as raw JSONB alongside structured columns. This preserves the full audit trail for retrospective ML feature engineering — no reprocessing Kafka needed to add new signals later.