Real-Time Fraud Detection · Kafka + Spark + Postgres

Free trials
don't stay free when abuse is caught
at the source.

A production-grade streaming pipeline that catches promo abuse — same device, multiple "new" accounts — before it drains marketing budget. Events flow from Kafka → Spark → Postgres with sub-second alert latency, fully dockerized and fault-tolerant.

Apache Kafka Spark Streaming PostgreSQL Docker · Dockerized 24h Tumbling Window Exactly-Once Semantics

signup-events · live feed

STREAMING

Events Seen

Alerts Fired

Unique Devices

The Problem

One person.
Infinite free trials.

Promo abuse is low-effort, high-reward fraud. A single actor creates dozens of "new" accounts from the same device — or swaps payment tokens — to repeatedly claim free trials, referral bonuses, and first-time discounts. Traditional batch detection catches it days later, after the damage is done. By then, the promo budget is gone and the data is noisy.

Signups per device threshold
in a 24-hour window

DEFAULT THRESHOLD

<1s

Alert latency from first
event to Postgres write

MICRO-BATCH

24h

Rolling tumbling window
with 24h watermark

WINDOW SIZE

1×

Exactly-once semantics
via Spark checkpointing

DELIVERY GUARANTEE

Architecture

Three layers.
One clean story.

Each component solves exactly one problem. Kafka absorbs burst traffic and guarantees per-device ordering. Spark handles stateful windowed aggregation with late-event tolerance. Postgres surfaces actionable, explainable alerts ready for dashboards or downstream automation.

⚡

Kafka

signup-events
keyed by device_id
port 9092/9094

🔥

Spark

Structured Streaming
24h window + watermark
groupBy device_id

🗄️

Postgres

signups (raw + JSONB)
alerts (deduped)
ON CONFLICT upsert

Ingest Layer

Kafka Producer

Events keyed by device_id for per-device ordering. Exposes localhost:9094 to host, kafka:9092 inside Docker network. acks=all, 10 retries — no event dropped.

Processing Layer

Spark Structured Streaming

Parses JSON, applies withWatermark("ts","24h"), groups by device_id + window, counts. Emits alert when count ≥ THRESHOLD. Checkpoint on both sinks.

Storage Layer

PostgreSQL

Raw signups with JSONB for full audit trail. Alerts deduplicated via UNIQUE(device_id, window_start, window_end). ON CONFLICT DO UPDATE handles micro-batch re-delivery.

Detection Logic

The window
remembers everything.

Every signup event carries a device_id — a stable fingerprint that survives account deletion. Spark maintains a 24-hour tumbling window per device. The moment that window's count crosses the threshold, an alert record is written with full forensic context: window bounds, first/last seen timestamps, signup count.

Alert Lifecycle from raw event → actionable alert

📨 Signup Event
user_id + device_id
+ timestamp

Producer

🔑 Kafka Topic
keyed by device_id
ordered delivery

Kafka

⏱ 24h Window
groupBy device
count events

Spark

🔢 Count ≥ 10?
threshold check

Filter

🚨 Alert Written
device + window
+ evidence

Postgres

24-Hour Window — Device dev_003 (abusive) vs dev_021 (normal)

24-hour tumbling window

T+0hT+6hT+12hT+18hT+24h

dev_003 (abusive — 14 events)

dev_021 (normal — 2 events)

        stream/main_stream.py — core detection logic
        PySpark
      
# 24-hour tumbling window aggregation per device
w = window(col("ts"), "24 hours")

agg = (df
    .withWatermark("ts", "24 hours")   # tolerate late events
    .groupBy(col("device_id"), w)
    .agg(
        count(lit(1)).alias("signup_count"),
        smin("ts").alias("first_seen_ts"),
        smax("ts").alias("last_seen_ts"),
    )
    .where(col("signup_count") >= lit(THRESHOLD))   # THRESHOLD = 10
)

Key Design Decisions

Why this stack
earns its complexity.

🔑

Keyed by device_id, not user_id

Abusers create new accounts — that's the whole point. Keying by device fingerprint means the fraud signal survives account deletion. Per-key ordering in Kafka ensures correct windowed counts.

⏱

Watermark = Window Length (24h)

Setting watermark equal to window length maximizes late-event tolerance without holding state indefinitely. Events up to 24h late are still correctly counted — important for mobile clients with unreliable connectivity.

🔁

ON CONFLICT for idempotent writes

Spark micro-batches can replay on restart. The UNIQUE(device_id, window_start, window_end) constraint with ON CONFLICT DO UPDATE makes every write idempotent — no duplicate alerts.

📋

JSONB raw capture for ML

Every signup event is stored as raw JSONB alongside structured columns. This preserves the full audit trail for retrospective ML feature engineering — no reprocessing Kafka needed to add new signals later.

Free trials don't stay free when abuse is caughtat the source.

One person.Infinite free trials.

Three layers.One clean story.

The windowremembers everything.

Why this stackearns its complexity.

Free trials
don't stay free when abuse is caught
at the source.

One person.
Infinite free trials.

Three layers.
One clean story.

The window
remembers everything.

Why this stack
earns its complexity.