Crime Hotspot Prediction · 980K+ Records · LA Open Data

Where crime will
happen next.

A full ML pipeline that classifies geographic areas as crime hotspots using 980,000+ LAPD records. Four models benchmarked head-to-head — Random Forest wins with 87% accuracy and 0.93 ROC-AUC. Automated retraining via Airflow, data warehouse on BigQuery.

Random Forest · XGBoost KNN · Logistic Regression 87% Accuracy · 0.84 F1 Apache Airflow · BigQuery K-Means Geo Clustering SMOTE Class Balancing

model comparison · 200K test set RF WINS

1 Random Forest👑

AUC

0.84

2 XGBoost

AUC

0.79

3 KNN

AUC

0.70

4 Logistic Reg.

AUC

0.68

The Problem

Police resources are
finite. Crime isn't.

Reactive policing — dispatching officers after a crime is reported — is too slow and too expensive. The question this project answers is predictive: given the historical pattern of crime in a location and time, will this area become a hotspot? Getting that classification right means patrol resources go where they're needed, before incidents occur rather than after.

The dataset is 980K+ LAPD incident records spanning 2020 to present — real crime reports with spatial coordinates, timestamps, crime type, victim demographics, and premise codes. The hotspot label is defined as the top 10% of locations by crime density.

980K+

LAPD crime records, 2020–present

836K AFTER CLEANING

Top 10%

Location density threshold defines a hotspot

SPATIAL LABELING

SMOTE

Synthetic oversampling to balance 60/40 split

499K vs 336K → 499K vs 499K

K-Means

Geo clustering converts lat/lon into location features

SPATIAL ENCODING

ML Pipeline

From raw incident
reports to predictions.

📥

Ingest & Clean

Load 982K LAPD records. Drop columns >50% null, rows >50% null, and critical missing fields. Remove duplicates. 836K records survive.

pandas · 836K rows

🛠

Feature Engineering

Extract year, month, day_of_week, hour from timestamps. Label-encode high-cardinality columns (Area Name, Crime Desc, Premises). One-hot encode Victim Sex.

temporal + categorical features

📍

Spatial Clustering

K-Means on lat/lon coordinates creates a location_cluster feature — converting raw GPS into meaningful geographic zones the model can reason about.

KMeans · StandardScaler

⚖️

SMOTE Balancing

Hotspot class (60%) vs non-hotspot (40%) is imbalanced. SMOTE generates synthetic minority samples — producing a balanced 499K / 499K split for unbiased training.

imblearn SMOTE · 999K balanced

🤖

Train 4 Models + Tune

RandomizedSearchCV with 3-fold CV on Logistic Regression, KNN, Random Forest, and XGBoost. 80/20 stratified train/test split. 800K training samples.

RandomizedSearchCV · 3-fold CV

🔄

Airflow + BigQuery

Scheduled retraining DAG on Apache Airflow pulls fresh data from BigQuery, retrains the winning model, and outputs updated hotspot predictions for deployment.

Airflow DAG · BigQuery warehouse

Crime Intensity by Hour × Day of Week (simulated from dataset patterns)

12am6am12pm6pm11pm

Low

Medium

High

Crime by Day of Week

Results

Why Random Forest
won the benchmark.

Four models competed. The selection criteria weren't arbitrary — crime hotspot prediction requires high recall (missing a real hotspot is worse than a false alarm) and balanced precision (over-policing has real costs). Random Forest delivered the best combination on both fronts, plus the highest ROC-AUC at 0.926.

ROC-AUC — All Four Models

Area under curve on 200K held-out test set

RF: 0.926 AUC

Random Forest's 0.926 ROC-AUC means it correctly ranks a true hotspot above a non-hotspot 92.6% of the time across all probability thresholds — the strongest discrimination of any model tested. XGBoost's high recall (0.922) comes at a precision cost (0.696), making it prone to over-deployment of resources.

Precision · Recall · F1 — Model Comparison

Tuned models on 200K stratified test set

RF leads on F1 + Precision

XGBoost's recall advantage (0.922) flags nearly every real hotspot — but at 0.696 precision, it also incorrectly flags 30% of non-hotspot areas. For law enforcement resource allocation, that false-positive cost matters. Random Forest's 0.848 precision keeps over-deployment in check.

Key Decisions

What made this
pipeline work.

📍

Spatial Features via K-Means

Raw lat/lon is near-useless for tree models — too granular, too noisy. K-Means clustering converts coordinates into discrete geographic zones that carry meaningful crime-density signal without overfitting to exact GPS coordinates.

⚖️

SMOTE Over Downsampling

The 60/40 class imbalance is mild enough that downsampling would throw away useful data. SMOTE generates synthetic non-hotspot samples to balance the classes while preserving the full 836K record dataset for training.

🌲

Random Forest Over XGBoost

XGBoost's higher recall sounds appealing but its 0.696 precision creates unacceptable false-positive rates for deployment. Random Forest's balanced 0.848 / 0.833 precision-recall makes it the operationally responsible choice.

🔄

Automated Retraining via Airflow

Crime patterns shift with seasons, events, and enforcement changes. A static model degrades. The Airflow DAG pulls fresh data from BigQuery on a schedule, retrains the Random Forest, and deploys updated predictions — no manual intervention needed.

Where crime willhappen next.

Police resources arefinite. Crime isn't.

From raw incidentreports to predictions.

Why Random Forestwon the benchmark.

What made thispipeline work.

Where crime will
happen next.

Police resources are
finite. Crime isn't.

From raw incident
reports to predictions.

Why Random Forest
won the benchmark.

What made this
pipeline work.