Crime Hotspot Prediction · 980K+ Records · LA Open Data

Where crime will
happen next.

A full ML pipeline that classifies geographic areas as crime hotspots using 980,000+ LAPD records. Four models benchmarked head-to-head — Random Forest wins with 87% accuracy and 0.93 ROC-AUC. Automated retraining via Airflow, data warehouse on BigQuery.

Random Forest · XGBoost KNN · Logistic Regression 87% Accuracy · 0.84 F1 Apache Airflow · BigQuery K-Means Geo Clustering SMOTE Class Balancing
model comparison · 200K test set RF WINS
1 Random Forest👑
F1
AUC
0.84
2 XGBoost
F1
AUC
0.79
3 KNN
F1
AUC
0.70
4 Logistic Reg.
F1
AUC
0.68
The Problem

Police resources are
finite. Crime isn't.

Reactive policing — dispatching officers after a crime is reported — is too slow and too expensive. The question this project answers is predictive: given the historical pattern of crime in a location and time, will this area become a hotspot? Getting that classification right means patrol resources go where they're needed, before incidents occur rather than after.

The dataset is 980K+ LAPD incident records spanning 2020 to present — real crime reports with spatial coordinates, timestamps, crime type, victim demographics, and premise codes. The hotspot label is defined as the top 10% of locations by crime density.

980K+
LAPD crime records, 2020–present
836K AFTER CLEANING
Top 10%
Location density threshold defines a hotspot
SPATIAL LABELING
SMOTE
Synthetic oversampling to balance 60/40 split
499K vs 336K → 499K vs 499K
K-Means
Geo clustering converts lat/lon into location features
SPATIAL ENCODING
ML Pipeline

From raw incident
reports to predictions.

📥
Ingest & Clean
Load 982K LAPD records. Drop columns >50% null, rows >50% null, and critical missing fields. Remove duplicates. 836K records survive.
pandas · 836K rows
🛠
Feature Engineering
Extract year, month, day_of_week, hour from timestamps. Label-encode high-cardinality columns (Area Name, Crime Desc, Premises). One-hot encode Victim Sex.
temporal + categorical features
📍
Spatial Clustering
K-Means on lat/lon coordinates creates a location_cluster feature — converting raw GPS into meaningful geographic zones the model can reason about.
KMeans · StandardScaler
⚖️
SMOTE Balancing
Hotspot class (60%) vs non-hotspot (40%) is imbalanced. SMOTE generates synthetic minority samples — producing a balanced 499K / 499K split for unbiased training.
imblearn SMOTE · 999K balanced
🤖
Train 4 Models + Tune
RandomizedSearchCV with 3-fold CV on Logistic Regression, KNN, Random Forest, and XGBoost. 80/20 stratified train/test split. 800K training samples.
RandomizedSearchCV · 3-fold CV
🔄
Airflow + BigQuery
Scheduled retraining DAG on Apache Airflow pulls fresh data from BigQuery, retrains the winning model, and outputs updated hotspot predictions for deployment.
Airflow DAG · BigQuery warehouse
Crime Intensity by Hour × Day of Week (simulated from dataset patterns)
12am6am12pm6pm11pm
Low
Medium
High
Crime by Day of Week
Results

Why Random Forest
won the benchmark.

Four models competed. The selection criteria weren't arbitrary — crime hotspot prediction requires high recall (missing a real hotspot is worse than a false alarm) and balanced precision (over-policing has real costs). Random Forest delivered the best combination on both fronts, plus the highest ROC-AUC at 0.926.

ROC-AUC — All Four Models
Area under curve on 200K held-out test set
RF: 0.926 AUC
Random Forest's 0.926 ROC-AUC means it correctly ranks a true hotspot above a non-hotspot 92.6% of the time across all probability thresholds — the strongest discrimination of any model tested. XGBoost's high recall (0.922) comes at a precision cost (0.696), making it prone to over-deployment of resources.
Precision · Recall · F1 — Model Comparison
Tuned models on 200K stratified test set
RF leads on F1 + Precision
XGBoost's recall advantage (0.922) flags nearly every real hotspot — but at 0.696 precision, it also incorrectly flags 30% of non-hotspot areas. For law enforcement resource allocation, that false-positive cost matters. Random Forest's 0.848 precision keeps over-deployment in check.
Key Decisions

What made this
pipeline work.

📍
Spatial Features via K-Means
Raw lat/lon is near-useless for tree models — too granular, too noisy. K-Means clustering converts coordinates into discrete geographic zones that carry meaningful crime-density signal without overfitting to exact GPS coordinates.
⚖️
SMOTE Over Downsampling
The 60/40 class imbalance is mild enough that downsampling would throw away useful data. SMOTE generates synthetic non-hotspot samples to balance the classes while preserving the full 836K record dataset for training.
🌲
Random Forest Over XGBoost
XGBoost's higher recall sounds appealing but its 0.696 precision creates unacceptable false-positive rates for deployment. Random Forest's balanced 0.848 / 0.833 precision-recall makes it the operationally responsible choice.
🔄
Automated Retraining via Airflow
Crime patterns shift with seasons, events, and enforcement changes. A static model degrades. The Airflow DAG pulls fresh data from BigQuery on a schedule, retrains the Random Forest, and deploys updated predictions — no manual intervention needed.