A full ML pipeline that classifies geographic areas as crime hotspots using 980,000+ LAPD records. Four models benchmarked head-to-head — Random Forest wins with 87% accuracy and 0.93 ROC-AUC. Automated retraining via Airflow, data warehouse on BigQuery.
Reactive policing — dispatching officers after a crime is reported — is too slow and too expensive. The question this project answers is predictive: given the historical pattern of crime in a location and time, will this area become a hotspot? Getting that classification right means patrol resources go where they're needed, before incidents occur rather than after.
The dataset is 980K+ LAPD incident records spanning 2020 to present — real crime reports with spatial coordinates, timestamps, crime type, victim demographics, and premise codes. The hotspot label is defined as the top 10% of locations by crime density.
location_cluster feature — converting raw GPS into meaningful geographic zones the model can reason about.Four models competed. The selection criteria weren't arbitrary — crime hotspot prediction requires high recall (missing a real hotspot is worse than a false alarm) and balanced precision (over-policing has real costs). Random Forest delivered the best combination on both fronts, plus the highest ROC-AUC at 0.926.