top of page
Pink Gradient
Pink Gradient
< Back

Big Data : Stanford Policing Project

Project Description



Technologies: PySpark, Databricks, Spark SQL


Traffic violation data is typically massive in scale, making it ideal for big data processing. I used PySpark on Databricks to work with a dataset of over 10 million traffic violation records. The raw data required extensive preprocessing, including handling missing values, normalizing categorical variables (such as vehicle type), and correcting time formats.

Once cleaned, I used Spark SQL to aggregate violations by time of day, day of week, vehicle type, and location. The analysis revealed patterns such as peak violation hours, geographic hot spots, and common violation types. By leveraging PySpark, I was able to handle the large dataset efficiently while reducing processing time compared to traditional methods.

The final result was a set of Databricks notebooks and Spark SQL outputs that could be extended into dashboards. This project highlights my ability to work with big data workflows, build scalable ETL pipelines, and extract actionable patterns from large-scale structured data.


Skills Showcased:

  • Distributed data processing using Apache Spark

  • Data transformation & aggregation with Spark SQL

  • Scalable ETL development and performance tuning

  • Time-series analysis & visualization with Python and Databricks notebooks

  • Data handling and visualization using PySpark, Pandas, and Matplotlib


Data Source

Stanford Open Policing Project: https://openpolicing.stanford.edu/

Youtube Presentation Link: https://www.youtube.com/watch?v=Mco2p-wFxJA


Key Insights:

  • Black and Hispanic drivers were searched more often than White drivers, yet had lower contraband discovery rates.

  • White drivers were more likely to receive warnings, while Black drivers faced higher rates of arrests and citations.

  • The “veil of darkness” hypothesis was supported — racial disparities in search rates narrowed after sunset.

  • Disparities were consistent across several high-volume states including California, Texas, and Florida.


MS Business Analytics and Information Systems

University of SouthFlorida

  • LinkedIn
bottom of page