The Mansueto StopWatch is centered around the bus stop. We picked this unit of analysis since it constitutes the main point of contact between users and bus service: the stop is where users wait for the bus and it determines how close the service is from their trip origin and destination. Since there is no public data that precisely shows when a bus stopped at a bus stop, our first and major technical challenge was to build such data set. This would then be the building block for the aggregated metrics. To produce the data set, we took a variety of steps to process and merge several data sources, such as community, bus stop and route shapefiles, historic bus location pings, and the historic bus schedule data.
For the breakdown of how we process these inputs, we follow the terminology used by the CTA API documentation:
We began with over 110 million real time bus locations from June 1st 2022
to July 28th 2024 as provided by the
Chi Hack Night Ghost Bus
project. These bus locations are real-time data from the
CTA bus tracker API
get vehicles
feed, which stores real-time data
queried every five minutes. Each bus location ping includes
metadata such as vehicle number, route, pattern id, and a trip
id. Due to the lack of uniqueness of the existing trip ID provided, we
created a unique trip ID to group a collection of bus locations together.
Each trip represents a specific bus on a specific pattern and route at a
certain time of the day (for example, bus with vehicle ID 4654 traveling
northbound on pattern 1456 on route 6 on June 30th, 2023.) While this
method is not perfect, this new trip ID allows us to group bus pings
into one trip which facilitates analysis of service reliability.
Using this method, we identified 10,235,984 unique trips in the original
data set. For our analysis, it was necessary to transform the raw bus
location data into the desired bus stop view.
The original data represents 5-minute snapshots of
every bus in the CTA system which was converted to the times that each bus
passes a bus stop using imputation. For example, if two buses pass a stop within a 5-minute
snapshot, there will be two rows, each listing the estimated time the
first bus passed a stop and the estimated time since the last bus. This
transformation is necessary as it allows us to derive performance metrics
that are more interpretable and easier to localize than bus positions.
To do this, we
We then processed historic schedules of bus service to contrast it with the actual service provided. For this purpose, we used General Transit Feed Specification (GTFS) data. The CTA only allows for the download of the current schedule, which was an obstacle considering that we planned to evaluate bus service going back to June 2022. However, Transit land, an open data platform that collects GTFS data, maintains historic archive of all feeds. Historic feeds back to May 2022 were downloaded. Schedules were recreated from this historic GTFS data using GTFS Kit, an open-source Python library to work with GTFS data.
In addition to bus pings and schedules, the analysis relies on shapefiles of three main units of analysis: community areas, bus stops and routes. These shapefiles are mainly used for visualizations and for spatial operations. More specifically, we performed point-in-polygon operations to aggregate service performance metrics at the community level—by identifying the bus stops that serve each of the 77 community areas. Up-to-date shapefiles are available at the Chicago Data Portal for the following spatial units:
Using both the historic real-time bus location and the historic schedule data bus stop level, we calculated the following metrics for different time periods (including hour of the day, day of the week, week of the year, month of the year, year, week for each given year and month for each given year).
Metrics include:
To calculate the metrics, we:
For further details on the project data and methodology, consult the full report or the code repo.
Access these comprehensive metrics, updated daily covering all stops here.