Source: Wikimedia, user -stk.
In addition to the tables described above, tidytransit attempts to calculate the following tables when one uses read_gtfs():
tidytransit prints a message regarding these tables on reading any GTFS file.
# Read in GTFS feed
# here we use a feed included in the package, but note that you can read directly from the New York City Metropolitan Transit Authority using the following URL:
# nyc <- read_gtfs("http://web.mta.info/developers/data/nyct/subway/google_transit.zip")
local_gtfs_path <- system.file("extdata",
"google_transit_nyc_subway.zip",
package = "tidytransit")
nyc <- read_gtfs(local_gtfs_path,
local=TRUE,
geometry=TRUE,
frequency=TRUE)
#> Calculating route and stop headways.
For example, joining the standard routes table, with the ‘route_shortname’ variable to routes_frequencies.
routes_df_frequencies <- nyc$routes %>%
inner_join(nyc$routes_frequency, by = "route_id") %>%
select(route_long_name,
median_headways,
mean_headways,
st_dev_headways,
stop_count)
head(routes_df_frequencies)
#> # A tibble: 6 x 5
#> route_long_name median_headways mean_headways st_dev_headways stop_count
#> <chr> <int> <int> <dbl> <int>
#> 1 Broadway - 7 Av… 5 5 0.15 76
#> 2 7 Avenue Express 7 51 135. 120
#> 3 7 Avenue Express 8 8 0.08 68
#> 4 Lexington Avenu… 6 115 205. 77
#> 5 Lexington Avenu… 9 110 271. 102
#> 6 Lexington Avenu… 48 48 0 29
You can do the same with ‘simple features tables’.
For example, under the hood, plot(gtfs_obj) is doing this:
A more complex example of cross-table joins is to pull the stops and their headways for a given route.
This simple question is a great way to begin to understand a lot about the GTFS data model.
First, we’ll need to find a ‘service_id’, which will tell us which stops a route passes through on a given day of the week and year.
When calculating frequencies, tidytransit tries to guess which service_id is representative of a standard weekday by walking through a set of steps. Below we’ll just do some of this manually.
First, lets look at the calendar.
head(sample_n(nyc$calendar,10))
#> # A tibble: 6 x 10
#> service_id monday tuesday wednesday thursday friday saturday sunday
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 BSP18GEN-… 0 0 0 0 0 1 0
#> 2 BSP18GEN-… 1 1 1 1 1 0 0
#> 3 BSP18GEN-… 1 1 1 1 1 0 0
#> 4 BSP18GEN-… 1 1 1 1 1 0 0
#> 5 ASP18GEN-… 0 0 0 0 0 1 0
#> 6 BSP18GEN-… 1 1 1 1 1 0 0
#> # … with 2 more variables: start_date <date>, end_date <date>
Then we’ll pull a service_id for the C train on mondays.
select_service_id <- filter(nyc$calendar, monday==1) %>% pull(service_id)
select_route_id <- filter(nyc$routes,route_id=="C")
Now we’ll filter down through the data model to just stops for that route and service_ids.
some_trips <- nyc$trips %>%
filter(route_id %in% select_route_id &
service_id %in% select_service_id)
some_stop_times <- nyc$stop_times %>%
filter(trip_id %in% some_trips$trip_id)
some_stops <- nyc$stops_sf %>%
filter(stop_id %in% some_stop_times$stop_id)
Before we plot them, lets pull the frequency calculations from the calculated table onto their geometries.
Due to the way that schedules
some_stops_freq_sf <- some_stops %>%
left_join(nyc$stops_frequency, by="stop_id") %>%
select(headway)
plot(some_stops_freq_sf)
We may–in fact, we probably will–see some surprising outliers for headway calculations in this plot.
Calculating headways at stops is tricky for a number of reasons.
One of the main reasons is that GTFS wasn’t meant for this kind of analytical work. So the headway calculations in this package aren’t robust against all of the edge cases of every last service and stops that might be listed in a GTFS.
However, I have found that the methods in here are OK at describing transit service headways on routes and stops if you understand that GTFS data can be messy for analytical work.
One quick solution to the outlier stops in above plot is to throw out stops with headways greater than an unreasonable amount of time. For example, we can filter out stops with headways above 100 minutes.
some_stops_freq_sf <- some_stops %>%
left_join(nyc$stops_frequency, by="stop_id") %>%
select(headway) %>%
filter(headway<100)
plot(some_stops_freq_sf)
Of course, what solution works for you will depend on what you’re trying to accomplish.