Working with the National Transit Database

This post is the start of a series I’ll write on the National Transit Database (NTD).

The NTD is a series of Excel spreadsheets released annually, around early November, by the Federal Transit Administration (FTA). For now, I’m interested specifically in breakdowns data. In this post, I’ll describe the breakdowns data set’s structure and demonstrate how I clean it using R.

In later posts, I’ll use NTD data to talk about transit in the US.

Getting Started

I download¹ the 2021 breakdowns dataset and import it into an R session:

library(dplyr)   # To clean data.
library(knitr)   # For pretty data tables.
library(readxl)  # Guess.

url = "https://www.transit.dot.gov/sites/fta.dot.gov/files/2022-10/2021%20Breakdowns_static.xlsx"
tmp = tempfile()
download.file(url, tmp, mode = "wb")
ntd = readxl::read_excel(tmp, sheet = "Breakdowns")

dplyr::glimpse(ntd)

Rows: 1,249
Columns: 30
$ Agency                                             <chr> "MTA New York City …
$ City                                               <chr> "Brooklyn", "Brookl…
$ State                                              <chr> "NY", "NY", "NY", "…
$ `Legacy NTD ID`                                    <chr> "2008", "2008", "20…
$ `NTD ID`                                           <dbl> 20008, 20008, 20008…
$ `Organization Type`                                <chr> "Subsidiary Unit of…
$ `Reporter Type`                                    <chr> "Full Reporter", "F…
$ `Primary UZA Population`                           <dbl> 18351295, 18351295,…
$ `Agency VOMS`                                      <dbl> 10075, 10075, 10075…
$ Mode                                               <chr> "DR", "HR", "CB", "…
$ `Type of Service`                                  <chr> "PT", "DO", "DO", "…
$ `Mode VOMS`                                        <dbl> 878, 5410, 433, 322…
$ `Major Mechanical Failures`                        <dbl> 935, 2257, 440, 979…
$ `Major Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Other Mechanical Failures`                        <dbl> 691, 12732, 94, 230…
$ `Other Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Total Mechanical Failures`                        <dbl> 1626, 14989, 534, 1…
$ `Total Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Vehicle/Passenger Car Miles`                      <dbl> 29546888, 341593245…
$ `Vehicle/Passenger Car Miles Questionable`         <chr> NA, NA, NA, NA, NA,…
$ `Vehicle/Passenger Car Revenue Miles`              <dbl> 23722953, 331253516…
$ `Vehicle/Passenger Car Revenue Miles Questionable` <chr> NA, NA, NA, NA, NA,…
$ `Train Miles`                                      <dbl> 0, 37477972, 0, 0, …
$ `Train Miles Questionable`                         <lgl> NA, NA, NA, NA, NA,…
$ `Train Revenue Miles`                              <dbl> 0, 36328789, 0, 0, …
$ `Train Revenue Miles Questionable`                 <lgl> NA, NA, NA, NA, NA,…
$ `Any Data Questionable?`                           <chr> "No", "No", "No", "…
$ ...28                                              <dbl> NA, NA, 1, NA, NA, …
$ `0`                                                <chr> "Hide Questionable …
$ ...30                                              <dbl> NA, NA, 1, NA, NA, …

My purposes need just certain fields and records:

Primary keys: It’s better if I demonstrate what each record represents in the next section.
- NTD ID: Unique agency ID number.
- Mode: Transit type. Train, plane, or automobile? I only have energy to cover a few:
  - Bus (MB)
  - Commuter Rail (CR)
  - Heavy Rail (HR)
  - Light Rail (LR)
- Type of Service: Whether the transit was Purchased Transit (PT) (contracted out to a private firm) or Directly Operated (DO) (in-house).
Identifying info:
- Agency: Agency name and nickname/acronym, if it has one (e.g., Metra, WMATA).
- City: Agency HQ municipality.
- State: Agency HQ state.
Metrics:
- Mean Miles Between Breakdowns [calculated]: Breakdown frequency. \[\frac{VRM}{Total Mechanical Failures}\]
- Total Mechanical Failures: Breakdowns. \[{Major Mechanical Failures} + {Other Mechanical Failures}\]
  - No 0 values, since this is the denominator in my calculated metric.
- Vehicle/Passenger Car Revenues Miles (VRM): Total miles run in revenue service.
  - No 0 values. I only want transit that actually runs.
Quality flags: Indicate data that don’t meet the FTA’s quality standards. "Q" if data quality is questionable. "W" if the agency didn’t report this data to the FTA’s standards, but the FTA gave them a waiver.
- Total Mechanical Failures Questionable
  - No "Q" values.
- Vehicle/Passenger Car Revenue Miles Questionable
  - No "Q" values.

ntd = ntd %>%
  dplyr::select(
    `NTD ID`,
    Mode,
    `Type of Service`,
    Agency,
    `HQ City` = City,
    State,
    dplyr::starts_with("Vehicle/Passenger Car Revenue Miles"),
    dplyr::starts_with("Total Mechanical Failures")
  ) %>%
  dplyr::filter(
    Mode %in% c("CR", "HR", "LR", "MB"),
    `Vehicle/Passenger Car Revenue Miles Questionable` %in% c("W", NA_character_),
    `Vehicle/Passenger Car Revenue Miles` > 0,
    `Total Mechanical Failures Questionable` %in% c("W", NA_character_),
    `Total Mechanical Failures` > 0
  ) %>%
  dplyr::mutate(
    # Get agency nickname only.
    Agency = gsub(".*dba: ", "", Agency),
    `Mean Miles Between Breakdowns` = `Vehicle/Passenger Car Revenue Miles` / `Total Mechanical Failures`
  ) %>%
  dplyr::group_by(Mode) %>%
  dplyr::mutate(Rank = dplyr::min_rank(dplyr::desc(`Mean Miles Between Breakdowns`))) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(Mode, Rank)

ntd %>%
  dplyr::select(Mode, Rank, Agency, `HQ City`, State, `Type of Service`, `Mean Miles Between Breakdowns`) %>%
  dplyr::filter(Mode == "CR", Rank <= 10) %>%
  dplyr::mutate(
    `Mean Miles Between Breakdowns` = scales::comma(`Mean Miles Between Breakdowns`)
  ) %>%
  knitr::kable()

Mode	Rank	Agency	HQ City	State	Type of Service	Mean Miles Between Breakdowns
CR	1	Northern New England Passenger Rail Authority	Portland	ME	PT	2,105,053
CR	2	Pennsylvania Department of Transportation	Harrisburg	PA	PT	1,619,575
CR	3	MTA Metro-North Railroad	New York	NY	DO	910,686
CR	4	Northern Indiana Commuter Transportation District	Chesterton	IN	DO	764,272
CR	5	MTA Long Island Rail Road	Jamaica	NY	DO	748,705
CR	6	Metra	Chicago	IL	DO	536,836
CR	7	Trinity Metro	Fort Worth	TX	PT	488,962
CR	8	Metra	Chicago	IL	PT	478,091
CR	9	Caltrain	San Carlos	CA	PT	223,990
CR	10	North County Transit District	Oceanside	CA	PT	195,749

What Each Record Represents

Note that each record is a mode-agency-service. Here’s a smaller data set to demonstrate:

ntd %>%
  dplyr::filter(
    Mode %in% c("CR", "LR"),
    Agency %in% c("New Jersey Transit Corporation", "Metra")
  ) %>%
  dplyr::select(Mode, Agency, `HQ City`, State, `Type of Service`, `Total Mechanical Failures`) %>%
  knitr::kable()

Mode	Agency	HQ City	State	Type of Service	Total Mechanical Failures
CR	Metra	Chicago	IL	DO	30
CR	Metra	Chicago	IL	PT	35
CR	New Jersey Transit Corporation	Newark	NJ	DO	409
LR	New Jersey Transit Corporation	Newark	NJ	PT	257
LR	New Jersey Transit Corporation	Newark	NJ	DO	124

Metra runs 2 Commuter Rail (CR) services, one in-house (Directly Operated [DO]) and one contracted out (Purchased Transit [PT]).

NJ Transit runs just 1 CR service directly, but 2 Light Rail (LR) services. The DO service is the River Line and Newark LR combined; the PT service is the Hudson-Bergen LR.

Hopefully this clarifies what a mode-agency-service record is.

The Big Guys

I pay special attention to the largest agency-services, those running the most VRM. For each mode, I rank the 10 largest agency-services by breakdown frequency.

biggest = ntd %>%
  dplyr::group_by(Mode) %>%
  dplyr::top_n(10, `Vehicle/Passenger Car Revenue Miles`) %>%
  dplyr::mutate(`Rank, 10 Biggest` = dplyr::min_rank(dplyr::desc(`Mean Miles Between Breakdowns`))) %>%
  dplyr::ungroup() %>%
  dplyr::select(`NTD ID`, Mode, `Type of Service`, `Rank, 10 Biggest`)

ntd = dplyr::left_join(ntd, biggest, by = c("NTD ID", "Mode", "Type of Service"))

ntd %>%
  dplyr::filter(Mode == "CR", !is.na(`Rank, 10 Biggest`)) %>%
  dplyr::select(
    Mode,
    `Rank, 10 Biggest`,
    Agency,
    `HQ City`,
    State,
    `Type of Service`,
    `Mean Miles Between Breakdowns`
  ) %>%
  dplyr::mutate(`Mean Miles Between Breakdowns` = scales::comma(`Mean Miles Between Breakdowns`)) %>%
  knitr::kable()

Mode	Rank, 10 Biggest	Agency	HQ City	State	Type of Service	Mean Miles Between Breakdowns
CR	1	MTA Metro-North Railroad	New York	NY	DO	910,686
CR	2	MTA Long Island Rail Road	Jamaica	NY	DO	748,705
CR	3	Metra	Chicago	IL	DO	536,836
CR	4	Metra	Chicago	IL	PT	478,091
CR	5	Caltrain	San Carlos	CA	PT	223,990
CR	6	Metrolink	Los Angeles	CA	PT	175,324
CR	7	New Jersey Transit Corporation	Newark	NJ	DO	138,067
CR	8	Southeastern Pennsylvania Transportation Authority	Philadelphia	PA	DO	134,939
CR	9	Massachusetts Bay Transportation Authority	Boston	MA	PT	77,850
CR	10	Denver Regional Transportation District	Denver	CO	PT	19,564

And that’s it. I’ll post my full rankings soon. In another upcoming post I’ll cover why I’m doing this in the first place.

Footnotes

An enterprising young programmer could build an R/Python/etc package that automatically downloads NTD data, cleans it, and loads as a DataFrame. The tidycensus R package does this well for ACS and decennial Census data.↩︎