Working with the National Transit Database

Transit
R
Published

January 11, 2023

Modified

August 27, 2023

This post is the start of a series I’ll write on the National Transit Database (NTD).

The NTD is a series of Excel spreadsheets released annually, around early November, by the Federal Transit Administration (FTA). For now, I’m interested specifically in breakdowns data. In this post, I’ll describe the breakdowns data set’s structure and demonstrate how I clean it using R.

In later posts, I’ll use NTD data to talk about transit in the US.

Getting Started

I download1 the 2021 breakdowns dataset and import it into an R session:

library(dplyr)   # To clean data.
library(knitr)   # For pretty data tables.
library(readxl)  # Guess.

url = "https://www.transit.dot.gov/sites/fta.dot.gov/files/2022-10/2021%20Breakdowns_static.xlsx"
tmp = tempfile()
download.file(url, tmp, mode = "wb")
ntd = readxl::read_excel(tmp, sheet = "Breakdowns")

dplyr::glimpse(ntd)
Rows: 1,249
Columns: 30
$ Agency                                             <chr> "MTA New York City …
$ City                                               <chr> "Brooklyn", "Brookl…
$ State                                              <chr> "NY", "NY", "NY", "…
$ `Legacy NTD ID`                                    <chr> "2008", "2008", "20…
$ `NTD ID`                                           <dbl> 20008, 20008, 20008…
$ `Organization Type`                                <chr> "Subsidiary Unit of…
$ `Reporter Type`                                    <chr> "Full Reporter", "F…
$ `Primary UZA Population`                           <dbl> 18351295, 18351295,…
$ `Agency VOMS`                                      <dbl> 10075, 10075, 10075…
$ Mode                                               <chr> "DR", "HR", "CB", "…
$ `Type of Service`                                  <chr> "PT", "DO", "DO", "…
$ `Mode VOMS`                                        <dbl> 878, 5410, 433, 322…
$ `Major Mechanical Failures`                        <dbl> 935, 2257, 440, 979…
$ `Major Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Other Mechanical Failures`                        <dbl> 691, 12732, 94, 230…
$ `Other Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Total Mechanical Failures`                        <dbl> 1626, 14989, 534, 1…
$ `Total Mechanical Failures Questionable`           <lgl> NA, NA, NA, NA, NA,…
$ `Vehicle/Passenger Car Miles`                      <dbl> 29546888, 341593245…
$ `Vehicle/Passenger Car Miles Questionable`         <chr> NA, NA, NA, NA, NA,…
$ `Vehicle/Passenger Car Revenue Miles`              <dbl> 23722953, 331253516…
$ `Vehicle/Passenger Car Revenue Miles Questionable` <chr> NA, NA, NA, NA, NA,…
$ `Train Miles`                                      <dbl> 0, 37477972, 0, 0, …
$ `Train Miles Questionable`                         <lgl> NA, NA, NA, NA, NA,…
$ `Train Revenue Miles`                              <dbl> 0, 36328789, 0, 0, …
$ `Train Revenue Miles Questionable`                 <lgl> NA, NA, NA, NA, NA,…
$ `Any Data Questionable?`                           <chr> "No", "No", "No", "…
$ ...28                                              <dbl> NA, NA, 1, NA, NA, …
$ `0`                                                <chr> "Hide Questionable …
$ ...30                                              <dbl> NA, NA, 1, NA, NA, …

My purposes need just certain fields and records:

  • Primary keys: It’s better if I demonstrate what each record represents in the next section.

    • NTD ID: Unique agency ID number.

    • Mode: Transit type. Train, plane, or automobile? I only have energy to cover a few:

      • Bus (MB)

      • Commuter Rail (CR)

      • Heavy Rail (HR)

      • Light Rail (LR)

    • Type of Service: Whether the transit was Purchased Transit (PT) (contracted out to a private firm) or Directly Operated (DO) (in-house).

  • Identifying info:

    • Agency: Agency name and nickname/acronym, if it has one (e.g., Metra, WMATA).

    • City: Agency HQ municipality.

    • State: Agency HQ state.

  • Metrics:

    • Mean Miles Between Breakdowns [calculated]: Breakdown frequency. \[\frac{VRM}{Total Mechanical Failures}\]

    • Total Mechanical Failures: Breakdowns. \[{Major Mechanical Failures} + {Other Mechanical Failures}\]

      • No 0 values, since this is the denominator in my calculated metric.
    • Vehicle/Passenger Car Revenues Miles (VRM): Total miles run in revenue service.

      • No 0 values. I only want transit that actually runs.
  • Quality flags: Indicate data that don’t meet the FTA’s quality standards. "Q" if data quality is questionable. "W" if the agency didn’t report this data to the FTA’s standards, but the FTA gave them a waiver.

    • Total Mechanical Failures Questionable

      • No "Q" values.
    • Vehicle/Passenger Car Revenue Miles Questionable

      • No "Q" values.
ntd = ntd %>%
  dplyr::select(
    `NTD ID`,
    Mode,
    `Type of Service`,
    Agency,
    `HQ City` = City,
    State,
    dplyr::starts_with("Vehicle/Passenger Car Revenue Miles"),
    dplyr::starts_with("Total Mechanical Failures")
  ) %>%
  dplyr::filter(
    Mode %in% c("CR", "HR", "LR", "MB"),
    `Vehicle/Passenger Car Revenue Miles Questionable` %in% c("W", NA_character_),
    `Vehicle/Passenger Car Revenue Miles` > 0,
    `Total Mechanical Failures Questionable` %in% c("W", NA_character_),
    `Total Mechanical Failures` > 0
  ) %>%
  dplyr::mutate(
    # Get agency nickname only.
    Agency = gsub(".*dba: ", "", Agency),
    `Mean Miles Between Breakdowns` = `Vehicle/Passenger Car Revenue Miles` / `Total Mechanical Failures`
  ) %>%
  dplyr::group_by(Mode) %>%
  dplyr::mutate(Rank = dplyr::min_rank(dplyr::desc(`Mean Miles Between Breakdowns`))) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(Mode, Rank)

ntd %>%
  dplyr::select(Mode, Rank, Agency, `HQ City`, State, `Type of Service`, `Mean Miles Between Breakdowns`) %>%
  dplyr::filter(Mode == "CR", Rank <= 10) %>%
  dplyr::mutate(
    `Mean Miles Between Breakdowns` = scales::comma(`Mean Miles Between Breakdowns`)
  ) %>%
  knitr::kable()
Mode Rank Agency HQ City State Type of Service Mean Miles Between Breakdowns
CR 1 Northern New England Passenger Rail Authority Portland ME PT 2,105,053
CR 2 Pennsylvania Department of Transportation Harrisburg PA PT 1,619,575
CR 3 MTA Metro-North Railroad New York NY DO 910,686
CR 4 Northern Indiana Commuter Transportation District Chesterton IN DO 764,272
CR 5 MTA Long Island Rail Road Jamaica NY DO 748,705
CR 6 Metra Chicago IL DO 536,836
CR 7 Trinity Metro Fort Worth TX PT 488,962
CR 8 Metra Chicago IL PT 478,091
CR 9 Caltrain San Carlos CA PT 223,990
CR 10 North County Transit District Oceanside CA PT 195,749

What Each Record Represents

Note that each record is a mode-agency-service. Here’s a smaller data set to demonstrate:

ntd %>%
  dplyr::filter(
    Mode %in% c("CR", "LR"),
    Agency %in% c("New Jersey Transit Corporation", "Metra")
  ) %>%
  dplyr::select(Mode, Agency, `HQ City`, State, `Type of Service`, `Total Mechanical Failures`) %>%
  knitr::kable()
Mode Agency HQ City State Type of Service Total Mechanical Failures
CR Metra Chicago IL DO 30
CR Metra Chicago IL PT 35
CR New Jersey Transit Corporation Newark NJ DO 409
LR New Jersey Transit Corporation Newark NJ PT 257
LR New Jersey Transit Corporation Newark NJ DO 124

Metra runs 2 Commuter Rail (CR) services, one in-house (Directly Operated [DO]) and one contracted out (Purchased Transit [PT]).

NJ Transit runs just 1 CR service directly, but 2 Light Rail (LR) services. The DO service is the River Line and Newark LR combined; the PT service is the Hudson-Bergen LR.

Hopefully this clarifies what a mode-agency-service record is.

The Big Guys

I pay special attention to the largest agency-services, those running the most VRM. For each mode, I rank the 10 largest agency-services by breakdown frequency.

biggest = ntd %>%
  dplyr::group_by(Mode) %>%
  dplyr::top_n(10, `Vehicle/Passenger Car Revenue Miles`) %>%
  dplyr::mutate(`Rank, 10 Biggest` = dplyr::min_rank(dplyr::desc(`Mean Miles Between Breakdowns`))) %>%
  dplyr::ungroup() %>%
  dplyr::select(`NTD ID`, Mode, `Type of Service`, `Rank, 10 Biggest`)

ntd = dplyr::left_join(ntd, biggest, by = c("NTD ID", "Mode", "Type of Service"))

ntd %>%
  dplyr::filter(Mode == "CR", !is.na(`Rank, 10 Biggest`)) %>%
  dplyr::select(
    Mode,
    `Rank, 10 Biggest`,
    Agency,
    `HQ City`,
    State,
    `Type of Service`,
    `Mean Miles Between Breakdowns`
  ) %>%
  dplyr::mutate(`Mean Miles Between Breakdowns` = scales::comma(`Mean Miles Between Breakdowns`)) %>%
  knitr::kable()
Mode Rank, 10 Biggest Agency HQ City State Type of Service Mean Miles Between Breakdowns
CR 1 MTA Metro-North Railroad New York NY DO 910,686
CR 2 MTA Long Island Rail Road Jamaica NY DO 748,705
CR 3 Metra Chicago IL DO 536,836
CR 4 Metra Chicago IL PT 478,091
CR 5 Caltrain San Carlos CA PT 223,990
CR 6 Metrolink Los Angeles CA PT 175,324
CR 7 New Jersey Transit Corporation Newark NJ DO 138,067
CR 8 Southeastern Pennsylvania Transportation Authority Philadelphia PA DO 134,939
CR 9 Massachusetts Bay Transportation Authority Boston MA PT 77,850
CR 10 Denver Regional Transportation District Denver CO PT 19,564

And that’s it. I’ll post my full rankings soon. In another upcoming post I’ll cover why I’m doing this in the first place.

Footnotes

  1. An enterprising young programmer could build an R/Python/etc package that automatically downloads NTD data, cleans it, and loads as a DataFrame. The tidycensus R package does this well for ACS and decennial Census data.↩︎