Examining how total USA airline passenger changes following the Covid pandemic
Clearly air traffic plummeted during Covid. Bob Rudis tweeted a great figure showing how passenger numbers changed between 2019 and 2020. I wanted to recreate the figure here. It’s a good exercise in web scraping data, data cleaning, and how to plan and actually make appealing and informative time-series line graphs with customized ggplot2
elements.
The data are available at the TSA website here. They appear to be updated daily, so hopefully the code below should work no matter what day the data are downloaded.
We will use the rvest
package to read in the url and scrape the HTML table. The extract2()
function comes from magrittr
and helps us get the first (and only) element from the list produced by html_table()
.
library(rvest)
library(tidyverse)
library(magrittr)
url <- "https://www.tsa.gov/coronavirus/passenger-throughput"
df <- read_html(url) %>%
html_table() %>%
extract2(1)
head(df)
Date Total Traveler Throughput
1 10/19/2020 921,031
2 10/18/2020 1,031,505
3 10/17/2020 788,743
4 10/16/2020 973,046
5 10/15/2020 950,024
6 10/14/2020 717,940
Total Traveler Throughput (1 Year Ago - Same Weekday)
1 2,514,673
2 2,606,266
3 2,049,855
4 2,637,667
5 2,581,007
6 2,317,763
We have three columns: date, total passengers in 2020, and total passengers on the corresponding weekday of 2019. So, the final column is not the exact same calendar data - but the corresponding closest Monday, Tuesday, etc. of 2019.
First, let’s clean up the column names:
colnames(df) <- c("date", "pass2020", "pass2019")
head(df)
date pass2020 pass2019
1 10/19/2020 921,031 2,514,673
2 10/18/2020 1,031,505 2,606,266
3 10/17/2020 788,743 2,049,855
4 10/16/2020 973,046 2,637,667
5 10/15/2020 950,024 2,581,007
6 10/14/2020 717,940 2,317,763
We still have some cleaning steps. We need to fix the date, the numbers and to reshape the dataframe.
Let’s make the date a date object:
df$date <- as.Date(df$date, format="%m/%d/%Y")
str(df) #it's now a 'date' column
'data.frame': 233 obs. of 3 variables:
$ date : Date, format: "2020-10-19" ...
$ pass2020: chr "921,031" "1,031,505" "788,743" "973,046" ...
$ pass2019: chr "2,514,673" "2,606,266" "2,049,855" "2,637,667" ...
Next, we’ll remove commas from the numbers and ensure the columns are numeric - you can see from the output above that R still thinks they are characters:
df$pass2020 <- as.numeric(gsub(",", "", df$pass2020))
df$pass2019 <- as.numeric(gsub(",", "", df$pass2019))
head(df)
date pass2020 pass2019
1 2020-10-19 921031 2514673
2 2020-10-18 1031505 2606266
3 2020-10-17 788743 2049855
4 2020-10-16 973046 2637667
5 2020-10-15 950024 2581007
6 2020-10-14 717940 2317763
To be able to plot these data with ggplot2
, we need to convert this dataframe from a wide dataframe to a long dataframe:
df <- df %>%
pivot_longer(cols=2:3)
df
# A tibble: 466 x 3
date name value
<date> <chr> <dbl>
1 2020-10-19 pass2020 921031
2 2020-10-19 pass2019 2514673
3 2020-10-18 pass2020 1031505
4 2020-10-18 pass2019 2606266
5 2020-10-17 pass2020 788743
6 2020-10-17 pass2019 2049855
7 2020-10-16 pass2020 973046
8 2020-10-16 pass2019 2637667
9 2020-10-15 pass2020 950024
10 2020-10-15 pass2019 2581007
# ... with 456 more rows
Now we can graph. We’ll first make the basic graph, and then we can add the customization:
ggplot(df, aes(x=date, y=value, color=name)) +
geom_line()
There are lots of interesting patterns in that graph. Clearly, number one is that the passenger numbers in mid March 2020 showed a dramatic fall, and a very slow recovery. The rate of recovery seems to have slowed down between July 2020 and October 2020. Starting in October 2020 we’re perhaps seeing a slightly increase again in the rate of passenger increase. The other patterns of interest are the weekly cycle of passenger numbers - an obvious pattern, but always notable to observe, as well as the season variation in traffic that usually occur in normal years.
To make this chart more readable, we want to change the passenger numbers on the y-axis from ‘e’ numbers to understandable numbers, add a title, remove the legend, change the theme and the line colors. I will also make the lines thicker and change the font size of the text on the figure. We should also source the data and original author.
library(scales) # needed to reformat the y-axis with commas in the numbers
ggplot(df, aes(x=date, y=value, color=name)) +
geom_line(lwd=1) +
scale_color_manual(values=c("#f59a46", "#123abc"))+
scale_y_continuous(breaks = seq(0,3000000, 500000), labels = comma) +
ylab("") +
xlab("") +
ggtitle("TSA checkpoint travel numbers for 2020 and 2019",
subtitle = "Total Traveler Throughput; Same Weekday") +
theme_minimal() +
labs(caption = "Adapated from Figure by Bob Rudis; Data: https://www.tsa.gov/coronavirus/passenger-throughput") +
theme(
legend.position = 'none',
axis.text = element_text(size=12),
plot.title = element_text(size=18)
)
This looks a lot better but it still has one major issue, which is you don’t know which line is 2019 and which line is 2020. Well, you could take a guess, but it should be keyed in the figure. I want to change colors in the title of the plot to indicate this.
This is a little trickier and requires a package called ggtext
which is downloadable from here. The code you need to include requires you to use a little bit of HTML tagging, but the overall effect is worth it. More examples of using ggtext
can be found on it’s GitHub page. Here is a useful example also on stackoverflow.
library(ggtext) #remotes::install_github("wilkelab/ggtext")
ggplot(df, aes(x=date, y=value, color=name)) +
geom_line(lwd = 1) +
scale_color_manual(
name = NULL,
values = c(pass2019 = "#f59a46", pass2020 = "#123abc"),
labels = c(
pass2019 = "<i style='color:#f59a46'>I. pass2019</i>",
pass2020 = "<i style='color:#123abc'>I. pass2020</i>")
) +
labs(
title = "<span style='font-size:18pt'>**TSA checkpoint travel numbers for <span style='color:#123abc;'>2020</span>
and <span style='color:#f59a46;'>2019</span>**</span>
<span style='font-size:12pt'>Total Traveler Throughput; Same Weekday</span>
",
x = "Sepal length (cm)", y = "Sepal width (cm)"
) +
scale_y_continuous(breaks = seq(0,3000000, 500000), labels = comma) +
ylab("") +
xlab("")+
theme_minimal() +
labs(caption = "Adapated from Figure by Bob Rudis; Data: https://www.tsa.gov/coronavirus/passenger-throughput")+
theme(
plot.title = element_markdown(lineheight = 1.1),
legend.position = 'none',
axis.text = element_text(size=12)
)
For attribution, please cite this work as
Curley (2020, Oct. 20). James' R Blog: TSA passenger screenings during Covid. Retrieved from https://jamescurley.blog/posts/2020-10-20-tsa-passenger-screenings-during-covid/
BibTeX citation
@misc{curley2020tsa, author = {Curley, James}, title = {James' R Blog: TSA passenger screenings during Covid}, url = {https://jamescurley.blog/posts/2020-10-20-tsa-passenger-screenings-during-covid/}, year = {2020} }