The sfo_stats
dataset provides monthly statistics on San Francisco International Airport’s air traffic landing between July 2005 and September 2020. The following vignette demonstrate some approaches for exploring the dataset. As the structure of the sfo_stats
is similar to the sfo_passengers
dataset, we will repeat the same data prep steps as shown on the previous vignette. We will use the dplyr and plotly packages for data manipulation and visualization, respectively.
For simplicity, let’s use a shorter name , d
, for the dataset:
library(sfo)
library(dplyr)
library(plotly)
<- sfo_stats
d
head(d)
#> activity_period operating_airline operating_airline_iata_code
#> 1 202009 United Airlines UA
#> 2 202009 United Airlines UA
#> 3 202009 United Airlines UA
#> 4 202009 United Airlines UA
#> 5 202009 United Airlines UA
#> 6 202009 United Airlines UA
#> published_airline published_airline_iata_code geo_summary geo_region
#> 1 United Airlines UA International Mexico
#> 2 United Airlines UA Domestic US
#> 3 United Airlines UA International Canada
#> 4 United Airlines UA Domestic US
#> 5 United Airlines UA International Mexico
#> 6 United Airlines UA Domestic US
#> landing_aircraft_type aircraft_body_type aircraft_manufacturer aircraft_model
#> 1 Passenger Narrow Body Airbus A320
#> 2 Passenger Narrow Body Boeing B738
#> 3 Passenger Narrow Body Boeing B738
#> 4 Passenger Narrow Body Boeing B738
#> 5 Passenger Narrow Body Boeing B738
#> 6 Passenger Narrow Body Boeing B739
#> aircraft_version landing_count total_landed_weight
#> 1 - 37 5261326
#> 2 - 14 2048200
#> 3 - 1 146300
#> 4 - 251 36721300
#> 5 - 3 438900
#> 6 - 553 86986900
Next, let’s reformat the period indicator, activity_period
to a Date
object, setting the first day of the month as the default day:
$date <- as.Date(paste(substr(d$activity_period, 1,4),
dsubstr(d$activity_period, 5,6),
"01", sep ="/"))
We can see, with the str
command, the stucture of the dataset:
str(d)
#> 'data.frame': 25429 obs. of 15 variables:
#> $ activity_period : int 202009 202009 202009 202009 202009 202009 202009 202009 202009 202009 ...
#> $ operating_airline : chr "United Airlines" "United Airlines" "United Airlines" "United Airlines" ...
#> $ operating_airline_iata_code: chr "UA" "UA" "UA" "UA" ...
#> $ published_airline : chr "United Airlines" "United Airlines" "United Airlines" "United Airlines" ...
#> $ published_airline_iata_code: chr "UA" "UA" "UA" "UA" ...
#> $ geo_summary : chr "International" "Domestic" "International" "Domestic" ...
#> $ geo_region : chr "Mexico" "US" "Canada" "US" ...
#> $ landing_aircraft_type : chr "Passenger" "Passenger" "Passenger" "Passenger" ...
#> $ aircraft_body_type : chr "Narrow Body" "Narrow Body" "Narrow Body" "Narrow Body" ...
#> $ aircraft_manufacturer : chr "Airbus" "Boeing" "Boeing" "Boeing" ...
#> $ aircraft_model : chr "A320" "B738" "B738" "B738" ...
#> $ aircraft_version : chr "-" "-" "-" "-" ...
#> $ landing_count : int 37 14 1 251 3 553 1 13 52 102 ...
#> $ total_landed_weight : int 5261326 2048200 146300 36721300 438900 86986900 157300 2044900 10296000 22848000 ...
#> $ date : Date, format: "2020-09-01" "2020-09-01" ...
The data set has 11 categorical variables and two numeric variables - landing_count
and total_landed_weight
.
Let’s start with viewing the total monthly number of landing in SFO:
%>%
d group_by(date) %>%
summarise(landing_count = sum(landing_count)) %>%
plot_ly(x = ~ date, y = ~ landing_count,
type = "scatter", mode = "lines") %>%
layout(title = "Montly Landing in SFO Airport",
yaxis = list(title = "Number of Landing"),
xaxis = list(title = "Source: San Francisco data portal (DataSF)"))
As can seen in the aggregate plot above, the data has:
We can use plotly’s fill plot to review the distribution of landing at SFO by geo region:
%>%
d group_by(date, geo_region) %>%
summarise(landing_count = sum(landing_count)) %>%
as.data.frame() %>%
plot_ly(x = ~ date,
y = ~ landing_count,
# name = 'Food and Tobacco',
type = 'scatter',
mode = 'none',
stackgroup = 'one',
groupnorm = 'percent', fillcolor = ~ geo_region) %>%
layout(title = "Dist. of Landing at SFO by Region",
yaxis = list(title = "Percentage",
ticksuffix = "%"))
As expected, we can notice the change in geo’s landing distribution since March 2020 due to the Covid19 pandemic.
The aircraft_manufacturer
column, as the name implies, provides the the aircraft manufacture. Let’s summarize the total landing during 2019, the most recent full calendar year, by the manufacturer type:
%>%
d filter(activity_period >= 201901 & activity_period < 202001,
!= "") %>%
aircraft_manufacturer group_by(aircraft_manufacturer) %>%
summarise(total_landing = sum(landing_count),
`.groups` = "drop") %>%
arrange(-total_landing) %>%
plot_ly(labels = ~ aircraft_manufacturer,
values = ~ total_landing) %>%
add_pie(hole = 0.6) %>%
layout(title = "Landing Distribution by Aircraft Manufacturer During 2019")
Similarly, we can add the aircract_body_type
and get the distribution of landing airplans during 2019 by manufacturer and body type (e.g., wide, narrow, etc.):
%>%
d filter(activity_period >= 201901 & activity_period < 202001,
!= "") %>%
aircraft_manufacturer group_by(aircraft_manufacturer, aircraft_body_type) %>%
summarise(total_landing = sum(landing_count),
`.groups` = "drop") %>%
arrange(-total_landing)
#> # A tibble: 9 x 3
#> aircraft_manufacturer aircraft_body_type total_landing
#> <chr> <chr> <int>
#> 1 Boeing Narrow Body 78143
#> 2 Airbus Narrow Body 56148
#> 3 Boeing Wide Body 25950
#> 4 Embraer Regional Jet 24324
#> 5 Bombardier Regional Jet 20862
#> 6 Airbus Wide Body 5753
#> 7 Bombardier Narrow Body 1014
#> 8 McDonnell Douglas Narrow Body 3
#> 9 McDonnell Douglas Wide Body 1
A Sankey plot enables us to get a distribution flow of some numeric value by multiple categorical variables. In the following example, we will use the sankey_ly
function to plot the distribution of landing during 2019 by geo, flight type, and aircraft details:
%>%
d filter(activity_period >= 201901 & activity_period < 202001,
!= "") %>%
aircraft_manufacturer group_by(geo_region, landing_aircraft_type,
aircraft_manufacturer, aircraft_model, %>%
aircraft_body_type) summarise(total_landing = sum(landing_count),
groups = "drop") %>%
sankey_ly(cat_cols = c("geo_region",
"landing_aircraft_type",
"aircraft_manufacturer",
"aircraft_model",
"aircraft_body_type"),
num_col = "total_landing",
title = "SFO Landing Summary by Geo Region and Aircraft Type During 2019")