The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This is practically the same code you can find on this blog post of
mine: https://brodrigues.co/posts/2018-11-14-luxairport.html
with some minor updates to reflect the current state of the
{tidyverse}
packages as well as logging using
{chronicler}
.
Let’s first load the required packages, and the avia
dataset included in the {chronicler}
package:
library(chronicler)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
data("avia")
Now I need to define the needed functions for the analysis. To
improve logging, I add the dim()
function as the
.g
argument of each function below. This will make it
possible to see how the dimensions of the data change inside the
pipeline:
# Define required functions
# You can use `record_many()` to avoid having to write everything
r_select <- record(select, .g = dim)
r_pivot_longer <- record(pivot_longer, .g = dim)
r_filter <- record(filter, .g = dim)
r_separate <- record(separate, .g = dim)
r_group_by <- record(group_by, .g = dim)
r_summarise <- record(summarise, .g = dim)
We can now start by preparing the data:
avia_clean <- avia %>%
r_select(1, contains("20")) %>% # select the first column and every column starting with 20
bind_record(r_pivot_longer,
-starts_with("freq"),
names_to = "date",
values_to = "passengers") %>%
bind_record(r_separate,
col = 1,
into = c("freq", "unit", "tra_meas", "air_pr\\time"),
sep = ",")
Let’s take a look at the data:
avia_clean
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 464,616 × 6
#> freq unit tra_meas `air_pr\\time` date passengers
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2000 :
#> 2 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2001 :
#> 3 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2002 :
#> 4 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2003 :
#> 5 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2004 :
#> 6 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005 :
#> 7 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-01 143
#> 8 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-02 134
#> 9 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-03 154
#> 10 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-04 142
#> # ℹ 464,606 more rows
#>
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with unveil(.c, "value").
#> To read the log of this object, call read_log(.c).
The passengers column contains ":"
characters instead of
NA
s, and it’s a character column. Let’s convert this column
to numbers:
r_mutate <- record(mutate, .g = dim)
avia_clean2 <- avia_clean %>%
bind_record(r_mutate,
passengers = as.numeric(passengers))
Let’s look at the data:
avia_clean2
#> NOK! Value computed unsuccessfully:
#> ---------------
#> Nothing
#>
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with unveil(.c, "value").
#> To read the log of this object, call read_log(.c).
What happened? Let’s read the log to find out!
read_log(avia_clean2)
#> [1] "OK `select` at 10:57:42 (0.003s)"
#> [2] "OK `pivot_longer` at 10:57:42 (0.015s)"
#> [3] "OK `separate` at 10:57:42 (3.077s)"
#> [4] "NOK `mutate` at 10:57:45 (0.230s)"
#> [5] "Total: 3.324 secs"
So what happened is that as.numeric()
introduced
NA
s by coercion. This is what happens when trying to
convert a character to a number, for example
as.numeric(":")
will result in an NA
. Because
mutate()
was recorded with the default value for its
strict
argument (which is 2
), warnings get
promoted to errors. This can be quite useful to avoid problems with
silent conversions. But in this case, we want to ignore the warning:
let’s record mutate()
with strict = 1
, so that
only errors can stop the pipeline:
r_mutate_lenient <- record(mutate, .g = dim, strict = 1)
avia_clean2 <- avia_clean %>%
bind_record(r_mutate_lenient,
passengers = as.numeric(passengers)
)
#> Warning: There was 1 warning in `.f()`.
#> ℹ In argument: `passengers = as.numeric(passengers)`.
#> Caused by warning:
#> ! NAs introduced by coercion
As you can see, the warnings get printed, they’re not captured. We
can now take a look at the data and see that ":"
characters
where successfully replaced by NA
s:
avia_clean2
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 464,616 × 6
#> freq unit tra_meas `air_pr\\time` date passengers
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2000 NA
#> 2 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2001 NA
#> 3 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2002 NA
#> 4 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2003 NA
#> 5 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2004 NA
#> 6 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005 NA
#> 7 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-01 143
#> 8 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-02 134
#> 9 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-03 154
#> 10 M FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2005-04 142
#> # ℹ 464,606 more rows
#>
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with unveil(.c, "value").
#> To read the log of this object, call read_log(.c).
Let’s continue and focus on monthly data:
avia_monthly <- avia_clean2 %>%
bind_record(r_filter,
freq == "M",
tra_meas == "PAS_BRD_ARR",
!is.na(passengers)) %>%
bind_record(r_mutate,
date = paste0(date, "01"),
date = ymd(date)) %>%
bind_record(r_select,
destination = "air_pr\\time", date, passengers)
To make sure I only have monthly data, I can count the values of the
date
column using dplyr::count()
. But because
avia_monthly
is not a data frame, but a
chronicle
I need to record()
the
dplyr::count()
function. But because I only need it this
once, I could instead use fmap_record()
, which makes it
possible to apply an undecorated function to a chronicle
object:
fmap_record(avia_monthly, count, date)
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 226 × 2
#> date n
#> <date> <int>
#> 1 2005-01-01 23
#> 2 2005-02-01 23
#> 3 2005-03-01 23
#> 4 2005-04-01 24
#> 5 2005-05-01 24
#> 6 2005-06-01 24
#> 7 2005-07-01 24
#> 8 2005-08-01 24
#> 9 2005-09-01 24
#> 10 2005-10-01 24
#> # ℹ 216 more rows
#>
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with unveil(.c, "value").
#> To read the log of this object, call read_log(.c).
avia_monthly
is an object of class
chronicle
, but in essence, it is just a list, with its own
print method:
avia_monthly
#> OK! Value computed successfully:
#> ---------------
#> Just
#> # A tibble: 6,643 × 3
#> destination date passengers
#> <chr> <date> <dbl>
#> 1 LU_ELLX_AT_LOWW 2005-01-01 1758
#> 2 LU_ELLX_AT_LOWW 2005-02-01 1843
#> 3 LU_ELLX_AT_LOWW 2005-03-01 2129
#> 4 LU_ELLX_AT_LOWW 2005-04-01 2332
#> 5 LU_ELLX_AT_LOWW 2005-05-01 2402
#> 6 LU_ELLX_AT_LOWW 2005-06-01 2475
#> 7 LU_ELLX_AT_LOWW 2005-07-01 2082
#> 8 LU_ELLX_AT_LOWW 2005-08-01 2175
#> 9 LU_ELLX_AT_LOWW 2005-09-01 2288
#> 10 LU_ELLX_AT_LOWW 2005-10-01 2296
#> # ℹ 6,633 more rows
#>
#> ---------------
#> This is an object of type `chronicle`.
#> Retrieve the value of this object with unveil(.c, "value").
#> To read the log of this object, call read_log(.c).
Now that the data is clean, we can read the log:
read_log(avia_monthly)
#> [1] "OK `select` at 10:57:42 (0.003s)"
#> [2] "OK `pivot_longer` at 10:57:42 (0.015s)"
#> [3] "OK `separate` at 10:57:42 (3.077s)"
#> [4] "OK `mutate` at 10:57:46 (0.034s)"
#> [5] "OK `filter` at 10:57:46 (0.005s)"
#> [6] "OK `mutate` at 10:57:46 (0.005s)"
#> [7] "OK `select` at 10:57:46 (0.001s)"
#> [8] "Total: 3.140 secs"
This is especially useful if the object avia_monthly
gets saved using saveRDS()
. People can then read this
object, can read the log to know what happened and reproduce the steps
if necessary.
Let’s take a look at the final data set:
avia_monthly %>%
unveil("value")
#> # A tibble: 6,643 × 3
#> destination date passengers
#> <chr> <date> <dbl>
#> 1 LU_ELLX_AT_LOWW 2005-01-01 1758
#> 2 LU_ELLX_AT_LOWW 2005-02-01 1843
#> 3 LU_ELLX_AT_LOWW 2005-03-01 2129
#> 4 LU_ELLX_AT_LOWW 2005-04-01 2332
#> 5 LU_ELLX_AT_LOWW 2005-05-01 2402
#> 6 LU_ELLX_AT_LOWW 2005-06-01 2475
#> 7 LU_ELLX_AT_LOWW 2005-07-01 2082
#> 8 LU_ELLX_AT_LOWW 2005-08-01 2175
#> 9 LU_ELLX_AT_LOWW 2005-09-01 2288
#> 10 LU_ELLX_AT_LOWW 2005-10-01 2296
#> # ℹ 6,633 more rows
It is also possible to take a look at the underlying
.log_df
object that contains more details, and see the
output of the .g
argument (which was defined in the
beginning as the dim()
function):
check_g(avia_monthly)
#> ops_number function g
#> 1 1 select 1434, 325
#> 2 2 pivot_longer 464616, 3
#> 3 3 separate 464616, 6
#> 4 4 mutate 464616, 6
#> 5 5 filter 6643, 6
#> 6 6 mutate 6643, 6
#> 7 7 select 6643, 3
After select()
the data has hu[[1]][1]
rows
and hu[[1]][2]
columns, after the call to
pivot_longer()
, hu[[2]][1]
rows and
hu[[2]][2]
columns, separate()
adds three
columns, after filter()
only hu[[5]][1]
rows
remain (mutate()
does not change the dimensions) and then
select()
is used to remove three columns.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.