Let’s convert the Leaders dataset into a tibble. It is a list with one element per character:
library(tibblify)
str(politicians[1])
#> List of 1
#> $ :List of 8
#> ..$ id : int 1
#> ..$ name : chr "Barack"
#> ..$ surname : chr "Obama"
#> ..$ dob : chr "1961-08-04"
#> ..$ n_children: num 2
#> ..$ parents :List of 2
#> .. ..$ mother: chr "Ann Dunham"
#> .. ..$ father: chr "Barack Obama Sr."
#> ..$ spouses :List of 1
#> .. ..$ : chr "Michelle Robinson"
#> ..$ offices :List of 2
#> .. ..$ :List of 2
#> .. .. ..$ name : chr "President of the United States"
#> .. .. ..$ start: chr "2009-01-20"
#> .. ..$ :List of 2
#> .. .. ..$ name : chr "United States Senator from Illinois"
#> .. .. ..$ start: chr "2005-01-03"
We can let tibblify()
automatically recognize the structure of the list and find an appropriate presentation as a tibble:
tibblify(politicians)
politicians_tibble <-
politicians_tibble#> # A tibble: 2 x 8
#> id name surname dob n_children parents$mother $father spouses offices
#> <int> <chr> <chr> <chr> <dbl> <chr> <chr> <list<> <list<t>
#> 1 1 Barack Obama 1961-… 2 Ann Dunham Barack… [1] [2 × 2]
#> 2 2 Boris Johnson 1964-… NA <NA> Stanle… [2] [3 × 2]
The parents
column is a tibble with the columns mother
and father
because in the original list leader1
the field parents
is a named list.
$parents
politicians_tibble#> # A tibble: 2 x 2
#> mother father
#> <chr> <chr>
#> 1 Ann Dunham Barack Obama Sr.
#> 2 <NA> Stanley Johnson
and the spouses
column is a list_of
character because the spouses
field is a list and all elements are characters
$spouses
politicians_tibble#> <list_of<character>[2]>
#> [[1]]
#> [1] "Michelle Robinson"
#>
#> [[2]]
#> [1] "Allegra Mostyn-Owen" "Marina Wheeler"
In the above example we used tibblify()
without any further specification on how to convert the list into a tibble. This is quite useful in an interactive session but often you want to provide a specification yourself. Some of the reasons are:
First, we use get_spec()
to view the specification used to convert our list to a tibble:
get_spec(politicians_tibble)
#> lcols(
#> id = lcol_int("id"),
#> name = lcol_chr("name"),
#> surname = lcol_chr("surname"),
#> dob = lcol_chr("dob"),
#> n_children = lcol_dbl("n_children", .default = NA),
#> parents = lcol_df(
#> "parents",
#> mother = lcol_chr("mother", .default = NA),
#> father = lcol_chr("father")
#> ),
#> spouses = lcol_lst_of(
#> "spouses",
#> .ptype = character(0),
#> .parser = ~vec_c(!!!.x, .ptype = character()),
#> .default = NULL
#> ),
#> offices = lcol_df_lst(
#> "offices",
#> name = lcol_chr("name"),
#> start = lcol_chr("start")
#> )
#> )
A specification always starts with a call to lcols()
(similar to readr::cols()
). Then you specify the columns you want with name-value pairs. The name is the name of the resulting column and the value is a specification created with one of the lcol_*()
functions.
The first argument to lcol_*()
is always a path
which describes where to find the element. The syntax is the same as in purrr::map()
used to extract fields. Some examples
politicians[[1]]
leader <-
# get the element `id`
c("id")
path <-"id"]]
leader[[#> [1] 1
# get the element `father` in the element `parents`
c("parents", "father")
path <-"parents"]][["mother"]]
leader[[#> [1] "Ann Dunham"
# get the first element in the element `spouses`
list("spouses", 1)
path <-"spouses"]][[1]]
leader[[#> [1] "Michelle Robinson"
A couple of typical vector types have a predefined extractor:
lcol_chr()
: create a character column.lcol_lgl()
: create a logical column.lcol_int()
: create an integer column.lcol_dbl()
: create a double column.lcol_dat()
: create a date column.lcol_dtt()
: create a datetime column.See parsing other types to create a column of your own prototype.
tibblify(
politicians,lcols(
lcol_int("id"),
lcol_chr("name"),
`family name` = lcol_chr("surname")
)
)#> # A tibble: 2 x 3
#> id name `family name`
#> <int> <chr> <chr>
#> 1 1 Barack Obama
#> 2 2 Boris Johnson
If an element doesn’t exist an error is thrown as in purrr::chuck()
. To use a default value instead of throwing an error use the .default
argument. The .default
value is also used in case the element at the path is empty:
list(
list_default <-list(a = 1),
list(a = NULL),
list(a = integer()),
list()
)
tibblify(
list_default,lcols(lcol_int("a"))
)#> Error: empty or absent element at path a
tibblify(
list_default,lcols(lcol_int("a", .default = 0))
)#> # A tibble: 4 x 1
#> a
#> <int>
#> 1 1
#> 2 0
#> 3 0
#> 4 0
When the cast is not possible with vctrs::vec_cast()
you can use the .parser
argument to supply a custom parser. It is passed to rlang::as_function()
so you can use a function or a formula. A typical use case are dates stored as strings.
tibblify(
politicians,lcols(
lcol_chr("surname"),
lcol_dat("dob", .parser = ~ as.Date(.x, format = "%Y-%m-%d"))
)
)#> # A tibble: 2 x 2
#> surname dob
#> <chr> <date>
#> 1 Obama 1961-08-04
#> 2 Johnson 1964-06-19
A list_of
is a list where each element in the list has the same prototype. It is useful when you have fields with more than one element as in the spouses
field.
tibblify(
spouses_tbl <-
politicians,lcols(
lcol_chr("surname"),
lcol_lst_of("spouses", .ptype = character())
)
)
$spouses
spouses_tbl#> <list_of<character>[2]>
#> [[1]]
#> [[1]][[1]]
#> [1] "Michelle Robinson"
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] "Allegra Mostyn-Owen"
#>
#> [[2]][[2]]
#> [1] "Marina Wheeler"
You can use tidyr::unnest()
or tidyr::unnest_longer()
to flatten these columns to regular columns.
A list column is used when you have a field with mixed elements.
Analogue to readr::col_guess()
and readr::col_skip()
you can specify that you want to guess the column type with lcol_guess()
respectively skip a field with lcol_skip()
. Skipping a column can be useful when you set a default column type or you want to make clear that you know about the field and intentionally skip it.
Guessing a column is useful in interactive sessions but you shouldn’t rely on it in automated scripts.
If a field contains is a named list where each element has length 1 or 0 the field is converted to a tibble column. This is for example the case for the parents
field:
tibblify(
leaders_tibble <-
politicians,lcols(
lcol_chr("surname"),
lcol_guess("parents")
)
)
leaders_tibble#> # A tibble: 2 x 2
#> surname parents$mother $father
#> <chr> <chr> <chr>
#> 1 Obama Ann Dunham Barack Obama Sr.
#> 2 Johnson <NA> Stanley Johnson
Tibble columns are a relatively new concept in the tidyverse. You can unpack a tibble column into regular columns with tidyr::unpack()
.
tibblify
provides shortcuts for a couple of common types. To parse a vector or record type without a parser use lcol_vec()
. Let’s say you have a list with difftimes
Sys.time()
now <- now - c(100, 200)
past <-
list(
x <-list(timediff = now - past[1]),
list(timediff = now - past[2])
)
x#> [[1]]
#> [[1]]$timediff
#> Time difference of 1.666667 mins
#>
#>
#> [[2]]
#> [[2]]$timediff
#> Time difference of 3.333333 mins
You need to define a prototype
as.difftime(0, units = "secs")
ptype <-
ptype#> Time difference of 0 secs
and then use it in lcol_vec()
tibblify(
x,lcols(
lcol_vec("timediff", ptype = ptype)
)
)#> # A tibble: 2 x 1
#> timediff
#> <drtn>
#> 1 100 secs
#> 2 200 secs
You can use the .default
argument of lcols()
to define a parser used for all unspecified fields.
tibblify(
politicians,lcols(
lcol_chr("name"),
lcol_chr("surname"),
.default = lcol_lst(path = zap(), .default = NULL)
)
)#> # A tibble: 2 x 8
#> name surname id dob n_children parents spouses offices
#> <chr> <chr> <list> <list> <list> <list> <list> <list>
#> 1 Barack Obama <int [1… <chr [1… <dbl [1]> <named list [… <list [1… <list [2…
#> 2 Boris Johnson <int [1… <chr [1… <NULL> <named list [… <list [2… <list [3…