The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:
unnest_longer() takes each element of a list-column and
makes a new row.unnest_wider() takes each element of a list-column and
makes a new column.hoist() is similar to unnest_wider() but
only plucks out selected components, and can reach down multiple
levels.(Alternative, for complex inputs where you need to rectangle a nested list according to a specification, see the tibblify package.)
A very large number of data rectangling problems can be solved by
combining jsonlite::read_json() with these functions and a
splash of dplyr (largely eliminating prior approaches that combined
mutate() with multiple purrr::map()s). Note
that jsonlite has another important function called
fromJSON(). We don’t recommend it here because it performs
its own automatic simplification (simplifyVector = TRUE).
This often works well, particularly in simple cases, but we think you’re
better off doing the rectangling yourself so you know exactly what’s
happening and can more easily handle the most complicated nested
structures.
To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.
We’ll start with gh_users, a list which contains
information about six GitHub users. To begin, we put the
gh_users list into a data frame:
This seems a bit counter-intuitive: why is the first step in making a list simpler to make it more complicated? But a data frame has a big advantage: it bundles together multiple vectors so that everything is tracked together in a single object.
Each user is a named list, where each element represents
a column.
names(users$user[[1]])
#>  [1] "login"               "id"                  "avatar_url"         
#>  [4] "gravatar_id"         "url"                 "html_url"           
#>  [7] "followers_url"       "following_url"       "gists_url"          
#> [10] "starred_url"         "subscriptions_url"   "organizations_url"  
#> [13] "repos_url"           "events_url"          "received_events_url"
#> [16] "type"                "site_admin"          "name"               
#> [19] "company"             "blog"                "location"           
#> [22] "email"               "hireable"            "bio"                
#> [25] "public_repos"        "public_gists"        "followers"          
#> [28] "following"           "created_at"          "updated_at"There are two ways to turn the list components into columns.
unnest_wider() takes every component and makes a new
column:
users %>% unnest_wider(user)
#> # A tibble: 6 × 30
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
#> 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
#> # ℹ 22 more variables: gists_url <chr>, starred_url <chr>,
#> #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> #   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> #   hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
#> #   followers <int>, following <int>, created_at <chr>, updated_at <chr>But in this case, there are many components and we don’t need most of
them so we can instead use hoist(). hoist()
allows us to pull out selected components using the same syntax as
purrr::pluck():
users %>% hoist(user, 
  followers = "followers", 
  login = "login", 
  url = "html_url"
)
#> # A tibble: 6 × 4
#>   followers login       url                            user             
#>       <int> <chr>       <chr>                          <list>           
#> 1       303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2       780 jennybc     https://github.com/jennybc     <named list [27]>
#> 3      3958 jtleek      https://github.com/jtleek      <named list [27]>
#> 4       115 juliasilge  https://github.com/juliasilge  <named list [27]>
#> 5       213 leeper      https://github.com/leeper      <named list [27]>
#> 6        34 masalmon    https://github.com/masalmon    <named list [27]>hoist() removes the named components from the
user list-column, so you can think of it as moving
components out of the inner list into the top-level data frame.
We start off gh_repos similarly, by putting it in a
tibble:
repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 × 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>This time the elements of repos are a list of
repositories that belong to that user. These are observations, so should
become new rows, so we use unnest_longer() rather than
unnest_wider():
repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 × 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> # ℹ 170 more rowsThen we can use unnest_wider() or
hoist():
repos %>% hoist(repo, 
  login = c("owner", "login"), 
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)
#> # A tibble: 176 × 5
#>   login       name        homepage watchers repo             
#>   <chr>       <chr>       <chr>       <int> <list>           
#> 1 gaborcsardi after       <NA>            5 <named list [65]>
#> 2 gaborcsardi argufy      <NA>           19 <named list [65]>
#> 3 gaborcsardi ask         <NA>            5 <named list [65]>
#> 4 gaborcsardi baseimports <NA>            0 <named list [65]>
#> 5 gaborcsardi citest      <NA>            0 <named list [65]>
#> 6 gaborcsardi clisymbols  ""             18 <named list [65]>
#> # ℹ 170 more rowsNote the use of c("owner", "login"): this allows us to
reach two levels deep inside of a list. An alternative approach would be
to pull out just owner and then put each element of it in a
column:
repos %>% 
  hoist(repo, owner = "owner") %>% 
  unnest_wider(owner)
#> # A tibble: 176 × 18
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 2 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 3 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 4 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 5 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 6 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> # ℹ 170 more rows
#> # ℹ 10 more variables: gists_url <chr>, starred_url <chr>,
#> #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> #   repo <list>got_chars has a similar structure to
gh_users: it’s a list of named lists, where each element of
the inner list describes some attribute of a GoT character. We start in
the same way, first by creating a data frame and then by unnesting each
component into a column:
chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 × 1
#>   char             
#>   <list>           
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> # ℹ 24 more rows
chars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 × 18
#>   url            id name  gender culture born  died  alive titles aliases father
#>   <chr>       <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
#> 1 https://ww…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
#> 2 https://ww…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
#> 3 https://ww…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
#> 4 https://ww…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
#> 5 https://ww…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
#> 6 https://ww…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
#> # ℹ 24 more rows
#> # ℹ 7 more variables: mother <chr>, spouse <chr>, allegiances <list>,
#> #   books <list>, povBooks <list>, tvSeries <list>, playedBy <list>This is more complex than gh_users because some
component of char are themselves a list, giving us a
collection of list-columns:
chars2 %>% select_if(is.list)
#> # A tibble: 30 × 7
#>   titles    aliases    allegiances books     povBooks  tvSeries  playedBy 
#>   <list>    <list>     <list>      <list>    <list>    <list>    <list>   
#> 1 <chr [2]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr [1]>
#> # ℹ 24 more rowsWhat you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:
chars2 %>% 
  select(name, books, tvSeries) %>% 
  pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>% 
  unnest_longer(value)
#> # A tibble: 179 × 3
#>   name          media    value            
#>   <chr>         <chr>    <chr>            
#> 1 Theon Greyjoy books    A Game of Thrones
#> 2 Theon Greyjoy books    A Storm of Swords
#> 3 Theon Greyjoy books    A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1         
#> 5 Theon Greyjoy tvSeries Season 2         
#> 6 Theon Greyjoy tvSeries Season 3         
#> # ℹ 173 more rowsOr maybe you want to build a table that lets you match title to name:
chars2 %>% 
  select(name, title = titles) %>% 
  unnest_longer(title)
#> # A tibble: 59 × 2
#>   name              title                                               
#>   <chr>             <chr>                                               
#> 1 Theon Greyjoy     Prince of Winterfell                                
#> 2 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands)
#> 3 Tyrion Lannister  Acting Hand of the King (former)                    
#> 4 Tyrion Lannister  Master of Coin (former)                             
#> 5 Victarion Greyjoy Lord Captain of the Iron Fleet                      
#> 6 Victarion Greyjoy Master of the Iron Victory                          
#> # ℹ 53 more rows(Note that the empty titles ("") are due to an
infelicity in the input got_chars: ideally people without
titles would have a title vector of length 0, not a title vector of
length 1 containing an empty string.)
Next we’ll tackle a more complex form of data that comes from Google’s geocoding service, stored in the repurssive package
repurrrsive::gmaps_cities
#> # A tibble: 5 × 2
#>   city       json            
#>   <chr>      <list>          
#> 1 Houston    <named list [2]>
#> 2 Washington <named list [2]>
#> 3 New York   <named list [2]>
#> 4 Chicago    <named list [2]>
#> 5 Arlington  <named list [2]>json is a list-column of named lists, so it makes sense
to start with unnest_wider():
repurrrsive::gmaps_cities %>%
  unnest_wider(json)
#> # A tibble: 5 × 3
#>   city       results    status
#>   <chr>      <list>     <chr> 
#> 1 Houston    <list [1]> OK    
#> 2 Washington <list [2]> OK    
#> 3 New York   <list [1]> OK    
#> 4 Chicago    <list [1]> OK    
#> 5 Arlington  <list [2]> OKNotice that results is a list of lists. Most of the
cities have 1 element (representing a unique match from the geocoding
API), but Washington and Arlington have two. We can pull these out into
separate rows with unnest_longer():
repurrrsive::gmaps_cities %>%
  unnest_wider(json) %>% 
  unnest_longer(results)
#> # A tibble: 7 × 3
#>   city       results          status
#>   <chr>      <list>           <chr> 
#> 1 Houston    <named list [5]> OK    
#> 2 Washington <named list [5]> OK    
#> 3 Washington <named list [5]> OK    
#> 4 New York   <named list [5]> OK    
#> 5 Chicago    <named list [5]> OK    
#> 6 Arlington  <named list [5]> OK    
#> # ℹ 1 more rowNow these all have the same components, as revealed by
unnest_wider():
repurrrsive::gmaps_cities %>%
  unnest_wider(json) %>% 
  unnest_longer(results) %>% 
  unnest_wider(results)
#> # A tibble: 7 × 7
#>   city  address_components formatted_address geometry     place_id types  status
#>   <chr> <list>             <chr>             <list>       <chr>    <list> <chr> 
#> 1 Hous… <list [4]>         Houston, TX, USA  <named list> ChIJAYW… <list> OK    
#> 2 Wash… <list [2]>         Washington, USA   <named list> ChIJ-bD… <list> OK    
#> 3 Wash… <list [4]>         Washington, DC, … <named list> ChIJW-T… <list> OK    
#> 4 New … <list [3]>         New York, NY, USA <named list> ChIJOwg… <list> OK    
#> 5 Chic… <list [4]>         Chicago, IL, USA  <named list> ChIJ7cv… <list> OK    
#> 6 Arli… <list [4]>         Arlington, TX, U… <named list> ChIJ05g… <list> OK    
#> # ℹ 1 more rowWe can find the latitude and longitude by unnesting
geometry:
repurrrsive::gmaps_cities %>%
  unnest_wider(json) %>% 
  unnest_longer(results) %>% 
  unnest_wider(results) %>% 
  unnest_wider(geometry)
#> # A tibble: 7 × 10
#>   city       address_components formatted_address   bounds       location    
#>   <chr>      <list>             <chr>               <list>       <list>      
#> 1 Houston    <list [4]>         Houston, TX, USA    <named list> <named list>
#> 2 Washington <list [2]>         Washington, USA     <named list> <named list>
#> 3 Washington <list [4]>         Washington, DC, USA <named list> <named list>
#> 4 New York   <list [3]>         New York, NY, USA   <named list> <named list>
#> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list> <named list>
#> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list> <named list>
#> # ℹ 1 more row
#> # ℹ 5 more variables: location_type <chr>, viewport <list>, place_id <chr>,
#> #   types <list>, status <chr>And then location:
repurrrsive::gmaps_cities %>%
  unnest_wider(json) %>%
  unnest_longer(results) %>%
  unnest_wider(results) %>%
  unnest_wider(geometry) %>%
  unnest_wider(location)
#> # A tibble: 7 × 11
#>   city       address_components formatted_address   bounds         lat    lng
#>   <chr>      <list>             <chr>               <list>       <dbl>  <dbl>
#> 1 Houston    <list [4]>         Houston, TX, USA    <named list>  29.8  -95.4
#> 2 Washington <list [2]>         Washington, USA     <named list>  47.8 -121. 
#> 3 Washington <list [4]>         Washington, DC, USA <named list>  38.9  -77.0
#> 4 New York   <list [3]>         New York, NY, USA   <named list>  40.7  -74.0
#> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list>  41.9  -87.6
#> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list>  32.7  -97.1
#> # ℹ 1 more row
#> # ℹ 5 more variables: location_type <chr>, viewport <list>, place_id <chr>,
#> #   types <list>, status <chr>We could also just look at the first address for each city:
repurrrsive::gmaps_cities %>%
  unnest_wider(json) %>%
  hoist(results, first_result = 1) %>%
  unnest_wider(first_result) %>%
  unnest_wider(geometry) %>%
  unnest_wider(location)
#> # A tibble: 5 × 12
#>   city       address_components formatted_address  bounds             lat    lng
#>   <chr>      <list>             <chr>              <list>           <dbl>  <dbl>
#> 1 Houston    <list [4]>         Houston, TX, USA   <named list [2]>  29.8  -95.4
#> 2 Washington <list [2]>         Washington, USA    <named list [2]>  47.8 -121. 
#> 3 New York   <list [3]>         New York, NY, USA  <named list [2]>  40.7  -74.0
#> 4 Chicago    <list [4]>         Chicago, IL, USA   <named list [2]>  41.9  -87.6
#> 5 Arlington  <list [4]>         Arlington, TX, USA <named list [2]>  32.7  -97.1
#> # ℹ 6 more variables: location_type <chr>, viewport <list>, place_id <chr>,
#> #   types <list>, results <list>, status <chr>Or use hoist() to dive deeply to get directly to
lat and lng:
repurrrsive::gmaps_cities %>%
  hoist(json,
    lat = list("results", 1, "geometry", "location", "lat"),
    lng = list("results", 1, "geometry", "location", "lng")
  )
#> # A tibble: 5 × 4
#>   city         lat    lng json            
#>   <chr>      <dbl>  <dbl> <list>          
#> 1 Houston     29.8  -95.4 <named list [2]>
#> 2 Washington  47.8 -121.  <named list [2]>
#> 3 New York    40.7  -74.0 <named list [2]>
#> 4 Chicago     41.9  -87.6 <named list [2]>
#> 5 Arlington   32.7  -97.1 <named list [2]>I’d normally use readr::parse_datetime() or
lubridate::ymd_hms(), but I can’t here because it’s a
vignette and I don’t want to add a dependency to tidyr just to simplify
one example.↩︎
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.