Getting started with dm

2020-05-04

The goal of the package {dm} and its dm class is to facilitate working with multiple related tables.

An object of the dm class contains the data in the tables, and metadata about the tables, such as

This package augments {dplyr}/{dbplyr} workflows:

In addition, a battery of utilities is provided that helps with creating a tidy data model.

This package follows several of the “tidyverse” rules:

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a great collaboration!

We will now demonstrate some of the features of {dm}:

  1. Creation of dm objects
  2. Setting keys and drawing
  3. Filtering
  4. Copying and discovery

Let’s first have a brief look at how to create a dm-class object.

library(tidyverse)
library(dm)

Creating dm objects:

The {nycflights13} package offers a nice example of interconnected tables. The most straightforward way of squeezing those tables into a dm object is the dm() function:

library(nycflights13)

flights_dm <- dm(
  flights,
  airlines,
  airports,
  planes,
  weather
)
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `flights`, `airlines`, `airports`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

This fairly verbose output shows the data and metadata of a dm object. The various components can be accessed with functions of the type dm_get_...(), e.g.:

names(dm_get_tables(flights_dm))
#> [1] "flights"  "airlines" "airports" "planes"   "weather"
dm_get_all_pks(flights_dm)
#> # A tibble: 0 x 2
#> # … with 2 variables: table <chr>, pk_col <keys>
dm_get_all_fks(flights_dm)
#> # A tibble: 0 x 3
#> # … with 3 variables: child_table <chr>, child_fk_cols <keys>,
#> #   parent_table <chr>

The function dm_get_tables() returns a named list containing the individual tables. The dm object also behaves like a named list of tables:

names(flights_dm)
#> [1] "flights"  "airlines" "airports" "planes"   "weather"
flights_dm$airports
#> # A tibble: 1,458 x 8
#>    faa   name                    lat    lon   alt    tz dst   tzone        
#>    <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
#>  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
#>  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
#>  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
#>  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
#>  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
#>  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
#>  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
#>  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
#>  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
#> 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
#> # … with 1,448 more rows

Keys and visualization

As you can see in the output above, no keys have been set so far. We will use dm_add_pk() and dm_add_fk() to add primary keys (pk) and foreign keys (fk):

flights_dm_with_one_key <- 
  flights_dm %>% 
  dm_add_pk(airlines, carrier) %>% 
  dm_add_fk(flights, carrier, airlines)

After you set the keys and establish relations, you can create a graphical representation of your data model with dm_draw():

flights_dm_with_one_key %>% 
  dm_draw()
%0 airlines airlines carrier airports airports flights flights carrier flights:carrier->airlines:carrier planes planes weather weather

The dm_nycflights13() function provides a shortcut: the dm object returned by this function contains all tables (by default a reduced version of table flights), defines all primary and foreign keys, and even assigns colors to the different types of tables. We will be using the dm object created by this function from now on.

flights_dm_with_keys <- dm_nycflights13(cycle = TRUE)
flights_dm_with_keys %>% 
  dm_draw()
%0 airlines airlines carrier airports airports faa flights flights carrier tailnum origin dest flights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planes tailnum flights:tailnum->planes:tailnum weather weather

Filtering a table of a dm object

The idea of a filter on a dm object:

  1. You can filter one or more of dm’s tables, just like with normal dplyr::filter() calls
  2. Filtering conditions are immediately executed for the table in question and additionally stored in the dm object
  3. If you access a table via dm_apply_filters_to_tbl(), a sequence of semi_join() calls is performed to retrieve the requested table with only those values in the key columns which correspond to the remaining values in the filtered tables

The function dm_apply_filters() essentially calls dm_apply_filters_to_tbl() for each table of the dm and creates a new dm object from the result.

Currently, this only works if the graph induced by the fk relations is cycle free, the default for dm_nycflights13():

flights_dm_acyclic <- dm_nycflights13()
flights_dm_acyclic %>% 
  dm_draw()
%0 airlines airlines carrier airports airports faa flights flights carrier tailnum origin flights:carrier->airlines:carrier flights:origin->airports:faa planes planes tailnum flights:tailnum->planes:tailnum weather weather

Let’s set two filters:

us_flights_from_jfk_prepared <- 
  flights_dm_acyclic %>%
  dm_filter(airports, name == "John F Kennedy Intl") %>% 
  dm_filter(airlines, name == "US Airways Inc.")
us_flights_from_jfk_prepared
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> airlines: name == "US Airways Inc."
#> airports: name == "John F Kennedy Intl"

With dm_apply_filters(), you can apply all filters and therefore update all tables in the dm, so that they contain only the rows that are relevant to the filters. The effect of the filters can be seen when counting the rows:

us_flights_from_jfk <- dm_apply_filters(us_flights_from_jfk_prepared)
us_flights_from_jfk %>% 
  dm_get_tables() %>% 
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>        1        1       95       67      861

Alternatively, you can just pull out one of the tables from dm to answer the question which planes were used to service the US Airways flights that departed from JFK airport:

dm_apply_filters_to_tbl(us_flights_from_jfk, "planes")
#> # A tibble: 67 x 9
#>    tailnum  year type       manufacturer  model  engines seats speed engine
#>    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
#>  1 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  2 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  3 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  4 N111US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  5 N112US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  6 N113UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
#>  7 N126UW   2009 Fixed win… AIRBUS        A320-…       2   182    NA Turbo…
#>  8 N152UW   2013 Fixed win… AIRBUS        A321-…       2   199    NA Turbo…
#>  9 N154UW   2013 Fixed win… AIRBUS        A321-…       2   199    NA Turbo…
#> 10 N167US   2001 Fixed win… AIRBUS INDUS… A321-…       2   199    NA Turbo…
#> # … with 57 more rows

Each of the planes in the result set above was a part of at least one US Airways flight departing from JFK. Do they have any common characteristics?

dm_apply_filters_to_tbl(us_flights_from_jfk, "planes") %>% 
  count(model)
#> # A tibble: 6 x 2
#>   model               n
#>   <chr>           <int>
#> 1 A319-112           16
#> 2 A320-214            7
#> 3 A320-232           11
#> 4 A321-211            8
#> 5 A321-231           24
#> 6 ERJ 190-100 IGW     1

For comparison, let’s look at the equivalent manual query in {dplyr} syntax:

flights %>% 
  left_join(airports, by = c("origin" = "faa")) %>% 
  filter(name == "John F Kennedy Intl") %>%
  left_join(airlines, by = "carrier") %>% 
  filter(name.y == "US Airways Inc.") %>%
  semi_join(planes, ., by = "tailnum") %>% 
  count(model)

The {dm} code is leaner because the foreign key relations are encoded in the object.

Mind, that if you access a table via tbl.dm(), $.dm() or [[.dm(), filter conditions set for other tables are ignored.

Joining two tables

The dm_join_to_tbl() function joins two immediately related tables in a data model. The definition of the primary and foreign key constraints is used to define the relationship.

flights_dm_with_keys %>%
  dm_join_to_tbl(airlines, flights, join = left_join)
#> # A tibble: 11,227 x 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1    10        3           2359         4      426
#>  2  2013     1    10       16           2359        17      447
#>  3  2013     1    10      450            500       -10      634
#>  4  2013     1    10      520            525        -5      813
#>  5  2013     1    10      530            530         0      824
#>  6  2013     1    10      531            540        -9      832
#>  7  2013     1    10      535            540        -5     1015
#>  8  2013     1    10      546            600       -14      645
#>  9  2013     1    10      549            600       -11      652
#> 10  2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 13 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>

The same operation using {dplyr} syntax looks like this:

library(nycflights13)
airlines %>% 
  left_join(flights, by = "carrier")

Omitting the by argument leads to a warning.

Copy a dm object to a DB or learn from it

dm objects can be transferred from one src to another. The relevant verb is copy_dm_to(), which will copy both data and key constraints.

src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src:  sqlite 3.31.1 [:memory:]
#> tbls:
flights_dm_with_keys_remote <- copy_dm_to(src_sqlite, flights_dm_with_keys)

As a result, the tables are transferred to the target data source, and all keys will be contained in the returned data model.

src_sqlite
#> src:  sqlite 3.31.1 [:memory:]
#> tbls: airlines, airports, flights, planes, sqlite_stat1, sqlite_stat4,
#>   weather
flights_dm_with_keys_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.31.1 [:memory:]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 4

In the opposite direction, dm objects can also be “learned” from a DB, including the key constraints, by utilizing the DB’s meta-information tables. Unfortunately, this currently only works for MSSQL and Postgres, so we cannot show the results here just yet:

flights_dm_from_remote <- dm_learn_from_db(src_sqlite)

Further reading