Create a dm object from data frames

2020-05-04

This document describes how you can get your data into a dm object.

Although the example that we will be using is included in dm, and you can see it by running dm::dm_nycflights13(), we will be going through the built-in example one more time here, step by step.

The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013, and are available through the nycflights13 package:

Once we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.

library(nycflights13)

airports
#> # A tibble: 1,458 x 8
#>    faa   name                    lat    lon   alt    tz dst   tzone        
#>    <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
#>  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
#>  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
#>  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
#>  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
#>  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
#>  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
#>  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
#>  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
#>  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
#> 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
#> # … with 1,448 more rows

Your own data probably will not be available in an R package. It is sufficient if you can load it as data frames into your R session. If the data is too large, consider connecting to the database instead. See vignette("dm-db") for details.

Adding Tables

First, we will tell dm which tables we want to work with and how they are connected. For that, we can use dm().

library(dm)

flights_dm_no_keys <- dm(airlines, airports, flights, planes, weather)
flights_dm_no_keys
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

The as_dm() function is an alternative that works if you already have a list of tables.

Defining Keys

Even though you now have a new dm object that contains all your data, some key details are still missing that determine how your five tables are connected (the foreign keys), and which column(s) uniquely identify the observations (the primary keys).

Primary Keys

dm offers dm_enum_pk_candidates() to identify primary keys and dm_add_pk() to add them.

dm_enum_pk_candidates(
  dm = flights_dm_no_keys,
  table = planes
)
#> # A tibble: 9 x 3
#>   columns     candidate why                                                
#>   <keys>      <lgl>     <chr>                                              
#> 1 tailnum     TRUE      ""                                                 
#> 2 engine      FALSE     "has duplicate values: 4 Cycle, Reciprocating, Tur…
#> 3 engines     FALSE     "has duplicate values: 1, 2, 3, 4"                 
#> 4 manufactur… FALSE     "has duplicate values: AIRBUS, AIRBUS INDUSTRIE, A…
#> 5 model       FALSE     "has duplicate values: 717-200, 737-301, 737-3G7, …
#> 6 seats       FALSE     "has duplicate values: 2, 4, 5, 6, 7, …"           
#> 7 speed       FALSE     "has duplicate values: 90, 105, 162, 432, NA"      
#> 8 type        FALSE     "has duplicate values: Fixed wing multi engine, Fi…
#> 9 year        FALSE     "has duplicate values: 1959, 1963, 1975, 1976, 197…

Now, add the primary keys that you have identified:

flights_dm_only_pks <- 
  flights_dm_no_keys %>%
  dm_add_pk(table = airlines, columns = carrier) %>%
  dm_add_pk(airports, faa) %>%
  dm_add_pk(planes, tailnum)
flights_dm_only_pks
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 0

Foreign Keys

dm_enum_fk_candidates(
  dm = flights_dm_only_pks,
  table = flights,
  ref_table = airlines
)
#> # A tibble: 19 x 3
#>    columns      candidate why                                              
#>    <keys>       <lgl>     <chr>                                            
#>  1 carrier      TRUE      ""                                               
#>  2 tailnum      FALSE     "334264 entries (99.3%) of `flights$tailnum` not…
#>  3 dest         FALSE     "336776 entries (100%) of `flights$dest` not in …
#>  4 origin       FALSE     "336776 entries (100%) of `flights$origin` not i…
#>  5 air_time     FALSE     "Can't join on 'value' x 'value' because of inco…
#>  6 arr_delay    FALSE     "Can't join on 'value' x 'value' because of inco…
#>  7 arr_time     FALSE     "Can't join on 'value' x 'value' because of inco…
#>  8 day          FALSE     "Can't join on 'value' x 'value' because of inco…
#>  9 dep_delay    FALSE     "Can't join on 'value' x 'value' because of inco…
#> 10 dep_time     FALSE     "Can't join on 'value' x 'value' because of inco…
#> 11 distance     FALSE     "Can't join on 'value' x 'value' because of inco…
#> 12 flight       FALSE     "Can't join on 'value' x 'value' because of inco…
#> 13 hour         FALSE     "Can't join on 'value' x 'value' because of inco…
#> 14 minute       FALSE     "Can't join on 'value' x 'value' because of inco…
#> 15 month        FALSE     "Can't join on 'value' x 'value' because of inco…
#> 16 sched_arr_t… FALSE     "Can't join on 'value' x 'value' because of inco…
#> 17 sched_dep_t… FALSE     "Can't join on 'value' x 'value' because of inco…
#> 18 time_hour    FALSE     "cannot join a POSIXct object with an object tha…
#> 19 year         FALSE     "Can't join on 'value' x 'value' because of inco…

To define how your tables are related, use dm_add_fk() to add foreign keys. First, define the tables that you wish to connect by parameterizing the dm_add_fk() function with table and ref_table options.

Then indicate in column which column of table refers to ref_table’s primary key, which you’ve defined above. Voilà, here’s your dm object that you can work with:

flights_dm_all_keys <-
  flights_dm_only_pks %>%
  dm_add_fk(table = flights, columns = tailnum, ref_table = planes) %>%
  dm_add_fk(flights, carrier, airlines) %>%
  dm_add_fk(flights, origin, airports)
flights_dm_all_keys
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3

Visualization

Use dm_draw() at any stage of the process to get a visual representation:

flights_dm_no_keys %>%
  dm_draw(rankdir = "TB", view_type = "all")
%0 airlines airlines carrier name airports airports faa name lat lon alt tz dst tzone flights flights year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour planes planes tailnum year type manufacturer model engines seats speed engine weather weather origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour

flights_dm_no_keys %>%
  dm_add_pk(airlines, carrier) %>% 
  dm_draw()
%0 airlines airlines carrier airports airports flights flights planes planes weather weather

flights_dm_only_pks %>%
  dm_add_fk(flights, tailnum, planes) %>% 
  dm_draw()
%0 airlines airlines carrier airports airports faa flights flights tailnum planes planes tailnum flights:tailnum->planes:tailnum weather weather

flights_dm_all_keys %>% 
  dm_draw()
%0 airlines airlines carrier airports airports faa flights flights carrier tailnum origin flights:carrier->airlines:carrier flights:origin->airports:faa planes planes tailnum flights:tailnum->planes:tailnum weather weather

Integrity Checks

Check the constraints for your new data model or for intermediate steps:

flights_dm_no_keys %>%
  dm_examine_constraints()
#> ℹ No constraints defined.

flights_dm_only_pks %>%
  dm_examine_constraints()
#> ℹ All constraints satisfied.

flights_dm_all_keys %>% 
  dm_examine_constraints()
#> ! Unsatisfied constraints:
#>  Table `flights`: foreign key tailnum into table `planes`: 50094 entries (14.9%) of `flights$tailnum` not in `planes$tailnum`: N725MQ (575), N722MQ (513), N723MQ (507), N713MQ (483), N735MQ (396), …

The results are presented in a human-readable form, and stored internally as a tibble for programmatic inspection.

Programing

Helper functions are available to access details on keys and check results.

Call dm_get_all_pks() to retrieve a data frame with your primary keys:

dm_get_all_pks(flights_dm_only_pks)
#> # A tibble: 3 x 2
#>   table    pk_col 
#>   <chr>    <keys> 
#> 1 airlines carrier
#> 2 airports faa    
#> 3 planes   tailnum

A data frame of foreign keys is retrieved with dm_get_all_fks():

flights_dm_all_keys %>% 
  dm_get_all_pks()
#> # A tibble: 3 x 2
#>   table    pk_col 
#>   <chr>    <keys> 
#> 1 airlines carrier
#> 2 airports faa    
#> 3 planes   tailnum

Use tibble::as_tibble() on the result of dm_examine_constraints() to programmatically inspect which constaints are not satisfied:

flights_dm_all_keys %>% 
  dm_examine_constraints() %>% 
  tibble::as_tibble()
#> # A tibble: 6 x 6
#>   table   kind  columns ref_table is_key problem                           
#>   <chr>   <chr> <keys>  <chr>     <lgl>  <chr>                             
#> 1 flights FK    tailnum planes    FALSE  "50094 entries (14.9%) of `flight…
#> 2 airlin… PK    carrier NA        TRUE   ""                                
#> 3 airpor… PK    faa     NA        TRUE   ""                                
#> 4 planes  PK    tailnum NA        TRUE   ""                                
#> 5 flights FK    carrier airlines  TRUE   ""                                
#> 6 flights FK    origin  airports  TRUE   ""