This document describes how you can get your data into a dm object.
Although the example that we will be using is included in dm, and you can see it by running dm::dm_nycflights13(), we will be going through the built-in example one more time here, step by step.
The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013, and are available through the nycflights13 package:
flights represents the trips taken by planes
airlines includes
the names of transport organizations (name)
their abbreviated codes (carrier)
airports indicates the ports of departure (origin) and of destination (dest)
weather contains meteorological information at each hour
planes describes characteristics of the aircraft
Once we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.
library(nycflights13)airports
#> # A tibble: 1,458 x 8
#> faa name lat lon alt tz dst tzone
#> <chr><chr><dbl><dbl><dbl><dbl><chr><chr>
#> 1 04G Lansdowne Airport 41.1 -80.61044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.21593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.61000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with 1,448 more rows
Your own data probably will not be available in an R package. It is sufficient if you can load it as data frames into your R session. If the data is too large, consider connecting to the database instead. See vignette("dm-db") for details.
Adding Tables
First, we will tell dm which tables we want to work with and how they are connected. For that, we can use dm().
The as_dm() function is an alternative that works if you already have a list of tables.
Defining Keys
Even though you now have a new dm object that contains all your data, some key details are still missing that determine how your five tables are connected (the foreign keys), and which column(s) uniquely identify the observations (the primary keys).
Primary Keys
dm offers dm_enum_pk_candidates() to identify primary keys and dm_add_pk() to add them.
#> # A tibble: 19 x 3
#> columns candidate why
#> <keys><lgl><chr>
#> 1 carrier TRUE ""
#> 2 tailnum FALSE "334264 entries (99.3%) of `flights$tailnum` not…
#> 3 dest FALSE "336776 entries (100%) of `flights$dest` not in …
#> 4 origin FALSE "336776 entries (100%) of `flights$origin` not i…
#> 5 air_time FALSE "Can't join on 'value' x 'value' because of inco…
#> 6 arr_delay FALSE "Can't join on 'value' x 'value' because of inco…
#> 7 arr_time FALSE "Can't join on 'value' x 'value' because of inco…
#> 8 day FALSE "Can't join on 'value' x 'value' because of inco…
#> 9 dep_delay FALSE "Can't join on 'value' x 'value' because of inco…
#> 10 dep_time FALSE "Can't join on 'value' x 'value' because of inco…
#> 11 distance FALSE "Can't join on 'value' x 'value' because of inco…
#> 12 flight FALSE "Can't join on 'value' x 'value' because of inco…
#> 13 hour FALSE "Can't join on 'value' x 'value' because of inco…
#> 14 minute FALSE "Can't join on 'value' x 'value' because of inco…
#> 15 month FALSE "Can't join on 'value' x 'value' because of inco…
#> 16 sched_arr_t… FALSE "Can't join on 'value' x 'value' because of inco…
#> 17 sched_dep_t… FALSE "Can't join on 'value' x 'value' because of inco…
#> 18 time_hour FALSE "cannot join a POSIXct object with an object tha…
#> 19 year FALSE "Can't join on 'value' x 'value' because of inco…
To define how your tables are related, use dm_add_fk() to add foreign keys. First, define the tables that you wish to connect by parameterizing the dm_add_fk() function with table and ref_table options.
Then indicate in column which column of table refers to ref_table’s primary key, which you’ve defined above. Voilà, here’s your dm object that you can work with: