Quickstart

quickstart example for drake

William Michael Landau

2017-09-29

Quick examples

library(drake)
load_basic_example() # Also (over)writes report.Rmd.
plot_graph(my_plan) # Hover, click, drag, zoom, pan.
make(my_plan) # Run the workflow.
make(my_plan) # Check that everything is already up to date.

Dive deeper into the built-in examples.

example_drake("basic") # Write the code files.
examples_drake() # List the other examples.
vignette("quickstart") # This vignette

Useful functions

Besides make(), here are some useful functions to learn about drake,

load_basic_example()
drake_tip()
examples_drake()
example_drake()

set up your workflow plan,

plan()
analyses()
summaries()
evaluate()
expand()
gather()
wildcard() # from the wildcard package

explore the dependency network,

outdated()
missed()
plot_graph()
dataframes_graph()
render_graph()
read_graph()
deps()
tracked()

interact with the cache,

clean()
cached()
imported()
built()
readd()
loadd()
find_project()
find_cache()

make use of recorded build times,

build_times()
predict_runtime()
rate_limiting_times()

speed up your project with parallel computing,

make() # with jobs > 2
max_useful_jobs()
parallelism_choices()
shell_file()

finely tune the caching and hashing,

available_hash_algos()
cache_path()
cache_types()
configure_cache()
default_long_hash_algo()
default_short_hash_algo()
long_hash()
short_hash()
new_cache()
recover_cache()
this_cache()
type_of_cache()

and debug your work.

check()
session()
in_progress()
progress()
config()
read_config()

Setting up the basic example

Let’s establish the building blocks of a data analysis workflow.

library(knitr)
library(drake)

First, we will generate a few datasets.

simulate <- function(n){
  data.frame(
    x = stats::rnorm(n),
    y = rpois(n, 1)
  )
}

Then, we will analyze each dataset with multiple analysis methods.

reg1 <- function(d){
  lm(y ~ + x, data = d)
}

reg2 <- function(d){
  d$x2 <- d$x ^ 2
  lm(y ~ x2, data = d)
}

Finally, we will generate a dynamic report to display results.

my_knit <- function(file, ...){
  knit(file)
}

We need the source file report.Rmd.

lines <- c(
  "---",
  "title: Example Report",
  "author: You",
  "output: html_document",
  "---",
  "",
  "Look how I read outputs from the drake cache.",
  "",
  "```{r example_chunk}",
  "library(drake)",
  "readd(small)",
  "readd(coef_regression2_small)",
  "loadd(large)",
  "head(large)",
  "```")
writeLines(lines, "report.Rmd")

Workflow plan

The workflow plan lists the intermediate steps of your project.

load_basic_example()
my_plan
##                    target                                      command
## 1             'report.md'   my_knit('report.Rmd', report_dependencies)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4     report_dependencies      c(small, large, coef_regression2_small)
## 5       regression1_small                                  reg1(small)
## 6       regression1_large                                  reg1(large)
## 7       regression2_small                                  reg2(small)
## 8       regression2_large                                  reg2(large)
## 9  summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small                      coef(regression1_small)
## 14 coef_regression1_large                      coef(regression1_large)
## 15 coef_regression2_small                      coef(regression2_small)
## 16 coef_regression2_large                      coef(regression2_large)

Each row is an intermediate step, and each command generates a target. A target is an output R object (cached when generated) or output file (specified with single quotes), and a command just an ordinary piece of R code (not necessarily a single function call). As input, commands may take objects imported from your workspace, targets generated by other commands, or initial input files. These dependencies give your project an underlying network.

# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")

See also dataframes_graph(), render_graph(), and config() for faster and more customized regraphing.

You can also check the dependencies of individual targets.

deps(reg2)
## [1] "lm"
deps(my_plan$command[1]) # Files like report.Rmd are single-quoted.
## [1] "'report.Rmd'"        "my_knit"             "report_dependencies"
deps(my_plan$command[16])
## [1] "coef"              "regression2_large"

List all the reproducibly-tracked objects and files, including imports and targets.

tracked(my_plan, targets = "small")
## Unloading targets from environment:
##   report_dependencies
## [1] "small"        "simulate"     "data.frame"   "rpois"       
## [5] "stats::rnorm"
tracked(my_plan)
##  [1] "'report.md'"            "small"                 
##  [3] "large"                  "report_dependencies"   
##  [5] "regression1_small"      "regression1_large"     
##  [7] "regression2_small"      "regression2_large"     
##  [9] "summ_regression1_small" "summ_regression1_large"
## [11] "summ_regression2_small" "summ_regression2_large"
## [13] "coef_regression1_small" "coef_regression1_large"
## [15] "coef_regression2_small" "coef_regression2_large"
## [17] "reg1"                   "reg2"                  
## [19] "simulate"               "my_knit"               
## [21] "'report.Rmd'"           "c"                     
## [23] "summary"                "suppressWarnings"      
## [25] "coef"                   "lm"                    
## [27] "data.frame"             "rpois"                 
## [29] "stats::rnorm"           "knit"

Check for cycles, missing input files, and other pitfalls.

check(my_plan)

Generate the workflow plan

The data frame my_plan would be a pain to write by hand, so drake has functions to help you.

my_plan

my_datasets <- plan(
  small = simulate(5),
  large = simulate(50))
my_datasets
##   target      command
## 1  small  simulate(5)
## 2  large simulate(50)

For multiple replicates:

expand(my_datasets, values = c("rep1", "rep2"))
##       target      command
## 1 small_rep1  simulate(5)
## 2 small_rep2  simulate(5)
## 3 large_rep1 simulate(50)
## 4 large_rep2 simulate(50)

Each dataset is analyzed multiple ways.

methods <- plan(
  regression1 = reg1(..dataset..),
  regression2 = reg2(..dataset..))
methods
##        target           command
## 1 regression1 reg1(..dataset..)
## 2 regression2 reg2(..dataset..)

We evaluate the ..dataset.. wildcard.

my_analyses <- analyses(methods, data = my_datasets)
my_analyses
##              target     command
## 1 regression1_small reg1(small)
## 2 regression1_large reg1(large)
## 3 regression2_small reg2(small)
## 4 regression2_large reg2(large)

Next, we summarize each analysis of each dataset using summary statistics and regression coefficients.

summary_types <- plan(
  summ = suppressWarnings(summary(..analysis..)),
  coef = coef(..analysis..))
summary_types
##   target                                 command
## 1   summ suppressWarnings(summary(..analysis..))
## 2   coef                      coef(..analysis..)
results <- summaries(summary_types, analyses = my_analyses,
  datasets = my_datasets, gather = NULL)
results
##                   target                                      command
## 1 summ_regression1_small suppressWarnings(summary(regression1_small))
## 2 summ_regression1_large suppressWarnings(summary(regression1_large))
## 3 summ_regression2_small suppressWarnings(summary(regression2_small))
## 4 summ_regression2_large suppressWarnings(summary(regression2_large))
## 5 coef_regression1_small                      coef(regression1_small)
## 6 coef_regression1_large                      coef(regression1_large)
## 7 coef_regression2_small                      coef(regression2_small)
## 8 coef_regression2_large                      coef(regression2_large)

The gather feature groups summaries into a smaller number of more manageable targets. I shut it off here to make the data frames more readable.

For the dynamic report, we have to declare the dependencies manually.

load_in_report <- plan(
  report_dependencies = c(small, large, coef_regression2_small))
load_in_report
##                target                                 command
## 1 report_dependencies c(small, large, coef_regression2_small)

Remember: use single quotes for file dependencies. The functions quotes(), unquote(), and strings() from the eply package may help. Also, please be aware that drake cannot track entire directories/folders.

report <- plan(
  report.md = my_knit('report.Rmd', report_dependencies), # nolint
  file_targets = TRUE, strings_in_dots = "filenames")
report
##        target                                    command
## 1 'report.md' my_knit('report.Rmd', report_dependencies)

Finally, gather your workflow together with rbind(). Row order does not matter.

my_plan <- rbind(report, my_datasets, load_in_report, my_analyses, results)
my_plan
##                    target                                      command
## 1             'report.md'   my_knit('report.Rmd', report_dependencies)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4     report_dependencies      c(small, large, coef_regression2_small)
## 5       regression1_small                                  reg1(small)
## 6       regression1_large                                  reg1(large)
## 7       regression2_small                                  reg2(small)
## 8       regression2_large                                  reg2(large)
## 9  summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small                      coef(regression1_small)
## 14 coef_regression1_large                      coef(regression1_large)
## 15 coef_regression2_small                      coef(regression2_small)
## 16 coef_regression2_large                      coef(regression2_large)

Flexible helpers to make workflow plans

If your workflow does not fit the rigid datasets/analyses/summaries framework, check out functions expand(), evaluate(), and gather().

df <- plan(data = simulate(center = MU, scale = SIGMA))
df
##   target                              command
## 1   data simulate(center = MU, scale = SIGMA)
df <- expand(df, values = c("rep1", "rep2"))
df
##      target                              command
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2)
##        target                             command
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
##      target                             command
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
##      target                           command
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2   simulate(center = 2, scale = 1)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
##             target                           command
## 1  data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2    data_rep1_1_1   simulate(center = 1, scale = 1)
## 3   data_rep1_1_10  simulate(center = 1, scale = 10)
## 4  data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5    data_rep1_2_1   simulate(center = 2, scale = 1)
## 6   data_rep1_2_10  simulate(center = 2, scale = 10)
## 7  data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8    data_rep2_1_1   simulate(center = 1, scale = 1)
## 9   data_rep2_1_10  simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11   data_rep2_2_1   simulate(center = 2, scale = 1)
## 12  data_rep2_2_10  simulate(center = 2, scale = 10)
gather(df)
##   target                                            command
## 1 target list(data_rep1 = data_rep1, data_rep2 = data_rep2)
gather(df, target = "my_summaries", gather = "rbind")
##         target                                             command
## 1 my_summaries rbind(data_rep1 = data_rep1, data_rep2 = data_rep2)

Run the workflow

You may want to check for outdated or missing targets/imports first.

outdated(my_plan, verbose = FALSE) # Targets that need to be (re)built.
##  [1] "'report.md'"            "coef_regression1_large"
##  [3] "coef_regression1_small" "coef_regression2_large"
##  [5] "coef_regression2_small" "large"                 
##  [7] "regression1_large"      "regression1_small"     
##  [9] "regression2_large"      "regression2_small"     
## [11] "report_dependencies"    "small"                 
## [13] "summ_regression1_large" "summ_regression1_small"
## [15] "summ_regression2_large" "summ_regression2_small"
missed(my_plan, verbose = FALSE) # Checks your workspace.

Then just make(my_plan).

make(my_plan)
## check 10 items: 'report.Rmd', c, summary, suppressWarnings, coef, lm, data.fr...
## import 'report.Rmd'
## import c
## import summary
## import suppressWarnings
## import coef
## import lm
## import data.frame
## import rpois
## import stats::rnorm
## import knit
## check 4 items: reg1, reg2, simulate, my_knit
## import reg1
## import reg2
## import simulate
## import my_knit
## check 2 items: small, large
## target small
## target large
## check 4 items: regression1_small, regression1_large, regression2_small, regre...
## target regression1_small
## target regression1_large
## target regression2_small
## target regression2_large
## check 8 items: summ_regression1_small, summ_regression1_large, summ_regressio...
## target summ_regression1_small
## target summ_regression1_large
## target summ_regression2_small
## target summ_regression2_large
## target coef_regression1_small
## target coef_regression1_large
## target coef_regression2_small
## target coef_regression2_large
## check 1 item: report_dependencies
## unload 11 items: regression1_small, regression1_large, regression2_small, reg...
## target report_dependencies
## check 1 item: 'report.md'
## unload 3 items: small, large, coef_regression2_small
## target 'report.md'

The non-file dependencies of your last target are already loaded in your workspace.

"report_dependencies" %in% ls() # Should be TRUE.
## [1] TRUE
outdated(my_plan, verbose = FALSE) # Everything is up to date.
build_times(digits = 4) # How long did it take to make each target?
##                      item   type elapsed   user system
## 1            'report.Rmd' import  0.001s 0.004s     0s
## 2             'report.md' target  0.022s  0.02s 0.004s
## 3                       c import  0.002s     0s     0s
## 4                    coef import  0.002s     0s     0s
## 5  coef_regression1_large target  0.002s     0s     0s
## 6  coef_regression1_small target  0.002s     0s 0.004s
## 7  coef_regression2_large target  0.002s     0s     0s
## 8  coef_regression2_small target  0.004s 0.004s     0s
## 9              data.frame import  0.004s 0.004s     0s
## 10                   knit import  0.003s     0s     0s
## 11                  large target  0.002s 0.004s     0s
## 12                     lm import  0.002s     0s     0s
## 13                my_knit import  0.002s 0.004s     0s
## 14                   reg1 import  0.004s 0.004s     0s
## 15                   reg2 import  0.002s     0s     0s
## 16      regression1_large target  0.003s 0.004s     0s
## 17      regression1_small target  0.004s 0.004s     0s
## 18      regression2_large target  0.003s 0.004s     0s
## 19      regression2_small target  0.003s 0.004s     0s
## 20    report_dependencies target  0.002s     0s     0s
## 21                  rpois import  0.002s     0s     0s
## 22               simulate import  0.002s     0s     0s
## 23                  small target  0.003s     0s     0s
## 24           stats::rnorm import  0.002s 0.004s     0s
## 25 summ_regression1_large target  0.002s 0.004s     0s
## 26 summ_regression1_small target  0.003s 0.004s     0s
## 27 summ_regression2_large target  0.003s     0s     0s
## 28 summ_regression2_small target  0.002s     0s     0s
## 29                summary import  0.004s     0s     0s
## 30       suppressWarnings import  0.004s 0.004s     0s

See also predict_runtime() and rate_limiting_times().

In the new graph, the red nodes from before are now green.

# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")

Optionally, get visNetwork nodes and edges so you can make your own plot with visNetwork or render_graph().

dataframes_graph(my_plan)

Use readd() and loadd() to load more targets. (They are cached in the hidden .drake/ folder using storr). Other functions interact and view the cache.

readd(coef_regression2_large)
## (Intercept)          x2 
##  1.09974766  0.04536352
loadd(small)
head(small)
##            x y
## 1 -1.4318013 2
## 2 -1.5429977 2
## 3  0.3550202 1
## 4 -0.9434383 1
## 5 -0.4181989 1
rm(small)
cached(small, large)
## small large 
##  TRUE  TRUE
cached()
##  [1] "'report.Rmd'"           "'report.md'"           
##  [3] "c"                      "coef"                  
##  [5] "coef_regression1_large" "coef_regression1_small"
##  [7] "coef_regression2_large" "coef_regression2_small"
##  [9] "data.frame"             "knit"                  
## [11] "large"                  "lm"                    
## [13] "my_knit"                "reg1"                  
## [15] "reg2"                   "regression1_large"     
## [17] "regression1_small"      "regression2_large"     
## [19] "regression2_small"      "report_dependencies"   
## [21] "rpois"                  "simulate"              
## [23] "small"                  "stats::rnorm"          
## [25] "summ_regression1_large" "summ_regression1_small"
## [27] "summ_regression2_large" "summ_regression2_small"
## [29] "summary"                "suppressWarnings"
built()
##  [1] "'report.md'"            "coef_regression1_large"
##  [3] "coef_regression1_small" "coef_regression2_large"
##  [5] "coef_regression2_small" "large"                 
##  [7] "regression1_large"      "regression1_small"     
##  [9] "regression2_large"      "regression2_small"     
## [11] "report_dependencies"    "small"                 
## [13] "summ_regression1_large" "summ_regression1_small"
## [15] "summ_regression2_large" "summ_regression2_small"
imported()
##  [1] "'report.Rmd'"     "c"                "coef"            
##  [4] "data.frame"       "knit"             "lm"              
##  [7] "my_knit"          "reg1"             "reg2"            
## [10] "rpois"            "simulate"         "stats::rnorm"    
## [13] "summary"          "suppressWarnings"
head(read_plan())
##                target                                    command
## 1         'report.md' my_knit('report.Rmd', report_dependencies)
## 2               small                                simulate(5)
## 3               large                               simulate(50)
## 4 report_dependencies    c(small, large, coef_regression2_small)
## 5   regression1_small                                reg1(small)
## 6   regression1_large                                reg1(large)
head(progress()) # See also in_progress()
##           'report.Rmd'            'report.md'                      c 
##             "finished"             "finished"             "finished" 
##                   coef coef_regression1_large coef_regression1_small 
##             "finished"             "finished"             "finished"
progress(large)
##      large 
## "finished"
session() # of the last call to make()
## R Under development (unstable) (2017-09-16 r73293)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 17.04
## 
## Matrix products: default
## BLAS: /home/landau/packages/R/R-devel/lib/R/lib/libRblas.so
## LAPACK: /home/landau/packages/R/R-devel/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.17  drake_4.2.0
## 
## loaded via a namespace (and not attached):
##  [1] igraph_1.1.2      Rcpp_0.12.12      magrittr_1.5     
##  [4] R6_2.2.2          stringr_1.2.0     storr_1.1.2      
##  [7] plyr_1.8.4        visNetwork_2.0.1  tools_3.5.0      
## [10] parallel_3.5.0    R.oo_1.21.0       eply_0.1.0       
## [13] withr_2.0.0       htmltools_0.3.6   yaml_2.1.14      
## [16] rprojroot_1.2     digest_0.6.12     crayon_1.3.4     
## [19] htmlwidgets_0.9   R.utils_2.5.0     codetools_0.2-15 
## [22] testthat_1.0.2    evaluate_0.10.1   rmarkdown_1.6    
## [25] stringi_1.1.5     compiler_3.5.0    backports_1.1.0  
## [28] R.methodsS3_1.7.1 jsonlite_1.5      lubridate_1.6.0  
## [31] pkgconfig_2.0.1

The next time you run make(my_plan), nothing will be built because drake knows everything is up to date.

make(my_plan)
## check 10 items: 'report.Rmd', c, summary, suppressWarnings, coef, data.frame,...
## import 'report.Rmd'
## import c
## import summary
## import suppressWarnings
## import coef
## import data.frame
## import rpois
## import stats::rnorm
## import lm
## import knit
## check 4 items: simulate, reg1, reg2, my_knit
## import simulate
## import reg1
## import reg2
## import my_knit
## check 2 items: small, large
## check 4 items: regression1_small, regression1_large, regression2_small, regre...
## check 8 items: summ_regression1_small, summ_regression1_large, summ_regressio...
## check 1 item: report_dependencies
## check 1 item: 'report.md'

But if you change one of your functions, commands, or other dependencies, drake will update the affected parts of the workflow. Let’s say we want to change the quadratic term to a cubic term in our reg2() function.

reg2 <- function(d) {
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}

The targets depending on reg2() need to be rebuilt and everything else is left alone.

outdated(my_plan, verbose = FALSE)
## [1] "'report.md'"            "coef_regression2_large"
## [3] "coef_regression2_small" "regression2_large"     
## [5] "regression2_small"      "report_dependencies"   
## [7] "summ_regression2_large" "summ_regression2_small"
# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")
make(my_plan)
## check 10 items: 'report.Rmd', c, summary, suppressWarnings, coef, data.frame,...
## import 'report.Rmd'
## import c
## import summary
## import suppressWarnings
## import coef
## import data.frame
## import rpois
## import stats::rnorm
## import lm
## import knit
## check 4 items: simulate, reg1, reg2, my_knit
## import simulate
## import reg1
## import reg2
## import my_knit
## check 2 items: small, large
## check 4 items: regression1_small, regression1_large, regression2_small, regre...
## load 2 items: large, small
## target regression2_small
## target regression2_large
## check 8 items: summ_regression1_small, summ_regression1_large, summ_regressio...
## target summ_regression2_small
## target summ_regression2_large
## target coef_regression2_small
## target coef_regression2_large
## check 1 item: report_dependencies
## unload 5 items: regression2_small, regression2_large, summ_regression2_small,...
## target report_dependencies
## check 1 item: 'report.md'
## unload 3 items: small, large, coef_regression2_small
## target 'report.md'

But trivial changes to whitespace and comments are totally ignored in your functions and in my_plan$command.

reg2 <- function(d) {
  d$x3 <- d$x ^ 3
    lm(y ~ x3, data = d) # I indented here.
}
outdated(my_plan, verbose = FALSE) # Everything is up to date.

Need to add new work on the fly? Just append rows to the workflow plan. If the rest of your workflow is up to date, only the new work is run.

new_simulation <- function(n){
  data.frame(x = rnorm(n), y = rnorm(n))
}

additions <- plan(
  new_data = new_simulation(36) + sqrt(10))
additions
##     target                       command
## 1 new_data new_simulation(36) + sqrt(10)
my_plan <- rbind(my_plan, additions)
my_plan
##                    target                                      command
## 1             'report.md'   my_knit('report.Rmd', report_dependencies)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4     report_dependencies      c(small, large, coef_regression2_small)
## 5       regression1_small                                  reg1(small)
## 6       regression1_large                                  reg1(large)
## 7       regression2_small                                  reg2(small)
## 8       regression2_large                                  reg2(large)
## 9  summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small                      coef(regression1_small)
## 14 coef_regression1_large                      coef(regression1_large)
## 15 coef_regression2_small                      coef(regression2_small)
## 16 coef_regression2_large                      coef(regression2_large)
## 17               new_data                new_simulation(36) + sqrt(10)
make(my_plan)
## check 12 items: 'report.Rmd', c, summary, suppressWarnings, coef, sqrt, data....
## import 'report.Rmd'
## import c
## import summary
## import suppressWarnings
## import coef
## import sqrt
## import data.frame
## import rnorm
## import rpois
## import stats::rnorm
## import lm
## import knit
## check 5 items: new_simulation, simulate, reg1, reg2, my_knit
## import new_simulation
## import simulate
## import reg1
## import reg2
## import my_knit
## check 3 items: small, large, new_data
## target new_data
## check 4 items: regression1_small, regression1_large, regression2_small, regre...
## check 8 items: summ_regression1_small, summ_regression1_large, summ_regressio...
## check 1 item: report_dependencies
## check 1 item: 'report.md'

If you ever need to erase your work, use clean(). Any targets removed from the cache will have to be rebuilt on the next call to make(), so be careful.

clean(small, reg1) # uncaches individual targets and imported objects
clean() # cleans all targets out of the cache
clean(destroy = TRUE) # removes the cache entirely

High-performance computing

The network graph is the key to drake’s parallel computing.

clean()
load_basic_example()
make(my_plan, jobs = 2, verbose = FALSE) # Parallelize over 2 jobs.
# Change a dependency.
reg2 <- function(d) {
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")

When you call make(my_plan, jobs = 4), the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4) should be faster than make(..., jobs = 1), but it would be superfluous to use more than 4 jobs.

See function max_useful_jobs() to suggest the number of jobs, taking into account which targets are already up to date. Try out the following in a fresh R session.

library(drake)
load_basic_example()
plot_graph(my_plan) # Set targets_only to TRUE for smaller graphs.
max_useful_jobs(my_plan) # 8
max_useful_jobs(my_plan, imports = "files") # 8
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 8
make(my_plan, jobs = 4)
plot_graph(my_plan)
# Ignore the targets already built.
max_useful_jobs(my_plan) # 1
max_useful_jobs(my_plan, imports = "files") # 1
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 0
# Change a function so some targets are now out of date.
reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
plot_graph(my_plan)
max_useful_jobs(my_plan) # 4
max_useful_jobs(my_plan, from_scratch = TRUE) # 8
max_useful_jobs(my_plan, imports = "files") # 4
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 4

As for how the parallelism is implemented, you can choose from multiple built-in backends.

  1. mclapply: low-overhead, light-weight. drake::make(my_plan, parallelism = "mclapply", jobs = 2) invokes parallel::mclapply() under the hood, distributing the work over at most two independent processes (set with jobs). Mclapply is an ideal choice for low-overhead single-node parallelism, but it does not work on Windows.
  2. parLapply: medium-overhead, light-weight. make(my_plan, parallelism = "parLapply", jobs = 2) invokes parallel::mclapply() under the hood. This option is similar to mclapply except that it works on Windows and costs a little extra time up front.
  3. Makefile: high-overhead, heavy-duty. For this one, Windows users need to download and install Rtools. For everyone else, just make sure Make is installed. The build order may be different for Makefile parallelism because all the imports are imported before any of the targets are built with the Makefile. That means plot_graph(), dataframes_graph(), and max_useful_jobs() behave differently for .
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y

You may need to replace module load R with a command to load a specific version of R. SLURM users can just point to srun and dispense with shell.sh altogether.

make(my_plan, parallelism = "Makefile", jobs = 4,
  prepend = "SHELL=srun")

For long projects, put your call to make() in an R script (say, script.R) and run it from the Linux terminal.

nohup nice -19 R CMD BATCH script.R &

Even after you log out, a background process will keep running on the login node and submit new jobs at the appropriate time. Jobs are only submitted if the targets need to be (re)built.

Important notes on Makefile-level parallelism

Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best parallel backend for your system and uses the number of jobs you give to the jobs argument to make(). To use at most 2 jobs for imports and at most 4 jobs for targets, run

make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")

The Makefile generated by make(plan, parallelism = "Makefile") is not standalone. Do not run it outside of drake::make(). Drake uses dummy timestamp files to tell the Makefile which targets can be skipped and which need to be (re)built, so running make in the terminal will most likely give incorrect results.