drake

data frames in R for Make

William Michael Landau

2017-02-27

Drake is a workflow manager for R. When it runs a project, it automatically builds missing and outdated results while skipping over all the up-to-date output. This automation and reproducibility is important for data analysis workflows, especially large projects under heavy development.

Acknowledgements and history

The original idea of a time-saving reproducible build system extends back decades to GNU Make, which today helps data scientists as well as the original user base of complied-language programmers. More recently, Rich FitzJohn created remake, a breakthrough reimagining of Make for R and the most important inspiration for drake. Drake is a fresh, minimalist reinterpretation of some of remake’s pioneering fundamental concepts, scaled up for computationally-demanding workflows. Relative to remake, some of drake’s most prominent distinguishing features at the time of writing this document are

Thanks also to Kirill Müller and Daniel Falster. They contributed code patches and enhancement ideas to my parallelRemake and remakeGenerator packages, which I have now subsumed into drake.

Rtools for Windows users

Windows users need Rtools to run make(..., makefile = TRUE) (system2("make") needs to be possible).

Help and troubleshooting

Use the help_drake() function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Basic usage

Drake was built to keep track of large and complicated statistical analysis workflows, but for now, let’s start small. Here is a baby workflow plan to produce a variable named a.

  1. Evaluate 1 + 1 and assign the result to variable d.
  2. Evaluate 2 + 2 and assign the result to variable e.
  3. Evaluate 3 + 3 and assign the result to variable f.
  4. After we have d and f, evaluate d + f and assign the result to variable c.
  5. After we have d and e, evaluate d + e and assign the result to variable b.
  6. After we have b and c, evaluate b + c and assign the result to variable a.

We represent this plan in a data frame with code and output.

library(drake)
x = example_plan("small")
x
##   output  code
## 1      a b + c
## 2      b d + e
## 3      c d + f
## 4      d 1 + 1
## 5      e 2 + 2
## 6      f 3 + 3

Use make() to run the six steps in the correct order. Keep in mind that steps 1 through 3 can be interchanged and steps 4 and 5 can be interchanged.

check(x) # check for errors first 
make(x)
## build  f
## build  d
## build  e
## build  c
## build  b
## build  a
readd(a) # see also loadd() and cached()
## [1] 14

The whole point of drake is to reproducibly track your output. If an object is already up to date, drake will skip it next time.

make(x, output = c("c", "f"))
## skip   f
## skip   d
## skip   c

When you change your code, drake brings your results up to date, doing the minimum amount of work necesary..

x$code[3] = "sqrt(d) + 2*f + 1" # new code for variable c
make(x)
## skip   f
## skip   d
## skip   e
## build  c
## skip   b
## build  a
readd(a)
## [1] 20.41421
make(x)
## skip   f
## skip   d
## skip   e
## skip   c
## skip   b
## skip   a
x$code[5] = "2*2*1" # variable e: previously 2 + 2, so the output value doesn't change
make(x)
## skip   f
## skip   d
## build  e
## skip   c
## skip   b
## skip   a
readd(a)
## [1] 20.41421
x$code[5] = "7/2" # new code for variable e
make(x)
## skip   f
## skip   d
## build  e
## skip   c
## build  b
## build  a
readd(a)
## [1] 19.91421
x$code[5] = "7 /2 # changes to comments and whitespace are ignored"
make(x)
## skip   f
## skip   d
## skip   e
## skip   c
## skip   b
## skip   a

Interacting with the cache

Try these functions in an interactive R session at the root directory of your project (or a subdirectory, with the search argument).

Cleaning and pruning

Prune your worklfow to remove objects no longer in the plan.

cached()
## [1] "a" "b" "c" "d" "e" "f"
x = x[1:5,]
prune(x)
cached()
## [1] "a" "b" "c" "d" "e"

Use clean() to completely remove everything generated and tracked by make(). This is a nuclear option, so only use it if you are totally sure you want to start over from scratch.

clean() # removes the cached objects but keeps the hidden ".drake/" folder
cached()
## character(0)
clean(destroy = TRUE) # removes ".drake/"

To use prune() and clean(), you must be in the project’s root directory, which you can find with find_project().

Imported objects

The code in your workflow plan may depend on functions you write yourself, data that you download from a website before every runthrough, etc. Drake just pulls these objects from your workspace and reproducibly tracks them. Try make(..., envir = my_workspace) if you want to use a custom R environment instead. WARNING: packages, global options, and other parts of the global environment are available, but NOT reproducibly tracked. (Use packrat to reproducibly manage your packages.)

x = data.frame(output = c("out", "my_input"), code = c("my_input - 1", "f(2)"))
f = function(x) g(x) + 1
g = function(x) h(x) + 2
h = function(x) x^2 + my_var
# make(x) # quits in error because "my_var" is undefined
my_var = 1
make(x)
## import my_var
## import h
## import g
## import f
## build  my_input
## build  out
readd(out)
## [1] 7
make(x)
## import my_var
## import h
## import g
## import f
## skip   my_input
## skip   out
my_var = 2 # drake knows you changed "my_var"
make(x)
## import my_var
## import h
## import g
## import f
## build  my_input
## build  out
readd(out)
## [1] 8

Drake knows when your functions change, and it respects how functions are nested. Here, f() calls g(), and g() calls h(). So if h() changes, then everything depending on f() will be rebuilt.

h = function(x){ x - 10 + my_var}
make(x)
## import my_var
## import h
## import g
## import f
## build  my_input
## build  out
readd(out)
## [1] -4

But changes to comments and whitespace in functions are ignored.

h = function(x){
  x-10+my_var
}
make(x)
## import my_var
## import h
## import g
## import f
## skip   my_input
## skip   out
readd(out)
## [1] -4

Caution about imported objects

I repeat: only your workspace is reproducibly tracked (or envir in make()).

global = 10000
run = function(x) make(x)
run(x)
## build  my_input
## skip   out
readd(out)
## [1] -4

In addition, beware of automatically-loaded '.RData' files that could wreck your workspace. If you have an '.RData' file in your working directory, drake warns you on load.

save.image()
drake:::.onLoad()
## Warning in drake:::.onLoad(): Auto-saved workspace file '.RData' detected.
## This is bad for reproducible code. Drake says you should remove it with
## unlink('.RData').
unlink('.RData')

External files

Sometimes character strings are just plain strings, but other times they are names of reproducibly-tracked files that your workflow depends on. Drake tells the difference with quoting. Double-quoted strings are ordinary strings, and strings wrapped in single quotes stand for file dependencies. Functions as_file(), quotes(), strings(), and unquote() ease some of the string-manipulation burden. (These last three are from the eply package.)

saveRDS("imported data", file = "imported_file")
x = data.frame(
  output = c("'first'", "message", "'second'", "contents_of_imported_file"),
  code = c(
    "saveRDS(\"hello world\", \"first\")",
    "readRDS('first')",
    "saveRDS(message, \"second\")",
    "readRDS('imported_file')"))
x
##                      output                            code
## 1                   'first' saveRDS("hello world", "first")
## 2                   message                readRDS('first')
## 3                  'second'      saveRDS(message, "second")
## 4 contents_of_imported_file        readRDS('imported_file')

In the first row, notice that first is single-quoted on the left and double-quoted on the right. This tells drake to expect an output file named first, which does not depend on the character string "first". If double quotes were on the right side as well, the file first would depend on itself, and make() would quit in error. On the other hand, message uses a single-quoted 'first' in its code, telling drake to treat the file first as an external file dependency.

Before you run your workflow, use check() to screen for circular dependencies, missing files, and possible mistakes in quoting. (make() checks the first two of these.)

check(x)
## Double-quoted strings were found in plan$code.
## Should these be single-quoted instead?
## Remember: single-quoted strings are file dependencies/outputs
## and double-quoted strings are just ordinary strings.
## 
## output: 'first' 
## strings in code: "hello world" "first" 
## 
## output: 'second' 
## strings in code: "second"
make(x, output = "'second'") # Use single quotes here too.
## build  'first'
## build  message
## build  'second'
make(x)
## skip   'first'
## import 'imported_file'
## skip   message
## build  contents_of_imported_file
## skip   'second'
readRDS("second")
## [1] "hello world"
readd(contents_of_imported_file)
## [1] "imported data"
readd("'second'") # Only the fingerprints of external files are cached.
## $hash
## [1] "0544f5e936e8320dbca44450c4550074"
## 
## $mtime
## [1] "2017-02-27 09:06:39 EST"

Both imported and output files are reproducibly tracked.

make(x)
## skip   'first'
## import 'imported_file'
## skip   message
## skip   contents_of_imported_file
## skip   'second'
cached()
## [1] "'first'"                   "'imported_file'"          
## [3] "'second'"                  "contents_of_imported_file"
## [5] "message"
list.files()
## [1] "drake.R"       "drake.Rmd"     "drake.html"    "first"        
## [5] "imported_file" "second"

Cleaning and pruning remove output files, but not imported input files.

clean()
cached()
## character(0)
list.files()
## [1] "drake.R"       "drake.Rmd"     "drake.html"    "imported_file"
unlink("imported_file")

If you damage or delete any output files, drake will recover them for you.

file_plan = plan(list = c(
  "'a'" = "saveRDS(17, \"a\")",
  "'b'" = "saveRDS(1 + readRDS('a'), \"b\")",
  "c" = "readRDS('b')"))
file_plan
##   output                           code
## 1    'a'               saveRDS(17, "a")
## 2    'b' saveRDS(1 + readRDS('a'), "b")
## 3      c                   readRDS('b')
make(file_plan, verbose = FALSE) # first runthrough
readRDS('b')
## [1] 18
saveRDS(5, 'b') # damage the file 'b'
make(file_plan)
## skip   'a'
## build  'b'
## skip   c
readRDS('b')
## [1] 18
clean()

WARNING: drake does not look for file dependencies inside the bodies of imported functions. Inside imported functions, single quotes are not given any special treatment. Single quotes are unavoidably turned into double quotes when a function is parsed or tidied, so if you import f <- function(x) read.csv('my_file.csv') from your calling environment, then 'my_file.csv' will not necessarily be treated as a file dependency.

The plan() function errs on the side of single-quoting to make sure file dependencies are not forgotten. Still, you should always run check() before make(). Read on for more about plan().

More on workflow plans

Drake has a few built-in workflow plans.

example_plans()
## [1] "small" "debug"
example_plan("small")
##   output  code
## 1      a b + c
## 2      b d + e
## 3      c d + f
## 4      d 1 + 1
## 5      e 2 + 2
## 6      f 3 + 3
example_plan("debug")
##   output                                     code
## 1      a as.numeric(as.matrix(read.csv('input')))
## 2      b                                    a + 1
## 3      c                                    a + 2
## 4      d                                 f(b + c)
## 5      e                                 g(b + d)
## 6    'd'                          saveRDS(d, "d")
## 7    'e'                          saveRDS(e, "e")
## 8  final                             readRDS('e')

The “debug” plan relies on external functions and files that you should load with debug_setup() before calling make(). When you’re done, call debug_cleanup() to remove the files for the example.

Drake’s plan() function helps create workflow plan data frames, and the as_file() function wraps strings in single quotes so drake can recognize them as file names. The eply package has functions quotes(), strings(), and unquote() to help with character manipulation and quoting. I stress: single-quoted strings denote file dependencies, and double-quoted strings are just ordinary character strings. The plan() function errs on the side of single-quoting to make sure file dependencies are not forgotten. For more control over quoting, either use the strings_in_dots argument or disregard the freeform dots '...' in favor of the list argument.

plan(x = a, y = readRDS(2, 'input.rds'))
##   output                    code
## 1      x                       a
## 2      y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, 'input.rds'), 
     strings_in_dots = "file_deps") # default
##   output                    code
## 1      x                       a
## 2      y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, 'input.rds'), 
     strings_in_dots = "not_deps")
##   output                    code
## 1      x                       a
## 2      y readRDS(2, "input.rds")
plan(x = a, y = readRDS(2, "input.rds"))
##   output                    code
## 1      x                       a
## 2      y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, "input.rds"), 
     strings_in_dots = "not_deps")
##   output                    code
## 1      x                       a
## 2      y readRDS(2, "input.rds")
plan(list = c(x = "a", y = "readRDS(\"some_string\", 'input.rds')"))
##   output                                code
## 1      x                                   a
## 2      y readRDS("some_string", 'input.rds')
plan('a' = 1)
##   output code
## 1      a    1
plan("'a'" = 1)
##   output code
## 1    'a'    1
plan("'a'" = 1, strings_in_dots = "not_deps") # does not affect output names
##   output code
## 1    'a'    1
# plan('"a"' = 1) # error: output names can't be double-quoted

The following demonstrates a common mistake. If you fail to enforce double-quoting below, then the file 'x' will depend on itself, creating a vicious circularity. (Quoting is tricky because R unavoidably converts single quotes to double quotes when it parses/deparses expressions.)

p = plan(x = saveRDS(1, "x"), y = saveRDS(2, "y"), file_outputs = TRUE)
p
##   output            code
## 1    'x' saveRDS(1, 'x')
## 2    'y' saveRDS(2, 'y')
# check(p) # quits in error
# make(p)  # quits in error
p = plan(x = saveRDS(1, "x"), y = saveRDS(2, "y"), 
         file_outputs = TRUE, strings_in_dots = "not_deps")
p
##   output            code
## 1    'x' saveRDS(1, "x")
## 2    'y' saveRDS(2, "y")
check(p)
## Double-quoted strings were found in plan$code.
## Should these be single-quoted instead?
## Remember: single-quoted strings are file dependencies/outputs
## and double-quoted strings are just ordinary strings.
## 
## output: 'x' 
## strings in code: "x" 
## 
## output: 'y' 
## strings in code: "y"

More examples:

as_file(letters[1:4])
## [1] "'a'" "'b'" "'c'" "'d'"
a = 4
plan(list = c(x = a, "'file'" = "readRDS(2, 'input.rds')"))
##   output                    code
## 1      x                       4
## 2 'file' readRDS(2, 'input.rds')
library(eply) # for quotes(), strings(), and unquote()
quotes(1:5, single = TRUE)
## [1] "'1'" "'2'" "'3'" "'4'" "'5'"
unquote("'not_a_file'")
## [1] "not_a_file"
strings(these, are, strings)
## [1] "these"   "are"     "strings"

Expanding a workflow plan

A realistic example

Let’s turn to a more realistic workflow, the kind you might use for a statistical analysis. First we’ll plan to generate a couple datasets with user-defined functions my_large() and my_small().

data = plan(large = my_large(), small = my_small())

We’ll use these methods of analysis.

methods = plan(reg = regression(..dataset..), 
rf = random_forest(..dataset..))

To apply each method to each dataset, expand out the methods data frame, where each dataset name substitutes ..dataset.. in turn.

myanalyses = analyses(methods, data)
myanalyses
##      output                 code
## 1 reg_large    regression(large)
## 2 reg_small    regression(small)
## 3  rf_large random_forest(large)
## 4  rf_small random_forest(small)

You can generate multiple summaries of each analysis.

summary_types = plan(
  stats = summary_statistics(..analysis..),
  error = mean_squared_error(..analysis.., ..dataset..))
mysummaries = summaries(summary_types, analyses= myanalyses, datasets = data)
mysummaries[3:10,]
##             output                                 code
## 3  stats_reg_large        summary_statistics(reg_large)
## 4  stats_reg_small        summary_statistics(reg_small)
## 5   stats_rf_large         summary_statistics(rf_large)
## 6   stats_rf_small         summary_statistics(rf_small)
## 7  error_reg_large mean_squared_error(reg_large, large)
## 8  error_reg_small mean_squared_error(reg_small, small)
## 9   error_rf_large  mean_squared_error(rf_large, large)
## 10  error_rf_small  mean_squared_error(rf_small, small)

Summaries are grouped together which is convenient for post-processing. (See the gather argument.)

mysummaries[1:2,]
##   output
## 1  error
## 2  stats
##                                                                                                                                                 code
## 1 list(error_reg_large = error_reg_large, error_reg_small = error_reg_small, error_rf_large = error_rf_large, \n    error_rf_small = error_rf_small)
## 2 list(stats_reg_large = stats_reg_large, stats_reg_small = stats_reg_small, stats_rf_large = stats_rf_large, \n    stats_rf_small = stats_rf_small)

Some external file outputs may follow.

out = plan(my_table.csv = save_summaries(stats),
           my_plot.pdf = plot_errors(error), file_outputs = TRUE)

If you have dynmaic knitr reports at the very end (say, my_report.Rmd), you will need to manually declare any dependencies loaded into code chunks with loadd() or readd(). This ensures that the reports are rebuilt when their dependencies change.

report_depends = plan(deps = c(stats, error))

reports = plan(
  my_report.md = my_knit('my_report.Rmd', deps),
  my_report.html = my_render('my_report.md', deps),
  file_outputs = TRUE)
reports
##             output                            code
## 1   'my_report.md'  my_knit('my_report.Rmd', deps)
## 2 'my_report.html' my_render('my_report.md', deps)

where

my_knit = function(file, ...) knitr::knit(file)
my_render = function(file, ...) rmarkdown::render(file)

Finally, gather all your commands into a single data frame and run the project.

my_plan = rbind(data, myanalyses, mysummaries, out, 
  report_depends, reports)
tmp = file.create("my_report.Rmd") # You would write this by hand.
check(my_plan)
# make(my_plan)
tmp = file.remove("my_report.Rmd")

More flexibility for generating workflow plans

If your workflow does not fit the rigid datasets/analyses/summaries framework, check out functions expand(), evaluate(), and gather().

df = plan(data = simulate(center = MU, scale = SIGMA))
df
##   output                                 code
## 1   data simulate(center = MU, scale = SIGMA)
df = expand(df, values = c("rep1", "rep2"))
df
##      output                                 code
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2)
##        output                                code
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
##      output                                code
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
##      output                              code
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2   simulate(center = 2, scale = 1)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
##             output                              code
## 1  data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2    data_rep1_1_1   simulate(center = 1, scale = 1)
## 3   data_rep1_1_10  simulate(center = 1, scale = 10)
## 4  data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5    data_rep1_2_1   simulate(center = 2, scale = 1)
## 6   data_rep1_2_10  simulate(center = 2, scale = 10)
## 7  data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8    data_rep2_1_1   simulate(center = 1, scale = 1)
## 9   data_rep2_1_10  simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11   data_rep2_2_1   simulate(center = 2, scale = 1)
## 12  data_rep2_2_10  simulate(center = 2, scale = 10)
gather(df)
##   output                                               code
## 1 output list(data_rep1 = data_rep1, data_rep2 = data_rep2)
gather(df, output = "my_summaries", gather = "rbind")
##         output                                                code
## 1 my_summaries rbind(data_rep1 = data_rep1, data_rep2 = data_rep2)

High-performance computing

Parallel processes with Makefiles

As advertised, drake seamlessly integrates with Makefiles for high-performance computing. If Rtools is installed on your system, make(..., makefile = TRUE, command = "make", args = "--jobs=4") will create a Makefile and then use it to distribute the build steps over four parallel R sessions. (Use run = F to just write the Makefile and not build any outputs.) Try the following yourself.

library(drake)
plan = example_plan("debug")
debug_setup()
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")
g = function(x){
  h(x) + i(x) + 1
}
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")

As the above code demonstrates, Make acts as more than just a job scheduler. Just like regular drake::make(), the Makefile knows what can safely be skipped. The detection of the required build steps happens in an initialization step in R that must be repeated at the beginning of every runthrough, which means the Makefile is not standalone. You must only run the Makefile using make(..., makefile = TRUE) in an interactive R session, Rscript, R CMD BATCH, or something similar. Do not invoke make directly in the Linux command line.

The command argument lets you to micromanage how you call Makefile, keeping in mind that the Makefile will be in your working directory. For example, make(..., makefile = T, command = "make", args = c("--jobs=8", "-s")) distributes the work over 8 parallel jobs and suppresses verbose console output from Make. On LSF systems, you could even replace "make" with lsmake in your command.

To use packages and global options in your Makefile-accelerated workflow, you need to use the packages and global arguments in make(). The packages argument defaults to loadedNamespaces(), so calling library() before make() should be sufficient most of the time. If you need to control the order that your packages load in the individual build steps, use packages to list your packages from first to last. The code in global can also be used to load packages, but its main purpose is to set up anything else that needs to be in the global environment, such as global options. All the code chunks in global are run in the global environment before individual build steps. Similarly to before, the effects of packages and global are not reproducibly tracked.

To prepend lines of code to your Makefile, use the prepend argument to make(). This can be used to write comments, define variables, and connect your work to a formal job scheduler, cluster, or supercomputer. Which leads us to…

Distributed computing on a cluster or supercomputer

If you want to distribute your work over multiple nodes of a Slurm cluster, you can create a Makefile using the solution in this post. The following command submits jobs to Slurm to build individual outputs, with at most 8 jobs running simultaneously.

make(..., makefile = TRUE, command = "make", args = "--jobs=8",
  prepend = c(
    "SHELL = srun",
    ".SHELLFLAGS = <ARGS> bash -c"))

To make sure your work keeps running after you log out, save your R code to a file (say, my_file.R) and then run the following in the Linux command line.

nohup nice -19 R CMD BATCH my_file.R &

For job schedulers other than Slurm, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. Your my_file.R should end with the following.

make(.., makefile = TRUE, args = "--jobs=8", prepend = "SHELL = ./shell.sh")

where the file shell.sh contains

#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y

Now, in the Linux command line, enable execution with

chmod +x shell.sh

and then run as before with

nohup nice -19 R CMD BATCH my_file.R &

Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake storr cache. Do this with your shell.sh. For the Univa Grid Engine, for example, use the -cwd flag for qsub.