Drake is a workflow manager for R. When it runs a project, it automatically builds missing and outdated results while skipping over all the up-to-date output. This automation and reproducibility is important for data analysis workflows, especially large projects under heavy development.
The original idea of a time-saving reproducible build system extends back decades to GNU Make, which today helps data scientists as well as the original user base of complied-language programmers. More recently, Rich FitzJohn created remake, a breakthrough reimagining of Make for R and the most important inspiration for drake. Drake is a fresh, minimalist reinterpretation of some of remake’s pioneering fundamental concepts, scaled up for computationally-demanding workflows. Relative to remake, some of drake’s most prominent distinguishing features at the time of writing this document are
Thanks also to Kirill Müller and Daniel Falster. They contributed code patches and enhancement ideas to my parallelRemake and remakeGenerator packages, which I have now subsumed into drake.
Windows users need Rtools
to run make(..., makefile = TRUE)
(system2("make")
needs to be possible).
Use the help_drake()
function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.
Drake was built to keep track of large and complicated statistical analysis workflows, but for now, let’s start small. Here is a baby workflow plan to produce a variable named a
.
1 + 1
and assign the result to variable d
.2 + 2
and assign the result to variable e
.3 + 3
and assign the result to variable f
.d
and f
, evaluate d + f
and assign the result to variable c
.d
and e
, evaluate d + e
and assign the result to variable b
.b
and c
, evaluate b + c
and assign the result to variable a
.We represent this plan in a data frame with code and output.
library(drake)
x = example_plan("small")
x
## output code
## 1 a b + c
## 2 b d + e
## 3 c d + f
## 4 d 1 + 1
## 5 e 2 + 2
## 6 f 3 + 3
Use make()
to run the six steps in the correct order. Keep in mind that steps 1 through 3 can be interchanged and steps 4 and 5 can be interchanged.
check(x) # check for errors first
make(x)
## build f
## build d
## build e
## build c
## build b
## build a
readd(a) # see also loadd() and cached()
## [1] 14
The whole point of drake is to reproducibly track your output. If an object is already up to date, drake will skip it next time.
make(x, output = c("c", "f"))
## skip f
## skip d
## skip c
When you change your code, drake brings your results up to date, doing the minimum amount of work necesary..
x$code[3] = "sqrt(d) + 2*f + 1" # new code for variable c
make(x)
## skip f
## skip d
## skip e
## build c
## skip b
## build a
readd(a)
## [1] 20.41421
make(x)
## skip f
## skip d
## skip e
## skip c
## skip b
## skip a
x$code[5] = "2*2*1" # variable e: previously 2 + 2, so the output value doesn't change
make(x)
## skip f
## skip d
## build e
## skip c
## skip b
## skip a
readd(a)
## [1] 20.41421
x$code[5] = "7/2" # new code for variable e
make(x)
## skip f
## skip d
## build e
## skip c
## build b
## build a
readd(a)
## [1] 19.91421
x$code[5] = "7 /2 # changes to comments and whitespace are ignored"
make(x)
## skip f
## skip d
## skip e
## skip c
## skip b
## skip a
Try these functions in an interactive R session at the root directory of your project (or a subdirectory, with the search
argument).
status()
returns the running build status of each object so far ("skipped"
, "built"
, "imported"
, or "IN PROGRESS"
). Unlisted objects were not yet reached by make()
. If make()
is interrupted or quits in error, any output that was being made at the time will still be labeled "IN PROGRESS"
.session()
returns the sessionInfo()
from the last call to make()
.cached()
lists the objects in the cache.built()
lists the built and cached objects from the workflow plan.imported()
lists the cached dependencies were imported from envir
or the calling environment (stay tuned).readd()
returns an object from the cache.loadd()
loads one or more objects from the cache into your workspace.find_project()
returns the root directory of your drake project. Your current working directory needs to be inside your project..find_cache()
shows the location of your project’s hidden drake cache, which is a folder called .drake/
at the root of your project. The cache was generated using storr.Prune your worklfow to remove objects no longer in the plan.
cached()
## [1] "a" "b" "c" "d" "e" "f"
x = x[1:5,]
prune(x)
cached()
## [1] "a" "b" "c" "d" "e"
Use clean()
to completely remove everything generated and tracked by make()
. This is a nuclear option, so only use it if you are totally sure you want to start over from scratch.
clean() # removes the cached objects but keeps the hidden ".drake/" folder
cached()
## character(0)
clean(destroy = TRUE) # removes ".drake/"
To use prune()
and clean()
, you must be in the project’s root directory, which you can find with find_project()
.
The code in your workflow plan may depend on functions you write yourself, data that you download from a website before every runthrough, etc. Drake just pulls these objects from your workspace and reproducibly tracks them. Try make(..., envir = my_workspace)
if you want to use a custom R environment instead. WARNING: packages, global options, and other parts of the global environment are available, but NOT reproducibly tracked. (Use packrat
to reproducibly manage your packages.)
x = data.frame(output = c("out", "my_input"), code = c("my_input - 1", "f(2)"))
f = function(x) g(x) + 1
g = function(x) h(x) + 2
h = function(x) x^2 + my_var
# make(x) # quits in error because "my_var" is undefined
my_var = 1
make(x)
## import my_var
## import h
## import g
## import f
## build my_input
## build out
readd(out)
## [1] 7
make(x)
## import my_var
## import h
## import g
## import f
## skip my_input
## skip out
my_var = 2 # drake knows you changed "my_var"
make(x)
## import my_var
## import h
## import g
## import f
## build my_input
## build out
readd(out)
## [1] 8
Drake knows when your functions change, and it respects how functions are nested. Here, f()
calls g()
, and g()
calls h()
. So if h()
changes, then everything depending on f()
will be rebuilt.
h = function(x){ x - 10 + my_var}
make(x)
## import my_var
## import h
## import g
## import f
## build my_input
## build out
readd(out)
## [1] -4
But changes to comments and whitespace in functions are ignored.
h = function(x){
x-10+my_var
}
make(x)
## import my_var
## import h
## import g
## import f
## skip my_input
## skip out
readd(out)
## [1] -4
I repeat: only your workspace is reproducibly tracked (or envir
in make()
).
global = 10000
run = function(x) make(x)
run(x)
## build my_input
## skip out
readd(out)
## [1] -4
In addition, beware of automatically-loaded '.RData'
files that could wreck your workspace. If you have an '.RData'
file in your working directory, drake warns you on load.
save.image()
drake:::.onLoad()
## Warning in drake:::.onLoad(): Auto-saved workspace file '.RData' detected.
## This is bad for reproducible code. Drake says you should remove it with
## unlink('.RData').
unlink('.RData')
Sometimes character strings are just plain strings, but other times they are names of reproducibly-tracked files that your workflow depends on. Drake tells the difference with quoting. Double-quoted strings are ordinary strings, and strings wrapped in single quotes stand for file dependencies. Functions as_file()
, quotes()
, strings()
, and unquote()
ease some of the string-manipulation burden. (These last three are from the eply package.)
saveRDS("imported data", file = "imported_file")
x = data.frame(
output = c("'first'", "message", "'second'", "contents_of_imported_file"),
code = c(
"saveRDS(\"hello world\", \"first\")",
"readRDS('first')",
"saveRDS(message, \"second\")",
"readRDS('imported_file')"))
x
## output code
## 1 'first' saveRDS("hello world", "first")
## 2 message readRDS('first')
## 3 'second' saveRDS(message, "second")
## 4 contents_of_imported_file readRDS('imported_file')
In the first row, notice that first
is single-quoted on the left and double-quoted on the right. This tells drake to expect an output file named first
, which does not depend on the character string "first"
. If double quotes were on the right side as well, the file first
would depend on itself, and make()
would quit in error. On the other hand, message
uses a single-quoted 'first'
in its code, telling drake to treat the file first
as an external file dependency.
Before you run your workflow, use check()
to screen for circular dependencies, missing files, and possible mistakes in quoting. (make()
checks the first two of these.)
check(x)
## Double-quoted strings were found in plan$code.
## Should these be single-quoted instead?
## Remember: single-quoted strings are file dependencies/outputs
## and double-quoted strings are just ordinary strings.
##
## output: 'first'
## strings in code: "hello world" "first"
##
## output: 'second'
## strings in code: "second"
make(x, output = "'second'") # Use single quotes here too.
## build 'first'
## build message
## build 'second'
make(x)
## skip 'first'
## import 'imported_file'
## skip message
## build contents_of_imported_file
## skip 'second'
readRDS("second")
## [1] "hello world"
readd(contents_of_imported_file)
## [1] "imported data"
readd("'second'") # Only the fingerprints of external files are cached.
## $hash
## [1] "0544f5e936e8320dbca44450c4550074"
##
## $mtime
## [1] "2017-02-27 09:06:39 EST"
Both imported and output files are reproducibly tracked.
make(x)
## skip 'first'
## import 'imported_file'
## skip message
## skip contents_of_imported_file
## skip 'second'
cached()
## [1] "'first'" "'imported_file'"
## [3] "'second'" "contents_of_imported_file"
## [5] "message"
list.files()
## [1] "drake.R" "drake.Rmd" "drake.html" "first"
## [5] "imported_file" "second"
Cleaning and pruning remove output files, but not imported input files.
clean()
cached()
## character(0)
list.files()
## [1] "drake.R" "drake.Rmd" "drake.html" "imported_file"
unlink("imported_file")
If you damage or delete any output files, drake will recover them for you.
file_plan = plan(list = c(
"'a'" = "saveRDS(17, \"a\")",
"'b'" = "saveRDS(1 + readRDS('a'), \"b\")",
"c" = "readRDS('b')"))
file_plan
## output code
## 1 'a' saveRDS(17, "a")
## 2 'b' saveRDS(1 + readRDS('a'), "b")
## 3 c readRDS('b')
make(file_plan, verbose = FALSE) # first runthrough
readRDS('b')
## [1] 18
saveRDS(5, 'b') # damage the file 'b'
make(file_plan)
## skip 'a'
## build 'b'
## skip c
readRDS('b')
## [1] 18
clean()
WARNING: drake does not look for file dependencies inside the bodies of imported functions. Inside imported functions, single quotes are not given any special treatment. Single quotes are unavoidably turned into double quotes when a function is parsed or tidied, so if you import f <- function(x) read.csv('my_file.csv')
from your calling environment, then 'my_file.csv'
will not necessarily be treated as a file dependency.
The plan()
function errs on the side of single-quoting to make sure file dependencies are not forgotten. Still, you should always run check()
before make()
. Read on for more about plan()
.
Drake has a few built-in workflow plans.
example_plans()
## [1] "small" "debug"
example_plan("small")
## output code
## 1 a b + c
## 2 b d + e
## 3 c d + f
## 4 d 1 + 1
## 5 e 2 + 2
## 6 f 3 + 3
example_plan("debug")
## output code
## 1 a as.numeric(as.matrix(read.csv('input')))
## 2 b a + 1
## 3 c a + 2
## 4 d f(b + c)
## 5 e g(b + d)
## 6 'd' saveRDS(d, "d")
## 7 'e' saveRDS(e, "e")
## 8 final readRDS('e')
The “debug” plan relies on external functions and files that you should load with debug_setup()
before calling make()
. When you’re done, call debug_cleanup()
to remove the files for the example.
Drake’s plan()
function helps create workflow plan data frames, and the as_file()
function wraps strings in single quotes so drake can recognize them as file names. The eply package has functions quotes()
, strings()
, and unquote()
to help with character manipulation and quoting. I stress: single-quoted strings denote file dependencies, and double-quoted strings are just ordinary character strings. The plan()
function errs on the side of single-quoting to make sure file dependencies are not forgotten. For more control over quoting, either use the strings_in_dots
argument or disregard the freeform dots '...'
in favor of the list
argument.
plan(x = a, y = readRDS(2, 'input.rds'))
## output code
## 1 x a
## 2 y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, 'input.rds'),
strings_in_dots = "file_deps") # default
## output code
## 1 x a
## 2 y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, 'input.rds'),
strings_in_dots = "not_deps")
## output code
## 1 x a
## 2 y readRDS(2, "input.rds")
plan(x = a, y = readRDS(2, "input.rds"))
## output code
## 1 x a
## 2 y readRDS(2, 'input.rds')
plan(x = a, y = readRDS(2, "input.rds"),
strings_in_dots = "not_deps")
## output code
## 1 x a
## 2 y readRDS(2, "input.rds")
plan(list = c(x = "a", y = "readRDS(\"some_string\", 'input.rds')"))
## output code
## 1 x a
## 2 y readRDS("some_string", 'input.rds')
plan('a' = 1)
## output code
## 1 a 1
plan("'a'" = 1)
## output code
## 1 'a' 1
plan("'a'" = 1, strings_in_dots = "not_deps") # does not affect output names
## output code
## 1 'a' 1
# plan('"a"' = 1) # error: output names can't be double-quoted
The following demonstrates a common mistake. If you fail to enforce double-quoting below, then the file 'x'
will depend on itself, creating a vicious circularity. (Quoting is tricky because R unavoidably converts single quotes to double quotes when it parses/deparses expressions.)
p = plan(x = saveRDS(1, "x"), y = saveRDS(2, "y"), file_outputs = TRUE)
p
## output code
## 1 'x' saveRDS(1, 'x')
## 2 'y' saveRDS(2, 'y')
# check(p) # quits in error
# make(p) # quits in error
p = plan(x = saveRDS(1, "x"), y = saveRDS(2, "y"),
file_outputs = TRUE, strings_in_dots = "not_deps")
p
## output code
## 1 'x' saveRDS(1, "x")
## 2 'y' saveRDS(2, "y")
check(p)
## Double-quoted strings were found in plan$code.
## Should these be single-quoted instead?
## Remember: single-quoted strings are file dependencies/outputs
## and double-quoted strings are just ordinary strings.
##
## output: 'x'
## strings in code: "x"
##
## output: 'y'
## strings in code: "y"
More examples:
as_file(letters[1:4])
## [1] "'a'" "'b'" "'c'" "'d'"
a = 4
plan(list = c(x = a, "'file'" = "readRDS(2, 'input.rds')"))
## output code
## 1 x 4
## 2 'file' readRDS(2, 'input.rds')
library(eply) # for quotes(), strings(), and unquote()
quotes(1:5, single = TRUE)
## [1] "'1'" "'2'" "'3'" "'4'" "'5'"
unquote("'not_a_file'")
## [1] "not_a_file"
strings(these, are, strings)
## [1] "these" "are" "strings"
Let’s turn to a more realistic workflow, the kind you might use for a statistical analysis. First we’ll plan to generate a couple datasets with user-defined functions my_large()
and my_small()
.
data = plan(large = my_large(), small = my_small())
We’ll use these methods of analysis.
methods = plan(reg = regression(..dataset..),
rf = random_forest(..dataset..))
To apply each method to each dataset, expand out the methods
data frame, where each dataset name substitutes ..dataset..
in turn.
myanalyses = analyses(methods, data)
myanalyses
## output code
## 1 reg_large regression(large)
## 2 reg_small regression(small)
## 3 rf_large random_forest(large)
## 4 rf_small random_forest(small)
You can generate multiple summaries of each analysis.
summary_types = plan(
stats = summary_statistics(..analysis..),
error = mean_squared_error(..analysis.., ..dataset..))
mysummaries = summaries(summary_types, analyses= myanalyses, datasets = data)
mysummaries[3:10,]
## output code
## 3 stats_reg_large summary_statistics(reg_large)
## 4 stats_reg_small summary_statistics(reg_small)
## 5 stats_rf_large summary_statistics(rf_large)
## 6 stats_rf_small summary_statistics(rf_small)
## 7 error_reg_large mean_squared_error(reg_large, large)
## 8 error_reg_small mean_squared_error(reg_small, small)
## 9 error_rf_large mean_squared_error(rf_large, large)
## 10 error_rf_small mean_squared_error(rf_small, small)
Summaries are grouped together which is convenient for post-processing. (See the gather
argument.)
mysummaries[1:2,]
## output
## 1 error
## 2 stats
## code
## 1 list(error_reg_large = error_reg_large, error_reg_small = error_reg_small, error_rf_large = error_rf_large, \n error_rf_small = error_rf_small)
## 2 list(stats_reg_large = stats_reg_large, stats_reg_small = stats_reg_small, stats_rf_large = stats_rf_large, \n stats_rf_small = stats_rf_small)
Some external file outputs may follow.
out = plan(my_table.csv = save_summaries(stats),
my_plot.pdf = plot_errors(error), file_outputs = TRUE)
If you have dynmaic knitr reports at the very end (say, my_report.Rmd
), you will need to manually declare any dependencies loaded into code chunks with loadd()
or readd()
. This ensures that the reports are rebuilt when their dependencies change.
report_depends = plan(deps = c(stats, error))
reports = plan(
my_report.md = my_knit('my_report.Rmd', deps),
my_report.html = my_render('my_report.md', deps),
file_outputs = TRUE)
reports
## output code
## 1 'my_report.md' my_knit('my_report.Rmd', deps)
## 2 'my_report.html' my_render('my_report.md', deps)
where
my_knit = function(file, ...) knitr::knit(file)
my_render = function(file, ...) rmarkdown::render(file)
Finally, gather all your commands into a single data frame and run the project.
my_plan = rbind(data, myanalyses, mysummaries, out,
report_depends, reports)
tmp = file.create("my_report.Rmd") # You would write this by hand.
check(my_plan)
# make(my_plan)
tmp = file.remove("my_report.Rmd")
If your workflow does not fit the rigid datasets/analyses/summaries framework, check out functions expand()
, evaluate()
, and gather()
.
df = plan(data = simulate(center = MU, scale = SIGMA))
df
## output code
## 1 data simulate(center = MU, scale = SIGMA)
df = expand(df, values = c("rep1", "rep2"))
df
## output code
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2)
## output code
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
## output code
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
## output code
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2 simulate(center = 2, scale = 1)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
## output code
## 1 data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2 data_rep1_1_1 simulate(center = 1, scale = 1)
## 3 data_rep1_1_10 simulate(center = 1, scale = 10)
## 4 data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5 data_rep1_2_1 simulate(center = 2, scale = 1)
## 6 data_rep1_2_10 simulate(center = 2, scale = 10)
## 7 data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8 data_rep2_1_1 simulate(center = 1, scale = 1)
## 9 data_rep2_1_10 simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11 data_rep2_2_1 simulate(center = 2, scale = 1)
## 12 data_rep2_2_10 simulate(center = 2, scale = 10)
gather(df)
## output code
## 1 output list(data_rep1 = data_rep1, data_rep2 = data_rep2)
gather(df, output = "my_summaries", gather = "rbind")
## output code
## 1 my_summaries rbind(data_rep1 = data_rep1, data_rep2 = data_rep2)
As advertised, drake seamlessly integrates with Makefiles for high-performance computing. If Rtools is installed on your system, make(..., makefile = TRUE, command = "make", args = "--jobs=4")
will create a Makefile and then use it to distribute the build steps over four parallel R sessions. (Use run = F
to just write the Makefile and not build any outputs.) Try the following yourself.
library(drake)
plan = example_plan("debug")
debug_setup()
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")
g = function(x){
h(x) + i(x) + 1
}
make(plan, makefile = TRUE, command = "make", args = "--jobs=4")
As the above code demonstrates, Make acts as more than just a job scheduler. Just like regular drake::make()
, the Makefile knows what can safely be skipped. The detection of the required build steps happens in an initialization step in R that must be repeated at the beginning of every runthrough, which means the Makefile is not standalone. You must only run the Makefile using make(..., makefile = TRUE)
in an interactive R session, Rscript
, R CMD BATCH
, or something similar. Do not invoke make
directly in the Linux command line.
The command
argument lets you to micromanage how you call Makefile, keeping in mind that the Makefile will be in your working directory. For example, make(..., makefile = T, command = "make", args = c("--jobs=8", "-s"))
distributes the work over 8 parallel jobs and suppresses verbose console output from Make. On LSF systems, you could even replace "make"
with lsmake in your command
.
To use packages and global options in your Makefile-accelerated workflow, you need to use the packages
and global
arguments in make()
. The packages
argument defaults to loadedNamespaces()
, so calling library()
before make()
should be sufficient most of the time. If you need to control the order that your packages load in the individual build steps, use packages
to list your packages from first to last. The code in global
can also be used to load packages, but its main purpose is to set up anything else that needs to be in the global environment, such as global options. All the code chunks in global
are run in the global environment before individual build steps. Similarly to before, the effects of packages
and global
are not reproducibly tracked.
To prepend lines of code to your Makefile, use the prepend
argument to make()
. This can be used to write comments, define variables, and connect your work to a formal job scheduler, cluster, or supercomputer. Which leads us to…
If you want to distribute your work over multiple nodes of a Slurm cluster, you can create a Makefile using the solution in this post. The following command submits jobs to Slurm to build individual outputs, with at most 8 jobs running simultaneously.
make(..., makefile = TRUE, command = "make", args = "--jobs=8",
prepend = c(
"SHELL = srun",
".SHELLFLAGS = <ARGS> bash -c"))
To make sure your work keeps running after you log out, save your R code to a file (say, my_file.R
) and then run the following in the Linux command line.
nohup nice -19 R CMD BATCH my_file.R &
For job schedulers other than Slurm, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. Your my_file.R
should end with the following.
make(.., makefile = TRUE, args = "--jobs=8", prepend = "SHELL = ./shell.sh")
where the file shell.sh
contains
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
Now, in the Linux command line, enable execution with
chmod +x shell.sh
and then run as before with
nohup nice -19 R CMD BATCH my_file.R &
Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake
storr cache. Do this with your shell.sh
. For the Univa Grid Engine, for example, use the -cwd
flag for qsub
.