Drake is a workflow manager and build system for
Organize your work in a data frame. Then make()
it.
library(drake)
load_basic_example() # Also (over)writes report.Rmd. `example_drake("basic")`, `vignette("quickstart")`.
my_plan
## target command
## 1 'report.md' my_knit('report.Rmd', report_dependencies)
## 2 small simulate(5)
## 3 large simulate(50)
## 4 report_dependencies c(small, large, coef_regression2_small)
## 5 regression1_small reg1(small)
## 6 regression1_large reg1(large)
## 7 regression2_small reg2(small)
## 8 regression2_large reg2(large)
## 9 summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small coef(regression1_small)
## 14 coef_regression1_large coef(regression1_large)
## 15 coef_regression2_small coef(regression2_small)
## 16 coef_regression2_large coef(regression2_large)
make(my_plan)
install.packages("drake") # latest CRAN release
devtools::install_github("wlandau-lilly/drake@v3.1.0", build = TRUE) # latest GitHub release
devtools::install_github("wlandau-lilly/drake", build = TRUE) # development version
For make(..., parallelism = "Makefile")
, Windows users need to download and install Rtools
.
library(drake)
load_basic_example() # Also (over)writes report.Rmd. `example_drake("basic")`, `vignette("quickstart")`.
plot_graph(my_plan) # Hover, click, drag, zoom, pan. Try file = "graph.html" and targets_only = TRUE.
outdated(my_plan) # Which targets need to be (re)built?
missed(my_plan) # Are you missing anything from your workspace?
check(my_plan) # Are you missing files? Is your workflow plan okay?
make(my_plan) # Run the workflow.
outdated(my_plan) # Everything is up to date.
plot_graph(my_plan) # The graph also shows what is up to date.
Dive deeper into the built-in examples.
example_drake("basic") # Write the code files of the canonical tutorial.
examples_drake() # List the other examples.
vignette("quickstart") # Same as https://cran.r-project.org/package=drake/vignettes/quickstart.html
Besides make()
, here are some useful functions to learn about drake,
load_basic_example()
drake_tip()
examples_drake()
example_drake()
set up your workflow plan,
plan()
analyses()
summaries()
evaluate()
expand()
gather()
wildcard() # from the wildcard package
explore the dependency network,
outdated()
missed()
plot_graph()
dataframes_graph()
render_graph()
read_graph()
deps()
tracked()
interact with the cache,
clean()
cached()
imported()
built()
build_times()
readd()
loadd()
find_project()
find_cache()
debug your work,
check()
session()
in_progress()
progress()
config()
read_config()
and speed up your project with parallel computing.
make() # with jobs > 2
max_useful_jobs()
parallelism_choices()
shell_file()
The CRAN page links to multiple rendered vignettes.
vignette(package = "drake") # List the vignettes.
vignette("drake") # High-level intro.
vignette("quickstart") # Walk through a simple example.
vignette("caution") # Avoid common pitfalls.
Please refer to TROUBLESHOOTING.md on the GitHub page for instructions.
There is room to improve the conversation and the landscape of reproducibility in the R and Statistics communities. At a more basic level than scientific replicability, literate programming, and version control, reproducibility carries an implicit promise that the alleged results of an analysis really do match the code. Drake helps keep this promise by tracking the relationships among the components of the analysis, a rare and effective approach that also saves time.
library(drake)
load_basic_example()
outdated(my_plan) # Which targets need to be (re)built?
make(my_plan) # Build what needs to be built.
outdated(my_plan) # Everything is up to date.
reg2 = function(d){ # Change one of your functions.
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
outdated(my_plan) # Some targets depend on reg2().
plot_graph(my_plan) # Set targets_only to TRUE for smaller graphs.
make(my_plan) # Rebuild just the outdated targets.
outdated(my_plan) # Everything is up to date again.
plot_graph(my_plan) # The colors changed in the graph.
Similarly to Make, drake arranges the intermediate steps of your workflow in a dependency web. This network is the key to drake’s parallel computing. For example, consider the network graph of the basic example.
library(drake)
load_basic_example()
make(my_plan, jobs = 2, verbose = FALSE) # Parallelize over 2 jobs.
reg2 = function(d){ # Change a dependency.
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")
When you call make(my_plan, jobs = 4)
, the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4)
should be faster than make(..., jobs = 1)
, but it would be superfluous to use more than 4 jobs.
See function max_useful_jobs()
to suggest the number of jobs, taking into account which targets are already up to date. Try out the following in a fresh R session.
library(drake)
load_basic_example()
plot_graph(my_plan) # Look at the graph to make sense of the output.
max_useful_jobs(my_plan) # 8
max_useful_jobs(my_plan, imports = "files") # 8
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 8
make(my_plan)
plot_graph(my_plan)
# Ignore the targets already built.
max_useful_jobs(my_plan) # 1
max_useful_jobs(my_plan, imports = "files") # 1
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 0
# Change a function so some targets are now out of date.
reg2 = function(d){
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
plot_graph(my_plan)
max_useful_jobs(my_plan) # 4
max_useful_jobs(my_plan, imports = "files") # 4
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 4
As for how the parallelism is implemented, you can choose from multiple built-in backends.
make(..., parallelism = "mclapply", jobs = 2)
invokes parallel::mclapply()
under the hood and distributes the work over at most two independent processes (set with jobs
). Mclapply is an ideal choice for low-overhead single-node parallelism, but it does not work on Windows.make(..., parallelism = "parLapply", jobs = 2)
invokes parallel::mclapply()
under the hood. This option is similar to mclapply except that it works on Windows and costs a little extra time up front.make(..., parallelism = "Makefile", jobs = 2)
creates a proper Makefile to distribute the work over multiple independent R sessions. With custom settings, you can distribute the R sessions over different jobs/nodes on a cluster. The build order may be different here because all the imports are imported before any of the targets are built with the Makefile. That means plot_graph()
, dataframes_graph()
, and max_useful_jobs()
behave differently for . For more details, see the quickstart vignette.