With drake, there is room for error with respect to tracking dependencies, managing environments and workspaces, etc. For example, in some edge cases, it is possible to trick drake into ignoring dependencies. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version. In this vignette, I will try to address some of the main issues to keep in mind for writing reproducible workflows safely.
template <- plan(x = process(..setting..))
processed <- evaluate(template, wildcard = "..setting..",
values = c("\"option1\"", "\"option2\""))
gathered <- gather(processed, target = "bad_target")
my_plan <- rbind(processed, gathered)
my_plan
## target command
## 1 x_"option1" process("option1")
## 2 x_"option2" process("option2")
## 3 bad_target list(x_"option1" = x_"option1", x_"option2" = x_"option2")
Here, make(my_plan)
would generate an error because the command for bad_target
has illegal symbols. To avoid this sort of problem, please keep literal quotes out of your wildcards.
template <- plan(x = process("..setting.."), strings_in_dots = "literals")
processed <- evaluate(template, wildcard = "..setting..",
values = c("option1", "option2"))
gathered <- gather(processed, target = "bad_target")
my_plan <- rbind(processed, gathered)
my_plan
## target command
## 1 x_option1 process("option1")
## 2 x_option2 process("option2")
## 3 bad_target list(x_option1 = x_option1, x_option2 = x_option2)
To be safe, use check(my_plan)
to screen for problems like this one.
As of version 3.0.0, drake’s execution environment is the user’s workspace by default. As an upshot, the workspace is vulnerable to side-effects of make()
. To protect your workspace, you may want to create a custom evaluation environment containing all your imported objects and then pass it to the envir argument of make()
. Here is how.
library(drake)
envir = new.env(parent = globalenv())
eval(expression({
f = function(x){
g(x) + 1
}
g = function(x){
x + 1
}
}), envir = envir)
myplan = plan(out = f(1:3))
make(myplan, envir = envir)
## check 1 item
## import g
## check 1 item
## import f
## check 1 item
## target out
ls() # Check that your workspace did not change.
## [1] "envir" "gathered" "my_plan" "myplan" "processed" "template"
ls(envir) # Check your evaluation environment.
## [1] "f" "g" "out"
envir$out
## [1] 3 4 5
readd(out)
## [1] 3 4 5
In your workflow plan data frame (produced by plan()
and accepted by make()
), your commands can usually be flexible R expressions.
plan(target1 = 1 + 1 - sqrt(sqrt(3)),
target2 = my_function(web_scraped_data) %>% my_tidy)
## target command
## 1 target1 1 + 1 - sqrt(sqrt(3))
## 2 target2 my_function(web_scraped_data) %>% my_tidy
However, please try to avoid formulas and function definitions in your commands. You may be able to get away with plan(f = function(x){x + 1})
or plan(f = y ~ x)
in some use cases, but be careful. Rather than using commands for this, it is better to define functions and formulas in your workspace before calling make()
. (Alternatively, use the envir
argument to make()
to tightly control which imported functions are available.) Use the check()
function to help screen and quality-control your workflow plan data frame, use tracked()
to see the items that are reproducibly tracked, and use plot_graph()
and build_graph()
to see the dependency structure of your project.
Consider the workflow plan data frame below.
my_plan = plan(list = c(a = "x <- 1; return(x)"))
my_plan
## target command
## 1 a x <- 1; return(x)
deps(my_plan$command[1])
## [1] "return"
Here, x
is a mere side effect of the command, and it will not be reproducibly tracked. And if you add a proper target called x
to the workflow plan data frame, the results of your analysis may not be correct. Side effects of commands can be unpredictable, so please try to minimize them. It is a good practice to write your commands as function calls. Nested function calls are okay.
During the execution workflow of a drake project, please do not change your working directory (with setwd()
, for example). At the very least, if you do change your working directory during a command in your workflow plan, please return to the original working directory before the command is completed. Drake relies on a hidden cache (the .drake/
folder) at the root of your project, so navigating to a different folder may confuse drake.
Yes, you can declare a file target or input file by enclosing it in single quotes in your workflow plan data frame. But entire directories (i.e. folders) cannot yet be tracked this way. Tracking directories is a tricky problem, and lots of individual edge cases need to be ironed out before I can deliver a clean, reliable solution. Please see issue 12 for updates and a discussion.
As the user, you should take responsibility for how the steps of your workflow are interconnected. This will affect which targets are built and which ones are skipped. There are several ways to explore the dependency relatoinsihp.
load_basic_example()
my_plan
## target command
## 1 'report.md' my_knit('report.Rmd', report_dependencies)
## 2 small simulate(5)
## 3 large simulate(50)
## 4 report_dependencies c(small, large, coef_regression2_small)
## 5 regression1_small reg1(small)
## 6 regression1_large reg1(large)
## 7 regression2_small reg2(small)
## 8 regression2_large reg2(large)
## 9 summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small coef(regression1_small)
## 14 coef_regression1_large coef(regression1_large)
## 15 coef_regression2_small coef(regression2_small)
## 16 coef_regression2_large coef(regression2_large)
# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")
You can also check the dependencies of individual targets.
deps(reg2)
## [1] "lm"
deps(my_plan$command[1]) # report.Rmd is single-quoted because it is a file dependency.
## [1] "'report.Rmd'" "my_knit" "report_dependencies"
deps(my_plan$command[16])
## [1] "coef" "regression2_large"
List all the reproducibly-tracked objects and files, including imports and targets.
tracked(my_plan, targets = "small")
## [1] "small" "simulate" "data.frame" "rpois"
## [5] "stats::rnorm"
tracked(my_plan)
## [1] "'report.md'" "small"
## [3] "large" "report_dependencies"
## [5] "regression1_small" "regression1_large"
## [7] "regression2_small" "regression2_large"
## [9] "summ_regression1_small" "summ_regression1_large"
## [11] "summ_regression2_small" "summ_regression2_large"
## [13] "coef_regression1_small" "coef_regression1_large"
## [15] "coef_regression2_small" "coef_regression2_large"
## [17] "my_knit" "simulate"
## [19] "reg1" "reg2"
## [21] "'report.Rmd'" "c"
## [23] "summary" "suppressWarnings"
## [25] "coef" "knit"
## [27] "data.frame" "rpois"
## [29] "stats::rnorm" "lm"
First of all, if you are ever unsure about what exactly is reproducibly tracked, consult the examples in the following documentation.
?deps
?tracked
?plot_graph
Drake can be fooled into skipping objects that should be treated as dependencies. For example:
f <- function(){
b = get("x", envir = globalenv()) # x is incorrectly ignored
file_dependency = readRDS('input_file.rds') # 'input_file.rds' is incorrectly ignored
digest::digest(file_dependency)
}
deps(f)
## [1] "digest::digest" "get" "globalenv" "readRDS"
command = "x <- digest::digest('input_file.rds'); assign(\"x\", 1); x"
deps(command)
## [1] "'input_file.rds'" "assign" "digest::digest"
With functions produced by Vectorize()
, detecting dependencies is especially hard because the body of every such a function is
{
args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
character(length(args))
else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
}
Thus, If f <- Vectorize(g, ...)
is such a function, drake searches g()
for dependencies, not f()
. Specifically, if drake sees that environment(f)[["FUN"]]
exists and is a function, then environment(f)[["FUN"]]
will be searched instead of f()
.
In addition, if f()
is the output of Vectorize()
, then drake reacts to changes in environment(f)[["FUN"]]
, not f()
. Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make()
still thinks everything is up to date.
Some R functions use .Call()
to run compiled code in the backend. The R code in these functions is tracked, but not the compiled code called with .Call()
.
On Windows, do not use make(..., parallelism = "mclapply")
. Replace "mclapply"
with one of the other parallelism_choices()
or let drake
choose the parallelism
for you. For make(..., parallelism = "Makefile")
, Windows users need to download and install Rtools
.
The Makefile generated by make(myplan, parallelism = "Makefile")
is not standalone. Do not run it outside of drake::make()
. Drake uses dummy timestamp files to tell the Makefile what to do, and running make
in the terminal will most likely give incorrect results.
Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best parallel backend for your system and uses the number of jobs you give to the jobs
argument to make()
. To use at most 2 jobs for imports and at most 4 jobs for targets, run
make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")