This vignette describes general best practices for creating, configuring, and running drake
projects. It answers frequently asked questions and clears up common misconceptions, and it will continuously develop in response to community feedback.
For examples of how to structure your code files, see the beginner oriented example projects:
Write the code directly with the drake_example()
function.
drake_example("basic")
drake_example("gsp")
drake_example("packages")
In practice, you do not need to organize your files the way the examples do, but it does happen to be a reasonable way of doing things.
It is best to write your code as a bunch of functions. You can save those functions in R scripts and then source()
them before doing anything else.
# Load functions get_data(), analyze_data, and summarize_results()
source("my_functions.R")
Then, set up your workflow plan data frame.
good_plan <- drake_plan(
my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint
my_analysis = analyze_data(my_data),
my_summaries = summarize_results(my_data, my_analysis)
)
## Warning: Converting double-quotes to single-quotes because the
## `strings_in_dots` argument is missing. Use the file_in(), file_out(), and
## knitr_in() functions to work with files in your commands. To remove this
## warning, either call `drake_plan()` with `strings_in_dots = "literals"` or
## use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`.
good_plan
## # A tibble: 3 x 2
## target command
## <chr> <chr>
## 1 my_data get_data(file_in('data.csv'))
## 2 my_analysis analyze_data(my_data)
## 3 my_summaries summarize_results(my_data, my_analysis)
Drake
knows that my_analysis
depends on my_data
because my_data
is an argument to analyze_data()
, which is part of the command for my_analysis
.
config <- drake_config(good_plan)
vis_drake_graph(config)
Now, you can call make()
to build the targets.
make(good_plan)
If your commands are really long, just put them in larger functions. Drake
analyzes imported functions for non-file dependencies.
Some people are accustomed to dividing their work into R scripts and then calling source()
to run each step of the analysis. For example you might have the following files.
get_data.R
analyze_data.R
summarize_results.R
If you migrate to drake
, you may be tempted to set up a workflow plan like this.
bad_plan <- drake_plan(
my_data = source(file_in("get_data.R")),
my_analysis = source(file_in("analyze_data.R")),
my_summaries = source(file_in("summarize_data.R"))
)
## Warning: Converting double-quotes to single-quotes because the
## `strings_in_dots` argument is missing. Use the file_in(), file_out(), and
## knitr_in() functions to work with files in your commands. To remove this
## warning, either call `drake_plan()` with `strings_in_dots = "literals"` or
## use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`.
bad_plan
## # A tibble: 3 x 2
## target command
## <chr> <chr>
## 1 my_data source(file_in('get_data.R'))
## 2 my_analysis source(file_in('analyze_data.R'))
## 3 my_summaries source(file_in('summarize_data.R'))
But now, the dependency structure of your work is broken. Your R script files are dependencies, but since my_data
is not mentioned in a function or command, drake
does not know that my_analysis
depends on it.
config <- drake_config(bad_plan)
vis_drake_graph(config)
Dangers:
make(bad_plan, jobs = 2)
, drake
will try to build my_data
and my_analysis
at the same time even though my_data
must finish before my_analysis
begins.Drake
is oblivious to data.csv
since it is not explicitly mentioned in a workflow plan command. So when data.csv
changes, make(bad_plan)
will not rebuild my_data
.my_analysis
will not update when my_data
changes.source()
is formatted counter-intuitively. If source(file_in("get_data.R"))
is the command for my_data
, then my_data
will always be a list with elements "value"
and "visible"
. In other words, source(file_in("get_data.R"))$value
is really what you would want.In addition, this source()
-based approach is simply inconvenient. Drake
rebuilds my_data
every time get_data.R
changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses my_data = get_data()
, drake
does not trigger rebuilds when comments or whitespace in get_data()
are modified. Drake
is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.
For a serious project, you should use drake
's make()
function outside knitr
. In other words, you should treat R Markdown reports and other knitr
documents as targets and imports, not as a way to run make()
. Viewed as targets, drake
makes special exceptions for R Markdown reports and other knitr reports such as *.Rmd
and *.Rnw
files. Not every drake
project needs them, but it is good practice to use them to summarize the final results of a project once all the other targets have already been built. The basic example, for instance, has an R Markdown report. report.Rmd
is knitted to build report.md
, which summarizes the final results.
# Load all the functions and the workflow plan data frame, my_plan.
load_basic_example() # Get the code with drake_example("basic").
To see where report.md
will be built, look to the right of the dependency graph.
config <- drake_config(my_plan)
vis_drake_graph(config)
Drake
treats knitr report as a special cases. Whenever drake
sees knit()
or render()
(rmarkdown) mentioned in a command, it dives into the source file to look for dependencies. Consider report.Rmd
, which you can view here. When drake
sees readd(small)
in an active code chunk, it knows report.Rmd depends on the target called small
, and it draws the appropriate arrow in the dependency graph above. And if small
ever changes, make(my_plan)
will re-process report.Rmd to produce the target file report.md
.
knitr reports are the only kind of file that drake
analyzes for dependencies. It does not give R scripts the same special treatment.
The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to
expose_imports()
to properly account for all your nested function dependencies, anddevtools::load_all()
, set the prework
argument of make()
: e.g. make(prework = "devtools::load_all()")
.Thanks to Jasper Clarkberg for the workaround behind expose_imports()
.
For drake
, there is one problem: nested functions. Drake
always looks for imported functions nested in other imported functions, but only in your environment. When it sees a function from a package, it does not look in its body for other imports.
To see this, consider the digest()
function from the digest
package. Digest
package is a utility for computing hashes, not a data science workflow, but I will use it to demonstrate how drake
treats imports from packages.
library(digest)
g <- function(x){
digest(x)
}
f <- function(x){
g(x)
}
plan <- drake_plan(x = f(1))
# Here are the reproducibly tracked objects in the workflow.
tracked(plan)
## [1] "g" "digest" "f" "x"
# But the `digest()` function has dependencies too.
# Because `drake` knows `digest()` is from a package,
# it ignores these dependencies by default.
head(deps(digest), 10)
## [1] ".Call" ".errorhandler" "any"
## [4] "as.integer" "as.raw" "base::serialize"
## [7] "digest_impl" "file.access" "file.exists"
## [10] "file.info"
To force drake
to dive deeper into the nested functions in a package, you must use expose_imports()
. Again, I demonstrate with the digest
package package, but you should really only do this with a package you write yourself to contain your workflow. For external packages, packrat is a much better solution for package reproducibility.
expose_imports(digest)
## <environment: R_GlobalEnv>
new_objects <- tracked(plan)
head(new_objects, 10)
## [1] "digest" "warning" "as.raw"
## [4] ".Call" ".errorhandler" "any"
## [7] "as.integer" "base::serialize" "digest_impl"
## [10] "file.access"
length(new_objects)
## [1] 32
# Now when you call `make()`, `drake` will dive into `digest`
# to import dependencies.
cache <- storr::storr_environment() # just for examples
make(plan, cache = cache)
## target x
head(cached(cache = cache), 10)
## [1] "any" "as.integer" "as.raw"
## [4] "base::serialize" "digest" "digest_impl"
## [7] "f" "file.access" "file.exists"
## [10] "file.info"
length(cached(cache = cache))
## [1] 30
## [1] TRUE
Drake
has the following functions to generate workflow plan data frames (the plan
argument of make()
, where you list your targets and commands).
drake_plan()
evaluate_plan()
expand_plan()
gather_plan()
reduce_plan()
plan_analyses()
plan_summaries()
Except for drake_plan()
, they all use wildcards as templates. For example, suppose your workflow checks several metrics of several schools. The idea is to write a workflow plan with your metrics and let the wildcard templating expand over the available schools.
hard_plan <- drake_plan(
credits = check_credit_hours(school__),
students = check_students(school__),
grads = check_graduations(school__),
public_funds = check_public_funding(school__)
)
evaluate_plan(
hard_plan,
rules = list(school__ = c("schoolA", "schoolB", "schoolC"))
)
## # A tibble: 12 x 2
## target command
## <chr> <chr>
## 1 credits_schoolA check_credit_hours(schoolA)
## 2 credits_schoolB check_credit_hours(schoolB)
## 3 credits_schoolC check_credit_hours(schoolC)
## 4 students_schoolA check_students(schoolA)
## 5 students_schoolB check_students(schoolB)
## 6 students_schoolC check_students(schoolC)
## 7 grads_schoolA check_graduations(schoolA)
## 8 grads_schoolB check_graduations(schoolB)
## 9 grads_schoolC check_graduations(schoolC)
## 10 public_funds_schoolA check_public_funding(schoolA)
## 11 public_funds_schoolB check_public_funding(schoolB)
## 12 public_funds_schoolC check_public_funding(schoolC)
But what if some metrics do not make sense? For example, what if schoolC
is a completely privately-funded school? With no public funds, check_public_funds(schoolC)
may quit in error if we are not careful. This is where setting up workflow plans gets tricky. You may need to use multiple wildcards and make some combinations of values are left out.
library(magrittr)
rules_grid <- tibble::tibble(
school_ = c("schoolA", "schoolB", "schoolC"),
funding_ = c("public", "public", "private"),
) %>%
tidyr::crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
dplyr::filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013"))) %>%
print()
## # A tibble: 10 x 3
## school_ funding_ cohort_
## <chr> <chr> <chr>
## 1 schoolA public 2012
## 2 schoolA public 2013
## 3 schoolA public 2014
## 4 schoolA public 2015
## 5 schoolB public 2014
## 6 schoolB public 2015
## 7 schoolC private 2012
## 8 schoolC private 2013
## 9 schoolC private 2014
## 10 schoolC private 2015
Then, alternately choose expand = TRUE
and expand = FALSE
when evaluating the wildcards.
drake_plan(
credits = check_credit_hours("school_", "funding_", "cohort_"),
students = check_students("school_", "funding_", "cohort_"),
grads = check_graduations("school_", "funding_", "cohort_"),
public_funds = check_public_funding("school_", "funding_", "cohort_"),
strings_in_dots = "literals"
) %>% evaluate_plan(
wildcard = "school_",
values = rules_grid$school_,
expand = TRUE
) %>%
evaluate_plan(
wildcard = "funding_",
rules = rules_grid,
expand = FALSE
) %>%
DT::datatable()
## Error in loadNamespace(name): there is no package called 'webshot'
Thanks to Alex Axthelm for this example in issue 235.
Some workflows rely on remote data from the internet, and the workflow needs to refresh when the datasets change. As an example, let us consider the download logs of CRAN packages.
library(drake)
library(R.utils) # For unzipping the files we download.
library(curl) # For downloading data.
library(httr) # For querying websites.
url <- "http://cran-logs.rstudio.com/2018/2018-02-09-r.csv.gz"
How do we know when the data at the URL changed? We get the time that the file was last modified. (Alternatively, we could use an HTTP ETag.)
query <- HEAD(url)
timestamp <- query$headers[["last-modified"]]
timestamp
## [1] "Mon, 12 Feb 2018 16:34:48 GMT"
In our workflow plan, the timestamp is a target and a dependency. When the timestamp changes, so does everything downstream.
cranlogs_plan <- drake_plan(
timestamp = HEAD(url)$headers[["last-modified"]],
logs = get_logs(url, timestamp),
strings_in_dots = "literals"
)
cranlogs_plan
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 timestamp "HEAD(url)$headers[[\"last-modified\"]]"
## 2 logs get_logs(url, timestamp)
To make sure we always have the latest timestamp, we use the "always"
trigger. (See this section of the debugging vignette for more on triggers.)
cranlogs_plan$trigger <- c("always", "any")
cranlogs_plan
## # A tibble: 2 x 3
## target command trigger
## <chr> <chr> <chr>
## 1 timestamp "HEAD(url)$headers[[\"last-modified\"]]" always
## 2 logs get_logs(url, timestamp) any
Lastly, we define the get_logs()
function, which actually downloads the data.
# The ... is just so we can write dependencies as function arguments
# in the workflow plan.
get_logs <- function(url, ...){
curl_download(url, "logs.csv.gz") # Get a big file.
gunzip("logs.csv.gz", overwrite = TRUE) # Unzip it.
out <- read.csv("logs.csv", nrows = 4) # Extract the data you need.
unlink(c("logs.csv.gz", "logs.csv")) # Remove the big files
out # Value of the target.
}
When we are ready, we run the workflow.
make(cranlogs_plan)
## Unloading targets from environment:
## timestamp
## target timestamp: trigger "always"
## target logs
## Used non-default triggers. Some targets may not be up to date.
readd(logs)
## date time size version os country ip_id
## 1 2018-02-09 13:01:13 82375220 3.4.3 win RO 1
## 2 2018-02-09 13:02:06 74286541 3.3.3 win US 2
## 3 2018-02-09 13:02:10 82375216 3.4.3 win US 3
## 4 2018-02-09 13:03:30 82375220 3.4.3 win IS 4