This tutorial is a recommended starting place for learning how to use drake
. It is the abridged version of the basic example vignette. See this section of the README for a high-level overview of the available documentation.
Write the code files to your workspace.
drake_example("basic")
The new basic
folder now includes a file structure of a serious drake
project, plus an interactive-tutorial.R
to narrate the example. The code is also online here.
Is there an association between the weight and the fuel efficiency of cars? To find out, we use the mtcars
dataset from the datasets
package. The mtcars
dataset originally came from the 1974 Motor Trend US magazine, and it contains design and performance data on 32 models of automobile.
# ?mtcars # more info
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Here, wt
is weight in tons, and mpg
is fuel efficiency in miles per gallon. We want to figure out if there is an association between wt
and mpg
. The mtcars
dataset itself only has 32 rows, so we generate two larger bootstrapped datasets and then analyze them with regression models. We summarize the regression models to see if there is an association.
Your workspace begins with a bunch of imports: functions, pre-loaded data objects, and saved files available before the real work begins.
load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").
# Drake looks for data objects and functions in your R session environment
ls()
## [1] "AES" "AESdecryptECB" "AESencryptECB"
## [4] "AESinit" "attr_sha1" "avoid_this"
## [7] "b" "bad_plan" "cache"
## [10] "command" "config" "cranlogs_plan"
## [13] "debug_plan" "digest" "digest_impl"
## [16] "envir" "error" "example_class"
## [19] "example_object" "f" "g"
## [22] "get_logs" "good_plan" "hard_plan"
## [25] "hmac" "little_b" "logs"
## [28] "makeRaw" "makeRaw.character" "makeRaw.default"
## [31] "makeRaw.digest" "makeRaw.raw" "modes"
## [34] "my_plan" "my_variable" "myplan"
## [37] "new_objects" "num2hex" "padWithZeros"
## [40] "plan" "print.AES" "query"
## [43] "random_rows" "reg1" "reg2"
## [46] "rules_grid" "sha1" "sha1.Date"
## [49] "sha1.NULL" "sha1.POSIXct" "sha1.POSIXlt"
## [52] "sha1.anova" "sha1.array" "sha1.call"
## [55] "sha1.character" "sha1.complex" "sha1.data.frame"
## [58] "sha1.default" "sha1.factor" "sha1.function"
## [61] "sha1.integer" "sha1.list" "sha1.logical"
## [64] "sha1.matrix" "sha1.name" "sha1.numeric"
## [67] "sha1.pairlist" "sha1.raw" "simulate"
## [70] "timestamp" "tmp" "totally_okay"
## [73] "url" "x"
# and saved files in your file system.
list.files()
## [1] "best-practices.R" "best-practices.Rmd" "best-practices.html"
## [4] "best-practices.md" "caution.R" "caution.Rmd"
## [7] "caution.html" "caution.md" "debug.R"
## [10] "debug.Rmd" "debug.html" "debug.md"
## [13] "drake.R" "drake.Rmd" "example-basic.Rmd"
## [16] "example-gsp.Rmd" "example-packages.Rmd" "faq.Rmd"
## [19] "graph.Rmd" "parallelism.Rmd" "report.R"
## [22] "report.Rmd" "storage.Rmd" "timing.Rmd"
Your real work is outlined in a data frame of data analysis steps called “targets”. The targets depend on the imports, and drake
will figure out how they are all connected.
my_plan
## # A tibble: 15 x 2
## target command
## <chr> <chr>
## 1 "" "knit(knitr_in(\"report.Rmd\"), file_out(\"repo…
## 2 small simulate(48)
## 3 large simulate(64)
## 4 regression1_small reg1(small)
## 5 regression1_large reg1(large)
## 6 regression2_small reg2(small)
## 7 regression2_large reg2(large)
## 8 summ_regression1_small suppressWarnings(summary(regression1_small$resi…
## 9 summ_regression1_large suppressWarnings(summary(regression1_large$resi…
## 10 summ_regression2_small suppressWarnings(summary(regression2_small$resi…
## 11 summ_regression2_large suppressWarnings(summary(regression2_large$resi…
## 12 coef_regression1_small suppressWarnings(summary(regression1_small))$co…
## 13 coef_regression1_large suppressWarnings(summary(regression1_large))$co…
## 14 coef_regression2_small suppressWarnings(summary(regression2_small))$co…
## 15 coef_regression2_large suppressWarnings(summary(regression2_large))$co…
Wildcard templating generates these data frames at scale.
library(magrittr)
dataset_plan <- drake_plan(
small = simulate(5),
large = simulate(50)
)
dataset_plan
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 small simulate(5)
## 2 large simulate(50)
analysis_methods <- drake_plan(
regression = regNUMBER(dataset__) # nolint
) %>%
evaluate_plan(wildcard = "NUMBER", values = 1:2)
analysis_methods
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 regression_1 reg1(dataset__)
## 2 regression_2 reg2(dataset__)
analysis_plan <- plan_analyses(
plan = analysis_methods,
datasets = dataset_plan
)
analysis_plan
## # A tibble: 4 x 2
## target command
## <chr> <chr>
## 1 regression_1_small reg1(small)
## 2 regression_1_large reg1(large)
## 3 regression_2_small reg2(small)
## 4 regression_2_large reg2(large)
whole_plan <- rbind(dataset_plan, analysis_plan)
whole_plan
## # A tibble: 6 x 2
## target command
## <chr> <chr>
## 1 small simulate(5)
## 2 large simulate(50)
## 3 regression_1_small reg1(small)
## 4 regression_1_large reg1(large)
## 5 regression_2_small reg2(small)
## 6 regression_2_large reg2(large)
For the commands you pass in with the free-form ...
argument, drake_plan()
uses tidy evaluation. For example, it supports quasiquotation with the !!
argument. Use tidy_evaluation = FALSE
or the list
argument to suppress this behavior.
my_variable <- 5
drake_plan(
a = !!my_variable,
b = !!my_variable + 1,
list = c(d = "!!my_variable")
)
## # A tibble: 3 x 2
## target command
## <chr> <chr>
## 1 a 5
## 2 b 5 + 1
## 3 d !!my_variable
drake_plan(
a = !!my_variable,
b = !!my_variable + 1,
list = c(d = "!!my_variable"),
tidy_evaluation = FALSE
)
## # A tibble: 3 x 2
## target command
## <chr> <chr>
## 1 a !(!my_variable)
## 2 b !(!my_variable + 1)
## 3 d !!my_variable
For instances of !!
that remain in the workflow plan, make()
will run these commands in tidy fashion, evaluating the !!
operator using the environment you provided.
Using static code analysis, drake
detects the dependencies of all your targets. The result is an interactive network diagram.
vis_drake_graph(my_plan)
At this point, all your targets are out of date because the project is new.
config <- drake_config(my_plan, verbose = FALSE) # Master configuration list
outdated(config)
## [1] "\"report.md\"" "coef_regression1_large"
## [3] "coef_regression1_small" "coef_regression2_large"
## [5] "coef_regression2_small" "large"
## [7] "regression1_large" "regression1_small"
## [9] "regression2_large" "regression2_small"
## [11] "small" "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"
The make()
function traverses the network and builds the targets that require updates.
make(my_plan)
## target large
## target small
## target regression1_large
## target regression1_small
## target regression2_large
## target regression2_small
## target coef_regression1_large
## target coef_regression1_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression1_large
## target summ_regression1_small
## target summ_regression2_large
## target summ_regression2_small
## target file "report.md"
For the reg2()
model on the small dataset, the p-value on x2
is so small that there may be an association between weight and fuel efficiency after all.
readd(coef_regression2_small)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.504915 1.02496426 26.835000 9.676340e-30
## x2 -0.708536 0.08285938 -8.551066 4.617125e-11
The project is currently up to date, so the next make()
does nothing.
make(my_plan)
## Unloading targets from environment:
## coef_regression2_small
## large
## small
## All targets are already up to date.
But a nontrivial change in reg2()
triggers updates to all the affected downstream targets.
reg2 <- function(d){
d$x3 <- d$x ^ 3
lm(y ~ x3, data = d)
}
make(my_plan)
## target regression2_large
## target regression2_small
## target coef_regression2_large
## target coef_regression2_small
## target summ_regression2_large
## target summ_regression2_small
## target file "report.md"
Drake
has built-in example projects. You can generate the code files for an example with drake_example()
, and you can list the available examples with drake_examples()
. For instance, drake_example("gsp")
generates the R script and R Markdown report for the built-in econometrics data analysis project. See below for the currently supported examples.
drake
.basic
: A tiny, minimal example with the mtcars
dataset to demonstrate how to use drake
. Use load_basic_example()
to set up the project in your workspace. The basic example vignette is a parallel walkthrough of the same example.gsp
: A more concrete, practical example using real econometrics data. It explores the relationships between gross state product and other quantities, and it shows off drake
's ability to generate lots of reproducibly-tracked tasks with ease.packages
: A concrete, practical example using data on R package downloads. It demonstrates how drake
can refresh a project based on new incoming data without restarting everything from scratch.Docker-psock
: demonstrates how to deploy targets to a Docker container using a specialized PSOCK cluster.Makefile-cluster
: uses Makefiles to deploy targets to a generic cluster (configurable).sge
: uses "future_lapply"
parallelism to deploy targets to a Sun/Univa Grid Engine cluster. Other clusters are similar. See the batchtools/inst/templates and future.batchtools/inst/templates for more example *.tmpl
template files.slurm
: similar to sge
, but for SLURM.torque
: similar to sge
, but for TORQUE.Regarding the high-performance computing examples, there is no one-size-fits-all *.tmpl
configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand how to use your job scheduler and batchtools.