Drake
has extensive high-performance computing support, from local multicore computing on your laptop to serious supercomputing across multiple nodes of a large cluster. In make()
, just set the jobs
argument to something greater than 1. That unlocks local multicore parallelism. For large-scale distributed parallelism, set parallelism
to "Makefile"
and stay tuned for an explanation.
Drake
’s approach to parallelism relies on the network graph of the targets and imports.
clean()
load_basic_example()
make(my_plan, jobs = 2, verbose = FALSE) # Parallelize over 2 jobs.
# Change a dependency.
reg2 <- function(d) {
d$x3 <- d$x ^ 3
lm(y ~ x3, data = d)
}
# Hover, click, drag, zoom, and pan.
plot_graph(my_plan, width = "100%", height = "500px")
When you call make(my_plan, jobs = 4)
, the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4)
should be faster than make(..., jobs = 1)
, but it would be superfluous to use more than 4 jobs.
See function max_useful_jobs()
to suggest the number of jobs, taking into account which targets are already up to date. Try out the following in a fresh R session.
library(drake)
load_basic_example()
plot_graph(my_plan) # Set targets_only to TRUE for smaller graphs.
max_useful_jobs(my_plan) # 8
max_useful_jobs(my_plan, imports = "files") # 8
max_useful_jobs(my_plan, imports = "all") # 8
max_useful_jobs(my_plan, imports = "none") # 8
make(my_plan, jobs = 4)
plot_graph(my_plan)
# Ignore the targets already built.
max_useful_jobs(my_plan) # 1
max_useful_jobs(my_plan, imports = "files") # 1
max_useful_jobs(my_plan, imports = "all") # 8
max_useful_jobs(my_plan, imports = "none") # 0
# Change a function so some targets are now out of date.
reg2 <- function(d){
d$x3 <- d$x ^ 3
lm(y ~ x3, data = d)
}
plot_graph(my_plan)
max_useful_jobs(my_plan) # 4
max_useful_jobs(my_plan, from_scratch = TRUE) # 8
max_useful_jobs(my_plan, imports = "files") # 4
max_useful_jobs(my_plan, imports = "all") # 8
max_useful_jobs(my_plan, imports = "none") # 4
Drake
has multiple parallel backends, i.e. separate mechanisms for achieving parallelism. Some are low-overhead and limited, others are high-overhead and scalable. Just set the parallelism
argument of Make
to choose a backend. The best choice usually depends on your project’s scale and stage of deployment.
parallelism_choices() # List the parallel backends.
?parallelism_choices # Read an explanation of each backend.
default_parallelism() # "parLapply" on Windows, "mclapply" everywhere else
The mclapply
backend is powered by the mclapply()
function from the parallel
package. It is a way to fork multiple processes on your local machine to take advantage of multicore computing. It spins up quickly, but it lacks scalability, and it does not work on Windows. If you try to call make(.., parallelism = "mclapply", jobs = 2)
on a Windows machine, drake
will warn you and then demote the number of jobs to 1.
make(.., parallelism = "mclapply", jobs = 2)
The parLapply
backend is powered by the parLapply()
function from the parallel
package. Like the mclapply
backend, parLapply
only scales up to a handful of jobs on your local machine. However, it works on all platforms. The tradeoff is overhead. parLapply
is fast once it gets going, but it takes a long time to set up because each call to make()
creates a new parallel socket cluster and transfers all you data and session info to each parallel thread individually. So if jobs
is less than 2, make()
does not bother setting up a cluster, and it uses lapply()
instead. More importantly, the default parallel backend is parLapply
on Windows machines and mclapply
everywhere else.
make(.., parallelism = "parLapply", jobs = 2)
default_parallelism() # "parLapply" on Windows, "mclapply" everywhere else
The Makefile
backend uses proper Makefiles to distribute targets across different R sessions. After processing all the imports in parallel using the default backend, make(..., parallelism = "Makefile")
spins up whole new separate R session for each target individually. The Makefile
acts as a job scheduler, waiting until the dependencies are finished before initiating the next targets at each parallelizable stage. Thanks to a clever idea by Kirill Muller, drake
communicates with the Makefile
by writing hidden dummy files in the cache whose only job is to hold a timestamp. The Makefile
sees these timestamps and knows which jobs to run and which ones to skip.
Unlike other backends, the Makefile
backend processes all the imports first before beginning the first target. This is different from the other backends, where some targets are sometimes built before or simultaneously with independent imports. In addition, during import processing, make()
uses the system’s default parallelism (mclapply
or parLapply
) and the number of jobs you supplied to the jobs
argument. Stay tuned for how to use different numbers of jobs for imports versus targets.
Before running Makefile
parallelism, Windows users need to download and install Rtools
. For everyone else, just make sure Make is installed. Then, in the next make()
, simply set the parallelism
and jobs
arguments as before.
make(my_plan, parallelism = "Makefile", jobs = 2)
You will see a Makefile
written to your working directory. Do not run this Makefile
by itself. It will not work correctly by itself because it depends on the transient dummy timestamp files created by make()
.
Makefile
parallelism is just a bit richer. You can now use the args
argument to send custom arguments to the Makefile
. For example, you could use 4 parallel jobs for the imports and 6 parallel jobs for the targets.
make(my_plan, parallelism = "Makefile", jobs = 4, args = "--jobs=6 --silent")
In addition, you can use a program other than GNU Make to run the Makefile
. You may be interested in lsmake
as an alternative, for example.
make(my_plan, parallelism = "Makefile", jobs = 4, command = "lsmake")
default_Makefile_command()
## [1] "make"
For finer control over the build process, use the recipe_command
argument. By default, the recipe_command
is "Rscript -e 'R_RECIPE'"
.
default_recipe_command()
## [1] "Rscript -e 'R_RECIPE'"
r_recipe_wildcard()
## [1] "R_RECIPE"
The R_RECIPE
wildcard is replaced by drake::mk("your_target", "path_to_cache")
in the Makefile
. That way, a target named your_target
is built with the Makefile
recipe,
Rscript -e 'drake::mk("your_target", "path_to_cache")'
You can change the recipe with the recipe_command
argument. For example, to save some time and skip the loading of the methods
package, you might use "R -e 'R_RECIPE' -q"
.
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "R -e 'R_RECIPE' -q")
The Makefile
recipe for your_target
becomes
R -e 'drake::mk("your_target", "path_to_cache") -q'
That particular recipe fails on Windows, but you have flexibility.
Use the Makefile_recipe()
function to show and tweak Makefile
recipes in advance.
Makefile_recipe()
## Rscript -e 'drake::mk(target = "your_target", cache_path = "/tmp/Rtmp3fe9pW/Rbuild24bf574dbaea/drake/vignettes/.drake")'
Makefile_recipe(
recipe_command = "R -e 'R_RECIPE' -q",
target = "this_target",
cache_path = "custom_cache"
)
## R -e 'drake::mk(target = "this_target", cache_path = "custom_cache")' -q
If recipe_command
contains no mention of R_RECIPE
, then R_RECIPE
is single-quoted and appended automatically.
Makefile_recipe(recipe_command = "R -q -e")
## R -q -e 'drake::mk(target = "your_target", cache_path = "/tmp/Rtmp3fe9pW/Rbuild24bf574dbaea/drake/vignettes/.drake")'
Try each of the following and look at the generated Makefile
after each call to make()
. To see the recipes printed to the console, run clean()
between each make()
and leave verbose
equal to TRUE
(default).
make(my_plan, parallelism = "Makefile", jobs = 4)
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "Rscript -e")
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "Rscript -e 'R_RECIPE'")
But do not try the following on Windows.
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "R -e 'R_RECIPE' -q")
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "R -q -e 'R_RECIPE'")
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "R -q -e")
For the recommended approach to supercomputing with drake
, you need a new configuration file to tell the Makefile
how to talk to the cluster. The shell_file()
function writes a starter.
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
This file acts as the “shell” of the Makefile
instead of, say, the Unix shell alone. It is a mechanism for tricking the Makefile
into submitting each target as a job on your cluster rather than a new R session on your local machine. You may need to configure shell.sh
for your system, such as changing module load R
to reference the version of R installed on the compute nodes of the cluster.
To tell the Makefile
to use shell.sh
, you will need to add the line SHELL=./shell.sh
to the top of the Makefile
. This should not be done manually. Instead, use the prepend
argument of make()
.
make(my_plan, parallelism = "Makefile", jobs = 2, prepend = "SHELL=./shell.sh")
SLURM users may be able to invoke srun
and dispense with shell.sh
altogether, although this has been known to fail on some SLURM systems.
make(my_plan, parallelism = "Makefile", jobs = 4,
prepend = "SHELL=srun")
And you may be able to use recipe_command
to to talk to the cluster rather than prepend
(though most job schedulers require a script file).
make(my_plan, parallelism = "Makefile", jobs = 4,
recipe_command = "tell_cluster_to_submit Rscript -e")
If you are interested in Makefile
parallelism on a cluster, then you likely have a project that takes several hours or more to run. In that case, we recommend that you submit a master job on the login node that runs persistently until your work is complete. To do so, just save you call to make()
in an R script, say my_script.R
, and then deploy your work from the Linux terminal with the following.
nohup nice -19 R CMD BATCH script.R &
See the timing vignette for explanations of functions rate_limiting_times()
and predict_runtime()
, which can help predict the possible speed gains of having multiple independent jobs. If you suspect drake
itself is slowing down your project, you may want to read the storage vignette to learn how to set the hashing algorithms of your project.