Drake
's make()
function generates your project's output, and drake
takes storing this output seriously. This guide explains how drake
caches and hashes its data, and describes customization options that can increase convenience and speed.
When you run make()
, drake
stores your imports and output targets in a hidden cache.
library(drake)
load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").
config <- make(my_plan, verbose = FALSE)
You can explore your cached data using functions like loadd()
, readd()
, and cached()
.
head(cached())
## [1] "\"report.Rmd\"" "\"report.md\""
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
head(readd(small))
## x y
## 1 3.730 17.3
## 2 5.250 10.4
## 3 3.730 17.3
## 4 5.345 14.7
## 5 3.190 24.4
## 6 3.440 17.8
loadd(large)
head(large)
## x y
## 1 3.780 15.2
## 2 3.170 15.8
## 3 5.424 10.4
## 4 5.250 10.4
## 5 3.780 15.2
## 6 3.570 14.3
rm(large) # Does not remove `large` from the cache.
The storr package does the heavy lifting. A storr
is an object in R that serves as an abstraction for a storage backend, usually a file system. See the main storr vignette for a thorough walkthrough.
class(config$cache) # from `config <- make(...)`
## [1] "storr" "R6"
cache <- get_cache() # Get the default cache from the last build.
class(cache)
## [1] "storr" "R6"
cache$list() # Functionality from storr
## [1] "\"report.Rmd\"" "\"report.md\""
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
## [7] "data.frame" "knit"
## [9] "large" "lm"
## [11] "mtcars" "nrow"
## [13] "random_rows" "reg1"
## [15] "reg2" "regression1_large"
## [17] "regression1_small" "regression2_large"
## [19] "regression2_small" "sample.int"
## [21] "simulate" "small"
## [23] "summ_regression1_large" "summ_regression1_small"
## [25] "summ_regression2_large" "summ_regression2_small"
## [27] "summary" "suppressWarnings"
head(cache$get("small")) # Functionality from storr
## x y
## 1 3.730 17.3
## 2 5.250 10.4
## 3 3.730 17.3
## 4 5.345 14.7
## 5 3.190 24.4
## 6 3.440 17.8
The concept of hashing is central to storr's internals. Storr uses hashes to label stored objects, and drake
leverages these hashes to figure out which targets are up to date and which ones are outdated. A hash is like a target's fingerprint, so the hash changes when the target changes. Regardless of the target's size, the hash is always the same number of characters.
library(digest) # package for hashing objects and files
smaller_data <- 12
larger_data <- rnorm(1000)
digest(smaller_data) # compute the hash
## [1] "23c80a31c0713176016e6e18d76a5f31"
digest(larger_data)
## [1] "d667cb4e86bc358992bdcd486c29e976"
However, different hash algorithms vary in length.
digest(larger_data, algo = "sha512")
## [1] "5125636b48d17ec4ad59ce27d42b1d33b9b1bdf22c23054a5ce91aa684ffe143e478fe2cfa07bb949d0e1274851bdbe4d365a31c3bb98e015e160aa471984eb6"
digest(larger_data, algo = "md5")
## [1] "d667cb4e86bc358992bdcd486c29e976"
digest(larger_data, algo = "xxhash64")
## [1] "ae4a84cb926c7bfc"
digest(larger_data, algo = "murmur32")
## [1] "deddff38"
Hashing is expensive, and unsurprisingly, shorter hashes are usually faster to compute. So why not always use murmur32
? One reason is the risk of collisions: that is, when two different objects have the same hash. In general, shorter hashes have more frequent collisions. On the other hand, a longer hash is not always the answer. Besides the loss of speed, drake
and storr sometimes use hash keys as file names, and long hashes could violate the 260-character cap on Windows file paths. That is why drake
uses a shorter hash algorithm for internal cache-related file names and a longer hash algorithm for everything else.
default_short_hash_algo()
## [1] "xxhash64"
default_long_hash_algo()
## [1] "sha256"
short_hash(cache)
## [1] "xxhash64"
long_hash(cache)
## [1] "sha256"
For new projects, use new_cache()
to set the hash algorithms of the default cache.
# cache_path(cache) # Default cache from before. # nolint
# Start from scratch to reset both hash algorithms.
clean(destroy = TRUE)
tmp <- new_cache(
path = default_cache_path(), # The `.drake/` folder.
short_hash_algo = "crc32",
long_hash_algo = "sha1"
)
The cache at default_cache_path()
(equivalently, the .drake/
folder) is the default cache used for make()
.
config <- make(my_plan, verbose = FALSE)
short_hash(config$cache) # xxhash64 is the default_short_hash_algo()
## [1] "crc32"
long_hash(config$cache) # sha256 is the default_long_hash_algo()
## [1] "sha1"
You can change the long hash algorithm without throwing away the cache, but your project will rebuild from scratch. As for the short hash, you are committed until you delete the cache and all its supporting files.
outdated(config) # empty
## character(0)
config$cache <- configure_cache(
config$cache,
long_hash_algo = "murmur32",
overwrite_hash_algos = TRUE
)
Below, the targets become outdated because the existing hash keys do not match the new hash algorithm.
config <- drake_config(my_plan, verbose = FALSE, cache = config$cache)
outdated(config)
## [1] "\"report.md\"" "coef_regression1_large"
## [3] "coef_regression1_small" "coef_regression2_large"
## [5] "coef_regression2_small" "large"
## [7] "regression1_large" "regression1_small"
## [9] "regression2_large" "regression2_small"
## [11] "small" "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"
config <- make(my_plan, verbose = FALSE)
short_hash(config$cache) # same as before
## [1] "crc32"
long_hash(config$cache) # different from before
## [1] "murmur32"
You do not need to use the default cache at the default_cache_path()
(.drake/
). However, if you use a different file system, such as the custom faster_cache/
folder below, you will need to manually supply the cache to all functions that require one.
faster_cache <- new_cache(
path = "faster_cache",
short_hash_algo = "murmur32",
long_hash_algo = "murmur32"
)
# cache_path(faster_cache) # nolint
# cache_path(cache) # location of the previous cache # nolint
short_hash(faster_cache)
## [1] "murmur32"
long_hash(faster_cache)
## [1] "murmur32"
new_plan <- drake_plan(
simple = 1 + 1
)
make(new_plan, cache = faster_cache)
## target simple
cached(cache = faster_cache)
## [1] "simple"
readd(simple, cache = faster_cache)
## [1] 2
You can recover an old cache from the file system. You could use storr::storr_rds()
directly if you know the short hash algorithm, but this_cache()
and recover_cache()
are safer for drake
. get_cache()
is similar, but it has a slightly different interface.
old_cache <- this_cache("faste_cache") # Get a cache you know exists...
recovered <- recover_cache("faster_cache") # or create a new one if missing.
If you want bypass drake
and generate a cache directly from storr, it is best to do so right from the beginning.
library(storr)
my_storr <- storr_rds("my_storr", mangle_key = TRUE)
make(new_plan, cache = my_storr)
## Unloading targets from environment:
## simple
## target simple
cached(cache = my_storr)
## [1] "simple"
readd(simple, cache = my_storr)
## [1] 2
In addition to storr_rds()
, drake
supports in-memory caches created from storr_environment()
. However, parallel computing is not supported these caches. The jobs
argument must be 1, and the parallelism
argument must be either "mclapply"
or "parLapply"
. (It is sufficient to leave the default values alone.)
memory_cache <- storr_environment()
other_plan <- drake_plan(
some_data = rnorm(50),
more_data = rpois(75, lambda = 10),
result = mean(c(some_data, more_data))
)
make(other_plan, cache = memory_cache)
## target more_data
## target some_data
## target result
cached(cache = memory_cache)
## [1] "c" "mean" "more_data" "result" "rnorm" "rpois"
## [7] "some_data"
readd(result, cache = memory_cache)
## [1] 6.232917
In theory, it should be possible to leverage serious databases using storr_dbi()
. However, if you use such caches, please heed the following.
storr::storr_dbi()
cache is not thread-safe. Either use no parallel computing at all or set parallelism = "future"
with caching = "master"
. The "future"
backend is currently experimental, but it allows the master process to do all the caching in order to avoid race conditions.The following example requires the DBI
and RSQLite
packages.
mydb <- DBI::dbConnect(RSQLite::SQLite(), "my-db.sqlite")
cache <- storr::storr_dbi(
tbl_data = "data",
tbl_keys = "keys",
con = mydb
)
load_basic_example() # Get the code with drake_example("basic").
unlink(".drake", recursive = TRUE)
make(my_plan, cache = cache)
If you want to start from scratch, you can clean()
the cache. Use the destroy
argument to remove it completely. cache$del()
and cache$destroy()
are also options, but they leave output file targets dangling. By contrast, clean(destroy = TRUE)
removes file targets generated by drake::make()
. drake_gc()
and clean(..., garbage_collection = TRUE)
do garbage collection, and clean(purge = TRUE)
removes all target-level data, not just the final output values.
clean(small, large)
cached() # 'small' and 'large' are gone
## [1] "\"report.Rmd\"" "\"report.md\""
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
## [7] "data.frame" "knit"
## [9] "lm" "mtcars"
## [11] "nrow" "random_rows"
## [13] "reg1" "reg2"
## [15] "regression1_large" "regression1_small"
## [17] "regression2_large" "regression2_small"
## [19] "sample.int" "simulate"
## [21] "summ_regression1_large" "summ_regression1_small"
## [23] "summ_regression2_large" "summ_regression2_small"
## [25] "summary" "suppressWarnings"
clean(destroy = TRUE)
clean(destroy = TRUE, cache = faster_cache)
clean(destroy = TRUE, cache = my_storr)