library(mikropml)
By default, preprocess_data()
and run_ml()
use only one process in series. If you’d like to parallelize various steps of the pipeline to make them run faster, install foreach
, future
, future.apply
, and doFuture
. Then, register a future plan prior to calling preprocess_data()
and run_ml()
:
::registerDoFuture()
doFuture::plan(future::multicore, workers = 2) future
Above, we used the multicore
plan to split the work across 2 cores. See the future
documentation for more about picking the best plan for your use case. Notably, multicore
does not work inside RStudio or on Windows; you will need to use multisession
instead in those cases.
After registering a future plan, you can call preprocess_data()
and run_ml()
as usual, and they will run certain tasks in parallel.
preprocess_data(otu_small, 'dx')$dat_transformed
otu_data_preproc <- run_ml(otu_data_preproc, 'glmnet') result1 <-
run_ml()
multiple times in parallel in RYou can use functions from the future.apply
package to call run_ml()
multiple times in parallel with different parameters. You will first need to run future::plan()
as above if you haven’t already. Then, call run_ml()
with multiple seeds using future_lapply()
:
future.apply::future_lapply(seq(100, 102), function(seed) {
results_multi <-# NOTE: use more seeds for real-world data
run_ml(otu_data_preproc, 'glmnet', seed = seed)
future.seed = TRUE) },
Each call to run_ml()
with a different seed uses a different random split of the data into training and testing sets. Since we are using seeds, we must set future.seed
to TRUE
(see the future.apply
documentation and this blog post for details on parallel-safe random seeds). This example uses only a few seeds for speed and simplicity, but for real data we recommend using many more seeds to get a better estimate of model performance.
Extract the performance results and combine into one dataframe for all seeds:
future.apply::future_lapply(results_multi,
perf_df <-function(result) {result[['performance']]},
future.seed = TRUE) %>%
dplyr::bind_rows()
perf_df
You may also wish to compare performance for different ML methods. mapply()
can iterate over multiple lists or vectors, and future_mapply()
works the same way:
# NOTE: use more seeds for real-world data
expand.grid(seeds = seq(100, 102),
param_grid <-methods = c('glmnet', 'rf'))
future.apply::future_mapply(
results_mtx <-function(seed, method) {
run_ml(otu_data_preproc, method, seed = seed)
},$seeds,
param_grid$methods,
param_gridfuture.seed = TRUE
)
Extract and combine the performance results for all seeds and methods:
dplyr::bind_rows(results_mtx['performance',])
perf_df2 <- perf_df2
Visualize the performance results (ggplot2
is required):
plot_model_performance(perf_df2)
perf_boxplot <- perf_boxplot
plot_model_performance()
returns a ggplot2 object. You can add layers to customize the plot:
+
perf_boxplot theme_classic() +
scale_color_brewer(palette = "Dark2") +
coord_flip()
You can also create your own plots however you like using the performance results.
When parallelizing multiple calls to run_ml()
in R as in the examples above, all of the results objects are held in memory. This isn’t a big deal for a small dataset run with only a few seeds. However, for large datasets run in parallel with, say, 100 seeds (recommended), you may run into problems trying to store all of those objects in memory at once. One solution is to write the results files of each run_ml()
call, then concatenate them at the end. We show one way to accomplish this with Snakemake in an example Snakemake workflow here.