The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
staRburst makes it trivial to scale your parallel R code from your laptop to 100+ AWS workers. This vignette walks through setup and common usage patterns.
Before using staRburst, you need to configure AWS resources. This only needs to be done once.
This will: - Validate your AWS credentials - Create an S3 bucket for data transfer - Create an ECR repository for Docker images - Set up ECS cluster and VPC resources - Check Fargate quotas and offer to request increases
The simplest way to use staRburst is with the furrr
package:
library(furrr)
library(starburst)
# Define your work
expensive_simulation <- function(i) {
# Some computation that takes a few minutes
results <- replicate(1000, {
x <- rnorm(10000)
mean(x^2)
})
mean(results)
}
# Local execution (single core)
plan(sequential)
system.time({
results_local <- future_map(1:100, expensive_simulation)
})
#> ~16 minutes on typical laptop
# Cloud execution (50 workers)
plan(future_starburst, workers = 50)
system.time({
results_cloud <- future_map(1:100, expensive_simulation)
})
#> ~2 minutes (including 45s startup)
#> Cost: ~$0.85
# Results are identical
identical(results_local, results_cloud)
#> [1] TRUElibrary(starburst)
library(furrr)
# Simulate portfolio returns
simulate_portfolio <- function(seed) {
set.seed(seed)
# Random walk for 252 trading days
returns <- rnorm(252, mean = 0.0003, sd = 0.02)
prices <- cumprod(1 + returns)
list(
final_value = prices[252],
max_drawdown = max(cummax(prices) - prices) / max(prices),
sharpe_ratio = mean(returns) / sd(returns) * sqrt(252)
)
}
# Run 10,000 simulations on 100 workers
plan(future_starburst, workers = 100)
results <- future_map(1:10000, simulate_portfolio, .options = furrr_options(seed = TRUE))
# Analyze results
final_values <- sapply(results, `[[`, "final_value")
hist(final_values, breaks = 50, main = "Distribution of Portfolio Final Values")
# 95% confidence interval
quantile(final_values, c(0.025, 0.975))Performance: - Local (single core): ~4 hours - Cloud (100 workers): ~3 minutes - Cost: ~$1.80
library(starburst)
library(furrr)
# Your data
data <- read.csv("my_data.csv")
# Bootstrap function
bootstrap_regression <- function(i, data) {
# Resample with replacement
boot_indices <- sample(nrow(data), replace = TRUE)
boot_data <- data[boot_indices, ]
# Fit model
model <- lm(y ~ x1 + x2 + x3, data = boot_data)
# Return coefficients
coef(model)
}
# Run 10,000 bootstrap samples
plan(future_starburst, workers = 50)
boot_results <- future_map(1:10000, bootstrap_regression, data = data)
# Convert to matrix
boot_coefs <- do.call(rbind, boot_results)
# 95% confidence intervals for each coefficient
apply(boot_coefs, 2, quantile, probs = c(0.025, 0.975))library(starburst)
library(furrr)
# Process one sample
process_sample <- function(sample_id) {
# Read from S3 (data already in cloud)
fastq_path <- sprintf("s3://my-genomics-data/samples/%s.fastq", sample_id)
data <- read_fastq(fastq_path)
# Align reads
aligned <- align_reads(data, reference = "hg38")
# Call variants
variants <- call_variants(aligned)
# Return summary
list(
sample_id = sample_id,
num_variants = nrow(variants),
variants = variants
)
}
# Process 1000 samples on 100 workers
sample_ids <- list.files("s3://my-genomics-data/samples/", pattern = ".fastq$")
plan(future_starburst, workers = 100)
results <- future_map(sample_ids, process_sample, .progress = TRUE)
# Combine results
all_variants <- do.call(rbind, lapply(results, `[[`, "variants"))Performance: - Local (sequential): ~208 hours (8.7 days) - Cloud (100 workers): ~2 hours - Cost: ~$47
If your data is already in S3, workers can read it directly:
For smaller datasets, you can pass data as arguments:
For very large objects, pre-upload to S3:
# Upload once
large_data <- read.csv("huge_file.csv")
s3_path <- starburst_upload(large_data, "s3://my-bucket/large_data.rds")
# Workers read from S3
plan(future_starburst, workers = 100)
results <- future_map(1:1000, function(i) {
# Read from S3 inside worker
data <- readRDS(s3_path)
process(data, i)
})# Set maximum cost per job
starburst_config(
max_cost_per_job = 10, # Don't start jobs that would cost >$10
cost_alert_threshold = 5 # Warn when approaching $5
)
# Now jobs exceeding limit will error before starting
plan(future_starburst, workers = 1000) # Would cost ~$35/hour
#> Error: Estimated cost ($35/hr) exceeds limit ($10/hr)If you request more workers than your quota allows, staRburst automatically uses wave-based execution:
# Quota allows 25 workers, but you request 100
plan(future_starburst, workers = 100, cpu = 4)
#> β Requested: 100 workers (400 vCPUs)
#> β Current quota: 100 vCPUs (allows 25 workers max)
#>
#> π Execution plan:
#> β’ Running in 4 waves of 25 workers each
#>
#> π‘ Request quota increase to 500 vCPUs? [y/n]: y
#>
#> β Quota increase requested
#> β‘ Starting wave 1 (25 workers)...
results <- future_map(1:1000, expensive_function)
#> β‘ Wave 1: 100% complete (250 tasks)
#> β‘ Wave 2: 100% complete (500 tasks)
#> β‘ Wave 3: 100% complete (750 tasks)
#> β‘ Wave 4: 100% complete (1000 tasks)Environment mismatch: Packages not found on workers
Task failures: Some tasks failing
# Check logs
starburst_logs(task_id = "failed-task-id")
# Often due to memory limits - increase worker memory
plan(future_starburst, workers = 50, memory = "16GB") # Default is 8GBSlow data transfer: Large objects taking too long
β Good: Each task takes >5 minutes
β Bad: Each task takes <1 minute
Instead of:
Do:
Donβt:
big_data <- read.csv("10GB_file.csv") # Upload for every task
results <- future_map(1:1000, function(i) process(big_data, i))Do:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.