BRM on the bike dataset (regression)

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

This vignette demonstrates Blockwise Reduced Modeling (BRM) on the Capital Bikeshare dataset, a regression problem where we predict hourly ride counts (cnt) from weather and calendar features. Because bike is otherwise complete, we induce a realistic blockwise missing pattern with simulate_blockwise_missing() to show BRM in its element.

The method, and this dataset’s role as a benchmark, are described in Srinivasan, Currim, and Ram (2025), A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns, INFORMS Journal on Data Science.

library(blockwise)
data(bike)
str(bike, list.len = 20)
#> 'data.frame':    17379 obs. of  9 variables:
#>  $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ weekday   : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 7 ...
#>  $ weathersit: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...
#>  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#>  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#>  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
#>  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

Inducing a blockwise missing pattern

We mask two groups of columns jointly — mimicking the situation where two independent data sources feeding your pipeline fail on different subsets of rows — plus a small per-column noise rate.

bike_miss <- simulate_blockwise_missing(
  bike,
  blocks = list(
    c("windspeed", "hum", "weekday"),
    c("hr", "temp", "weathersit")
  ),
  prop_missing = 0.30,
  noise        = 0.05
)

round(colMeans(is.na(bike_miss)) * 100, 1)  # percent missing per column
#>     season       mnth         hr    weekday weathersit       temp        hum 
#>        0.0        0.0       33.5       33.4       33.4       33.5       33.5 
#>  windspeed        cnt 
#>       33.5        0.0

Train / test split

set.seed(1234)
idx <- sample(nrow(bike_miss), size = floor(0.75 * nrow(bike_miss)))
train <- bike_miss[idx, ]
test  <- bike_miss[-idx, ]

X_train <- train[, setdiff(names(train), "cnt")]
y_train <- train$cnt

X_test  <- test[, setdiff(names(test), "cnt")]
y_test  <- test$cnt

Fit BRM

brm() is learner-agnostic: pass any learner() specification. Here we try a linear model and a gradient-boosted tree ensemble. The number of blocks is chosen automatically by the elbow heuristic (choose_num_blocks).

set.seed(1234)
fit_lm <- brm(X_train, y_train, learner = learner_lm())
fit_lm
#> Blockwise Reduced Model (BRM)
#>   blocks        : 3 
#>   overlap       : TRUE 
#>   learner type  : regression 
#>   features      : 8 
#>   cols / block  : 5, 8, 5

fit_gbm <- brm(
  X_train, y_train,
  learner  = learner_gbm(distribution = "poisson", n.trees = 300),
  n_blocks = fit_lm$n_blocks     # reuse so models are comparable
)
fit_gbm

Predict and score

rmse <- function(y, yhat) sqrt(mean((y - yhat)^2))

pred_lm <- predict(fit_lm, X_test)
cat("BRM (lm)   RMSE:", round(rmse(y_test, pred_lm), 2), "\n")
#> BRM (lm)   RMSE: 149.6

pred_gbm <- predict(fit_gbm, X_test)
cat("BRM (gbm)  RMSE:", round(rmse(y_test, pred_gbm), 2), "\n")

Comparison to a listwise-deletion baseline

The conventional alternative is to drop rows that have any missing value and fit a single model on what remains. This wastes data and gets progressively worse as the missing rate grows.

complete_train <- na.omit(train)
fit_lw <- lm(cnt ~ ., data = complete_train)

# For a fair comparison we need to impute the test set's NAs somehow; use
# mean/mode from the complete training rows.
X_test_imp <- X_test
for (j in names(X_test_imp)) {
  na_idx <- is.na(X_test_imp[[j]])
  if (!any(na_idx)) next
  ref <- complete_train[[j]]
  if (is.factor(ref)) {
    X_test_imp[[j]][na_idx] <- names(sort(table(ref), decreasing = TRUE))[1]
  } else {
    X_test_imp[[j]][na_idx] <- mean(ref, na.rm = TRUE)
  }
}
pred_lw <- predict(fit_lw, newdata = X_test_imp)

cat("Listwise-deletion lm  RMSE:", round(rmse(y_test, pred_lw), 2), "\n")
#> Listwise-deletion lm  RMSE: 151.9
cat("Training rows used    : BRM =", nrow(train), " listwise =",
    nrow(complete_train), "\n")
#> Training rows used    : BRM = 13034  listwise = 3320

BRM keeps all training rows (splitting them into per-pattern subsets); the listwise baseline throws away any row with at least one NA.

Citation

If you use BRM in your work, please cite:

Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.

Run citation("blockwise") to get a ready-to-paste BibTeX entry.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.