The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette demonstrates Blockwise Reduced Modeling (BRM) on the
Capital Bikeshare dataset, a regression problem where
we predict hourly ride counts (cnt) from weather and
calendar features. Because bike is otherwise complete, we
induce a realistic blockwise missing pattern with
simulate_blockwise_missing() to show BRM in its
element.
The method, and this dataset’s role as a benchmark, are described in Srinivasan, Currim, and Ram (2025), A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns, INFORMS Journal on Data Science.
library(blockwise)
data(bike)
str(bike, list.len = 20)
#> 'data.frame': 17379 obs. of 9 variables:
#> $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#> $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 7 ...
#> $ weathersit: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...
#> $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#> $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#> $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
#> $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...We mask two groups of columns jointly — mimicking the situation where two independent data sources feeding your pipeline fail on different subsets of rows — plus a small per-column noise rate.
bike_miss <- simulate_blockwise_missing(
bike,
blocks = list(
c("windspeed", "hum", "weekday"),
c("hr", "temp", "weathersit")
),
prop_missing = 0.30,
noise = 0.05
)
round(colMeans(is.na(bike_miss)) * 100, 1) # percent missing per column
#> season mnth hr weekday weathersit temp hum
#> 0.0 0.0 33.5 33.4 33.4 33.5 33.5
#> windspeed cnt
#> 33.5 0.0brm() is learner-agnostic: pass any
learner() specification. Here we try a linear model and a
gradient-boosted tree ensemble. The number of blocks is chosen
automatically by the elbow heuristic
(choose_num_blocks).
The conventional alternative is to drop rows that have any missing value and fit a single model on what remains. This wastes data and gets progressively worse as the missing rate grows.
complete_train <- na.omit(train)
fit_lw <- lm(cnt ~ ., data = complete_train)
# For a fair comparison we need to impute the test set's NAs somehow; use
# mean/mode from the complete training rows.
X_test_imp <- X_test
for (j in names(X_test_imp)) {
na_idx <- is.na(X_test_imp[[j]])
if (!any(na_idx)) next
ref <- complete_train[[j]]
if (is.factor(ref)) {
X_test_imp[[j]][na_idx] <- names(sort(table(ref), decreasing = TRUE))[1]
} else {
X_test_imp[[j]][na_idx] <- mean(ref, na.rm = TRUE)
}
}
pred_lw <- predict(fit_lw, newdata = X_test_imp)
cat("Listwise-deletion lm RMSE:", round(rmse(y_test, pred_lw), 2), "\n")
#> Listwise-deletion lm RMSE: 151.9
cat("Training rows used : BRM =", nrow(train), " listwise =",
nrow(complete_train), "\n")
#> Training rows used : BRM = 13034 listwise = 3320BRM keeps all training rows (splitting them into per-pattern
subsets); the listwise baseline throws away any row with at least one
NA.
If you use BRM in your work, please cite:
Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.
Run citation("blockwise") to get a ready-to-paste BibTeX
entry.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.