Real Data Examples with Different Distributions

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Frédéric Bertrand

2025-11-21

This vignette demonstrates SelectBoost.gamlss on real datasets using different distributions. We include a growth example (Dutch boys) and further count/continuous/binary cases.

What you’ll learn

How to configure scopes and engines for Box–Cox t growth models, insurance claims, and quine attendance.
How to use selection_table() and effect_plot() to interpret the resulting fits.
How to adjust bootstrap settings (B, pi_thr, parallel) for heavier real-world analyses.

You may want to install optional engines for best results: install.packages(c("glmnet","grpreg","SGL","future.apply"))

library(SelectBoost.gamlss)
library(gamlss)
library(MASS)           # Insurance, quine
set.seed(1)

1. Growth data (BCT) — Dutch boys

We model height vs age with BCT (Box–Cox t). Grouped penalties select among smoothers and interactions with pubertal stage.

Increase B to 50 for real use.

set.seed(123)
# Load and CLEAN the Dutch boys growth data
utils::data("boys7482", package = "SelectBoost.gamlss")
# Keep only rows complete
boys <- boys7482[stats::complete.cases(boys7482),]
boys$gen <- as.factor(boys$gen)
# (Optional) keep levels actually present
boys <- droplevels(boys)

# Fit SelectBoost.gamlss on CLEANED data (no parallel for vignette stability)
fit_growth <- sb_gamlss(
  hgt ~ 1, data = boys, family = gamlss.dist::BCT(),
  mu_scope    = ~ pb(age) + pb(age):gen,
  sigma_scope = ~ pb(age),
  nu_scope    = ~ pb(age),
  engine = "grpreg", engine_sigma = "grpreg", engine_nu = "grpreg",
  grpreg_penalty = "grLasso",
  B = 1, pi_thr = 0.6, pre_standardize = TRUE,
  parallel = "none", trace = FALSE
)

# Peek at selection (first rows)
print(utils::head(selection_table(fit_growth), 12))
#>   parameter        term count prop
#> 1        mu     pb(age)     0    0
#> 2        mu pb(age):gen     0    0
#> 3     sigma     pb(age)     0    0
#> 4        nu     pb(age)     0    0

# Effect plot for age on mu
if (requireNamespace("ggplot2", quietly = TRUE)) {
  print(effect_plot(fit_growth, "age", boys, what = "mu"))
}

Increase B to 50 for real use.

set.seed(123)
# Stability curves over a small c0 grid (still on CLEANED data)
  g <- sb_gamlss_c0_grid(
    hgt ~ 1, data = boys, family = gamlss.dist::BCT(),
    mu_scope = ~ pb(age) + pb(age):gen,
    sigma_scope = ~ pb(age),
    nu_scope = ~ pb(age),
    c0_grid = seq(0.2, 0.8, by = 0.1),
    B = 1, pi_thr = 0.6, pre_standardize = TRUE,
    parallel = "none", trace = FALSE
  )
  plot_stability_curves(g, terms = c("pb(age)", "pb(age):gen"), parameter = "mu")

Interpretation. Expect pb(age) to be stably selected for μ and σ; interactions like pb(age):gen often appear around puberty.

2. Count data (PO) — Insurance claims

Fit Poisson with log-offset for exposure.

data(Insurance, package = "MASS")
ins <- transform(Insurance, logH = log(Holders))

fit_po <- sb_gamlss(
  Claims ~ offset(logH), data = ins, family = gamlss.dist::PO(),
  mu_scope = ~ Age + District + Group,
  engine = "glmnet", glmnet_alpha = 1,    # lasso
  B = 100, pi_thr = 0.6, pre_standardize = FALSE,
  parallel = "auto", trace = FALSE
)

selection_table(fit_po)
#>   parameter     term count prop
#> 1        mu      Age   100    1
#> 2        mu District   100    1
#> 3        mu    Group   100    1

3. Positive continuous (GA) — Old Faithful

Use Gamma for positive waiting times vs eruption length.

data(faithful)
faith <- transform(faithful, eru2 = eruptions^2)

fit_ga <- sb_gamlss(
  waiting ~ 1, data = faith, family = gamlss.dist::GA(),
  mu_scope    = ~ pb(eruptions) + eru2,
  sigma_scope = ~ pb(eruptions),
  engine = "glmnet", glmnet_alpha = 0.5,   # elastic-net
  B = 60, pi_thr = 0.6, pre_standardize = TRUE,
  parallel = "auto", trace = FALSE
)

selection_table(fit_ga)
#>   parameter          term count prop
#> 1        mu pb(eruptions)    60    1
#> 2        mu          eru2    60    1
#> 3     sigma pb(eruptions)    60    1

# Effect plot for eruptions on mu
if (requireNamespace('ggplot2', quietly = TRUE)) {
  print(effect_plot(fit_ga, 'eruptions', faith, what = 'mu'))
}

4. Binary (BI) — `mtcars` transmission

Treat am (0/1) as Bernoulli.

mt <- transform(mtcars,
  am = as.integer(am),
  cyl = factor(cyl), gear = factor(gear), carb = factor(carb), vs = factor(vs))

fit_bin <- sb_gamlss(
  am ~ 1, data = mt, family = gamlss.dist::BI(),
  mu_scope = ~ wt + hp + qsec + cyl + gear + carb + vs,
  engine = "grpreg", grpreg_penalty = "grLasso",
  B = 80, pi_thr = 0.6, pre_standardize = TRUE,
  parallel = "auto", trace = FALSE
)

selection_table(fit_bin)
#>   parameter term count prop
#> 1        mu   wt     0    0
#> 2        mu   hp     0    0
#> 3        mu qsec     0    0
#> 4        mu  cyl     0    0
#> 5        mu gear     0    0
#> 6        mu carb     0    0
#> 7        mu   vs     0    0

5. Overdispersed counts (NBII) — `quine` absences

Negative Binomial II handles extra-Poisson variation.

data(quine, package = "MASS")
fit_nb2 <- sb_gamlss(
  Days ~ 1, data = quine, family = gamlss.dist::NBII(),
  mu_scope = ~ Eth + Sex + Lrn + Age,
  engine = "grpreg", grpreg_penalty = "grLasso",
  B = 80, pi_thr = 0.6, pre_standardize = FALSE,
  parallel = "auto", trace = FALSE
)

selection_table(fit_nb2)
#>   parameter term count prop
#> 1        mu  Eth     0    0
#> 2        mu  Sex     0    0
#> 3        mu  Lrn     0    0
#> 4        mu  Age     0    0

Tips

If you have many factor levels or smoothers, prefer engine="grpreg"/"sgl" (group-aware).
For very wide numeric designs, try engine="glmnet" with glmnet_alpha in \[0,1\].
Use pre_standardize=TRUE (fast C++ scaler) when predictors are numeric and on different scales.
For speed, enable parallel workers: future::plan(multisession, workers = 4); parallel="auto".

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Real Data Examples with Different Distributions

Frédéric Bertrand

2025-11-21

What you’ll learn

1. Growth data (BCT) — Dutch boys

2. Count data (PO) — Insurance claims

3. Positive continuous (GA) — Old Faithful

4. Binary (BI) — mtcars transmission

5. Overdispersed counts (NBII) — quine absences

Tips

4. Binary (BI) — `mtcars` transmission

5. Overdispersed counts (NBII) — `quine` absences