Comparing modeling engines

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

lineagefreq provides multiple modeling engines through the unified fit_model() interface. This vignette shows how to compare them using the built-in backtesting framework.

Three engines are available in v0.1.0:

mlr: Multinomial logistic regression (frequentist, fast).
piantham: Converts MLR growth rates to relative reproduction numbers using a generation-time approximation.
hier_mlr: Hierarchical MLR with partial pooling across locations via empirical Bayes shrinkage.

Setup

library(lineagefreq)

Engine 1: MLR

The default engine fits a multinomial logistic regression with one growth rate parameter per non-reference lineage.

data(sarscov2_us_2022)
x <- lfq_data(sarscov2_us_2022,
              lineage = variant, date = date,
              count = count, total = total)

fit_mlr <- fit_model(x, engine = "mlr")
growth_advantage(fit_mlr, type = "growth_rate")
#> # A tibble: 5 × 6
#>   lineage estimate lower upper type        pivot
#>   <chr>      <dbl> <dbl> <dbl> <chr>       <chr>
#> 1 BA.1       0     0     0     growth_rate BA.1 
#> 2 BA.2       0.231 0.229 0.232 growth_rate BA.1 
#> 3 BA.4/5     0.400 0.398 0.402 growth_rate BA.1 
#> 4 BQ.1       0.352 0.346 0.357 growth_rate BA.1 
#> 5 Other      0.151 0.149 0.152 growth_rate BA.1

Engine 2: Piantham

The Piantham engine wraps MLR and translates growth rates to relative effective reproduction numbers using a specified mean generation time.

fit_pian <- fit_model(x, engine = "piantham",
                      generation_time = 5)
growth_advantage(fit_pian, type = "relative_Rt",
                 generation_time = 5)
#> # A tibble: 5 × 6
#>   lineage estimate lower upper type        pivot
#>   <chr>      <dbl> <dbl> <dbl> <chr>       <chr>
#> 1 BA.1        1     1     1    relative_Rt BA.1 
#> 2 BA.2        1.18  1.18  1.18 relative_Rt BA.1 
#> 3 BA.4/5      1.33  1.33  1.33 relative_Rt BA.1 
#> 4 BQ.1        1.29  1.28  1.29 relative_Rt BA.1 
#> 5 Other       1.11  1.11  1.11 relative_Rt BA.1

Comparing fit statistics

glance() returns a one-row summary for each model. Since Piantham is a wrapper around MLR, the log-likelihood and AIC are identical.

dplyr::bind_rows(
  glance.lfq_fit(fit_mlr),
  glance.lfq_fit(fit_pian)
)
#> # A tibble: 2 × 10
#>   engine   n_lineages n_timepoints   nobs    df   logLik     AIC     BIC pivot
#>   <chr>         <int>        <int>  <int> <int>    <dbl>   <dbl>   <dbl> <chr>
#> 1 mlr               5           40 461424     8 -465911. 931838. 931852. BA.1 
#> 2 piantham          5           40 461424     8 -465911. 931838. 931852. BA.1 
#> # ℹ 1 more variable: convergence <int>

Backtesting

The backtest() function implements rolling-origin evaluation. At each origin date, the model is fit on past data and forecasts are compared to held-out future observations.

bt <- backtest(x,
  engines = c("mlr", "piantham"),
  horizons = c(7, 14, 21),
  min_train = 56,
  generation_time = 5
)
#> Backtesting ■■■                                7% | ETA: 16s
#> Backtesting ■■■■■■                            17% | ETA: 14s
#> Backtesting ■■■■■■■■■■                        31% | ETA: 13s
#> Backtesting ■■■■■■■■■■■■■■■                   46% | ETA: 11s
#> Backtesting ■■■■■■■■■■■■■■■■■■                57% | ETA:  9s
#> Backtesting ■■■■■■■■■■■■■■■■■■■■■             68% | ETA:  7s
#> Backtesting ■■■■■■■■■■■■■■■■■■■■■■■■■         80% | ETA:  5s
#> Backtesting ■■■■■■■■■■■■■■■■■■■■■■■■■■■       88% | ETA:  3s
#> Backtesting ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■   99% | ETA:  0s
#> Backtesting ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
bt
#> 
#> ── Backtest results
#> 900 predictions across 31 origins
#> Engines: "mlr, piantham"
#> Horizons: 7, 14, 21 days
#> 
#> # A tibble: 900 × 9
#>    origin_date target_date horizon engine lineage predicted lower upper observed
#>  * <date>      <date>        <int> <chr>  <chr>       <dbl> <dbl> <dbl>    <dbl>
#>  1 2022-03-05  2022-03-12        7 mlr    BA.1         0.44 0.35   0.53 0.439   
#>  2 2022-03-05  2022-03-12        7 mlr    BA.2         0.18 0.115  0.26 0.172   
#>  3 2022-03-05  2022-03-12        7 mlr    BA.4/5       0    0      0.02 0.00577 
#>  4 2022-03-05  2022-03-12        7 mlr    BQ.1         0    0      0.01 0.000406
#>  5 2022-03-05  2022-03-12        7 mlr    Other        0.37 0.28   0.46 0.382   
#>  6 2022-03-05  2022-03-19       14 mlr    BA.1         0.39 0.31   0.49 0.399   
#>  7 2022-03-05  2022-03-19       14 mlr    BA.2         0.21 0.13   0.28 0.199   
#>  8 2022-03-05  2022-03-19       14 mlr    BA.4/5       0.01 0      0.03 0.00799 
#>  9 2022-03-05  2022-03-19       14 mlr    BQ.1         0    0      0.01 0.000543
#> 10 2022-03-05  2022-03-19       14 mlr    Other        0.39 0.29   0.48 0.394   
#> # ℹ 890 more rows

Scoring

score_forecasts() computes standardized accuracy metrics.

sc <- score_forecasts(bt,
  metrics = c("mae", "coverage"))
sc
#> # A tibble: 12 × 4
#>    engine   horizon metric     value
#>    <chr>      <int> <chr>      <dbl>
#>  1 mlr            7 mae      0.00432
#>  2 mlr            7 coverage 1      
#>  3 mlr           14 mae      0.00398
#>  4 mlr           14 coverage 1      
#>  5 mlr           21 mae      0.00452
#>  6 mlr           21 coverage 1      
#>  7 piantham       7 mae      0.00418
#>  8 piantham       7 coverage 1      
#>  9 piantham      14 mae      0.00421
#> 10 piantham      14 coverage 1      
#> 11 piantham      21 mae      0.00432
#> 12 piantham      21 coverage 1

Model ranking

compare_models() summarizes scores per engine, sorted by MAE.

compare_models(sc, by = c("engine", "horizon"))
#> # A tibble: 6 × 4
#>   engine   horizon     mae coverage
#>   <chr>      <int>   <dbl>    <dbl>
#> 1 mlr           14 0.00398        1
#> 2 piantham       7 0.00418        1
#> 3 piantham      14 0.00421        1
#> 4 mlr            7 0.00432        1
#> 5 piantham      21 0.00432        1
#> 6 mlr           21 0.00452        1

Visualization

plot_backtest(sc)

When to use which engine

Scenario	Recommended engine
Single location, quick estimate	`mlr`
Need relative Rt interpretation	`piantham`
Multiple locations, sparse data	`hier_mlr`
Time-varying fitness (v0.2)	`garw`

Hierarchical MLR

When data spans multiple locations with unequal sequencing depth, hier_mlr shrinks location-specific estimates toward the global mean. This stabilizes estimates for low-data locations.

A demonstration requires multi-location data, which the built-in single-location dataset does not provide. See ?fit_model for an example with simulated multi-location data.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.