Getting started with lineagefreq

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

lineagefreq models pathogen lineage frequency dynamics from genomic surveillance count data. Given a table of lineage-resolved sequence counts over time, the package estimates relative growth advantages, generates short-term frequency forecasts, and provides tools for evaluating model accuracy.

This vignette demonstrates the core workflow using simulated SARS-CoV-2 surveillance data.

Preparing data

The entry point is lfq_data(), which validates and standardizes a count table. The minimum input is a data frame with columns for date, lineage name, and sequence count.

library(lineagefreq)

data(sarscov2_us_2022)
head(sarscov2_us_2022)
#>         date variant count total
#> 1 2022-01-08    BA.1 11802 14929
#> 2 2022-01-08    BA.2   565 14929
#> 3 2022-01-08  BA.4/5     4 14929
#> 4 2022-01-08    BQ.1     0 14929
#> 5 2022-01-08   Other  2558 14929
#> 6 2022-01-15    BA.1 10342 13672

x <- lfq_data(sarscov2_us_2022,
              lineage = variant,
              date    = date,
              count   = count,
              total   = total)
x
#> 
#> ── Lineage frequency data
#> 5 lineages, 40 time points
#> Date range: 2022-01-08 to 2022-10-08
#> Lineages: "BA.1, BA.2, BA.4/5, BQ.1, Other"
#> 
#> # A tibble: 200 × 7
#>    .date      .lineage .count total .total    .freq .reliable
#>  * <date>     <chr>     <int> <int>  <int>    <dbl> <lgl>    
#>  1 2022-01-08 BA.1      11802 14929  14929 0.791    TRUE     
#>  2 2022-01-08 BA.2        565 14929  14929 0.0378   TRUE     
#>  3 2022-01-08 BA.4/5        4 14929  14929 0.000268 TRUE     
#>  4 2022-01-08 BQ.1          0 14929  14929 0        TRUE     
#>  5 2022-01-08 Other      2558 14929  14929 0.171    TRUE     
#>  6 2022-01-15 BA.1      10342 13672  13672 0.756    TRUE     
#>  7 2022-01-15 BA.2        608 13672  13672 0.0445   TRUE     
#>  8 2022-01-15 BA.4/5        3 13672  13672 0.000219 TRUE     
#>  9 2022-01-15 BQ.1          0 13672  13672 0        TRUE     
#> 10 2022-01-15 Other      2719 13672  13672 0.199    TRUE     
#> # ℹ 190 more rows

The function computes frequencies, flags low-count time points, and returns a validated lfq_data object.

Fitting a model

fit_model() provides a unified interface. The default engine is multinomial logistic regression (MLR).

fit <- fit_model(x, engine = "mlr")
fit
#> Lineage frequency model (mlr)
#> 5 lineages, 40 time points
#> Date range: 2022-01-08 to 2022-10-08
#> Pivot: "BA.1"
#> 
#> Growth rates (per 7-day unit):
#>   ↑ BA.2: 0.2308
#>   ↑ BA.4/5: 0.4002
#>   ↑ BQ.1: 0.3516
#>   ↑ Other: 0.1506
#> 
#> AIC: 9e+05; BIC: 9e+05

The print output shows each lineage’s estimated growth rate relative to the pivot (reference) lineage, which is auto-selected as the most prevalent lineage early in the time series.

Extracting growth advantages

growth_advantage() converts growth rates into interpretable metrics. Four output types are available.

ga <- growth_advantage(fit,
                       type = "relative_Rt",
                       generation_time = 5)
ga
#> # A tibble: 5 × 6
#>   lineage estimate lower upper type        pivot
#>   <chr>      <dbl> <dbl> <dbl> <chr>       <chr>
#> 1 BA.1        1     1     1    relative_Rt BA.1 
#> 2 BA.2        1.18  1.18  1.18 relative_Rt BA.1 
#> 3 BA.4/5      1.33  1.33  1.33 relative_Rt BA.1 
#> 4 BQ.1        1.29  1.28  1.29 relative_Rt BA.1 
#> 5 Other       1.11  1.11  1.11 relative_Rt BA.1

A relative Rt above 1 indicates a lineage growing faster than the reference. The confidence intervals are derived from the Fisher information matrix.

Visualizing the fit

autoplot() supports four plot types for fitted models.

autoplot(fit, type = "frequency")

autoplot(fit, type = "advantage", generation_time = 5)

Forecasting

forecast() projects frequencies forward with uncertainty quantified by parametric simulation.

fc <- forecast(fit, horizon = 28)
autoplot(fc)
#> Warning in ggplot2::scale_x_date(date_labels = "%Y-%m-%d"): A <numeric> value was passed to a Date scale.
#> ℹ The value was converted to a <Date> object.

Detecting emerging lineages

summarize_emerging() tests each lineage for statistically significant frequency increases.

summarize_emerging(x)
#> # A tibble: 4 × 10
#>   lineage first_seen last_seen  n_timepoints current_freq growth_rate p_value
#>   <chr>   <date>     <date>            <int>        <dbl>       <dbl>   <dbl>
#> 1 BA.2    2022-01-08 2022-10-08           40       0.156      0.00476       0
#> 2 BA.4/5  2022-01-08 2022-10-08           40       0.797      0.0293        0
#> 3 BQ.1    2022-01-08 2022-10-08           40       0.0176     0.0143        0
#> 4 Other   2022-01-08 2022-10-08           40       0.0293    -0.00489       0
#> # ℹ 3 more variables: p_adjusted <dbl>, significant <lgl>, direction <chr>

Next steps

Compare multiple engines with backtest() — see vignette("model-comparison").
Run a full surveillance workflow — see vignette("surveillance-workflow").

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.