The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

All model runner options

This article breaks down all the options available when running mbg models. For a summary of these terms, see the documentation for the mbg::MBGModelRunner$new() method


1: MBG model basics

The model always requires two terms: input_data, which includes all point observations of the outcome to be estimated, and id_raster, which lays out the study area.

1.A: Input data

Formatted as a data.frame or data.table::data.table. Should contain at least the following fields:

1.B: ID raster

A terra::SpatRaster object meeting the following requirements:

An ID raster can be created using the mbg::build_id_raster function.

Before running a model, you could use the terra::extract function to ensure that all points in your input data overlap a non-NA pixel in the ID raster.


3: Model effects

The model currently has four effect types which can be toggled and controlled via settings passed to mbg::MbgModelRunner.

3.A: Covariate effects

Relevant settings:

A covariate effect will only be included if use_covariates is TRUE (the default) and covariate_rasters are passed. The covariate_rasters are an optional list of terra::SpatRaster pixel-level predictive covariates. They can be incorporated into the model in two different ways depending on the value of use_stacking:

3.A.i: Standard covariate effect

Only applied if a covariate effect is included and use_stacking is FALSE (the default).

The covariate effect at observation \(i\) is \(\gamma^{covariates}_i = \vec{\beta}X_{s_i}\), where \(\vec{\beta}\) are linear effects on the matrix of covariate values \(X\) evaluated at the location of observation \(i\) (\(s_i\)).

Note that an intercept is not included by default. If you want a model with no covariate effects other than an intercept, pass a covariate_rasters with an intercept raster containing all 1s.

A prior is applied to the variance of effects on all covariates other than the intercept: prior_covariate_effect (default list(threshold = 3, prob_above = 0.05)) is a penalized complexity prior that can be expressed as a level of certainty about the standard deviation on each fixed effect \(\beta\). For example, the default prior corresponds to \(P(\sigma_{\beta} > 3) = 0.05\).

3.A.ii: Stacked ensemble model

Only applied if a covariate effect is included and use_stacking is TRUE.

For a stacked ensemble model, the covariate effect for observation \(i\) is: \[ \gamma^{covariates}_i = \sum_{j=1}^{J}\left[ w_{j} f_j(X_{s_i}) \right] \\ Constraints: w_j > 0 \ \forall \ j, \ \textstyle \sum_{j=1}^{J}(w_j) = 1 \] Where:

  • \(\vec{f}\): Predictions from a set of J regression models fit to the raw covariate data \(X\)
  • \(\vec{w}\): A weighting vector corresponding to each regression model \(f_j\). The weights are constrained to be strictly greater than zero and to sum to one.
  • \(X_{s_i}\): The raw covariate values at the location of observation \(i\), \(s_i\)

Relevant model settings:

  • stacking_model_settings (default list(gbm = NULL, treebag = NULL, rf = NULL)): Defines the list of component models \(f_j(X)\) to be fitted to the covariates. A named list—each name corresponds to a regression model in the caret package, and each value stores optional settings that can be passed to that model.
  • stacking_cv_settings (default list(method = 'repeatedcv', number = 5, repeats = 5)): These are used by caret::traincontrol to cross-validate each regression model
  • stacking_use_admin_bounds (default FALSE), admin_bounds (default NULL), admin_bounds_id (default NULL): If stacking_use_admin_bounds is TRUE and the other two values are set, adds administrative fixed effects to each of the component models.
  • stacking_prediction_range (default NULL): Can be used to restrict the prediction range of each component regression model. For binomial data, a reasonable limit is to not predict outside of c(0, 1).

3.B: Gaussian process

If the setting use_gp is TRUE (the default), adds a spatially correlated effect:

\[ Z \sim GP(0, \Sigma_s) \] Where \(Z\) is a Gaussian process with mean zero and stationary isotropic Matern covariance over space (\(\Sigma_s\)).

The Gaussian process is informed by priors on the range and variance:

To simplify estimation, the R-INLA package represents the continuous Gaussian process on a 2D spatial mesh. Three more settings control the mesh:

For more details about the INLA approach to approximate Gaussian process regression, see the papers at the bottom of this page.


3.C: Administrative-level effect

This effect is a random intercept grouped by administrative unit. The administrative level (polygon boundaries) of interest can be set by the user. If the effect is on, then the following term is added:

\[ \gamma^{admin}_{a_i} \sim N(0, \sigma^2_{admin}) \] In other words, \(\vec\gamma^{admin}\) is an vector of random intercepts with length equal to the total number of administrative units, IID normal with mean 0 and variance \(\sigma^2_{admin}\). All observations \(i\) in the same administrative division \(a\) share the same intercept \(\gamma^{admin}_{a_i}\).

Relevant settings:


3.D: Nugget

The nugget is an independently and identically distributed (IID) normal effect applied to each observation. It corresponds to “irreducible variation” not captured by any other model effect:

\[ \gamma^{Nugget}_i \sim N(0, \sigma^2_{nugget}) \]

Relevant settings:


4: Aggregation to polygon boundaries

As shown in the introductory tutorial, the mbg::MbgModelRunner object can automatically aggregate predictions to administrative boundaries. The following three objects are required to perform aggregation:


5: Logging

Finally, the setting verbose (default TRUE) governs whether the model will perform detailed logging. You can access model logs afterwards by running mbg::logging_get_timer_log.


6: Further reading

Bakka, H., et al. (2018). Spatial modeling with R‐INLA: A review. Wiley Interdisciplinary Reviews: Computational Statistics, 10(6), e1443. https://doi.org/10.1002/wics.1443

Bhatt, S., Cameron, E., Flaxman, S. R., Weiss, D. J., Smith, D. L., & Gething, P. W. (2017). Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. Journal of The Royal Society Interface, 14(134), 20170520. https://doi.org/10.1098/rsif.2017.0520

Freeman, M. (2017). An introduction to hierarchical modeling. http://mfviz.com/hierarchical-models/

Moraga, Paula. (2019). Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. ISBN 9780367357955. https://www.paulamoraga.com/book-geospatial/index.html

Opitz, T. (2017). Latent Gaussian modeling and INLA: A review with focus on space-time applications. Journal de la société française de statistique, 158(3), 62-85. https://www.numdam.org/article/JSFS_2017__158_3_62_0.pdf

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.