Title: | Machine Learning and Mapping for Spatial Epidemiology |
Version: | 0.1.0 |
Description: | Provides tools for the integration, visualisation, and modelling of spatial epidemiological data using the method described in Azeez, A., & Noel, C. (2025). 'Predictive Modelling and Spatial Distribution of Pancreatic Cancer in Africa Using Machine Learning-Based Spatial Model' <doi:10.5281/zenodo.16529986> and <doi:10.5281/zenodo.16529016>. It facilitates the analysis of geographic health data by combining modern spatial mapping tools with advanced machine learning (ML) algorithms. 'mlspatial' enables users to import and pre-process shapefile and associated demographic or disease incidence data, generate richly annotated thematic maps, and apply predictive models, including Random Forest, 'XGBoost', and Support Vector Regression, to identify spatial patterns and risk factors. It is suited for spatial epidemiologists, public health researchers, and GIS analysts aiming to uncover hidden geographic patterns in health-related outcomes and inform evidence-based interventions. |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, tidyr, kernlab, writexl, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Depends: | R (≥ 4.1) |
Imports: | sf, readxl, dplyr, ggplot2, randomForest, xgboost, e1071, caret, tmap, spdep, ggpubr, stats, methods |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2025-08-21 07:19:53 UTC; azeez |
Author: | Adeboye Azeez [aut, cre], Colin Noel [aut] |
Maintainer: | Adeboye Azeez <azizadeboye@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-08-26 19:40:02 UTC |
Africa shapefile data
Description
A dataset containing spatial polygons of Africa.
Usage
africa_shp
Format
An sf
object with spatial features.
Source
Your data source
Africa shapefile data 2
Description
A dataset containing spatial polygons of Africa.
Usage
africa_shps
Format
An sf
object with spatial features.
Source
Your data source
Compute Moran's I & LISA, classify clusters
Description
Computes global and local Moran’s I to assess spatial autocorrelation and classifies observations into spatial cluster types (e.g., High-High).
Usage
compute_spatial_autocorr(sf_data, values, signif = 0.05)
Arguments
sf_data |
An |
values |
A numeric vector or column name with the variable to test. |
signif |
Numeric significance level threshold for clusters (default 0.05). |
Value
A named list with elements:
-
data
: Ansf
object with added columns for standardized values, spatial lag, local Moran's I values, z-scores, p-values, and cluster classification. -
moran
: An object of classhtest
with global Moran's I test results.
Examples
library(sf)
library(spdep)
library(dplyr)
#Load and prepare spatial data
mapdata <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
mapdata <- st_make_valid(mapdata)
#Variable to analyze
values <- rnorm(nrow(mapdata))
#Run function
result <- compute_spatial_autocorr(mapdata, values, signif = 0.05)
#Inspect results
head(result$data)
result$moran
Get RMSE/MAE/R² metrics on training data
Description
Evaluate Model Performance by calculating RMSE, MAE, and R² metrics.
Usage
eval_model(model, data, formula, model_type = c("rf", "xgb", "svr"))
Arguments
model |
A trained model |
data |
A data frame |
formula |
A formula object |
model_type |
Character string: one of "rf", "xgb", or "svr" |
Value
A numeric value representing the model's accuracy
Declare known global variables to suppress R CMD check NOTE Global variables used in evaluation functions
Description
This is to suppress R CMD check notes about undefined global variables.
Join spatial and incidence datasets
Description
Join spatial and incidence datasets
Usage
join_data(sf_data, tbl_data, by)
Arguments
sf_data |
sf object |
tbl_data |
tibble of incidence |
by |
Column name to join on |
Value
sf object with joined attributes
Load incidence data from Excel
Description
Load incidence data from Excel
Usage
load_incidence_data(xlsx_path)
Arguments
xlsx_path |
Path to Excel file |
Value
tibble of data
Load shapefile as sf + optionally convert to sp
Description
Load shapefile as sf + optionally convert to sp
Usage
load_shapefile(shp_path, to_sp = FALSE)
Arguments
shp_path |
Path to shapefile (.shp) |
to_sp |
logical: also return Spatial object? |
Value
list with sf and optionally sp object
Examples for model evaluation functions
Description
Examples for model evaluation functions
Examples
library(randomForest)
library(caret)
data(panc_incidence)
mapdata <- join_data(africa_shp, panc_incidence, by = "NAME")
rf_model <- randomForest(incidence ~ female + male + agea + ageb + agec + fagea + fageb + fagec +
magea + mageb + magec + yrb + yrc + yrd + yre, data = mapdata, ntree = 500,
importance = TRUE)
rf_preds <- predict(rf_model, newdata = mapdata)
rf_metrics <- postResample(pred = rf_preds, obs = mapdata$incidence)
print(rf_metrics)
Pancreatic Cancer Incidence Data
Description
This dataset contains pancreatic cancer incidence rates across African countries.
Usage
data(panc_incidence)
Format
A data frame with the following variables:
- NAME
Character. Name of the country.
- incidence
Double. Incidence rate per 100,000 population.
- female
Double. Female pancreatic cancer patients.
- male
Double. Male pancreatic cancer patients.
- ageb
Double. Patients age between 20-54 years.
- agec
Double. Patients age above 55 years.
- agea
Double. Patients age below 20 years.
- fageb
Double. Female patients age between 20-54 years.
- fagec
Double. Female patients age above 55 years.
- fagea
Double. Female patients age below 20 years.
- mageb
Double. Male patients age between 20-54 years.
- magec
Double. Male patients age above 55 years.
- magea
Double. Male patients age below 20 years.
- yra
Double. Incidence rate in year 2017.
- yrb
Double. Incidence rate in year 2018.
- yrc
Double. Incidence rate in year 2019.
- yrd
Double. Incidence rate in year 2020.
- yre
Double. Incidence rate in year 2021.
Source
Global Burden of Disease (GBD) 2021 estimates, Seattle, United States https://vizhub.healthdata.org/gbd-results/
Pancreatic Cancer Prevalence Data
Description
This dataset contains pancreatic cancer incidence rates across African countries.
Usage
data(panc_prevalence)
Format
A data frame with the following variables:
- NAME
Character. Name of the country.
- prevalence
Numeric. Prevalence rate per 100,000 population.
- female
Numeric. Female pancreatic cancer patients.
- male
Numeric. Male pancreatic cancer patients.
- ageb
Numeric. Patients age between 20-54 years.
- agec
Numeric. Patients age above 55 years.
- agea
Numeric. Patients age below 20 years.
- fageb
Numeric. Female patients age between 20-54 years.
- fagec
Numeric. Female patients age above 55 years.
- fagea
Numeric. Female patients age below 20 years.
- mageb
Numeric. Male patients age between 20-54 years.
- magec
Numeric. Male patients age above 55 years.
- magea
Numeric. Male patients age below 20 years.
- yra
Numeric. Incidence rate in year 2017.
- yrb
Numeric. Incidence rate in year 2018.
- yrc
Numeric. Incidence rate in year 2019.
- yrd
Numeric. Incidence rate in year 2020.
- yre
Numeric. Incidence rate in year 2021.
Source
Global Burden of Disease (GBD) 2021 estimates, Seattle, United States https://vizhub.healthdata.org/gbd-results/
Pancreatic Cancer Mortality Data
Description
This dataset contains pancreatic cancer incidence rates across African countries.
Usage
data(pancre_mort)
Format
A data frame with the following variables:
- NAME
Character. Name of the country.
- mortality
Numeric. Mortality rate per 100,000 population.
- female
Numeric. Female pancreatic cancer patients.
- male
Numeric. Male pancreatic cancer patients.
- ageb
Numeric. Patients age between 20-54 years.
- agec
Numeric. Patients age above 55 years.
- agea
Numeric. Patients age below 20 years.
- fageb
Numeric. Female patients age between 20-54 years.
- fagec
Numeric. Female patients age above 55 years.
- fagea
Numeric. Female patients age below 20 years.
- mageb
Numeric. Male patients age between 20-54 years.
- magec
Numeric. Male patients age above 55 years.
- magea
Numeric. Male patients age below 20 years.
- yra
Numeric. Incidence rate in year 2017.
- yrb
Numeric. Incidence rate in year 2018.
- yrc
Numeric. Incidence rate in year 2019.
- yrd
Numeric. Incidence rate in year 2020.
- yre
Numeric. Incidence rate in year 2021.
Source
Global Burden of Disease (GBD) 2021 estimates, https://vizhub.healthdata.org/gbd-results/
Arrange Multiple tmap Plots in a Grid
Description
Arrange a list of tmap objects into a grid layout.
Usage
plot_map_grid(maps, ncol = 2)
Arguments
maps |
A list of tmap objects. |
ncol |
Number of columns in the grid (default is 2). |
Value
A tmap object representing arranged maps.
Examples
library(sf)
library(tmap)
# Load sample spatial data
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
# Add mock variables to map
nc$var1 <- runif(nrow(nc), 0, 100)
nc$var2 <- runif(nrow(nc), 10, 200)
# Create individual maps
map1 <- tm_shape(nc) + tm_fill("var1", title = "Variable 1")
map2 <- tm_shape(nc) + tm_fill("var2", title = "Variable 2")
# Arrange the maps in a grid using your function
plot_map_grid(list(map1, map2), ncol = 2)
Plot observed vs predicted values with correlation
Description
Creates a scatterplot of observed vs predicted values, with a 1:1 reference line and Pearson's R².
Usage
plot_obs_vs_pred(observed, predicted, title = "")
Arguments
observed |
Numeric vector of observed values. |
predicted |
Numeric vector of predicted values. |
title |
String for the plot title (default: ""). |
Value
No return value; called for side effect of displaying a plot.
Examples
observed <- c(10, 20, 30, 40)
predicted <- c(12, 18, 33, 39)
plot_obs_vs_pred(observed, predicted, title = "Observed vs Predicted")
Build a tmap for a single variable
Description
Creates a thematic map using the tmap
package for a single variable in an sf object.
Usage
plot_single_map(sf_data, var, title, palette = "reds")
Arguments
sf_data |
An sf object containing spatial data. |
var |
Variable name as a string to map. |
title |
Legend title for the fill legend. |
palette |
Color palette for the map (default is "reds"). |
Value
A tmap object representing the thematic map.
Examples
library(sf)
# Create example sf object
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
nc$incidence <- runif(nrow(nc), 0, 100)
# Plot
p1 <- plot_single_map(nc, "incidence", "Incidence")
Train Random Forest model
Description
Trains a Random Forest regression model.
Usage
train_rf(data, formula, ntree = 500, seed = 123)
Arguments
data |
A data frame containing the training data. |
formula |
A formula describing the model structure. |
ntree |
Number of trees to grow (default 500). |
seed |
Random seed for reproducibility (default 123). |
Value
A trained randomForest model object.
Examples
library(randomForest)
data(mtcars)
rf_model <- train_rf(mtcars, mpg ~ cyl + hp + wt, ntree = 100)
print(rf_model)
Train Support Vector Regression (SVR) model
Description
Train Support
Usage
train_svr(data, formula)
Arguments
data |
A data frame containing the training data. |
formula |
A formula specifying the model. |
Details
Trains an SVR model using the radial kernel.
Value
A trained svm
model object from the e1071 package.
Examples
# Load required package
library(e1071)
# Use built-in dataset
data(mtcars)
# Define regression formula
svr_formula <- mpg ~ cyl + disp + hp + wt
# Train SVR model
svr_model <- train_svr(data = mtcars, formula = svr_formula)
# Print model summary
print(svr_model)
# Predict on the same data (for illustration)
preds <- predict(svr_model, newdata = mtcars)
head(preds)
Train XGBoost model
Description
Train XGBoost model
Usage
train_xgb(data, formula, nrounds = 100, max_depth = 4, eta = 0.1)
Arguments
data |
A data frame with the training data. |
formula |
A formula defining the model structure. |
nrounds |
Number of boosting iterations. |
max_depth |
Maximum tree depth. |
eta |
Learning rate. |
Details
Trains an XGBoost regression model.
Value
A trained xgboost model object.
Examples
# Load required package
library(xgboost)
# Use built-in dataset
data(mtcars)
# Define regression formula
xgb_formula <- mpg ~ cyl + disp + hp + wt
# Train XGBoost model
xgb_model <- train_xgb(data = mtcars, formula = xgb_formula, nrounds = 50)
# Print model summary
print(xgb_model)