mopa and the climate4R bundle for climate data access

M. Iturbide & J. Bedia & J.M. Gutierrez

2018-02-12

1 Introduction

A popular application of Species Distribution Models (SDMs) is the future projection of species distributions by using climate change data provided by global and regional climate models (GCMs and RCMs). For this purpose, SDMs are first calibrated during a historical/reference period using specific climatic variables as predictors, typically in the form of bioclimatic variables (Busby, 1991; Nix, 1986). Therefore, an important barrier for SDM development is climate data retrieval and preparation, both for the baseline climate and for the future projections. The numerous climate databases available are scattered across many different repositories with various file formats, variable naming conventions, etc., sometimes requiring relatively complex, time-consuming data downloads and error-prone processing steps prior to SDM development. Therefore, predefined bioclimatic datasets, such as WorldClim, are widely used by the SDM community. This is also a major barrier for research reproducibility and updating in the light of new climate change projections.

In order to bridge this gap, mopa has been developed within the climate4R ecosystem, a bundle of R packages that allow transparent access and post-processing of standard climate data repositories, such as CMIP5.

Unlike mopa, the rest of the packages in the climate4R bundle are only available on gitHub. Therefore, we recommend function install_github from package devtools to install them (find installation instructions and more in the corresponding gitHub repositories).

This tutorial illustrates a full worked example of climate data retrieval and preparation for modeling two phylogenies of Oak species in Europe using mopa and climate4R.

2 Climate data preparation with climate4R

2.1 Climate data loading

loadeR is the central building-block of the climate4R bundle allowing to transparently access local and remote (OPeNDAP) climate datasets (NetCDF, Grib, HDF, etc.) building on NetCDF-Java (trough the rJava package). This package has been conceived to work in the framework of climate change studies and, therefore, it considers ensemble members as a basic dimension of the two main data structures (grid and station). Moreover, loadeR is enhanced by the User Data Gateway (UDG), a climate data service by Cantabria University which allows to remotely access harmonized data from several state-of-the-art climate datasets (e.g. CMIP5 or CORDEX models). This service requires (free) registration to accept the data policies of the data providers previous accessing the data (a user and password will be automatically provided).

The function loginUDG setup the UDG credentials for data access in the current R session:

devtools::install_github("SantanderMetGroup/loadeR", "SantanderMetGroup/loadeR.java")
library(loadeR)
loginUDG("userUDG", "pswrdUDG")

The function for loading gridded data in loadeR is loadGridData, where at least the dataset and the variable are specified. The dataset is the path to a local file or a URL pointing to a netCDF or NcML file.

If the data is to be loaded from the UDG, we can use function UDG.datasets to print the inventory of available UDG datasets, where the name, type and url are specified. In this particular case, we can use the “name” of the desired dataset instead of passing the complete url to loadGridData. Here we are interested in loading observations from the E-OBS dataset (Haylock et al., 2008) and porjection data from the CMIP5(Taylor et al., 2011) MPI model, thus, next we filter the names returned by UDG.datasets using an appropriate pattern:

ds <- UDG.datasets()$name
ds[grepl("E-OBS|CMIP5_MPI-ESM-MR", ds)]
## [1] E-OBS_v14_0.50regular       E-OBS_0.44rotated          
## [3] E-OBS_0.25regular           E-OBS_v14_0.22rotated      
## [5] CMIP5_MPI-ESM-MR_historical CMIP5_MPI-ESM-MR_rcp45     
## [7] CMIP5_MPI-ESM-MR_rcp85     
## 31 Levels: CMIP5_CCCMA-CANESM2_historical ... WATCH_WFDEI

From the available names, next we use the name “E-OBS_0.25regular” to load precipitation and temperature data (needed to build bioclimatic predictors) from the E-OBS observational dataset in order to prepare the baseline climate for a reference period (1971-2000).

di <- dataInventory("E-OBS_0.25regular")   # inventory of the available data

lon <- c(-5,20)   # reference region and period
lat <- c(35,70)
period <- 1971:2000

tmin_obs <- loadGridData("E-OBS_0.25regular", var = "tn", years = period, lonLim = lon, latLim = lat, aggr.m = "mean")
tmax_obs <- loadGridData("E-OBS_0.25regular", var = "tx", years = period, lonLim = lon, latLim = lat, aggr.m = "mean")
precip_obs <- loadGridData("E-OBS_0.25regular", var = "rr", years = period, lonLim = lon, latLim = lat, aggr.m = "sum")

Similarly, we use the names “CMIP5_MPI-ESM-MR_historical” and “CMIP5_MPI-ESM-MR_rcp45”, each corresponding to historical and future projections (rcp45) of the CMIP5 MPI-ESM-MR model. In this case, temperature data is originally provided in Kelvin and precipitation data in Kg/m²s. Unit conversion can be done after data loading, however, loadGridData allows the utilization of a “dictionary” to perform the unit conversion at the loading moment (visit https://github.com/SantanderMetGroup/loadeR/wiki/Harmonization for more info). We can create a dictionary defining the conversion factors to Celsius and milimeters as follows:

dictionary <- tempfile(pattern = "cmip5", fileext = ".dic")
writeLines(c("identifier,short_name,time_step,lower_time_bound,upper_time_bound,cell_method,offset,scale,deaccum,derived,interface",
             "tasmax,tasmax,24h,0,24,max,-273.15,1,0,0,",
             "tasmin,tasmin,24h,0,24,min,-273.15,1,0,0,",
             "tp,pr,24h,0,24,sum,0,86400,0,0,"), dictionary)

Next, historical data (reference period) and climate change data (period 2071-2100) is loaded using the previously created dictionary.

##historical

tmin_hist <- loadGridData("CMIP5_MPI-ESM-MR_historical", var = "tasmin", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "mean", dictionary = dictionary)
tmax_hist <- loadGridData("CMIP5_MPI-ESM-MR_historical", var = "tasmax", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "mean", dictionary = dictionary)
precip_hist <- loadGridData("CMIP5_MPI-ESM-MR_historical", var = "tp", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "sum", dictionary = dictionary)

##rcp45
period <- 2071:2100 #future period

tmin_rcp45 <- loadGridData("CMIP5_MPI-ESM-MR_rcp45", var = "tasmin", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "mean", dictionary = dictionary)
tmax_rcp45 <- loadGridData("CMIP5_MPI-ESM-MR_rcp45", var = "tasmax", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "mean", dictionary = dictionary)
precip_rcp45 <- loadGridData("CMIP5_MPI-ESM-MR_rcp45", var = "tp", years = period, lonLim = lon, latLim = lat, time = "DD", aggr.m = "sum", dictionary = dictionary)

2.2 Bias Correction: The Delta Method

The outputs of the GCMs (and/or coupled RCMs) cannot be used directly for impact studies given that they may contain important biases (e.g. Brands et al., 2011). Thus, a validation/calibration process is needed before using this data in real applications. With this regard, climate4R includes a package for statistical downscaling and bias correction, this is downscaleR (see the Wiki for worked examples).

Here, we apply the delta method using function biasCorrection. This method operates through the extraction of the climate change signal relative to the control run of the same model, so that the problem of model biases is alleviated to a great extent (e.g. Räisänen, 2007).

devtools::install_github("SantanderMetGroup/downscaleR")
library(downscaleR)

#COMPUTE CLIMATOLOGIES (package transformeR)
tmin <- biasCorrection(y = tmin_obs, x = tmin_hist, newdata = tmin_rcp45, method = "delta")
tmax <- biasCorrection(y = tmax_obs, x = tmax_hist, newdata = tmax_rcp45, method = "delta")
precip <- biasCorrection(y = precip_obs, x = precip_hist, newdata = precip_rcp45, method = "delta", precipitation = TRUE)

2.3 Data Transformation

We are interested in obtaining monthly climatologies for periods of 30 years. This is easily done with the functionalities provided by package transformeR (see the Wiki for worked examples), which is also part of the climate4R bundle. In this case, we extract the data corresponding to each month to subsequently compute the 30 year means with functions subsetGrid and climatology respectively.

2.3.1 Compute Climatic Means

devtools::install_github("SantanderMetGroup/transformeR")
library(transformeR)

#COMPUTE CLIMATOLOGIES (package transformeR)
monthly <- function(x){
  month <- list()
  for(i in 1:12){
    month[[i]] <- climatology(subsetGrid(x, season = i))
  }
  x1 <- bindGrid.time(month)
  return(x1)
}

tmin.clim <- monthly(tmin)
tmax.clim <- monthly(tmax)
precip.clim <- monthly(precip)
tmin.obs.clim <- monthly(tmin_obs)
tmax.obs.clim <- monthly(tmax_obs)
precip.obs.clim <- monthly(precip_obs)

2.3.2 Convert Data to mopa

grid2mopa is a specific function included in transformeR to convert the data structure built with loadeR to the format that is compatible with mopa (Raster* class objects from package raster (Hijmans, 2015)). Using function makeMultiGrid (package transormeR) we can do this operation for a collection of variables at once. Next, we run both functions for the observed (clim.obs) and future (clim) climates.

#TRANSFORM GRIDS
mg1 <- makeMultiGrid(tmin.clim, tmax.clim, precip.clim)
mg2 <- makeMultiGrid(tmin.obs.clim, tmax.obs.clim, precip.obs.clim)
clim <- grid2mopa(transformeR::redim(mg1, drop = TRUE))
clim.obs <- grid2mopa(transformeR::redim(mg2, drop = TRUE))

2.4 Calculate Bioclimatic Variables

A standard set of predictors often used to calibrate SDMs consists of bioclimatic variables (Busby, 1991; Nix, 1986). The main advantage of using raster class objects in mopa is the compatibility with other SDM oriented packages, for instance, we can calculate bioclimatic variables using function biovars from package dismo (Hijmans et al., 2017) as follows:

#CALCULATE BIOCLIMATIC VARIABLES (package dismo)
library(dismo)
ind <- c(1,2,5,12,18,19) #subset of bioclimatic variables used in this experiment
bioclim.obs <- biovars(prec = clim.obs$rr, tmin = clim.obs$tn, tmax = clim.obs$tx)[[ind]]
bioclim <- biovars(prec = clim$rr, tmin = clim$tn, tmax = clim$tx)[[ind]]

Additionally, we can take advantage of the functionalities provided by the raster package to further transform our data, for example, to standarize the set of bioclimatic variables:

#STANDARIZE (package raster)
min <- cellStats(bioclim.obs, stat = "min")
max <- cellStats(bioclim.obs, stat = "max")
bioclim.obs.st <- (bioclim.obs - min)/(max - min)
spplot(bioclim.obs.st, layout = c(6,1))

bioclim.st <- (bioclim - min)/(max - min)
spplot(bioclim.st, layout = c(6,1))

Note that here we use function spplot from package sp (Pebesma and Bivand, 2005) for graphical visualization of Raster* class objects.

3 Species distribution modeling with mopa

At this point, we are ready to use mopa. Next we define the background of the study area (function backgroundGrid), generate pseudo–absences (fun. pseudoAbsences), perform model calibration (fun. mopaTrain) and project the obtained fitted models to reference and future climatic conditions (fun. mopaPredict).

—NOTE: See the mopa Wiki for further worked examples and details.—

3.1 Preparing occurrence data

Here we use as example a dataset of species presence data included in the mopa package. This dataset contains a list of two data frames with distribution data of two Oak phylogenies in Europe, namely “H11” and “H01”.

require("mopa")
library(mopa)
data("Oak_phylo2")
plot(Oak_phylo2$H11, xlim = c(-10, 30), ylim = c(35,65))
points(Oak_phylo2$H01, col = "red")

Next we define the background of the study area with function backgroundGrid using one of the generated climate layers as spatial reference (raster = bioclim.obs.st$bio1). This background constitutes the baseline grid to generate pseudo-absences with function pseudoAbsences. In this case, we perform 10 realizations of randomly generated pseudo-absences for both phylogenies.

bg <- backgroundGrid(raster = bioclim.obs.st$bio1)
PA <- pseudoAbsences(xy = Oak_phylo2, realizations = 10, background = bg$xy)

3.2 Calibrating SDMs

Models are calibrated with function mopaTrain, that is able to fit multiple models (“random forest” in this example) and to k-fold cross-validate (10 folds in this example).

train <- mopaTrain(y = PA, x = bioclim.obs.st, algorithm = "rf", k = 10, weighting = TRUE)

Function extracFromModel is applied to extract the desired elements generated with mopaTrain. Next we extract the fitted models.

model <- extractFromModel(models = train, value = "model")

3.3 SDM Projections

Next we apply function mopaPredict to project fitted models into reference and future climate conditions and thus generate predictions of the potential distribution of the two Oak phylogenies. We can also use extractFromPrediction to extract a subset of predictions, for instance, here we retain those corresponding to phylogeny “H01”, thus 10 predictions are extracted in this case, each corresponding to a different pseudo-absence sample.

#MODEL PROJECTION IN REFERENCE CONDITIONS
predict <- mopaPredict(models = model, newClim = bioclim.obs.st)
prediction <- extractFromPrediction(predictions = predict, value = c("H01"))
spplot(prediction, at = seq(0,1,0.1), layout = c(5,2))

#MODEL PROJECTION IN FUTURE CONDITIONS
predict.fut <- mopaPredict(models = model, newClim = bioclim.st)
prediction.fut <- extractFromPrediction(predictions = predict.fut, value = c("H01"))
spplot(prediction.fut, at = seq(0,1,0.1), layout = c(5,2))

3.4 Analysis of SDM projections

Ideally, when modeling species distributions, several choices of each of the components involved in the modeling and projection process should be used. In this worked example, we used 10 different samples of pseudo-absence data, but only a single choice for the modeling algorithm (random forest) and baseline and future climate data (E-OBS, CMIP5_MPI-ESM-MR). In cases where there are several options for more than one component, they are handled by mopa providing the data as list class objects (e.g. list of raster stack objects to project fitted models into climate change conditions given by multiple GCMs and/or RCMs) and passing a character string of multiple algorithms to mopaTrain. This allows to account for the uncertainty in SDM projection ensembles as a measure of the variability that comes from the use of different datasets, training samples and SDM techniques. The uncertainty components that are considered in mopa are:

With this regard, mopa provides tools for uncertainty analysis. These are functions varianceAnalysis and varianceSummary that allow to analyze different uncertainty components in the ensemble of projections (check the “help” documentation of the functions for more detailed information).

References

Brands, S., Herrera, S., San-Martín, D., Gutiérrez, J., 2011. Validation of the ENSEMBLES Global Climate Models over southwestern Europe using probability density functions: A downscaler’s perspective. Climate Research 48, 145–161. https://doi.org/10.3354/cr00995

Busby, J., 1991. BIOCLIM - a bioclimatic analysis and prediction system, in: Nature Conservation: Cost Effective Biological Surveys and Data Analysis. CSIRO.

Haylock, M.R., Hofstra, N., Klein Tank, A.M.G., Klok, E.J., Jones, P.D., New, M., 2008. A European daily high-resolution gridded data set of surface temperature and precipitation for 1950–2006. Journal of Geophysical Research 113, D20119. https://doi.org/10.1029/2008JD010201

Hijmans, R.J., 2015. Raster: Geographic data analysis and modeling.

Hijmans, R.J., Phillips, S., Leathwick, J., Elith, J., 2017. Dismo: Species distribution modeling.

Nix, H.A., 1986. Atlas of elapid snakes of australia, in:. Australian Government Publishing Service, Canberra, Australia.

Pebesma, E.J., Bivand, R.S., 2005. Classes and methods for spatial data in R. R News 5, 9–13.

Räisänen, J., 2007. How reliable are climate models? Tellus A 59, 2–29.

Taylor, K.E., Stouffer, R.J., Meehl, G.A., 2011. An Overview of CMIP5 and the Experiment Design. Bulletin of the American Meteorological Society 93, 485–498. https://doi.org/10.1175/BAMS-D-11-00094.1