Prepping datasets for CRF models

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Running CRFs with categorical covariates requires expansion to model matrix format

The mosquito occurrence data from Golding et al 2015 (published in Parasites & Vectors) available from figshare here is useful for exploring how datasets need to be prepped for running Conditional Random Fields (CRF) models. Here, we will download the raw data from figshare (note, an internet connection will be needed for this step) change ‘dipping_round’ to a factor variable and remove un-needed columns

temp <- tempfile()
download.file('https://ndownloader.figshare.com/files/2075362',
              temp)
dataset <- read.csv(temp, as.is = T)
unlink(temp)

We can now change the categorical dipping_round and field_site variables to factors and remove some un-needed variables

dataset$dipping_round <- as.factor(dataset$dipping_round)
dataset$field_site <- as.factor(dataset$field_site)
dataset[,c(1,2,5,6)] <- NULL

It is important here to examine the level names of factor variables, as the 1st level (i.e. the dummy level) will be dropped from the dataset during conversion to model matrix format (as in standard lme4 analysis of factor covariates)

levels(dataset$dipping_round)[1]
levels(dataset$field_site)[1]

The next step is to convert any factor variables into model matrix format. As mentioned above, this step will drop the first level of a factor and then create an additional column for each additional level (i.e. dipping_round levels "3", "5" and "6" will all be assigned their own unique columns, while dipping_round level "2" will be dropped and treated as the reference level). It is also convenient to change names of the new covariate columns so they are easier to view and interpret (done here using dplyr::rename_all)

library(dplyr)
analysis.data = dataset %>% 
  cbind(.,data.frame(model.matrix(~.[,'field_site'],
                                  .)[,-1])) %>%
  cbind(.,data.frame(model.matrix(~.[,'dipping_round'],
                                  .)[,-1])) %>%
  dplyr::select(-field_site,-dipping_round) %>%
  dplyr::rename_all(funs(gsub("\\.|model.matrix", "", .)))

Finally, we need to convert species abundances to binary presence-absence format (as we are only estimating co-occurrences, not co-abundances). It is also highly advisable to scale any continuous variables so they all have mean = 0 and sd = 1

analysis.data[, 1:16] <- ifelse(analysis.data[, 1:16] > 0, 1, 0)
analysis.data[, 17:20] <- scale(analysis.data[, 17:20], center = T, scale = T)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.