The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In this vignette, we describe how prevalence mapping using DHS data can be carried out. Prevalence mapping is Small Area Estimation (SAE) for proportions of some indicator of interest. The vignette is organized as follows. In Section 2 we first describe the data preparation steps to obtain and process the relevant DHS survey data in R. We then describe three classes of models
Then in Section 6 we show several tools for visualization of different estimates, and in Section 7 we discuss how to aggregate the result for a given model to produce estimates at higher admin levels.
The following pieces of data are needed to produce SAE for DHS surveys:
The latest surveyPrev package can be installed from CRAN with
The development version of the surveyPrev package can be installed from GitHub with
After installation, the package can be loaded with
One of the key required packages, INLA is not available on CRAN or GitHub, and needs to be installed with the following code (Rue, Martino, and Chopin 2009).
install.packages("INLA", repos=c(getOption("repos"), 
                  INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(INLA)The following packages are needed to run the example analysis described in this vignette.
Currently, the surveyPrev package supports 22 indicators. The list of indicators and their IDs can be found in the surveyPrevIndicators dataset. The ID column correspond to the indicator IDs in the DHS API and are more standardized. The alternative ID column provides convenient and more readable indicator names for only a subset of indicators. Both can be used to retrieve the indicator in the functions described later. The full list of indicators and their IDs are summarized in the Table below.
| ID | alternative.ID | Description | Topic | 
|---|---|---|---|
| AN_ANEM_W_ANY | womananemia | Percentage of women aged 15-49 classified as having any anemia | Nutrition | 
| AN_NUTS_W_THN | Underweight (Prevalence of underweight (BMI < 18.5) women of reproductive age) | Nutrition | |
| CH_DIAT_C_ORT | Diarrhea treatment (Children under five with diarrhea treated with either ORS or RHF) | Child Health | |
| CH_VACC_C_BAS | Children age 12-23 months with all 8 basic vaccinations | Child Health | |
| CH_VACC_C_DP1 | Percentage of children 12-23 months who had received DPT 1 vaccination | Child Health | |
| CH_VACC_C_DP3 | DPT3 | Percentage of children 12-23 months who had received DPT 3 vaccination | Child Health | 
| CH_VACC_C_MSL | Percentage of children 12-23 months who had received MCV 1 (Measles containing vaccine) | Child Health | |
| CH_VACC_C_NON | Children 12-23 months with no vaccinations | Child Health | |
| CM_ECMR_C_NNR | nmr | Probability of dying in the first month of life in the five or ten years preceding the survey | Infant and Child Mortality | 
| CN_ANMC_C_ANY | Children under five with any anemia | Nutrition | |
| CN_BRFS_C_EXB | Children exclusively breastfed (Prevalence of exclusive breastfeeding of children under six months of age) | Nutrition | |
| CN_NUTS_C_HA2 | stunting | Stunting rate (Prevalence of stunted (HAZ < -2) children under five (0-59 months)) | Nutrition | 
| CN_NUTS_C_WH2 | wasting | Wasting rate (Prevalence of wasted (HAZ < -2) children under five (0-59 months)) | Nutrition | 
| FP_CUSA_W_MOD | Modern contraceptive prevalence rate (Married women currently using any modern method of contraception) | Family Planning | |
| FP_NADA_W_UNT | unmet_family | Percentage of women with an unmet need for family planning | Family Planning | 
| HA_HIVP_B_HIV | HIV prevalence | HIV prevalence | |
| ML_NETP_H_IT2 | Households with at least one insecticide-treated mosquito net (ITN) for every two persons who stayed in the household the previous night | Malaria | |
| RH_ANCN_W_N4P | ancvisit4+ | Antenatal visits for pregnancy: 4+ visits | Maternal Healthcare | 
| RH_DELA_C_SKP | Assistance during delivery from a skilled provide | Maternal Healthcare | |
| WS_SRCE_P_BAS | Population using a basic water source | Water, Sanitation and Hygiene | |
| WS_TLET_H_IMP | sanitation | Percentage of households using an improved sanitation facility | Water, Sanitation and Hygiene | 
| WS_TLET_P_BAS | Population with access to a basic sanitation service | Water, Sanitation and Hygiene | 
In this vignette, we consider estimating the percentage of women who had a live birth in the five years before the survey who had four or more antenatal care visits. This indicator can be extracted with indicator = "ancvisit4+" or indicator = "RH_ANCN_W_N4P".
Details on more common DHS indicators can be found in the Guide to DHS Statistics at https://dhsprogram.com/data/Guide-to-DHS-Statistics/. User-specified function can also be used to process raw DHS data into new indicators. Please refer to the vignette on Creating Customized Indicators for surveyPrev for more details.
Processing the binary indicator from the raw DHS data consist of two steps:
1. download the relevant DHS data recode using getDHSdata(), or manually from the DHS website,
2. create a data frame with the binary indicator and relevant information using getDHSindicator().
The getDHSdata() function downloads the relevant DHS data directly from the DHS website using the rdhs package. This step requires users to
rdhs::set_rdhs_config().After setting up the DHS account information, getDHSdata() can be used to download the relevant survey files directly into R. If data download using the API fails, or if it is preferred to work with downloaded data files offline, you may also manually download the file from the DHS website and read into R. The getDHSdata() function returns a message specifying which file is used (e.g., Individual Record file for the ANC visit example).
indicator <- "ancvisit4+"
year <- 2018
country <- "Zambia"
dhsData <- getDHSdata(country = country, indicator = indicator, year = year)We then use getDHSindicator() to process the raw survey data into a data frame, where the column value is the indicator of interest. The data frame also contains information specifying survey designs in the survey package, including cluster ID, household ID, survey weight and strata information.
##   cluster householdID    v024  weight strata value
## 1       1           1 eastern 1892890  rural     1
## 2       1           2 eastern 1892890  rural     1
## 3       1           3 eastern 1892890  rural     1
## 4       1           4 eastern 1892890  rural     0
## 5       1           7 eastern 1892890  rural     1
## 6       1           9 eastern 1892890  rural    NAThe getDHSgeo()function downloads the GPS data for cluster locations, also through the rdhs package. This step loads in the GPS location as a SpatialPointsDataFrame, but it can also be performed manually by downloading the GPS file from the DHS website. The GPS location of the clusters are used to match clusters to regions at different admin levels.
In the case of Zambia, the maps are already included in the package as example datasets (data(ZambiaAdm1) and data(ZambiaAdm2)) so we load them directly within R. For other countries, the Admin 1 and Admin 2 boundary shapefiles can be downloaded from the GADM site3 and read into R manually. Another alternative is to use the geodata and sf package to download the maps from the GADM site and load them in R directly, in the following steps:
poly.adm1 <- geodata::gadm(country="ZMB", level=1, path=tempdir())
poly.adm1 <- sf::st_as_sf(poly.adm1)
poly.adm2 <- geodata::gadm(country="ZMB", level=2, path=tempdir())
poly.adm2 <- sf::st_as_sf(poly.adm2)The clusterInfo() function extracts Admin 1 and Admin 2 area information for each cluster. To avoid confusion in countries that have duplicated region names, we create new labels for Admin 2 regions using the variable admin2.name.full. The clusters with invalid GPS locations are saved in the wrong.points object and not used in any models.
cluster.info <- clusterInfo(geo=geo, poly.adm1=poly.adm1, poly.adm2=poly.adm2, 
                            by.adm1 = "NAME_1",by.adm2 = "NAME_2")
head(cluster.info$data)##   cluster  LONGNUM     LATNUM                   geometry admin1.name
## 1       1 32.06631 -13.895685 POINT (32.06631 -13.89568)     Eastern
## 2       2 28.73625 -13.419209 POINT (28.73625 -13.41921)  Copperbelt
## 3       3 31.13254  -8.754055 POINT (31.13254 -8.754055)    Northern
## 4       4 28.44930 -14.419112  POINT (28.4493 -14.41911)     Central
## 5       5 29.96905 -12.677275 POINT (29.96905 -12.67727)     Central
## 6       6 32.74745  -9.319054 POINT (32.74745 -9.319054)    Muchinga
##   admin2.name   admin2.name.full
## 1      Katete     Eastern_Katete
## 2     Masaiti Copperbelt_Masaiti
## 3    Mpulungu  Northern_Mpulungu
## 4       Kabwe      Central_Kabwe
## 5    Chitambo   Central_Chitambo
## 6     Nakonde   Muchinga_Nakonde##  [1]  19  33  65 179 236 281 293 360 361 469The adminInfo() function computes the spatial the adjacency matrix, and if supplied with additional information on population, combines needed information for each administrative area, including the name, population, and urban population fractions of each administrative area. The admin.mat object stores the adjacency matrix for spatial models. We will ignore the population information for now and focus on only information from the DHS data, and will return to the population information in Section 7.
admin.info1 <- adminInfo(poly.adm = poly.adm1, admin = 1,by.adm = "NAME_1")
admin.info2 <- adminInfo(poly.adm = poly.adm2, admin = 2,by.adm = "NAME_2",by.adm.upper = "NAME_1")
head(admin.info2$data)##   admin1.name   admin2.name      admin2.name.full population urban
## 1     Central      Chibombo      Central_Chibombo         NA    NA
## 2     Central      Chisamba      Central_Chisamba         NA    NA
## 3     Central      Chitambo      Central_Chitambo         NA    NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi         NA    NA
## 5     Central         Kabwe         Central_Kabwe         NA    NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi         NA    NAFor \(y_j\) be the binary outcome of interest for the \(j^{\text{th}}\) individual in the survey and \(w_j\) be the design weight associated with this individual. For a given area denoted as \(i\), we have the weighted estimator:
\[ \hat p^{W}_{i}=\frac {\sum_{j \in S_i} y_j w_j}{\sum_{j \in S_i} w_j}\]
where \(S_i\) is the set of individual index within the \(i\)-th region. The directEST() function calculates direct estimates at different Admin levels using the SUMMER::smoothSurvey() function in the SUMMER package internally (Li et al. 2020). The output of the function includes the direct estimates and their variance, standard error, and confidence intervals based on the specified confidence level. The confidence intervals are computed on the logit scale, i.e., if we use \(D_i\) to denote the design-based variance of \(\hat p^{W}_i\), then the asymptotic design-based variance of \(\text{logit}(\hat p^{W}_i)\) is
\[
V_i  = \frac{D_i}{(\hat p^{W}_i)^2(1-\hat p^{W}_i)^2}
\]
and we compute the confidence interval on the probability scale to be
\[(\quad \text{expit}[\text{logit}(\hat p^{W}_i) - z_{\alpha/2}V_i^{1/2}], \;\;
\text{expit}[\text{logit}(\hat p^{W}_i) + z_{\alpha/2}V_i^{1/2}]\quad).
\]
Currently the package defaults to a two-stage stratified cluster sampling design, with the sampling clusters (enumeration areas) being stratified by Admin 1 areas and urban/rural strata, which is the most common sampling design in the DHS.
##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5845443 0.0007002450        0.3414564      0.011873143
## 2  Copperbelt  0.6023371 0.0006960677        0.4152124      0.012132270
## 3     Eastern  0.6505890 0.0003224625        0.6216294      0.006240117
## 4     Luapula  0.6453174 0.0007216008        0.5985190      0.013774331
## 5      Lusaka  0.5817949 0.0004319779        0.3301459      0.007296978
## 6    Muchinga  0.6884903 0.0004137610        0.7930706      0.008995197
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          84.22370 0.02646214    0.5319292    0.6352999 0.04526970
## 2          82.42480 0.02638310    0.5496679    0.6527379 0.04380122
## 3         160.25341 0.01795724    0.6146268    0.6849157 0.02760151
## 4          72.59881 0.02686263    0.5910940    0.6960479 0.04162700
## 5         137.04303 0.02078408    0.5405908    0.6218883 0.03572406
## 6         111.17044 0.02034112    0.6472976    0.7269017 0.02954452For national level (admin = 0), the function can additionally produce urban- and rural-specific direct estimates by specifying argument strata.
res_ad0 <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="all")
res_ad0_urban <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="urban")
res_ad0_rural <- directEST(data=data,
                   cluster.info= cluster.info,
                   admin=0,
                   strata="rural")
list(all = res_ad0$res.admin0, 
      urban = res_ad0_urban$res.admin0, 
      rural = res_ad0_rural$res.admin0)## $all
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6344011 7.614786e-05        0.5511447      0.001415533          706.4475
##     direct.se direct.lower direct.upper
## 1 0.008726274    0.6171347    0.6513289
## 
## $urban
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6066641 0.0002399573        0.4333115      0.004214152          237.2957
##    direct.se direct.lower direct.upper
## 1 0.01549055    0.5759275    0.6365788
## 
## $rural
##   direct.est   direct.var direct.logit.est direct.logit.var direct.logit.prec
## 1  0.6519019 0.0001103517        0.6274099      0.002142946          466.6473
##    direct.se direct.lower direct.upper
## 1 0.01050484    0.6310396    0.6721974In general, direct estimates at Admin 2 are unstable. For countries with a large number of Admin 2 areas, some of the direct estimates can be undefined if there is no cluster in the area.
Fay-Herriot models provides smoothed estimates at the areal level using direct estimate \(\hat p^{W}_{i}\) as input. The direct estimates are modeled as a noisy observation of the true prevalence, with the variance of noise determined by the design-based variance. We consider the spatial Fay-Herriot model for the logit transformed direct estimates, which is defined as follows:
\[\text{logit}(\hat p^{W}_{i})|\lambda_{i} \sim\textrm{Normal}(\lambda_{i}, V_{i}^{HT}),\] \[\lambda_{i}= \alpha + e_i+S_i.\] Here \(\text{expit}(\lambda_{i})\) is the latent true prevalence, and \(e_i\) and \(S_i\) are unstructured and structured spatial random effects. Inference is carried out using Bayesian methods and so the model specific is completed by priors on \(\alpha\), \(e\) and \(S\), and their hyperpriors. More details of the Bayesian model setup can be found in Wakefield, Okonek, and Pedersen (2020). Area level Fay-Herriot model are viewed as the most reliable model choice, since they acknowledge the design through the sue of a weighted estimate and its associated variance. See chapters 4 to 6 of Rao and Molina (2015).
As of now the package allows only an overall intercept \(\alpha\), but future versions of the package will allow area level covariates to be included. The default prior for the intercept is \(N(0, 1000)\). The structured and non-structured random effects are implemented used Besag-York-Mollié (BYM) via BYM2 parameterization, with default PC priors such that the marginal standard deviation has a prior such that \(Prob(\sigma > 1) = 0.01\) and the proportion of variation explained by the spatial effect, \(\phi\) has a prior such that \(Prob(\phi > 0.5) = 2/3\) (Riebler et al. 2016; Simpson et al. 2017).
The fhModel() calculates spatial Fay-Herriot estimates at different administrative levels using SUMMER::smoothSurvey(). Users can specify either only i.i.d. random effects (i.e., no \(S_i\) term) or the BYM2 model by setting model = "iid" or model = "bym2". The function returns model results at the specified admin level with aggregation = FALSE. Aggregated to higher admin levels is possible with aggregation = TRUE when additional population information is provided. The aggregation steps are described in more details in Section 7.
smth_res_ad1_bym2 <- fhModel(data,
                        cluster.info = cluster.info,
                        admin.info = admin.info1,
                        admin = 1,
                        model = "bym2",
                        aggregation =FALSE)
smth_res_ad1_iid <- fhModel(data, 
                        cluster.info = cluster.info, 
                        admin.info = admin.info1,
                        admin = 1, 
                        model = "iid",
                        aggregation =FALSE)
head(smth_res_ad1_bym2$res.admin1) ##   admin1.name      mean          var    median     lower     upper logit.mean
## 1     Central 0.6028279 0.0005262850 0.6035275 0.5559112 0.6455631  0.4181968
## 2  Copperbelt 0.6130262 0.0005303788 0.6135925 0.5663871 0.6567400  0.4611124
## 3     Eastern 0.6479354 0.0002699419 0.6480306 0.6153815 0.6799110  0.6107456
## 4     Luapula 0.6468364 0.0005282296 0.6471917 0.6007430 0.6910090  0.6066520
## 5      Lusaka 0.5941311 0.0003830601 0.5944710 0.5549919 0.6312606  0.3816880
## 6    Muchinga 0.6768713 0.0003558819 0.6769527 0.6397106 0.7137660  0.7407531
##     logit.var logit.median logit.lower logit.upper         sd         cv
## 1 0.009195237    0.4201849   0.2245841   0.2245841 0.02294090 0.03805548
## 2 0.009451857    0.4624387   0.2671258   0.2671258 0.02302995 0.03756765
## 3 0.005201276    0.6103938   0.4699907   0.4699907 0.01642991 0.02535732
## 4 0.010167730    0.6067175   0.4085620   0.4085620 0.02298325 0.03553177
## 5 0.006601776    0.3824795   0.2208608   0.2208608 0.01957192 0.03294209
## 6 0.007476746    0.7398026   0.5741084   0.5741084 0.01886483 0.02787064Columns 2-6 represent the posterior mean, posterior variance, posterior median and 2.5%, and 97.5% posterior quantiles, respectively. The posterior variance can be viewed as an analogue of the mean squared error (MSE) summary.
A Fay-Herriot model at Admin 2 level can be fitted by treating areas without direct estimates as missing data, though it is usually not recommended due to data sparsity. In addition, numerical issue arise when design-based variance of direct estimate is close to zero and logit precision become too large. A potential fix for this issue is to identify regions with too small design variance(< 1e-30), and delete the clusters in these regions, before fitting the model. However, this step creates additional bias in the results if the number of clusters deleted is large.
bad_admin2 <- subset(res_ad2$res.admin2, direct.var < 1e-30)$admin2.name.full
# identify clusters in these regions
bad_clusters <- subset(cluster.info$data, admin2.name.full %in% bad_admin2)$cluster
# Run FH model without these clusters
smth_res_ad2 <- fhModel(subset(data, !cluster %in% bad_clusters),
                    cluster.info = cluster.info,
                    admin.info = admin.info2,
                    admin = 2,
                    model = "bym2",
                    aggregation = FALSE)
head(smth_res_ad2$res.admin2)##        admin2.name.full      mean          var    median     lower     upper
## 1      Central_Chibombo 0.6535171 0.0026360727 0.6550209 0.5480457 0.7496094
## 2      Central_Chisamba 0.5579803 0.0006107532 0.5581483 0.5089874 0.6060565
## 3      Central_Chitambo 0.6217739 0.0101383288 0.6276832 0.4110805 0.8000109
## 4  Central_Itezhi-tezhi 0.6436590 0.0101398292 0.6505298 0.4284150 0.8197984
## 5         Central_Kabwe 0.5213792 0.0011020175 0.5213857 0.4560794 0.5861891
## 6 Central_Kapiri Mposhi 0.5106414 0.0015870676 0.5106247 0.4323244 0.5880890
##   logit.mean  logit.var logit.median logit.lower logit.upper         sd
## 1 0.64252506 0.05257552   0.64118322  0.19277749  0.19277749 0.05134270
## 2 0.23355308 0.01008966   0.23365025  0.03595353  0.03595353 0.02471342
## 3 0.52003748 0.19864247   0.52229030 -0.35950027 -0.35950027 0.10068927
## 4 0.61982125 0.20921818   0.62136870 -0.28832074 -0.28832074 0.10069672
## 5 0.08594926 0.01785317   0.08559503 -0.17613618 -0.17613618 0.03319665
## 6 0.04284320 0.02574167   0.04250537 -0.27237407 -0.27237407 0.03983802
##           cv
## 1 0.07856367
## 2 0.04429085
## 3 0.16193872
## 4 0.15644419
## 5 0.06367084
## 6 0.07801564Cluster-level models assume smoothing models for counts of events in each cluster (Wakefield, Okonek, and Pedersen 2020; Li et al. 2020). In terms of traditional SAE literature, cluster-level models are a type of unit-level model. We start with describing an unstratified model without taking into account the urban/rural stratification in the sampling design.
Let \(Y_c\) be the number of events in cluster \(c\), and \(n_c\) be the number of individuals at risk, where \(c= 1,\dots,C\). The unstratified model assumes the hierarchical structure:
\[Y_c \mid p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\]
where \(\alpha\) is the intercept, and \(i[s_c]\) indexes the area within which the cluster \(s_c\) resides. Similar to the area-level model, \(e_i\) and \(S_i\) are unstructured and structured spatial random effects with the same prior as before. The Beta-binomial distribution arise from a hierarchical model where the probability follows a \(\text{Beta}(a, b)\) prior. The overdispersion parameter, \(d=\frac{1}{\alpha+\beta+1}\), is between 0 and 1 and represent the the intracluster correlation between Bernoulli draws within cluster. The default prior for \(d\) is \(\text{logit}(d) \sim \text{Normal}(0,0.4)\).
The clusterModel() function fits the cluster-level model above. Unstratified model can be specified by setting stratification = FALSE, and either i.i.d. or the BYM2 model by setting model = "bym2" or model = "iid". For the the overdispersion parameter \(d\), users can change the prior mean and precision through overdisp.mean and overdisp.prec.
We fit the unstratified model in both Admin 1 and Admin 2 below. These models differ only in the admin level at which the spatial random effects enter, with the cluster-level model being the same in each case.
cl_res_ad1 <- clusterModel(data=data,
                   cluster.info=cluster.info,
                   admin.info  = admin.info1,
                   stratification = FALSE,
                   model = "bym2",
                   admin = 1,
                   aggregation =FALSE,
                   CI = 0.95)
cl_res_ad2 <- clusterModel(data=data,
                   cluster.info= cluster.info,
                   admin.info = admin.info2,
                   model = "bym2",
                   stratification =FALSE,
                   admin = 2,
                   aggregation =FALSE,
                   CI = 0.95)
head(cl_res_ad2$res.admin2)##        admin2.name.full      mean    median         sd         var     lower
## 1      Central_Chibombo 0.6448760 0.6447917 0.04581087 0.002098636 0.5518984
## 2      Central_Chisamba 0.5963255 0.5973862 0.05510070 0.003036088 0.4839398
## 3      Central_Chitambo 0.5798794 0.5841081 0.06852435 0.004695587 0.4385194
## 4  Central_Itezhi-tezhi 0.6458915 0.6466465 0.07320677 0.005359231 0.4920623
## 5         Central_Kabwe 0.5445200 0.5434148 0.04622923 0.002137142 0.4522349
## 6 Central_Kapiri Mposhi 0.5598836 0.5622164 0.04545403 0.002066069 0.4685788
##       upper         cv admin1.name   admin2.name population urban
## 1 0.7313468 0.07103827     Central      Chibombo         NA    NA
## 2 0.6996858 0.09240039     Central      Chisamba         NA    NA
## 3 0.7063556 0.11817001     Central      Chitambo         NA    NA
## 4 0.7798355 0.11334222     Central  Itezhi-tezhi         NA    NA
## 5 0.6331086 0.08489905     Central         Kabwe         NA    NA
## 6 0.6465233 0.08118479     Central Kapiri Mposhi         NA    NACluster-level models will produce biased estimates if the specified model does not hold for all units in the population. In the DHS, urban clusters are often oversampled. If such oversampling occurs and the outcome associated with urban/rural, then bias will result when the model does not allow for differences between urban/rural. To emphasize, bias will only arise when oversampling of urban (or rural) areas occurs, and the indicator has an association with urban/rural. One may check whether one needs to include urban/rural terms in the model, and then aggregate by urban/rural.
A fully adjusted model would include interaction terms representing the crossing of Admin 1 areas and urban/rural status. However, for reasons of parsimony we include area-level random effects and a single urban/rural fixed effect. The sampling model becomes
\[Y_c|p_c,d\sim \textrm{BetaBinomial}(n_c,p_c,d),\] \[p_c=\textrm{expit}(\alpha+\gamma\times I(s_c \in \textrm{urban})+x_c\beta+e_{i[s_c]}+S_{i[s_c]}).\]
The area-level risk is defined as \[p_i=q_i\times \text{expit}(\alpha+\gamma+e_{i[s_c]}+S_{i[s_c]})+(1-q_i)\times \text{expit}(\alpha+e_{i[s_c]}+S_{i[s_c]}),\] where \(q_i\) is the proportion of urban population in area \(i\) where “urban” is defined by the survey design sampling frame.
Stratified cluster-level model can be fit by setting stratification = TRUE. The returned results then consist of urban-specific, rural-specific, and the overall area-level prevalences. In order to facilitate aggregation for cluster-level model with urban/rural effects, we need to know the fraction of urban population by each administrative region. This computation can be performed with the getUR() function (see example for details), but it requires external information (i.e., the urban/rural definition by census), so we omit this step from this document for now and assume such information is already obtained from a difference source. he urban proportions for women of age 15 to 49 in Zambia in 2018 has been pre-computed and saved in ZambiaPopWomen$admin1_urban and ZambiaPopWomen$admin2_urban.
##     admin1.name admin2.name       urban
## 1       Eastern     Chadiza 0.049373174
## 2      Muchinga       Chama 0.054972234
## 3       Eastern     Chasefu 0.001718483
## 4 North-Western     Chavuma 0.325830536
## 5       Luapula      Chembe 0.000000000
## 6       Central    Chibombo 0.577152577Given the urban/rural proportions, we first add the information into admin.info2 using adminInfo().T
admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,
                         proportion = ZambiaPopWomen$admin2_urban,
                         by.adm="NAME_2",by.adm.upper="NAME_1")
head(admin.info2$data)##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258         NA
## 2     Central      Chisamba      Central_Chisamba 0.03825222         NA
## 3     Central      Chitambo      Central_Chitambo 0.00000000         NA
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000         NA
## 5     Central         Kabwe         Central_Kabwe 0.68604463         NA
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326         NAWe can then fit the stratified cluster-level model by specifying stratification = TRUE.
We can use mapPlot() function from the SUMMER package to visualize estimates and standard errors.
# Arrange all estimates into a long-format data frame
out1 <- res_ad1$res.admin1[, c("admin1.name", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad1_bym2$res.admin1[, c("admin1.name", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad1$res.admin1[, c("admin1.name", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
g1 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "mean", legend.label = "Mean")
g2 <- mapPlot(data = rbind(out1, out2, out3), geo = poly.adm1,
              by.data = "admin1.name",  by.geo = "NAME_1", is.long = TRUE,
              variable = "model", value = "cv", legend.label = "CV")
g1 / g2Figure 6.1: Comparing Admin 1 direct estimates, Fay–Herriot model, and unstratified cluster-level posterior mean estimates
We note that for Admin 2 estimates, surveyPrev uses a recreated region name in the form of [admin1_name]_[admin2_name] to avoid duplicated Admin 2 names. This new region identifier is assigned a column name of admin2.name.full in the output. Thus when plotting, we need to manually create the admin2.name.full in the spatial polygon object.
poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)
# Arrange all estimates into a long-format data frame
out1<- res_ad2$res.admin2[, c("admin2.name.full", "direct.est", "cv")]
colnames(out1)[2] <- "mean"
out1$model <- "Direct Estimates"
out2 <- smth_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out2$model <- "Fay-Herriot Model"
out3 <- cl_res_ad2$res.admin2[, c("admin2.name.full", "mean", "cv")]
out3$model <- "Unstratified Cluster-level Model"
out4 <- cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full", c("admin2.name.full", "mean", "cv")]
out4$model <- "Stratified Cluster-level Model"
g1 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "mean", 
              legend.label = "Mean", ncol = 4)
g2 <- mapPlot(data = rbind(out1, out2, out3, out4), geo = poly.adm2,
              by.data = "admin2.name.full",  by.geo = "admin2.name.full", 
              is.long = TRUE, variable = "model", value = "cv", 
              legend.label = "CV", ncol = 4)
g1 / g2 Figure 6.2: Comparing Admin 2 direct estimates, Fay–Herriot model, unstratified and stratified cluster-level posterior mean estimates
The scatterPlot() function takes results from two model outputs, and produce a scatter plot of the selected variable. For example, we can compare the direct estimates with the Fay-Herriot model (top row) and Cluster-level model (bottom row) in terms of their point estimates and standard deviations. In the two plots on the left, we see the expected shrinkage (attenuation) of the spatial Fay–Herriot posterior mean estimates as compared to the direct estimates. In the plots on the right we see the reduction in uncertainty which arises from using all of the data in a single Model.
s1 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=smth_res_ad1_bym2$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.est",
          value2="mean",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct estimate",
          label1="Direct Est",
          label2="Cluster-level model")
s4 <- scatterPlot(
          res1=res_ad1$res.admin1,
          res2=cl_res_ad1$res.admin1,
          value1="direct.se",
          value2="sd",
          by.res1="admin1.name",
          by.res2="admin1.name",
          title="Cluster-level model vs Direct SE",
          label1="Direct Est",
          label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)Figure 6.3: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row).
Similar smoothing patterns can be observed when comparing estimates at Admin 2 level.
s1 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.est",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay-Herriot vs Direct estimate",
          label1="Direct Est",
          label2="Spatial Fay-Herriot")
s2 <- scatterPlot(
          res1=res_ad2$res.admin2,
          res2=smth_res_ad2$res.admin2,
          value1="direct.se",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Fay–Herriot vs Direct SE",
          label1="Direct Est",
          label2="Spatial Fay–Herriot")
s3 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.est",
        value2="mean",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct estimate",
        label1="Direct Est",
        label2="Cluster-level model")
s4 <- scatterPlot(
        res1=res_ad2$res.admin2,
        res2=cl_res_ad2$res.admin2,
        value1="direct.se",
        value2="sd",
        by.res1="admin2.name.full",
        by.res2="admin2.name.full",
        title="Cluster-level model vs Direct SE",
        label1="Direct Est",
        label2="Cluster-level model")
(s1 + s2 ) / (s3 + s4)Figure 6.4: Comparing direct estimates with Fay-Herriot model (top row) and Unstratified cluster-level model (bottom row). Red triangles corresponding to areas where direct estimates are not available.
Finally, we compare the unstratified and stratified cluster-level model at Admin 2 level. We can observe mild differences but the estimates are mostly similar between the two models.
s1 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="mean",
          value2="mean",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model estimate",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
s2 <- scatterPlot(
          res1=cl_res_ad2$res.admin2,
          res2=cl_res_ad2_T$res.admin2[cl_res_ad2_T$res.admin2$type=="full",],
          value1="sd",
          value2="sd",
          by.res1="admin2.name.full",
          by.res2="admin2.name.full",
          title="Stratified vs Unstratified model SD",
          label1="Unstratified model estimates",
          label2="Stratified model estimates")
(s1 + s2 )  Figure 6.5: Comparing unstratified and stratified cluster-level model at Admin 2 level.
Another visualization function intervalPlot() can provide interval plots at different admin levels for model comparison with compare=T. Input should be a list of model outputs by surveyPrev with self defined model name, (e.g.model=list("name 1"= fit1,"name 2"= fit2)).
intervalPlot(admin = 1, compare = TRUE, model = list(
      "Direct estimate model"= res_ad1,
      "Fay-Herriot model"= smth_res_ad1_bym2,
      "Unstratified Cluster-level model"= cl_res_ad1))Figure 6.6: Comparing different Admin 1 estimates.
For stratified cluster-level models, intervalPlot() visualizes the urban, rural, and overall Admin 2 estimates within each Admin 1 region, when setting compare = FALSE.
plots <- intervalPlot(compare = FALSE, model = list("model"=cl_res_ad2_T))
patchwork::wrap_plots(plots, ncol = 2)Figure 6.7: Comparing Admin 2 estimates arranged by the corresponding Admin 1 regions.
Population fractions are necessary to aggregate estimates from finer administrative regions to higher levels. The script below demonstrates the creation of a pixel-level population raster at the 100m by 100m level for women of age 15 to 49 in Zambia in 2018. The corresponding age-sex-specific population estimates can be found at https://hub.worldpop.org/geodata/summary?id=16429. The path to the downloaded .tiff files are specified in the pop_dir object. The aggPopulation() aggregates population data from the population raster into admin levels based on the input shapefiles we read in previous step. Here we aggregate pixel-level population for women of age 15 to 49 in Zambia in 2018 into both Admin 1 and Admin 2 levels.
library(raster)
pop.abbrev <- "ZMB"
year <- 2018
pop.dir <- "../data/Zambia/worldpop"
## first sum up four rasters female  15-49
f_15_name <- paste0(pop.dir, "/", pop.abbrev, '_f_15_', year, '.tiff')
f_20_name <- paste0(pop.dir, "/", pop.abbrev, '_f_20_', year, '.tiff')
f_25_name <- paste0(pop.dir, "/", pop.abbrev, '_f_25_', year, '.tiff')
f_30_name <- paste0(pop.dir, "/", pop.abbrev, '_f_30_', year, '.tiff')
f_35_name <- paste0(pop.dir, "/", pop.abbrev, '_f_35_', year, '.tiff')
f_40_name <- paste0(pop.dir, "/", pop.abbrev, '_f_40_', year, '.tiff')
f_45_name <- paste0(pop.dir, "/", pop.abbrev, '_f_45_', year, '.tiff')
pop_f_15 <- raster(f_15_name)
pop_f_20 <- raster(f_20_name)
pop_f_25 <- raster(f_25_name)
pop_f_30 <- raster(f_30_name)
pop_f_35 <- raster(f_35_name)
pop_f_40 <- raster(f_40_name)
pop_f_45 <- raster(f_45_name)
raster <- pop_f_45 + pop_f_40 + pop_f_35 + pop_f_30 + pop_f_25 + pop_f_20 + pop_f_15
agg.pop1 <- aggPopulation(
   tiff = pop_raster,
   poly.adm  = ZambiaAdm1,
   by.adm  = "NAME_1")
colnames(agg.pop1)[1] <- "admin1.name"
agg.pop2 <- aggPopulation(
   tiff = pop_raster,
   poly.adm = ZambiaAdm2,
   by.adm  = "NAME_2",
   by.adm.upper= "NAME_1")The computed population estimates in this example is built into the package and available in the ZambiaPopWomen dataset.
##   admin1.name population
## 1     Central   430358.5
## 2  Copperbelt   598032.5
## 3     Eastern   452361.1
## 4     Luapula   342075.2
## 5      Lusaka   808640.2
## 6    Muchinga   214810.4##        admin2.name.full population admin1.name   admin2.name
## 1      Central_Chibombo  108561.22     Central      Chibombo
## 2      Central_Chisamba   28307.61     Central      Chisamba
## 3      Central_Chitambo   20727.69     Central      Chitambo
## 4  Central_Itezhi-tezhi   22814.54     Central  Itezhi-tezhi
## 5         Central_Kabwe   56875.20     Central         Kabwe
## 6 Central_Kapiri Mposhi   55995.96     Central Kapiri MposhiAn alternative approach to obtain population fractions is to simply use survey weights to estimate the population size within each area. This approach, however, can lead to noisy estimates at fine spatial resolutions. Nevertheless, similar to aggPopulation(), the aggSurveyWeight() function produces the estimated population size at the specified admin level. The output can be used as weights for aggregation model results to higher admin levels.
agg.survey1 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 1)
agg.survey2 <- aggSurveyWeight(data = data, cluster.info = cluster.info, admin = 2,
    poly.adm = poly.adm2, by.adm = "NAME_2", by.adm.upper = "NAME_1")As mentioned in Section 2, the adminInfo() combine information for each administrative area into a single data frame. This includes the name, population, and fractions of the urban population for each area. Population and urban population fractions can be added through the agg.pop and proportion objects, respectively. Additionally, the survey weight (output of aggSurveyWeight()) may serve as an alternative to population data, and users can add it via agg.pop.
admin.info1 <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =ZambiaPopWomen$admin1_pop,
                         proportion = ZambiaPopWomen$admin1_urban )
admin.info2 <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =ZambiaPopWomen$admin2_pop,
                         proportion = ZambiaPopWomen$admin2_urban)
admin.info1.survey <- adminInfo(poly.adm = poly.adm1, 
                         admin = 1,
                         by.adm="NAME_1",
                         agg.pop  =agg.survey1,
                         proportion = ZambiaPopWomen$admin1_urban )
admin.info2.survey  <- adminInfo(poly.adm = poly.adm2, 
                         admin = 2,by.adm="NAME_2",by.adm.upper = "NAME_1",
                         agg.pop =agg.survey2,
                         proportion = ZambiaPopWomen$admin2_urban)
head(admin.info2$data)##   admin1.name   admin2.name      admin2.name.full      urban population
## 1     Central      Chibombo      Central_Chibombo 0.57715258  108561.22
## 2     Central      Chisamba      Central_Chisamba 0.03825222   28307.61
## 3     Central      Chitambo      Central_Chitambo 0.00000000   20727.69
## 4     Central  Itezhi-tezhi  Central_Itezhi-tezhi 0.00000000   22814.54
## 5     Central         Kabwe         Central_Kabwe 0.68604463   56875.20
## 6     Central Kapiri Mposhi Central_Kapiri Mposhi 0.13246326   55995.96
##   population.admin1
## 1          430358.5
## 2          430358.5
## 3          430358.5
## 4          430358.5
## 5          430358.5
## 6          430358.5The following maps display the 15-49 female population and aggregated DHS survey weights ratio within upper admin levels.
out1 <- rbind(cbind(admin.info1$data, type = "WorldPop"), 
             cbind(admin.info1.survey$data, type = "SurveyWeight"))
out1 <- out1 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
g1 <- mapPlot(data = out1, geo = poly.adm1, size = 0.1, 
              by.data = "admin1.name", by.geo = "NAME_1", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")
out2 <- rbind(cbind(admin.info2$data, type = "WorldPop"), 
             cbind(admin.info2.survey$data, type = "SurveyWeight"))
out2 <- out2 %>% group_by(type) %>% mutate(pop.frac = population / sum(population)) 
poly.adm2$admin2.name.full=paste0(poly.adm2$NAME_1,"_",poly.adm2$NAME_2)
g2 <- mapPlot(data = out2, geo = poly.adm2, size = 0.1, 
              by.data = "admin2.name.full", by.geo = "admin2.name.full", 
              variable = "type",  value = "pop.frac", is.long = TRUE,
              legend.label = "Population Fraction")
g1 / g2Figure 7.1: survey weight ratio and population ratio at Admin 1 and Admin 2
The scatter plot below examines the fraction of population residing in each Admin 1 areas, computed by both WorldPop raster and survey weights. Noticed that DHS data may have missing Admin 2 areas. For example, Zambia has 112 out of 115 Admin 2 areas for this survey (red points in the scatter plots). Users have to be aware of that this is an incomplete set of weights for aggregation to from Admin 2 to Admin 1.
out2$pop.frac.within <- out2$population / out2$population.admin1
out3 <- tidyr::spread(out2[, c("admin2.name.full", "pop.frac.within", "type")], 
                      type, pop.frac.within)
out3 <- left_join(out3, admin.info2$data[, c("admin2.name.full", "population")])
ggplot(out3, aes(x = SurveyWeight, y = WorldPop, size = population)) +
    geom_point(alpha = 0.5, color = "red")+
    geom_abline(slope = 1, intercept = 0, linetype = "dashed")Figure 7.2: Comparing two sets of Admin 2 15-49 female population fractions within Admin 1 area.
For direct estimates, we caution that we almost always prefer directly computing direct estimates at the desired level, rather than aggregating direct estimates at finer levels. Nevertheless, here we describe how the aggregated direct estimates are computed.
Let \(i\) denote the index for the lower admin level and \(k[i]\) be the index for the corresponding upper admin level, then aggregated estimates for the \(k\)-th upper admin area is
\[ \hat p^{agg}_{k}=\frac {\sum_{k[i] = k} \hat p^{W}_{i}E_i}{\sum_{k[i] = k} \text{E}_i},\] where \(E_i=\widehat{\text{pop}_{i}} \times \textbf{1}(\hat p^{W}_{i} \neq \text{NA})\), i.e., the point estimates are obtained by weighted average of non-missing direct estimates in the area, where the weights are either given by external population information, or estimated by survey weights. The standard error and confidence intervals are computed by simulation using samples from the design-based asymptotic sampling distributions of \(\text{logit}(\hat p^{W}_i)\).
The directEST() function allows aggregation by survey weight directly by specifying weight = "survey", without going through the steps discussed in Section 7.2. We compute the direct estimates at admin1 level and aggregate them to national estimates. When using survey weight, the aggregated point estimate at national level is the same as the national level direct estimate, but the uncertainty calculation are different as discussed above.
res_ad1agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "population", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)
head(res_ad1agg$res.admin1)##   admin1.name direct.est   direct.var direct.logit.est direct.logit.var
## 1     Central  0.5845443 0.0007002450        0.3414564      0.011873143
## 2  Copperbelt  0.6023371 0.0006960677        0.4152124      0.012132270
## 3     Eastern  0.6505890 0.0003224625        0.6216294      0.006240117
## 4     Luapula  0.6453174 0.0007216008        0.5985190      0.013774331
## 5      Lusaka  0.5817949 0.0004319779        0.3301459      0.007296978
## 6    Muchinga  0.6884903 0.0004137610        0.7930706      0.008995197
##   direct.logit.prec  direct.se direct.lower direct.upper         cv
## 1          84.22370 0.02646214    0.5319292    0.6352999 0.04526970
## 2          82.42480 0.02638310    0.5496679    0.6527379 0.04380122
## 3         160.25341 0.01795724    0.6146268    0.6849157 0.02760151
## 4          72.59881 0.02686263    0.5910940    0.6960479 0.04162700
## 5         137.04303 0.02078408    0.5405908    0.6218883 0.03572406
## 6         111.17044 0.02034112    0.6472976    0.7269017 0.02954452##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%   0.628894 0.008586162 7.372218e-05    0.6117665    0.6452747res_ad1agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 1, 
                   weight = "survey", 
                   admin.info = admin.info1, 
                   aggregation = TRUE)
head(res_ad1agg_bysurvey$agg.natl)##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%  0.6343419 0.008295346 6.881276e-05    0.6176332    0.6495225Similarly, if we set admin = 2, the output includes aggregated results at both the Admin1 and national level with aggregation = TRUE, computed by weighting the Admin 2 level estimates by their population (as specified by weight = "population").
res_ad2agg <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2,
                   admin.info = admin.info2,
                   weight = "population",
                   aggregation = TRUE)
res_ad2agg_bysurvey <- directEST(data = data,
                   cluster.info = cluster.info,
                   admin = 2, 
                   weight = "survey", 
                   admin.info = admin.info2, 
                   aggregation = TRUE)
head(res_ad2agg$res.admin2)##        admin2.name.full direct.est   direct.var direct.logit.est
## 1      Central_Chibombo  0.6657436 3.575578e-03      0.688996052
## 2      Central_Chisamba  0.5531800 6.467530e-04      0.213527778
## 3      Central_Chitambo  0.5005565 5.721326e-02      0.002225886
## 4         Central_Kabwe  0.5129186 1.198784e-03      0.051686066
## 5 Central_Kapiri Mposhi  0.4903518 1.841140e-03     -0.038597405
## 6         Central_Luano  0.6666667 4.814825e-35      0.693147181
##   direct.logit.var direct.logit.prec    direct.se direct.lower direct.upper
## 1     7.220591e-02      1.384928e+01 5.979614e-02    0.5404939    0.7712991
## 2     1.058620e-02      9.446256e+01 2.543134e-02    0.5029671    0.6023310
## 3     9.154144e-01      1.092401e+00 2.391929e-01    0.1331933    0.8673198
## 4     1.920618e-02      5.206657e+01 3.462346e-02    0.4452360    0.5801308
## 5     2.948019e-02      3.392108e+01 4.290851e-02    0.4073046    0.5739350
## 6     9.750020e-34      1.025639e+33 6.938894e-18    0.6666667    0.6666667
##             cv   admin2.name admin1.name
## 1 8.981857e-02      Chibombo     Central
## 2 4.597299e-02      Chisamba     Central
## 3 4.778540e-01      Chitambo     Central
## 4 6.750284e-02         Kabwe     Central
## 5 8.750555e-02 Kapiri Mposhi     Central
## 6 1.040834e-17         Luano     Central##   admin1.name direct.est  direct.se direct.lower direct.upper
## 1     Central  0.6072971 0.02240626    0.5614618    0.6488131
## 2  Copperbelt  0.6122046 0.02495829    0.5586943    0.6560586
## 3     Eastern  0.6611554 0.01534567    0.6290711    0.6888775
## 4     Luapula  0.6531875 0.02380226    0.6032167    0.6959296
## 5      Lusaka  0.5799671 0.01995427    0.5401101    0.6180307
## 6    Muchinga  0.6785380 0.01604594    0.6443338    0.7068600##      direct.est   direct.se   direct.var direct.lower direct.upper
## 2.5%    0.63164 0.007168317 5.138477e-05    0.6158581    0.6436442We can compare these aggregated results at the national level using intervalPlot().
intervalPlot(admin = 0, group = FALSE, compare = TRUE, 
  model = list(
  "Admin 0 direct estimate" = res_ad0,
  "Admin 2 direct estimate, aggregated by survey" =res_ad2agg_bysurvey,
  "Admin 1 direct estimate, aggregated by survey" =res_ad1agg_bysurvey,
  "Admin 2 direct estimate, aggregated by population" = res_ad2agg,
  "Admin 1 direct estimate, aggregated by population" = res_ad1agg))Figure 7.3: Comparing different aggregated direct estimates at national level
Aggregation of the Fay-Herriot model can be similarly carried out as model based results. We first re-fit the Admin 1 model with the two updated admin.info objects.
smth_res_ad1_spatial <- fhModel(data=data,
                                cluster.info = cluster.info,
                                admin.info = admin.info1,
                                admin = 1,
                                model = "bym2",
                                aggregation = TRUE)
smth_res_ad1_spatial_survey <- fhModel(data=data,
                                      cluster.info = cluster.info,
                                      admin.info = admin.info1.survey,
                                      admin = 1,
                                      model = "bym2",
                                      aggregation = TRUE)We also re-fit the Admin 2 model with the two updated admin.info objects.
# Run FH model without problematic clusters
smth_res_ad2_agg_pop <- fhModel(data=subset(data, !cluster %in% bad_clusters),
                                cluster.info = cluster.info,
                                admin.info = admin.info2,
                                admin = 2,
                                model = "bym2",
                                aggregation = TRUE)
smth_res_ad2_agg_survey <- fhModel(data=subset(data, !cluster %in% bad_clusters),
                                  cluster.info = cluster.info,
                                  admin.info = admin.info2.survey,
                                  admin = 2,
                                  model = "bym2",
                                  aggregation = TRUE)And we can compare all four models when aggregating to national level, with the national direct estimate as an additional comparison.
intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 1 FH model, aggregated by survey" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
            "Admin 1 FH model, aggregated by pop" = smth_res_ad1_spatial))Figure 7.4: Comparing different aggregated Fay-Herriot estimates at national level
We can also compare all four set of results at the Admin 1 level, with the Admin 1 direct estimate as an additional comparison.
intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
            "Admin 1 direct estimates" = res_ad1,
            "Admin 1 FH model" = smth_res_ad1_spatial_survey,
            "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
            "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop))Figure 7.5: Comparing different aggregated Fay-Herriot estimates at national level
In the same manner, we fit the unstratified and stratified cluster-level model at both Admin 1 and Admin 2 levels. For simplicity of presentation, here we use only the population fractions derived from WorldPop. Results aggregating with survey weights produce very similar results and can be obtained by replacing admin.info argument with the corresponding object computed survey weights, i.e., admin.info1.survey and admin.info2.survey.
cl_res_ad1_pop <- clusterModel(data=data,
                               cluster.info=cluster.info,
                               admin.info = admin.info1,
                               stratification = FALSE,
                               model = "bym2",
                               admin = 1, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad2_pop <- clusterModel(data=data,
                               cluster.info= cluster.info,
                               admin.info = admin.info2,
                               model = "bym2",
                               stratification = FALSE,
                               admin = 2, 
                               aggregation = TRUE,
                               CI = 0.95)
cl_res_ad1_T_pop <- clusterModel(data=data,
                                 cluster.info=cluster.info,
                                 admin.info = admin.info1,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 1, 
                                 aggregation = TRUE,
                                 CI = 0.95)
cl_res_ad2_T_pop <- clusterModel(data=data,
                                 cluster.info = cluster.info,
                                 admin.info = admin.info2,
                                 model = "bym2",
                                 stratification = TRUE,
                                 admin = 2, 
                                 aggregation = TRUE,
                                 CI = 0.95)Finally we compare the different aggregated national estimates with the national direct estimates
intervalPlot(admin = 0, group = FALSE, compare = TRUE, model = list(
            "Admin 0 direct estimates" = res_ad0,
             "Admin 1 unstratified model, aggregated by pop" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model, aggregated by pop" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))Figure 7.6: Comparing different aggregated cluster-level estimates at national level
And we can also compare the different model estimates at the Admin 1 level, together with the Admin 1 direct estimates
intervalPlot(admin = 1, group = FALSE, compare = TRUE, model = list(
             "Admin 1 direct estimates" = res_ad1,
             "Admin 1 unstratified model" = cl_res_ad1_pop,
             "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
             "Admin 1 stratified model" = cl_res_ad1_T_pop,
             "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))Figure 7.7: Comparing different cluster-level estimates at Admin 1 level
Finally, we put together all the different Admin 1 level results based on both Fay-Herriot and cluster-level models. When comparing many estimates, it is usually useful to highlight groups of similar models. This can be done by specifying a group object in the fitted models, and set group = TRUE in intervalPlot().
res_ad1$group <- "Direct Estimate"
smth_res_ad1_spatial_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_survey$group <- "Fay-Herriot"
smth_res_ad2_agg_pop$group <- "Fay-Herriot"
cl_res_ad1_pop$group <- "Cluster-Level"
cl_res_ad2_pop$group <- "Cluster-Level"
cl_res_ad1_T_pop$group <- "Cluster-Level"
cl_res_ad2_T_pop$group <- "Cluster-Level"
intervalPlot(admin = 1, group = TRUE, compare = TRUE, model = list(
                 "Admin 1 direct estimates" = res_ad1,
                 "Admin 1 FH model" = smth_res_ad1_spatial_survey,
                 "Admin 2 FH model, aggregated by survey" = smth_res_ad2_agg_survey,
                 "Admin 2 FH model, aggregated by pop" = smth_res_ad2_agg_pop,
                 "Admin 1 unstratified model" = cl_res_ad1_pop,
                 "Admin 2 unstratified model, aggregated by pop" = cl_res_ad2_pop,
                 "Admin 1 stratified model" = cl_res_ad1_T_pop,
                 "Admin 2 stratified model, aggregated by pop" = cl_res_ad2_T_pop))Figure 7.8: Comparing different model estimates at national level
We thank Ben Mayala and Trevor Croft from the Demographic and Health Surveys (DHS) program for useful discussions in creating this R package.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.