This vignette describes a generalized procedure making use of methods implemented in two R packages developed in the Italian National Institute:

  1. R2BEAT (“Multivariate optimal allocation for different domains in one and two stages stratified sample design”)
  2. FS4 (“First Stage Stratification and Selection in Sampling”)

The first package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), while the second one performs a selection of the PSUs such that the final sample of SSU if of the self-weigthing type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs.

This general flow assumes that at least a previous round of the survey, whose sampling design has to be optimized, is available, and is characterized by the following steps:

1 Use of ReGenesees

Perform externally the definition of the sample design, and possibly of the calibration step, using the R package ReGenesees (also developed in Istat), and make the design object and the calibrated object available. Moreover, check the presence of lonely strata:

load("R2BEAT_ReGenesees.RData")   # ReGenesees design object

This is the ‘design’ object:

des
## Stratified 2 - Stage Cluster Sampling Design (with replacement)
## - [49] strata (collapsed)
## - [789, 2236] clusters
## 
## Call:
## e.svydesign(sample_2st, ids = ~municipality + id_hh, strata = ~stratum_sub, 
##     weights = ~d, self.rep.str = ~SR, check.data = TRUE)

and this is the calibrated object:

cal
## Calibrated, Stratified 2 - Stage Cluster Sampling Design (with replacement)
## - [49] strata (collapsed)
## - [789, 2236] clusters
## 
## Call:
## e.calibrate(design = des, df.population = pop, calmodel = ~clage:sex - 
##     1, partition = ~region, calfun = "logit", bounds = c(0.7, 
##     1.7), aggregate.stage = 2, force = FALSE)

It is advisable to check the presence of lonely strata:

# Control the presence of strata with less than two units
ls <- find.lon.strata(des)
## # No lonely PSUs found!

In case, provide to collapse and re-do the calibration.

In this example, in the ReGenesees objects there are the following variables:

str(des$variables)
## 'data.frame':    2244 obs. of  15 variables:
##  $ region               : Factor w/ 3 levels "north","center",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ municipality         : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ stratum              : Factor w/ 24 levels "1000","2000",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ stratum_sub          : Factor w/ 81 levels "100001","100002",..: 81 81 81 81 81 81 81 81 81 81 ...
##  $ SR                   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ id_hh                : Factor w/ 2236 levels "H100070","H100410",..: 69 43 64 49 367 27 372 373 374 368 ...
##  $ sex                  : Factor w/ 2 levels "1","2": 1 1 2 2 1 2 1 2 1 1 ...
##  $ clage                : Factor w/ 5 levels "cl0_17","cl18_34",..: 3 1 2 1 5 2 2 2 3 1 ...
##  $ income_hh            : num  43741 23284 23450 22171 19904 ...
##  $ work                 : num  1 1 1 2 0 1 1 1 1 2 ...
##  $ unemployed           : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ d                    : num  1238 1238 1238 1238 1238 ...
##  $ progr_str            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ var.PSU              : chr  "8.H12425" "8.H10738" "8.H12157" "8.H11208" ...
##  $ stratum_sub.collapsed: Factor w/ 49 levels "0.center.clps.1",..: 49 49 49 49 49 49 49 49 49 49 ...

where there are three potential target variables:

  1. income_hh (numeric);
  2. work (factor with values 0, 1, 2);
  3. unemployed (factor with values 0, 1).
summary(des$variables$income_hh)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   11463   18516   21661   26763  532331
table(des$variables$work)
## 
##    0    1    2 
##  306 1487  451
table(des$variables$unemployed)
## 
##    0    1 
## 1938  306

Great attention must be paid to the nature of the target variables, especially of the ‘factor’ type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of ‘1’ in the population. If factor variables are of other nature, then an error message is printed.

Therefore, we have to handle the ‘work’ variable in this way: as values 0, 1 and 2 indicate respectively non labour force, active and inactive people, we can decide to derive from ‘work’ two binary variables, ‘active’ and ‘inactive’:

des<-des.addvars(des,active=factor(ifelse(work==1,1,0)))
des<-des.addvars(des,inactive=factor(ifelse(work==2,1,0)))
cal<-des.addvars(cal,active=factor(ifelse(work==1,1,0)))
cal<-des.addvars(cal,inactive=factor(ifelse(work==2,1,0)))

Now, all the categorical target variables are compliant to the binary constraint:

table(cal$variables$active)
## 
##    0    1 
##  757 1487
table(cal$variables$inactive)
## 
##    0    1 
## 1793  451
table(cal$variables$unemployed)
## 
##    0    1 
## 1938  306

2 Build ‘strata’, ‘deff’, ‘effst’ and ‘rho’ dataframes

Using ReGenesees objects as input, produce the following dataframes (function ‘input_to_beat.2st_1’):

  1. the ‘stratif’ dataframe containing:
  • STRATUM: identifier of the single stratum
  • N: total population in terms of final sampling units
  • Mi,Si: mean and standard deviation of target variables (i=1,2,..,P)
  • DOMk: domain(s) to which the stratum belongs
  1. the ‘deff’ (design effect) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • DEFFi: the design effect for each target variable i (i=1,2,…,P)
  1. the ‘effst’ (estimator effect) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • EFFSTi: the estimator effect for each target variable i (i=1,2,…,P)
  1. the ‘rho’ (intraclass coefficient of correlation) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • RHO_ARi: the intraclass coefficient of correlation in self-representative PSUs for each target variable i (i=1,2,…,P)
  • RHO_NARi: the intraclass coefficient of correlation in non self-representative PSUs for each target variable i (i=1,2,…,P)

Actually, the ‘deff’ dataframe is not used in the following steps, it just remains for documentation purposes.

Here is the way we can produce the above items:

RGdes <- des                           # ReGenesees design object
RGcal <- cal                           # ReGenesees calibrated object

strata_vars <- c("stratum")            # variables of stratification
target_vars <- c("income_hh",
                 "active",
                 "inactive",
                 "unemployed")         # target variables
deff_vars <- "stratum"                 # stratification variables to be used when calculating deff and effst 
                                       #    (n.b: must coincide or be a subset of variables of stratification)
id_PSU <- c("municipality")            # identification variable of PSUs
id_SSU <- c("id_hh")                   # identification variable of SSUs
domain_vars <- c("region")             # domain variables
inp1 <- input_to_beat.2st_1(RGdes,
                            RGcal,
                            id_PSU,
                            id_SSU,
                            strata_vars,
                            target_vars,
                            deff_vars,
                            domain_vars)

and these are the results:

head(inp1$strata)
stratum STRATUM N M1 M2 M3 M4 S1 S2 S3 S4 COST CENS DOM1 DOM2
1000 1000 197451 22266.58 0.6404431 0.2323140 0.1272429 14554.88 0.4798705 0.4223082 0.3332449 1 0 1 center
10000 10000 106106 27985.40 0.7679285 0.2114187 0.0206528 24367.97 0.4221544 0.4083146 0.1422189 1 0 1 north
11000 11000 202700 29173.85 0.8029080 0.1730880 0.0240040 39232.92 0.3978024 0.3783234 0.1530613 1 0 1 north
12000 12000 57420 26937.42 0.7764955 0.2075926 0.0159119 15743.78 0.4165936 0.4055834 0.1251347 1 0 1 north
13000 13000 103089 26357.25 0.7185271 0.2814729 0.0000000 14592.50 0.4497176 0.4497176 0.0000000 1 0 1 north
14000 14000 84653 20538.42 0.7518236 0.2131042 0.0350721 14285.81 0.4319547 0.4095007 0.1839621 1 0 1 north
head(inp1$deff)
stratum STRATUM DEFF1 DEFF2 DEFF3 DEFF4 b_nar
1000 1000 0.951705 0.991140 1.006731 0.954024 56.50000
10000 10000 0.856598 1.687606 1.404308 0.819854 26.75000
11000 11000 1.811807 1.261816 1.346654 1.339464 23.77778
12000 12000 1.086363 0.502458 0.483954 0.700691 21.00000
13000 13000 1.000924 1.000924 1.000924 1.000000 95.00000
14000 14000 0.633543 0.856820 0.845580 0.677276 33.66667
head(inp1$effst)
stratum STRATUM EFFST1 EFFST2 EFFST3 EFFST4
1000 1000 0.9689494 1 1 0.9420958
10000 10000 0.9500011 1 1 1.1915475
11000 11000 0.9544521 1 1 1.0546196
12000 12000 1.0429461 1 1 0.9732493
13000 13000 0.9914219 1 1 1.0000000
14000 14000 0.9829167 1 1 1.0974521
head(inp1$rho)
STRATUM RHO_AR1 RHO_NAR1 RHO_AR2 RHO_NAR2 RHO_AR3 RHO_NAR3 RHO_AR4 RHO_NAR4
1000 1 -0.0008702 1 -0.0001596 1 0.0001213 1 -0.0008284
10000 1 -0.0055690 1 0.0267031 1 0.0157013 1 -0.0069960
11000 1 0.0356403 1 0.0114944 1 0.0152190 1 0.0149033
12000 1 0.0043181 1 -0.0248771 1 -0.0258023 1 -0.0149655
13000 1 0.0000098 1 0.0000098 1 0.0000098 1 0.0000000
14000 1 -0.0112181 1 -0.0043831 1 -0.0047271 1 -0.0098793

3 Build ‘PSU’ and ‘design’ dataframes

Prepare the inputs related to the PSUs (function ‘input_to_strat.2d_2’), that are

  1. the ‘des_file’ dataframe, containing the following information:
  • STRATUM: stratum identifier
  • MOS: measure of size of the stratum (in terms of number of contained selection units)
  • DELTA: factor that report the average number of SSUs for each selection unit
  • MINIMUM: minimum number of units to be selected in each PSU
  1. the ‘PSU_file’ dataframe, containing the following information:
  • stratum identifier
  • PSU id
  • PSU_MOS: number of selection units contained in a given PSU
# psu <- read.csv2("psu.csv") # Read the external file containing PSU information
head(psu)
municipality stratum ind hh
1 12000 1546 609
2 12000 936 402
3 12000 367 178
4 10000 13032 5788
5 12000 678 281
6 11000 3193 1194
psu_id="municipality"        # Identifier of the PSU
stratum_var="stratum"        # Identifier of the stratum
mos_var="ind"                # Variable to be used as 'measure of size'
delta=1                      # Average number of SSUs for each selection unit
minimum <- 50                # Minimum number of SSUs to be selected in each PSU
inp2 <- input_to_beat.2st_2(psu,
                            psu_id,
                            stratum_var,
                            mos_var,
                            delta,
                            minimum)
head(inp2$psu_file)
PSU_ID STRATUM PSU_MOS
1 12000 1546
2 12000 936
3 12000 367
4 10000 13032
5 12000 678
6 11000 3193
head(inp2$des_file)
STRATUM STRAT_MOS DELTA MINIMUM
1000 197007 1 50
2000 261456 1 50
3000 115813 1 50
4000 17241 1 50
5000 101067 1 50
6000 47218 1 50

4 Optimal allocation of units in each stratum

Using the function ‘beat.2st’ in ‘R2BEAT’ package execute the optimization of PSU and SSU allocation in strata:

cv <- as.data.frame(list(DOM=c("DOM1","DOM2"),
                         CV1=c(0.03,0.04),
                         CV2=c(0.06,0.08),
                         CV3=c(0.06,0.08),
                         CV4=c(0.06,0.08)))
cv
DOM CV1 CV2 CV3 CV4
DOM1 0.03 0.06 0.06 0.06
DOM2 0.04 0.08 0.08 0.08
stratif = inp1$strata 
errors = cv 
des_file = inp2$des_file 
psu_file = inp2$psu_file 
rho = inp1$rho 
effst = inp1$effst

alloc <- beat.2st(stratif, 
                  errors, 
                  des_file, 
                  psu_file, 
                  rho, 
                  deft_start = NULL, 
                  effst,
                  epsilon1 = 5, 
                  mmdiff_deft = 1,maxi = 15, 
                  epsilon = 10^(-11), minnumstrat = 2, maxiter = 200, maxiter1 = 25)
##   iteraction PSU_SR PSU NSR PSU Total  SSU
## 1          0      0       0         0 6534
## 2          1     38      62       100 8990
## 3          2     22     124       146 9447
## 4          3     23     125       148 9424

This is the sensitivity of the solution:

alloc$sensitivity
Type Dom V1 V2 V3 V4
1 DOM1 1 1 0 1 1
5 DOM2 1 1 0 3 1641
9 DOM2 2 25 1 14 68
13 DOM2 3 128 1 8 1

i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.

This are the expected values of the coefficients of variation:

alloc$expected
Type Dom V1 V2 V3 V4
1 DOM1 1 0.0188 0.0159 0.0465 0.0444
5 DOM2 1 0.0230 0.0202 0.0770 0.0800
9 DOM2 2 0.0399 0.0306 0.0788 0.0798
13 DOM2 3 0.0400 0.0402 0.0792 0.0605

5 Selection of PSUs

Using the function ‘StratSel’ in ‘FS4’ package execute the selection of PSU in strata:

allocat <- alloc$alloc[-nrow(alloc$alloc),]
sample_2st <- StratSel(dataPop= inp2$psu_file,
                       idpsu= ~ PSU_ID, 
                       dom= ~ STRATUM, 
                       final_pop= ~ PSU_MOS, 
                       size= ~ PSU_MOS, 
                       PSUsamplestratum= 1, 
                       min_sample= minimum, 
                       min_sample_index= FALSE, 
                       dataAll=allocat,
                       domAll= ~ factor(STRATUM), 
                       f_sample= ~ ALLOC, 
                       planned_min_sample= NULL, 
                       launch= F)

This is the overall sample design:

sample_2st[[2]]
Domain SRdom nSRdom SRdom+nSRdom SR_PSU_final_sample_unit NSR_PSU_final_sample_unit
1000 2 0 2 161 0
2000 0 3 3 0 169
3000 0 1 1 0 47
4000 0 1 1 0 4
5000 2 0 2 95 0
6000 0 1 1 0 42
7000 0 1 1 0 8
8000 0 1 1 0 7
9000 1 0 1 967 0
10000 6 0 6 765 0
11000 15 20 35 690 1074
12000 0 3 3 0 154
13000 1 0 1 11 0
14000 4 0 4 726 0
15000 8 16 24 367 845
16000 14 38 52 585 2033
17000 1 0 1 50 0
18000 0 2 2 0 80
19000 0 6 6 0 319
20000 0 1 1 0 47
21000 1 0 1 52 0
22000 0 1 1 0 46
23000 0 2 2 0 77
24000 0 1 1 0 5
Total 55 98 153 4469 4957
Mean 186 207
des <- sample_2st[[2]]
des <- des[1:(nrow(des)-1),]
strat <- c(as.character(as.numeric(des$Domain[1:(nrow(des)-1)])),"Tot")
barplot(t(des[1:(nrow(des)),2:3]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated PSUs by domain")

barplot(t(des[1:(nrow(des)),5:6]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, 
        cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated SSUs by domain")

and these are the selected PSUs:

selected_PSU <- sample_2st[[4]]
selected_PSU <- selected_PSU[selected_PSU$PSU_final_sample_unit > 0,]
write.table(sample_2st[[4]],"Selected_PSUs.csv",sep=";",row.names=F,quote=F)
head(selected_PSU)
Sampled_PSU Pik Size_Stratum STRATUM PSU_ID PSU_MOS PSU_MOS.1 ALLOC threshold final_populationdom sampling_fraction SR SizeSR stratum N_PSU_Stratum PSU_final_sample_unit nSR
1 1 1.0000000 146162 1000 330 146162 146162 161 61182.30 197007 0.0008172 1 146162 10001 1 119 0
2 1 1.0000000 50845 1000 309 50845 50845 161 61182.30 197007 0.0008172 1 0 10002 1 42 0
1100 1 0.3629408 99727 2000 304 36195 36195 169 77353.85 261456 0.0006464 0 0 20001 3 64 1
7 1 0.2126547 80318 2000 318 17080 17080 169 77353.85 261456 0.0006464 0 0 20002 4 52 1
12 1 0.1648303 81411 2000 317 13419 13419 169 77353.85 261456 0.0006464 0 0 20003 6 53 1
24 1 0.0396847 115813 3000 295 4596 4596 47 123205.32 115813 0.0004058 0 0 30001 26 47 1