This vignette describes a generalized procedure making use of methods implemented in two R packages developed in the Italian National Institute:
The first package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), while the second one performs a selection of the PSUs such that the final sample of SSU if of the self-weigthing type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs.
This general flow assumes that at least a previous round of the survey, whose sampling design has to be optimized, is available, and is characterized by the following steps:
Perform externally the definition of the sample design, and possibly of the calibration step, using the R package ReGenesees (also developed in Istat), and make the design object and the calibrated object available. Moreover, check the presence of lonely strata:
This is the ‘design’ object:
## Stratified 2 - Stage Cluster Sampling Design (with replacement)
## - [49] strata (collapsed)
## - [789, 2236] clusters
##
## Call:
## e.svydesign(sample_2st, ids = ~municipality + id_hh, strata = ~stratum_sub,
## weights = ~d, self.rep.str = ~SR, check.data = TRUE)
and this is the calibrated object:
## Calibrated, Stratified 2 - Stage Cluster Sampling Design (with replacement)
## - [49] strata (collapsed)
## - [789, 2236] clusters
##
## Call:
## e.calibrate(design = des, df.population = pop, calmodel = ~clage:sex -
## 1, partition = ~region, calfun = "logit", bounds = c(0.7,
## 1.7), aggregate.stage = 2, force = FALSE)
It is advisable to check the presence of lonely strata:
## # No lonely PSUs found!
In case, provide to collapse and re-do the calibration.
In this example, in the ReGenesees objects there are the following variables:
## 'data.frame': 2244 obs. of 15 variables:
## $ region : Factor w/ 3 levels "north","center",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ municipality : num 8 8 8 8 8 8 8 8 8 8 ...
## $ stratum : Factor w/ 24 levels "1000","2000",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ stratum_sub : Factor w/ 81 levels "100001","100002",..: 81 81 81 81 81 81 81 81 81 81 ...
## $ SR : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ id_hh : Factor w/ 2236 levels "H100070","H100410",..: 69 43 64 49 367 27 372 373 374 368 ...
## $ sex : Factor w/ 2 levels "1","2": 1 1 2 2 1 2 1 2 1 1 ...
## $ clage : Factor w/ 5 levels "cl0_17","cl18_34",..: 3 1 2 1 5 2 2 2 3 1 ...
## $ income_hh : num 43741 23284 23450 22171 19904 ...
## $ work : num 1 1 1 2 0 1 1 1 1 2 ...
## $ unemployed : num 0 0 0 0 1 0 0 0 0 0 ...
## $ d : num 1238 1238 1238 1238 1238 ...
## $ progr_str : num 1 1 1 1 1 1 1 1 1 1 ...
## $ var.PSU : chr "8.H12425" "8.H10738" "8.H12157" "8.H11208" ...
## $ stratum_sub.collapsed: Factor w/ 49 levels "0.center.clps.1",..: 49 49 49 49 49 49 49 49 49 49 ...
where there are three potential target variables:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 11463 18516 21661 26763 532331
##
## 0 1 2
## 306 1487 451
##
## 0 1
## 1938 306
Great attention must be paid to the nature of the target variables, especially of the ‘factor’ type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of ‘1’ in the population. If factor variables are of other nature, then an error message is printed.
Therefore, we have to handle the ‘work’ variable in this way: as values 0, 1 and 2 indicate respectively non labour force, active and inactive people, we can decide to derive from ‘work’ two binary variables, ‘active’ and ‘inactive’:
des<-des.addvars(des,active=factor(ifelse(work==1,1,0)))
des<-des.addvars(des,inactive=factor(ifelse(work==2,1,0)))
cal<-des.addvars(cal,active=factor(ifelse(work==1,1,0)))
cal<-des.addvars(cal,inactive=factor(ifelse(work==2,1,0)))
Now, all the categorical target variables are compliant to the binary constraint:
##
## 0 1
## 757 1487
##
## 0 1
## 1793 451
##
## 0 1
## 1938 306
Using ReGenesees objects as input, produce the following dataframes (function ‘input_to_beat.2st_1’):
Actually, the ‘deff’ dataframe is not used in the following steps, it just remains for documentation purposes.
Here is the way we can produce the above items:
RGdes <- des # ReGenesees design object
RGcal <- cal # ReGenesees calibrated object
strata_vars <- c("stratum") # variables of stratification
target_vars <- c("income_hh",
"active",
"inactive",
"unemployed") # target variables
deff_vars <- "stratum" # stratification variables to be used when calculating deff and effst
# (n.b: must coincide or be a subset of variables of stratification)
id_PSU <- c("municipality") # identification variable of PSUs
id_SSU <- c("id_hh") # identification variable of SSUs
domain_vars <- c("region") # domain variables
inp1 <- input_to_beat.2st_1(RGdes,
RGcal,
id_PSU,
id_SSU,
strata_vars,
target_vars,
deff_vars,
domain_vars)
and these are the results:
stratum | STRATUM | N | M1 | M2 | M3 | M4 | S1 | S2 | S3 | S4 | COST | CENS | DOM1 | DOM2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 1000 | 197451 | 22266.58 | 0.6404431 | 0.2323140 | 0.1272429 | 14554.88 | 0.4798705 | 0.4223082 | 0.3332449 | 1 | 0 | 1 | center |
10000 | 10000 | 106106 | 27985.40 | 0.7679285 | 0.2114187 | 0.0206528 | 24367.97 | 0.4221544 | 0.4083146 | 0.1422189 | 1 | 0 | 1 | north |
11000 | 11000 | 202700 | 29173.85 | 0.8029080 | 0.1730880 | 0.0240040 | 39232.92 | 0.3978024 | 0.3783234 | 0.1530613 | 1 | 0 | 1 | north |
12000 | 12000 | 57420 | 26937.42 | 0.7764955 | 0.2075926 | 0.0159119 | 15743.78 | 0.4165936 | 0.4055834 | 0.1251347 | 1 | 0 | 1 | north |
13000 | 13000 | 103089 | 26357.25 | 0.7185271 | 0.2814729 | 0.0000000 | 14592.50 | 0.4497176 | 0.4497176 | 0.0000000 | 1 | 0 | 1 | north |
14000 | 14000 | 84653 | 20538.42 | 0.7518236 | 0.2131042 | 0.0350721 | 14285.81 | 0.4319547 | 0.4095007 | 0.1839621 | 1 | 0 | 1 | north |
stratum | STRATUM | DEFF1 | DEFF2 | DEFF3 | DEFF4 | b_nar |
---|---|---|---|---|---|---|
1000 | 1000 | 0.951705 | 0.991140 | 1.006731 | 0.954024 | 56.50000 |
10000 | 10000 | 0.856598 | 1.687606 | 1.404308 | 0.819854 | 26.75000 |
11000 | 11000 | 1.811807 | 1.261816 | 1.346654 | 1.339464 | 23.77778 |
12000 | 12000 | 1.086363 | 0.502458 | 0.483954 | 0.700691 | 21.00000 |
13000 | 13000 | 1.000924 | 1.000924 | 1.000924 | 1.000000 | 95.00000 |
14000 | 14000 | 0.633543 | 0.856820 | 0.845580 | 0.677276 | 33.66667 |
stratum | STRATUM | EFFST1 | EFFST2 | EFFST3 | EFFST4 |
---|---|---|---|---|---|
1000 | 1000 | 0.9689494 | 1 | 1 | 0.9420958 |
10000 | 10000 | 0.9500011 | 1 | 1 | 1.1915475 |
11000 | 11000 | 0.9544521 | 1 | 1 | 1.0546196 |
12000 | 12000 | 1.0429461 | 1 | 1 | 0.9732493 |
13000 | 13000 | 0.9914219 | 1 | 1 | 1.0000000 |
14000 | 14000 | 0.9829167 | 1 | 1 | 1.0974521 |
STRATUM | RHO_AR1 | RHO_NAR1 | RHO_AR2 | RHO_NAR2 | RHO_AR3 | RHO_NAR3 | RHO_AR4 | RHO_NAR4 |
---|---|---|---|---|---|---|---|---|
1000 | 1 | -0.0008702 | 1 | -0.0001596 | 1 | 0.0001213 | 1 | -0.0008284 |
10000 | 1 | -0.0055690 | 1 | 0.0267031 | 1 | 0.0157013 | 1 | -0.0069960 |
11000 | 1 | 0.0356403 | 1 | 0.0114944 | 1 | 0.0152190 | 1 | 0.0149033 |
12000 | 1 | 0.0043181 | 1 | -0.0248771 | 1 | -0.0258023 | 1 | -0.0149655 |
13000 | 1 | 0.0000098 | 1 | 0.0000098 | 1 | 0.0000098 | 1 | 0.0000000 |
14000 | 1 | -0.0112181 | 1 | -0.0043831 | 1 | -0.0047271 | 1 | -0.0098793 |
Prepare the inputs related to the PSUs (function ‘input_to_strat.2d_2’), that are
municipality | stratum | ind | hh |
---|---|---|---|
1 | 12000 | 1546 | 609 |
2 | 12000 | 936 | 402 |
3 | 12000 | 367 | 178 |
4 | 10000 | 13032 | 5788 |
5 | 12000 | 678 | 281 |
6 | 11000 | 3193 | 1194 |
psu_id="municipality" # Identifier of the PSU
stratum_var="stratum" # Identifier of the stratum
mos_var="ind" # Variable to be used as 'measure of size'
delta=1 # Average number of SSUs for each selection unit
minimum <- 50 # Minimum number of SSUs to be selected in each PSU
inp2 <- input_to_beat.2st_2(psu,
psu_id,
stratum_var,
mos_var,
delta,
minimum)
head(inp2$psu_file)
PSU_ID | STRATUM | PSU_MOS |
---|---|---|
1 | 12000 | 1546 |
2 | 12000 | 936 |
3 | 12000 | 367 |
4 | 10000 | 13032 |
5 | 12000 | 678 |
6 | 11000 | 3193 |
STRATUM | STRAT_MOS | DELTA | MINIMUM |
---|---|---|---|
1000 | 197007 | 1 | 50 |
2000 | 261456 | 1 | 50 |
3000 | 115813 | 1 | 50 |
4000 | 17241 | 1 | 50 |
5000 | 101067 | 1 | 50 |
6000 | 47218 | 1 | 50 |
Using the function ‘beat.2st’ in ‘R2BEAT’ package execute the optimization of PSU and SSU allocation in strata:
cv <- as.data.frame(list(DOM=c("DOM1","DOM2"),
CV1=c(0.03,0.04),
CV2=c(0.06,0.08),
CV3=c(0.06,0.08),
CV4=c(0.06,0.08)))
cv
DOM | CV1 | CV2 | CV3 | CV4 |
---|---|---|---|---|
DOM1 | 0.03 | 0.06 | 0.06 | 0.06 |
DOM2 | 0.04 | 0.08 | 0.08 | 0.08 |
stratif = inp1$strata
errors = cv
des_file = inp2$des_file
psu_file = inp2$psu_file
rho = inp1$rho
effst = inp1$effst
alloc <- beat.2st(stratif,
errors,
des_file,
psu_file,
rho,
deft_start = NULL,
effst,
epsilon1 = 5,
mmdiff_deft = 1,maxi = 15,
epsilon = 10^(-11), minnumstrat = 2, maxiter = 200, maxiter1 = 25)
## iteraction PSU_SR PSU NSR PSU Total SSU
## 1 0 0 0 0 6534
## 2 1 38 62 100 8990
## 3 2 22 124 146 9447
## 4 3 23 125 148 9424
This is the sensitivity of the solution:
Type | Dom | V1 | V2 | V3 | V4 | |
---|---|---|---|---|---|---|
1 | DOM1 | 1 | 1 | 0 | 1 | 1 |
5 | DOM2 | 1 | 1 | 0 | 3 | 1641 |
9 | DOM2 | 2 | 25 | 1 | 14 | 68 |
13 | DOM2 | 3 | 128 | 1 | 8 | 1 |
i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.
This are the expected values of the coefficients of variation:
Type | Dom | V1 | V2 | V3 | V4 | |
---|---|---|---|---|---|---|
1 | DOM1 | 1 | 0.0188 | 0.0159 | 0.0465 | 0.0444 |
5 | DOM2 | 1 | 0.0230 | 0.0202 | 0.0770 | 0.0800 |
9 | DOM2 | 2 | 0.0399 | 0.0306 | 0.0788 | 0.0798 |
13 | DOM2 | 3 | 0.0400 | 0.0402 | 0.0792 | 0.0605 |
Using the function ‘StratSel’ in ‘FS4’ package execute the selection of PSU in strata:
allocat <- alloc$alloc[-nrow(alloc$alloc),]
sample_2st <- StratSel(dataPop= inp2$psu_file,
idpsu= ~ PSU_ID,
dom= ~ STRATUM,
final_pop= ~ PSU_MOS,
size= ~ PSU_MOS,
PSUsamplestratum= 1,
min_sample= minimum,
min_sample_index= FALSE,
dataAll=allocat,
domAll= ~ factor(STRATUM),
f_sample= ~ ALLOC,
planned_min_sample= NULL,
launch= F)
This is the overall sample design:
Domain | SRdom | nSRdom | SRdom+nSRdom | SR_PSU_final_sample_unit | NSR_PSU_final_sample_unit |
---|---|---|---|---|---|
1000 | 2 | 0 | 2 | 161 | 0 |
2000 | 0 | 3 | 3 | 0 | 169 |
3000 | 0 | 1 | 1 | 0 | 47 |
4000 | 0 | 1 | 1 | 0 | 4 |
5000 | 2 | 0 | 2 | 95 | 0 |
6000 | 0 | 1 | 1 | 0 | 42 |
7000 | 0 | 1 | 1 | 0 | 8 |
8000 | 0 | 1 | 1 | 0 | 7 |
9000 | 1 | 0 | 1 | 967 | 0 |
10000 | 6 | 0 | 6 | 765 | 0 |
11000 | 15 | 20 | 35 | 690 | 1074 |
12000 | 0 | 3 | 3 | 0 | 154 |
13000 | 1 | 0 | 1 | 11 | 0 |
14000 | 4 | 0 | 4 | 726 | 0 |
15000 | 8 | 16 | 24 | 367 | 845 |
16000 | 14 | 38 | 52 | 585 | 2033 |
17000 | 1 | 0 | 1 | 50 | 0 |
18000 | 0 | 2 | 2 | 0 | 80 |
19000 | 0 | 6 | 6 | 0 | 319 |
20000 | 0 | 1 | 1 | 0 | 47 |
21000 | 1 | 0 | 1 | 52 | 0 |
22000 | 0 | 1 | 1 | 0 | 46 |
23000 | 0 | 2 | 2 | 0 | 77 |
24000 | 0 | 1 | 1 | 0 | 5 |
Total | 55 | 98 | 153 | 4469 | 4957 |
Mean | 186 | 207 |
des <- sample_2st[[2]]
des <- des[1:(nrow(des)-1),]
strat <- c(as.character(as.numeric(des$Domain[1:(nrow(des)-1)])),"Tot")
barplot(t(des[1:(nrow(des)),2:3]), names=strat,
col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, cex.names=0.7)
legend("topleft",
legend = c("Self Representative","Non Self Representative"),
fill = c("darkblue", "red"))
title("Distribution of allocated PSUs by domain")
barplot(t(des[1:(nrow(des)),5:6]), names=strat,
col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7,
cex.names=0.7)
legend("topleft",
legend = c("Self Representative","Non Self Representative"),
fill = c("darkblue", "red"))
title("Distribution of allocated SSUs by domain")
and these are the selected PSUs:
selected_PSU <- sample_2st[[4]]
selected_PSU <- selected_PSU[selected_PSU$PSU_final_sample_unit > 0,]
write.table(sample_2st[[4]],"Selected_PSUs.csv",sep=";",row.names=F,quote=F)
head(selected_PSU)
Sampled_PSU | Pik | Size_Stratum | STRATUM | PSU_ID | PSU_MOS | PSU_MOS.1 | ALLOC | threshold | final_populationdom | sampling_fraction | SR | SizeSR | stratum | N_PSU_Stratum | PSU_final_sample_unit | nSR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1.0000000 | 146162 | 1000 | 330 | 146162 | 146162 | 161 | 61182.30 | 197007 | 0.0008172 | 1 | 146162 | 10001 | 1 | 119 | 0 |
2 | 1 | 1.0000000 | 50845 | 1000 | 309 | 50845 | 50845 | 161 | 61182.30 | 197007 | 0.0008172 | 1 | 0 | 10002 | 1 | 42 | 0 |
1100 | 1 | 0.3629408 | 99727 | 2000 | 304 | 36195 | 36195 | 169 | 77353.85 | 261456 | 0.0006464 | 0 | 0 | 20001 | 3 | 64 | 1 |
7 | 1 | 0.2126547 | 80318 | 2000 | 318 | 17080 | 17080 | 169 | 77353.85 | 261456 | 0.0006464 | 0 | 0 | 20002 | 4 | 52 | 1 |
12 | 1 | 0.1648303 | 81411 | 2000 | 317 | 13419 | 13419 | 169 | 77353.85 | 261456 | 0.0006464 | 0 | 0 | 20003 | 6 | 53 | 1 |
24 | 1 | 0.0396847 | 115813 | 3000 | 295 | 4596 | 4596 | 47 | 123205.32 | 115813 | 0.0004058 | 0 | 0 | 30001 | 26 | 47 | 1 |