This vignette describes a generalized procedure making use of the methods implemented in the R package developed in the Italian National Institute, namely R2BEAT (“Multistage Sampling Allocation and PSU selection”).

This package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), and also to perform a selection of the PSUs such that the final sample of SSU is of the self-weighting type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs, or at least those of minimum variability.

This general flow assumes that a sampling frame is available, containing, among the others, the following variables:

  • identifier of the Primary Sampling Units;
  • identifier of the Secondary Sampling Units;
  • variables identifying the sampling strata;
  • target variables, i.e. the variables from which sampling estimates will be produced.

As for the last type of variables, of course their direct availability is not possible: instead, proxy variables will be present in the sampling frame, or the same variables with predicted values.

Having this sampling frame, the workflow is based on the following steps:

  1. Loading data and pre-processing
  2. Producing the inputs for next steps (with a fine tuning of parameters)
  3. Optimal allocation of SSUs in PSUs
  4. Selection of PSUs
  5. Selection of SSUs

1 Loading data and pre-processing

We make use of a synthetic population data frame (pop), that is available at the link:

https://github.com/barcaroli/R2BEAT/tree/master/data

load("pop.RData")   
str(pop)
## 'data.frame':    2258507 obs. of  21 variables:
##  $ region       : Factor w/ 3 levels "north","center",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ province     : Factor w/ 6 levels "north_1","north_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ municipality : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pop_m        : num  1546 1546 1546 1546 1546 ...
##  $ type_m       : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ id_hh        : Factor w/ 963018 levels "H1","H10","H100",..: 1 1 1 2 3 3 3 3 1114 1114 ...
##  $ id_ind       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex          : int  1 2 1 2 1 1 2 2 1 1 ...
##  $ age          : int  33 69 81 46 38 64 63 35 37 6 ...
##  $ edu          : int  5 3 3 4 5 3 4 12 5 2 ...
##  $ marital      : num  1 2 2 1 1 2 2 2 1 1 ...
##  $ foreigner    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ earner       : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ income_hh    : num  30488 30488 30488 21756 29871 ...
##  $ work         : num  1 1 2 1 1 1 1 1 1 2 ...
##  $ unemployed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ one          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ stratum      : Factor w/ 24 levels "1000","2000",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ stratum_n    : num  12000 12000 12000 12000 12000 12000 12000 12000 12000 12000 ...
##  $ stratum_label: chr  "north_1_6" "north_1_6" "north_1_6" "north_1_6" ...
##  $ prog_hh      : int  1 2 3 1 1 2 3 4 1 2 ...

In this phase we may have to derive new variables, corresponding to the parameters required by the following steps. In this case, it is not necessary, as almost all the variables are already available. We just have to manipulate in order to derive two target (binary) variables, and add a work variable (‘one’):

pop$active <- ifelse(pop$work==1,1,0)
pop$inactive <- ifelse(pop$work==2,1,0)
pop$one <- 1

Great attention must be paid to the nature of the target variables, especially of the ‘factor’ type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of ‘1’ in the population. If factor variables are of other nature, then an error message is printed.

Therefore, we have to handle the ‘work’ variable in this way: as values 0, 1 and 2 indicate respectively non labour force, active and inactive people, this is why we derived from ‘work’ the two binary variables, ‘active’ and ‘inactive’.

We are now able to populate the required parameters:

samp_frame <- pop
id_PSU <- "municipality"  # only one
id_SSU <- "id_ind"        # only one
strata_var <- "stratum"   # only one
target_vars <- c("income_hh","active","inactive","unemployed")   # more than one
deff_var <- "stratum"     # only one
domain_var <- "region"    # only one
minimum <- 50  # minimum number of SSUs to be interviewed in each selected PSU
delta =  1     # average dimension of the SSU in terms of elementary survey units
f = 0.05          # suggestion for the sampling fraction 
deff_sugg <- 1.5  # suggestion for the deff value

2 Producing the inputs for next steps

With already assigned parameters, we can execute the ‘prepareInputToAllocation’ function:

inp <- prepareInputToAllocation(samp_frame,
                                id_PSU,
                                id_SSU,
                                strata_var,
                                target_vars,
                                deff_var,
                                domain_var,
                                minimum,
                                delta,
                                f,
                                deff_sugg)

The function ‘prepareInputToAllocation’ produces a list composed by six elements, stored in the ‘inp’ object:

  1. the ‘stratif’ dataframe containing:
  • STRATUM: identifier of the single stratum
  • N: total population in terms of final sampling units
  • Mi,Si: mean and standard deviation of target variables (i=1,2,..,P)
  • DOMk: domain(s) to which the stratum belongs
  1. the ‘deff’ (design effect) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • DEFFi: the design effect for each target variable i (i=1,2,…,P)
  1. the ‘effst’ (estimator effect) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • EFFSTi: the estimator effect for each target variable i (i=1,2,…,P)
  1. the ‘rho’ (intraclass coefficient of correlation) dataframe, containing the following information:
  • STRATUM: the stratum identifier
  • RHO_ARi: the intraclass coefficient of correlation in self-representative PSUs for each target variable i (i=1,2,…,P)
  • RHO_NARi: the intraclass coefficient of correlation in non self-representative PSUs for each target variable i (i=1,2,…,P)
  1. the ‘des_file’ dataframe, containing the following information:
  • STRATUM: stratum identifier
  • MOS: measure of size of the stratum (in terms of number of contained selection units)
  • DELTA: factor that report the average number of SSUs for each selection unit
  • MINIMUM: minimum number of units to be selected in each PSU
  1. the ‘PSU_file’ dataframe, containing the following information:
  • stratum identifier
  • PSU id
  • PSU_MOS: number of final selection units contained in a given PSU

(Actually, the ‘deff’ dataframe is not used in the following steps, it just remains for documentation purposes.)

Let us see the content of these objects:

head(inp$strata)
N M1 M2 M3 M4 S1 S2 S3 S4 COST CENS DOM1 DOM2 STRATUM
1000 197007 23959.87 0.6650322 0.2285807 0.1063871 22179.08 0.4719792 0.4199185 0.3083324 1 0 1 2 1000
2000 261456 20966.65 0.6709886 0.2297519 0.0992595 19624.65 0.4698541 0.4206732 0.2990102 1 0 1 2 2000
3000 115813 19814.73 0.6644591 0.2315975 0.1039434 14754.88 0.4721792 0.4218532 0.3051871 1 0 1 2 3000
4000 17241 18732.72 0.6273418 0.2499275 0.1227307 13462.74 0.4835122 0.4329708 0.3281278 1 0 1 2 4000
5000 101067 22070.31 0.6134445 0.2338845 0.1526710 17187.98 0.4869603 0.4232996 0.3596701 1 0 1 2 5000
6000 47218 21069.07 0.6135796 0.2348469 0.1515736 17342.74 0.4869288 0.4239031 0.3586070 1 0 1 2 6000
inp$deff
STRATUM DEFF1 DEFF2 DEFF3 DEFF4 b_nar
1 1000 1.5 1.5 1.5 1.5 4925.17500
12 2000 1.5 1.5 1.5 1.5 1005.60000
18 3000 1.5 1.5 1.5 1.5 222.71731
19 4000 1.5 1.5 1.5 1.5 47.89167
20 5000 1.5 1.5 1.5 1.5 2526.67500
21 6000 1.5 1.5 1.5 1.5 786.96667
22 7000 1.5 1.5 1.5 1.5 168.72222
23 8000 1.5 1.5 1.5 1.5 69.78421
24 9000 1.5 1.5 1.5 1.5 4641.65000
2 10000 1.5 1.5 1.5 1.5 883.58333
3 11000 1.5 1.5 1.5 1.5 174.49153
4 12000 1.5 1.5 1.5 1.5 57.65700
5 13000 1.5 1.5 1.5 1.5 5146.65000
6 14000 1.5 1.5 1.5 1.5 1049.78750
7 15000 1.5 1.5 1.5 1.5 194.15625
8 16000 1.5 1.5 1.5 1.5 44.59672
9 17000 1.5 1.5 1.5 1.5 3055.85000
10 18000 1.5 1.5 1.5 1.5 618.79167
11 19000 1.5 1.5 1.5 1.5 189.70676
13 20000 1.5 1.5 1.5 1.5 55.32091
14 21000 1.5 1.5 1.5 1.5 2757.20000
15 22000 1.5 1.5 1.5 1.5 696.51667
16 23000 1.5 1.5 1.5 1.5 240.55000
17 24000 1.5 1.5 1.5 1.5 48.19583
inp$effst
STRATUM EFFST1 EFFST2 EFFST3 EFFST4
1000 1 1 1 1
2000 1 1 1 1
3000 1 1 1 1
4000 1 1 1 1
5000 1 1 1 1
6000 1 1 1 1
7000 1 1 1 1
8000 1 1 1 1
9000 1 1 1 1
10000 1 1 1 1
11000 1 1 1 1
12000 1 1 1 1
13000 1 1 1 1
14000 1 1 1 1
15000 1 1 1 1
16000 1 1 1 1
17000 1 1 1 1
18000 1 1 1 1
19000 1 1 1 1
20000 1 1 1 1
21000 1 1 1 1
22000 1 1 1 1
23000 1 1 1 1
24000 1 1 1 1
inp$rho
STRATUM RHO_AR1 RHO_NAR1 RHO_AR2 RHO_NAR2 RHO_AR3 RHO_NAR3 RHO_AR4 RHO_NAR4
1 1000 1 0.0001015 1 0.0001015 1 0.0001015 1 0.0001015
12 2000 1 0.0004977 1 0.0004977 1 0.0004977 1 0.0004977
18 3000 1 0.0022551 1 0.0022551 1 0.0022551 1 0.0022551
19 4000 1 0.0106629 1 0.0106629 1 0.0106629 1 0.0106629
20 5000 1 0.0001980 1 0.0001980 1 0.0001980 1 0.0001980
21 6000 1 0.0006362 1 0.0006362 1 0.0006362 1 0.0006362
22 7000 1 0.0029811 1 0.0029811 1 0.0029811 1 0.0029811
23 8000 1 0.0072691 1 0.0072691 1 0.0072691 1 0.0072691
24 9000 1 0.0001077 1 0.0001077 1 0.0001077 1 0.0001077
2 10000 1 0.0005665 1 0.0005665 1 0.0005665 1 0.0005665
3 11000 1 0.0028820 1 0.0028820 1 0.0028820 1 0.0028820
4 12000 1 0.0088250 1 0.0088250 1 0.0088250 1 0.0088250
5 13000 1 0.0000972 1 0.0000972 1 0.0000972 1 0.0000972
6 14000 1 0.0004767 1 0.0004767 1 0.0004767 1 0.0004767
7 15000 1 0.0025886 1 0.0025886 1 0.0025886 1 0.0025886
8 16000 1 0.0114688 1 0.0114688 1 0.0114688 1 0.0114688
9 17000 1 0.0001637 1 0.0001637 1 0.0001637 1 0.0001637
10 18000 1 0.0008093 1 0.0008093 1 0.0008093 1 0.0008093
11 19000 1 0.0026496 1 0.0026496 1 0.0026496 1 0.0026496
13 20000 1 0.0092046 1 0.0092046 1 0.0092046 1 0.0092046
14 21000 1 0.0001814 1 0.0001814 1 0.0001814 1 0.0001814
15 22000 1 0.0007189 1 0.0007189 1 0.0007189 1 0.0007189
16 23000 1 0.0020872 1 0.0020872 1 0.0020872 1 0.0020872
17 24000 1 0.0105942 1 0.0105942 1 0.0105942 1 0.0105942
inp$des_file
STRATUM STRAT_MOS DELTA MINIMUM
1000 197007 1 50
2000 261456 1 50
3000 115813 1 50
4000 17241 1 50
5000 101067 1 50
6000 47218 1 50
7000 30370 1 50
8000 26518 1 50
9000 92833 1 50
10000 106030 1 50
11000 205900 1 50
12000 57657 1 50
13000 102933 1 50
14000 83983 1 50
15000 186390 1 50
16000 108816 1 50
17000 61117 1 50
18000 74255 1 50
19000 140383 1 50
20000 60853 1 50
21000 55144 1 50
22000 41791 1 50
23000 72165 1 50
24000 11567 1 50
head(inp$PSU_file)
## NULL

It may happen that the population in strata (variable ‘N’ in ‘inp$strata’ dataset) and the one derived by the PSU dataset (variable ‘STRAT_MOS’ in ‘inp$des_file’ dataset) are not the same.

We can check it by applying the function ‘check_input’ in this way:

newstrata <- check_input(strata=inp$strata,
                         des=inp$des_file,
                         strata_var_strata="STRATUM",
                         strata_var_des="STRATUM")
## 
## --------------------------------------------------
##  Differences between population in strata and PSUs  
## --------------------------------------------------
##    STRATUM N_in_strata N_in_PSUs relative_difference
## 1     1000      197007    197007                   0
## 2    10000      106030    106030                   0
## 3    11000      205900    205900                   0
## 4    12000       57657     57657                   0
## 5    13000      102933    102933                   0
## 6    14000       83983     83983                   0
## 7    15000      186390    186390                   0
## 8    16000      108816    108816                   0
## 9    17000       61117     61117                   0
## 10   18000       74255     74255                   0
## 11   19000      140383    140383                   0
## 12    2000      261456    261456                   0
## 13   20000       60853     60853                   0
## 14   21000       55144     55144                   0
## 15   22000       41791     41791                   0
## 16   23000       72165     72165                   0
## 17   24000       11567     11567                   0
## 18    3000      115813    115813                   0
## 19    4000       17241     17241                   0
## 20    5000      101067    101067                   0
## 21    6000       47218     47218                   0
## 22    7000       30370     30370                   0
## 23    8000       26518     26518                   0
## 24    9000       92833     92833                   0
## 
## --------------------------------------------------
## Population of PSUs has been attributed to strata

3 Fine tuning of the parameters

For the execution of the function ‘prepareInputToAllocation’ it is necessary to assign values to the different parameters. Some of them can be directly derived by available data, but for others, namely:

  • ‘minimum’ (minimum number of SSUs to be interviewed in each selected PSU)
  • ‘f’ (suggestion for the sampling fraction)
  • ‘deff_sugg’ (suggestion for the deff value)

the indication of the values is more difficult, without having any reference.

In order to orientate in the choice of these values, the function ‘sensitivity’ allows to perform a sensitivity analysis for each of this parameters.

To execute this function, the name of the parameter has to be given, together with the minimum and maximum value. On the basis of these minimum and maximum values, 10 different values will be used for carrying out the allocation. The output will be a graphical one.

This function requires also the definition of the precision constraints on the target values:

cv <- as.data.frame(list(DOM=c("DOM1","DOM2"),
                         CV1=c(0.03,0.04),
                         CV2=c(0.06,0.08),
                         CV3=c(0.06,0.08),
                         CV4=c(0.06,0.08)))
cv
DOM CV1 CV2 CV3 CV4
DOM1 0.03 0.06 0.06 0.06
DOM2 0.04 0.08 0.08 0.08

The meaning of these constraints is that, once we select a sample and produce extimates, we expect a maximum coefficient of variation for the first variable (‘income_hh’) equal to 3% at national level (‘DOM1’) and to 4% at regional level (‘DOM2’); respectively 6% and 8% for the other three variables.

For instance, we can analyze the impact of the ‘deff_sugg’ parameter on the final sample design by executing the following code:

sensitivity (samp_frame=pop,
            id_PSU="municipality",
            id_SSU="id_ind",
            strata_var="stratum",
            target_vars=c("income_hh","active","inactive","unemployed"),
            deff_var="stratum",
            domain_var="region",
            minimum=50,
            delta=1,
            f=0.05,
            search="deff",
            min=1,
            max=2) 

The same for the ‘minimum’ parameter:

sensitivity (samp_frame=pop,
            id_PSU="municipality",
            id_SSU="id_ind",
            strata_var="stratum",
            target_vars=c("income_hh","active","inactive","unemployed"),
            deff_var="stratum",
            domain_var="region",
            delta=1,
            f=0.05,
            deff_sugg=1.5,
            search="min_SSU",
            min=30,
            max=80) 

And, finally, for initial sampling rate:

sensitivity (samp_frame=pop,
            id_PSU="municipality",
            id_SSU="id_ind",
            strata_var="stratum",
            target_vars=c("income_hh","active","inactive","unemployed"),
            deff_var="stratum",
            domain_var="region",
            delta=1,
            minimum=50,
            deff_sugg=1.5,
            search="sample_fraction",
            min=0.01,
            max=0.10) 

By analysing the above graphs we can decide which values are the most suitable for the sample design.

4 Optimal allocation

Using the function ‘beat.2st’ in ‘R2BEAT’ package we can perform the optimization of PSU and SSU allocation in strata:

alloc <- beat.2st(inp$strata, 
                  cv, 
                  inp$des_file, 
                  inp$psu_file, 
                  inp$rho, 
                  deft_start = NULL, 
                  inp$effst,
                  epsilon1 = 5, 
                  mmdiff_deft = 1,
                  maxi = 15, 
                  epsilon = 10^(-11), 
                  minnumstrat = 2, 
                  maxiter = 200, 
                  maxiter1 = 25)
##   iterations PSU_SR PSU NSR PSU Total  SSU
## 1          0      0       0         0 7391
## 2          1     20      86       106 8302
## 3          2     21      99       120 8300

This is the sensitivity of the solution:

alloc$sensitivity
Type Dom V1 V2 V3 V4
1 DOM1 1 1 0 1 1
5 DOM2 1 0 0 0 1283
9 DOM2 2 1 0 1 247
13 DOM2 3 1 1 129 1

i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.

These are the expected values of the coefficients of variation:

alloc$expected
Type Dom V1 V2 V3 V4
1 DOM1 1 0.0123 0.0109 0.0288 0.0447
5 DOM2 1 0.0117 0.0074 0.0260 0.0800
9 DOM2 2 0.0256 0.0213 0.0533 0.0799
13 DOM2 3 0.0387 0.0438 0.0799 0.0609

5 Selection of PSUs

Using the function ‘StratSel’ execute the selection of PSU in strata:

set.seed(1234)
allocat <- alloc$alloc[-nrow(alloc$alloc),]
sample_2st <- StratSel(dataPop= inp$psu_file,
                       idpsu= ~ PSU_ID, 
                       dom= ~ STRATUM, 
                       final_pop= ~ PSU_MOS, 
                       size= ~ PSU_MOS, 
                       PSUsamplestratum= 1, 
                       min_sample= minimum, 
                       min_sample_index= FALSE, 
                       dataAll=allocat,
                       domAll= ~ factor(STRATUM), 
                       f_sample= ~ ALLOC, 
                       planned_min_sample= NULL, 
                       launch= F)

This is the overall sample design:

sample_2st[[2]]
Domain SRdom nSRdom SRdom+nSRdom SR_PSU_final_sample_unit NSR_PSU_final_sample_unit
1000 2 0 2 287 0
2000 3 4 7 143 232
3000 0 4 4 0 176
4000 0 1 1 0 33
5000 2 0 2 172 0
6000 1 1 2 33 49
7000 0 1 1 0 55
8000 0 1 1 0 56
9000 1 0 1 581 0
10000 6 0 6 611 0
11000 4 20 24 155 1053
12000 0 8 8 0 396
13000 1 0 1 733 0
14000 4 0 4 601 0
15000 9 17 26 434 883
16000 0 19 19 0 975
17000 1 0 1 70 0
18000 0 2 2 0 88
19000 0 4 4 0 175
20000 0 2 2 0 91
21000 1 0 1 64 0
22000 0 1 1 0 49
23000 0 2 2 0 89
24000 0 1 1 0 18
Total 35 88 123 3884 4418
Mean 162 184
des <- sample_2st[[2]]
des <- des[1:(nrow(des)-1),]
strat <- c(as.character(as.numeric(des$Domain[1:(nrow(des)-1)])),"Tot")
barplot(t(des[1:(nrow(des)),2:3]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated PSUs by domain")

barplot(t(des[1:(nrow(des)),5:6]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, 
        cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated SSUs by domain")

and these are the selected PSUs:

selected_PSU <- sample_2st[[4]]
selected_PSU <- selected_PSU[selected_PSU$PSU_final_sample_unit > 0,]
write.table(sample_2st[[4]],"Selected_PSUs.csv",sep=";",row.names=F,quote=F)
head(selected_PSU)
Sampled_PSU Pik Size_Stratum STRATUM PSU_ID PSU_MOS PSU_MOS.1 ALLOC threshold final_populationdom sampling_fraction SR SizeSR stratum N_PSU_Stratum PSU_final_sample_unit nSR
1 1 1.0000000 146162 1000 330 146162 146162 287 34321.78 197007 0.0014568 1 146162 10001 1 213 0
2 1 1.0000000 50845 1000 309 50845 50845 287 34321.78 197007 0.0014568 1 50845 10002 1 74 0
3 1 1.0000000 36195 2000 304 36195 36195 374 34954.01 261456 0.0014305 1 36195 20001 1 52 0
4 1 1.0000000 34156 2000 342 34156 34156 374 34954.01 261456 0.0014305 1 0 20002 1 49 0
5 1 1.0000000 29376 2000 315 29376 29376 374 34954.01 261456 0.0014305 1 0 20003 1 42 0
1100 1 0.5583857 44403 2000 292 24794 24794 374 34954.01 261456 0.0014305 0 0 20004 2 64 1

6 Selection of SSUs

Finally, we are able to select the Secondary Sample Units (the individuals) from the already selected PSUs (the municipalities). First, we load the population frame:

load("pop.RData")

and we proceed to select the sample in this way:

samp <- select_SSU(df=pop,
                   PSU_code="municipality",
                   SSU_code="id_ind",
                   PSU_sampled=selected_PSU[selected_PSU$Sampled_PSU==1,],
                   verbose=FALSE)

To check that the total amount is practically equal to what determined in the allocation step:

nrow(samp)
## [1] 8302
sum(allocat$ALLOC)
## [1] 8300

and that the sum of weights equalize population size:

nrow(pop)
## [1] 2258507
sum(samp$weight)
## [1] 2258507

This is the distribution of weights:

par(mfrow=c(1, 2))
boxplot(samp$weight,col="orange")
title("Weights distribution (total sample)",cex.main=0.7)
boxplot(weight ~ region, data=samp,col="orange")
title("Weights distribution by region",cex.main=0.7)

boxplot(weight ~ province, data=samp,col="orange")
title("Weights distribution by province",cex.main=0.7)
boxplot(weight ~ stratum, data=samp,col="orange")
title("Weights distribution by stratum",cex.main=0.7)

It can be seen that the sample is fully self-weighted inside strata, and approximately self-weighted in aggregations of strata, that is the result we wanted to obtain.