This vignette describes a generalized procedure making use of the methods implemented in the R package developed in the Italian National Institute, namely R2BEAT (“Multistage Sampling Allocation and PSU selection”).
This package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), and also to perform a selection of the PSUs such that the final sample of SSU is of the self-weighting type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs, or at least those of minimum variability.
This general flow assumes that a sampling frame is available, containing, among the others, the following variables:
As for the last type of variables, of course their direct availability is not possible: instead, proxy variables will be present in the sampling frame, or the same variables with predicted values.
Having this sampling frame, the workflow is based on the following steps:
We make use of a synthetic population data frame (pop), that is available at the link:
https://github.com/barcaroli/R2BEAT/tree/master/data
load("pop.RData")
str(pop)
## 'data.frame': 2258507 obs. of 21 variables:
## $ region : Factor w/ 3 levels "north","center",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ province : Factor w/ 6 levels "north_1","north_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ municipality : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pop_m : num 1546 1546 1546 1546 1546 ...
## $ type_m : num 6 6 6 6 6 6 6 6 6 6 ...
## $ id_hh : Factor w/ 963018 levels "H1","H10","H100",..: 1 1 1 2 3 3 3 3 1114 1114 ...
## $ id_ind : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : int 1 2 1 2 1 1 2 2 1 1 ...
## $ age : int 33 69 81 46 38 64 63 35 37 6 ...
## $ edu : int 5 3 3 4 5 3 4 12 5 2 ...
## $ marital : num 1 2 2 1 1 2 2 2 1 1 ...
## $ foreigner : int 0 0 0 0 0 0 0 0 0 0 ...
## $ earner : int 1 1 1 1 1 1 1 1 1 0 ...
## $ income_hh : num 30488 30488 30488 21756 29871 ...
## $ work : num 1 1 2 1 1 1 1 1 1 2 ...
## $ unemployed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ one : num 1 1 1 1 1 1 1 1 1 1 ...
## $ stratum : Factor w/ 24 levels "1000","2000",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ stratum_n : num 12000 12000 12000 12000 12000 12000 12000 12000 12000 12000 ...
## $ stratum_label: chr "north_1_6" "north_1_6" "north_1_6" "north_1_6" ...
## $ prog_hh : int 1 2 3 1 1 2 3 4 1 2 ...
In this phase we may have to derive new variables, corresponding to the parameters required by the following steps. In this case, it is not necessary, as almost all the variables are already available. We just have to manipulate in order to derive two target (binary) variables, and add a work variable (‘one’):
$active <- ifelse(pop$work==1,1,0)
pop$inactive <- ifelse(pop$work==2,1,0)
pop$one <- 1 pop
Great attention must be paid to the nature of the target variables, especially of the ‘factor’ type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of ‘1’ in the population. If factor variables are of other nature, then an error message is printed.
Therefore, we have to handle the ‘work’ variable in this way: as values 0, 1 and 2 indicate respectively non labour force, active and inactive people, this is why we derived from ‘work’ the two binary variables, ‘active’ and ‘inactive’.
We are now able to populate the required parameters:
<- pop
samp_frame <- "municipality" # only one
id_PSU <- "id_ind" # only one
id_SSU <- "stratum" # only one
strata_var <- c("income_hh","active","inactive","unemployed") # more than one
target_vars <- "stratum" # only one
deff_var <- "region" # only one
domain_var <- 50 # minimum number of SSUs to be interviewed in each selected PSU
minimum = 1 # average dimension of the SSU in terms of elementary survey units
delta = 0.05 # suggestion for the sampling fraction
f <- 1.5 # suggestion for the deff value deff_sugg
With already assigned parameters, we can execute the ‘prepareInputToAllocation’ function:
<- prepareInputToAllocation(samp_frame,
inp
id_PSU,
id_SSU,
strata_var,
target_vars,
deff_var,
domain_var,
minimum,
delta,
f, deff_sugg)
The function ‘prepareInputToAllocation’ produces a list composed by six elements, stored in the ‘inp’ object:
(Actually, the ‘deff’ dataframe is not used in the following steps, it just remains for documentation purposes.)
Let us see the content of these objects:
head(inp$strata)
N | M1 | M2 | M3 | M4 | S1 | S2 | S3 | S4 | COST | CENS | DOM1 | DOM2 | STRATUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 197007 | 23959.87 | 0.6650322 | 0.2285807 | 0.1063871 | 22179.08 | 0.4719792 | 0.4199185 | 0.3083324 | 1 | 0 | 1 | 2 | 1000 |
2000 | 261456 | 20966.65 | 0.6709886 | 0.2297519 | 0.0992595 | 19624.65 | 0.4698541 | 0.4206732 | 0.2990102 | 1 | 0 | 1 | 2 | 2000 |
3000 | 115813 | 19814.73 | 0.6644591 | 0.2315975 | 0.1039434 | 14754.88 | 0.4721792 | 0.4218532 | 0.3051871 | 1 | 0 | 1 | 2 | 3000 |
4000 | 17241 | 18732.72 | 0.6273418 | 0.2499275 | 0.1227307 | 13462.74 | 0.4835122 | 0.4329708 | 0.3281278 | 1 | 0 | 1 | 2 | 4000 |
5000 | 101067 | 22070.31 | 0.6134445 | 0.2338845 | 0.1526710 | 17187.98 | 0.4869603 | 0.4232996 | 0.3596701 | 1 | 0 | 1 | 2 | 5000 |
6000 | 47218 | 21069.07 | 0.6135796 | 0.2348469 | 0.1515736 | 17342.74 | 0.4869288 | 0.4239031 | 0.3586070 | 1 | 0 | 1 | 2 | 6000 |
$deff inp
STRATUM | DEFF1 | DEFF2 | DEFF3 | DEFF4 | b_nar | |
---|---|---|---|---|---|---|
1 | 1000 | 1.5 | 1.5 | 1.5 | 1.5 | 4925.17500 |
12 | 2000 | 1.5 | 1.5 | 1.5 | 1.5 | 1005.60000 |
18 | 3000 | 1.5 | 1.5 | 1.5 | 1.5 | 222.71731 |
19 | 4000 | 1.5 | 1.5 | 1.5 | 1.5 | 47.89167 |
20 | 5000 | 1.5 | 1.5 | 1.5 | 1.5 | 2526.67500 |
21 | 6000 | 1.5 | 1.5 | 1.5 | 1.5 | 786.96667 |
22 | 7000 | 1.5 | 1.5 | 1.5 | 1.5 | 168.72222 |
23 | 8000 | 1.5 | 1.5 | 1.5 | 1.5 | 69.78421 |
24 | 9000 | 1.5 | 1.5 | 1.5 | 1.5 | 4641.65000 |
2 | 10000 | 1.5 | 1.5 | 1.5 | 1.5 | 883.58333 |
3 | 11000 | 1.5 | 1.5 | 1.5 | 1.5 | 174.49153 |
4 | 12000 | 1.5 | 1.5 | 1.5 | 1.5 | 57.65700 |
5 | 13000 | 1.5 | 1.5 | 1.5 | 1.5 | 5146.65000 |
6 | 14000 | 1.5 | 1.5 | 1.5 | 1.5 | 1049.78750 |
7 | 15000 | 1.5 | 1.5 | 1.5 | 1.5 | 194.15625 |
8 | 16000 | 1.5 | 1.5 | 1.5 | 1.5 | 44.59672 |
9 | 17000 | 1.5 | 1.5 | 1.5 | 1.5 | 3055.85000 |
10 | 18000 | 1.5 | 1.5 | 1.5 | 1.5 | 618.79167 |
11 | 19000 | 1.5 | 1.5 | 1.5 | 1.5 | 189.70676 |
13 | 20000 | 1.5 | 1.5 | 1.5 | 1.5 | 55.32091 |
14 | 21000 | 1.5 | 1.5 | 1.5 | 1.5 | 2757.20000 |
15 | 22000 | 1.5 | 1.5 | 1.5 | 1.5 | 696.51667 |
16 | 23000 | 1.5 | 1.5 | 1.5 | 1.5 | 240.55000 |
17 | 24000 | 1.5 | 1.5 | 1.5 | 1.5 | 48.19583 |
$effst inp
STRATUM | EFFST1 | EFFST2 | EFFST3 | EFFST4 |
---|---|---|---|---|
1000 | 1 | 1 | 1 | 1 |
2000 | 1 | 1 | 1 | 1 |
3000 | 1 | 1 | 1 | 1 |
4000 | 1 | 1 | 1 | 1 |
5000 | 1 | 1 | 1 | 1 |
6000 | 1 | 1 | 1 | 1 |
7000 | 1 | 1 | 1 | 1 |
8000 | 1 | 1 | 1 | 1 |
9000 | 1 | 1 | 1 | 1 |
10000 | 1 | 1 | 1 | 1 |
11000 | 1 | 1 | 1 | 1 |
12000 | 1 | 1 | 1 | 1 |
13000 | 1 | 1 | 1 | 1 |
14000 | 1 | 1 | 1 | 1 |
15000 | 1 | 1 | 1 | 1 |
16000 | 1 | 1 | 1 | 1 |
17000 | 1 | 1 | 1 | 1 |
18000 | 1 | 1 | 1 | 1 |
19000 | 1 | 1 | 1 | 1 |
20000 | 1 | 1 | 1 | 1 |
21000 | 1 | 1 | 1 | 1 |
22000 | 1 | 1 | 1 | 1 |
23000 | 1 | 1 | 1 | 1 |
24000 | 1 | 1 | 1 | 1 |
$rho inp
STRATUM | RHO_AR1 | RHO_NAR1 | RHO_AR2 | RHO_NAR2 | RHO_AR3 | RHO_NAR3 | RHO_AR4 | RHO_NAR4 | |
---|---|---|---|---|---|---|---|---|---|
1 | 1000 | 1 | 0.0001015 | 1 | 0.0001015 | 1 | 0.0001015 | 1 | 0.0001015 |
12 | 2000 | 1 | 0.0004977 | 1 | 0.0004977 | 1 | 0.0004977 | 1 | 0.0004977 |
18 | 3000 | 1 | 0.0022551 | 1 | 0.0022551 | 1 | 0.0022551 | 1 | 0.0022551 |
19 | 4000 | 1 | 0.0106629 | 1 | 0.0106629 | 1 | 0.0106629 | 1 | 0.0106629 |
20 | 5000 | 1 | 0.0001980 | 1 | 0.0001980 | 1 | 0.0001980 | 1 | 0.0001980 |
21 | 6000 | 1 | 0.0006362 | 1 | 0.0006362 | 1 | 0.0006362 | 1 | 0.0006362 |
22 | 7000 | 1 | 0.0029811 | 1 | 0.0029811 | 1 | 0.0029811 | 1 | 0.0029811 |
23 | 8000 | 1 | 0.0072691 | 1 | 0.0072691 | 1 | 0.0072691 | 1 | 0.0072691 |
24 | 9000 | 1 | 0.0001077 | 1 | 0.0001077 | 1 | 0.0001077 | 1 | 0.0001077 |
2 | 10000 | 1 | 0.0005665 | 1 | 0.0005665 | 1 | 0.0005665 | 1 | 0.0005665 |
3 | 11000 | 1 | 0.0028820 | 1 | 0.0028820 | 1 | 0.0028820 | 1 | 0.0028820 |
4 | 12000 | 1 | 0.0088250 | 1 | 0.0088250 | 1 | 0.0088250 | 1 | 0.0088250 |
5 | 13000 | 1 | 0.0000972 | 1 | 0.0000972 | 1 | 0.0000972 | 1 | 0.0000972 |
6 | 14000 | 1 | 0.0004767 | 1 | 0.0004767 | 1 | 0.0004767 | 1 | 0.0004767 |
7 | 15000 | 1 | 0.0025886 | 1 | 0.0025886 | 1 | 0.0025886 | 1 | 0.0025886 |
8 | 16000 | 1 | 0.0114688 | 1 | 0.0114688 | 1 | 0.0114688 | 1 | 0.0114688 |
9 | 17000 | 1 | 0.0001637 | 1 | 0.0001637 | 1 | 0.0001637 | 1 | 0.0001637 |
10 | 18000 | 1 | 0.0008093 | 1 | 0.0008093 | 1 | 0.0008093 | 1 | 0.0008093 |
11 | 19000 | 1 | 0.0026496 | 1 | 0.0026496 | 1 | 0.0026496 | 1 | 0.0026496 |
13 | 20000 | 1 | 0.0092046 | 1 | 0.0092046 | 1 | 0.0092046 | 1 | 0.0092046 |
14 | 21000 | 1 | 0.0001814 | 1 | 0.0001814 | 1 | 0.0001814 | 1 | 0.0001814 |
15 | 22000 | 1 | 0.0007189 | 1 | 0.0007189 | 1 | 0.0007189 | 1 | 0.0007189 |
16 | 23000 | 1 | 0.0020872 | 1 | 0.0020872 | 1 | 0.0020872 | 1 | 0.0020872 |
17 | 24000 | 1 | 0.0105942 | 1 | 0.0105942 | 1 | 0.0105942 | 1 | 0.0105942 |
$des_file inp
STRATUM | STRAT_MOS | DELTA | MINIMUM |
---|---|---|---|
1000 | 197007 | 1 | 50 |
2000 | 261456 | 1 | 50 |
3000 | 115813 | 1 | 50 |
4000 | 17241 | 1 | 50 |
5000 | 101067 | 1 | 50 |
6000 | 47218 | 1 | 50 |
7000 | 30370 | 1 | 50 |
8000 | 26518 | 1 | 50 |
9000 | 92833 | 1 | 50 |
10000 | 106030 | 1 | 50 |
11000 | 205900 | 1 | 50 |
12000 | 57657 | 1 | 50 |
13000 | 102933 | 1 | 50 |
14000 | 83983 | 1 | 50 |
15000 | 186390 | 1 | 50 |
16000 | 108816 | 1 | 50 |
17000 | 61117 | 1 | 50 |
18000 | 74255 | 1 | 50 |
19000 | 140383 | 1 | 50 |
20000 | 60853 | 1 | 50 |
21000 | 55144 | 1 | 50 |
22000 | 41791 | 1 | 50 |
23000 | 72165 | 1 | 50 |
24000 | 11567 | 1 | 50 |
head(inp$PSU_file)
## NULL
It may happen that the population in strata (variable ‘N’ in ‘inp$strata’ dataset) and the one derived by the PSU dataset (variable ‘STRAT_MOS’ in ‘inp$des_file’ dataset) are not the same.
We can check it by applying the function ‘check_input’ in this way:
<- check_input(strata=inp$strata,
newstrata des=inp$des_file,
strata_var_strata="STRATUM",
strata_var_des="STRATUM")
##
## --------------------------------------------------
## Differences between population in strata and PSUs
## --------------------------------------------------
## STRATUM N_in_strata N_in_PSUs relative_difference
## 1 1000 197007 197007 0
## 2 10000 106030 106030 0
## 3 11000 205900 205900 0
## 4 12000 57657 57657 0
## 5 13000 102933 102933 0
## 6 14000 83983 83983 0
## 7 15000 186390 186390 0
## 8 16000 108816 108816 0
## 9 17000 61117 61117 0
## 10 18000 74255 74255 0
## 11 19000 140383 140383 0
## 12 2000 261456 261456 0
## 13 20000 60853 60853 0
## 14 21000 55144 55144 0
## 15 22000 41791 41791 0
## 16 23000 72165 72165 0
## 17 24000 11567 11567 0
## 18 3000 115813 115813 0
## 19 4000 17241 17241 0
## 20 5000 101067 101067 0
## 21 6000 47218 47218 0
## 22 7000 30370 30370 0
## 23 8000 26518 26518 0
## 24 9000 92833 92833 0
##
## --------------------------------------------------
## Population of PSUs has been attributed to strata
For the execution of the function ‘prepareInputToAllocation’ it is necessary to assign values to the different parameters. Some of them can be directly derived by available data, but for others, namely:
the indication of the values is more difficult, without having any reference.
In order to orientate in the choice of these values, the function ‘sensitivity’ allows to perform a sensitivity analysis for each of this parameters.
To execute this function, the name of the parameter has to be given, together with the minimum and maximum value. On the basis of these minimum and maximum values, 10 different values will be used for carrying out the allocation. The output will be a graphical one.
This function requires also the definition of the precision constraints on the target values:
<- as.data.frame(list(DOM=c("DOM1","DOM2"),
cv CV1=c(0.03,0.04),
CV2=c(0.06,0.08),
CV3=c(0.06,0.08),
CV4=c(0.06,0.08)))
cv
DOM | CV1 | CV2 | CV3 | CV4 |
---|---|---|---|---|
DOM1 | 0.03 | 0.06 | 0.06 | 0.06 |
DOM2 | 0.04 | 0.08 | 0.08 | 0.08 |
The meaning of these constraints is that, once we select a sample and produce extimates, we expect a maximum coefficient of variation for the first variable (‘income_hh’) equal to 3% at national level (‘DOM1’) and to 4% at regional level (‘DOM2’); respectively 6% and 8% for the other three variables.
For instance, we can analyze the impact of the ‘deff_sugg’ parameter on the final sample design by executing the following code:
sensitivity (samp_frame=pop,
id_PSU="municipality",
id_SSU="id_ind",
strata_var="stratum",
target_vars=c("income_hh","active","inactive","unemployed"),
deff_var="stratum",
domain_var="region",
minimum=50,
delta=1,
f=0.05,
search="deff",
min=1,
max=2)
The same for the ‘minimum’ parameter:
sensitivity (samp_frame=pop,
id_PSU="municipality",
id_SSU="id_ind",
strata_var="stratum",
target_vars=c("income_hh","active","inactive","unemployed"),
deff_var="stratum",
domain_var="region",
delta=1,
f=0.05,
deff_sugg=1.5,
search="min_SSU",
min=30,
max=80)
And, finally, for initial sampling rate:
sensitivity (samp_frame=pop,
id_PSU="municipality",
id_SSU="id_ind",
strata_var="stratum",
target_vars=c("income_hh","active","inactive","unemployed"),
deff_var="stratum",
domain_var="region",
delta=1,
minimum=50,
deff_sugg=1.5,
search="sample_fraction",
min=0.01,
max=0.10)
By analysing the above graphs we can decide which values are the most suitable for the sample design.
Using the function ‘beat.2st’ in ‘R2BEAT’ package we can perform the optimization of PSU and SSU allocation in strata:
<- beat.2st(inp$strata,
alloc
cv, $des_file,
inp$psu_file,
inp$rho,
inpdeft_start = NULL,
$effst,
inpepsilon1 = 5,
mmdiff_deft = 1,
maxi = 15,
epsilon = 10^(-11),
minnumstrat = 2,
maxiter = 200,
maxiter1 = 25)
## iterations PSU_SR PSU NSR PSU Total SSU
## 1 0 0 0 0 7391
## 2 1 20 86 106 8302
## 3 2 21 99 120 8300
This is the sensitivity of the solution:
$sensitivity alloc
Type | Dom | V1 | V2 | V3 | V4 | |
---|---|---|---|---|---|---|
1 | DOM1 | 1 | 1 | 0 | 1 | 1 |
5 | DOM2 | 1 | 0 | 0 | 0 | 1283 |
9 | DOM2 | 2 | 1 | 0 | 1 | 247 |
13 | DOM2 | 3 | 1 | 1 | 129 | 1 |
i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.
These are the expected values of the coefficients of variation:
$expected alloc
Type | Dom | V1 | V2 | V3 | V4 | |
---|---|---|---|---|---|---|
1 | DOM1 | 1 | 0.0123 | 0.0109 | 0.0288 | 0.0447 |
5 | DOM2 | 1 | 0.0117 | 0.0074 | 0.0260 | 0.0800 |
9 | DOM2 | 2 | 0.0256 | 0.0213 | 0.0533 | 0.0799 |
13 | DOM2 | 3 | 0.0387 | 0.0438 | 0.0799 | 0.0609 |
Using the function ‘StratSel’ execute the selection of PSU in strata:
set.seed(1234)
<- alloc$alloc[-nrow(alloc$alloc),]
allocat <- StratSel(dataPop= inp$psu_file,
sample_2st idpsu= ~ PSU_ID,
dom= ~ STRATUM,
final_pop= ~ PSU_MOS,
size= ~ PSU_MOS,
PSUsamplestratum= 1,
min_sample= minimum,
min_sample_index= FALSE,
dataAll=allocat,
domAll= ~ factor(STRATUM),
f_sample= ~ ALLOC,
planned_min_sample= NULL,
launch= F)
This is the overall sample design:
2]] sample_2st[[
Domain | SRdom | nSRdom | SRdom+nSRdom | SR_PSU_final_sample_unit | NSR_PSU_final_sample_unit |
---|---|---|---|---|---|
1000 | 2 | 0 | 2 | 287 | 0 |
2000 | 3 | 4 | 7 | 143 | 232 |
3000 | 0 | 4 | 4 | 0 | 176 |
4000 | 0 | 1 | 1 | 0 | 33 |
5000 | 2 | 0 | 2 | 172 | 0 |
6000 | 1 | 1 | 2 | 33 | 49 |
7000 | 0 | 1 | 1 | 0 | 55 |
8000 | 0 | 1 | 1 | 0 | 56 |
9000 | 1 | 0 | 1 | 581 | 0 |
10000 | 6 | 0 | 6 | 611 | 0 |
11000 | 4 | 20 | 24 | 155 | 1053 |
12000 | 0 | 8 | 8 | 0 | 396 |
13000 | 1 | 0 | 1 | 733 | 0 |
14000 | 4 | 0 | 4 | 601 | 0 |
15000 | 9 | 17 | 26 | 434 | 883 |
16000 | 0 | 19 | 19 | 0 | 975 |
17000 | 1 | 0 | 1 | 70 | 0 |
18000 | 0 | 2 | 2 | 0 | 88 |
19000 | 0 | 4 | 4 | 0 | 175 |
20000 | 0 | 2 | 2 | 0 | 91 |
21000 | 1 | 0 | 1 | 64 | 0 |
22000 | 0 | 1 | 1 | 0 | 49 |
23000 | 0 | 2 | 2 | 0 | 89 |
24000 | 0 | 1 | 1 | 0 | 18 |
Total | 35 | 88 | 123 | 3884 | 4418 |
Mean | 162 | 184 |
<- sample_2st[[2]]
des <- des[1:(nrow(des)-1),]
des <- c(as.character(as.numeric(des$Domain[1:(nrow(des)-1)])),"Tot")
strat barplot(t(des[1:(nrow(des)),2:3]), names=strat,
col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, cex.names=0.7)
legend("topleft",
legend = c("Self Representative","Non Self Representative"),
fill = c("darkblue", "red"))
title("Distribution of allocated PSUs by domain")
barplot(t(des[1:(nrow(des)),5:6]), names=strat,
col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7,
cex.names=0.7)
legend("topleft",
legend = c("Self Representative","Non Self Representative"),
fill = c("darkblue", "red"))
title("Distribution of allocated SSUs by domain")
and these are the selected PSUs:
<- sample_2st[[4]]
selected_PSU <- selected_PSU[selected_PSU$PSU_final_sample_unit > 0,]
selected_PSU write.table(sample_2st[[4]],"Selected_PSUs.csv",sep=";",row.names=F,quote=F)
head(selected_PSU)
Sampled_PSU | Pik | Size_Stratum | STRATUM | PSU_ID | PSU_MOS | PSU_MOS.1 | ALLOC | threshold | final_populationdom | sampling_fraction | SR | SizeSR | stratum | N_PSU_Stratum | PSU_final_sample_unit | nSR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1.0000000 | 146162 | 1000 | 330 | 146162 | 146162 | 287 | 34321.78 | 197007 | 0.0014568 | 1 | 146162 | 10001 | 1 | 213 | 0 |
2 | 1 | 1.0000000 | 50845 | 1000 | 309 | 50845 | 50845 | 287 | 34321.78 | 197007 | 0.0014568 | 1 | 50845 | 10002 | 1 | 74 | 0 |
3 | 1 | 1.0000000 | 36195 | 2000 | 304 | 36195 | 36195 | 374 | 34954.01 | 261456 | 0.0014305 | 1 | 36195 | 20001 | 1 | 52 | 0 |
4 | 1 | 1.0000000 | 34156 | 2000 | 342 | 34156 | 34156 | 374 | 34954.01 | 261456 | 0.0014305 | 1 | 0 | 20002 | 1 | 49 | 0 |
5 | 1 | 1.0000000 | 29376 | 2000 | 315 | 29376 | 29376 | 374 | 34954.01 | 261456 | 0.0014305 | 1 | 0 | 20003 | 1 | 42 | 0 |
1100 | 1 | 0.5583857 | 44403 | 2000 | 292 | 24794 | 24794 | 374 | 34954.01 | 261456 | 0.0014305 | 0 | 0 | 20004 | 2 | 64 | 1 |
Finally, we are able to select the Secondary Sample Units (the individuals) from the already selected PSUs (the municipalities). First, we load the population frame:
load("pop.RData")
and we proceed to select the sample in this way:
<- select_SSU(df=pop,
samp PSU_code="municipality",
SSU_code="id_ind",
PSU_sampled=selected_PSU[selected_PSU$Sampled_PSU==1,],
verbose=FALSE)
To check that the total amount is practically equal to what determined in the allocation step:
nrow(samp)
## [1] 8302
sum(allocat$ALLOC)
## [1] 8300
and that the sum of weights equalize population size:
nrow(pop)
## [1] 2258507
sum(samp$weight)
## [1] 2258507
This is the distribution of weights:
par(mfrow=c(1, 2))
boxplot(samp$weight,col="orange")
title("Weights distribution (total sample)",cex.main=0.7)
boxplot(weight ~ region, data=samp,col="orange")
title("Weights distribution by region",cex.main=0.7)
boxplot(weight ~ province, data=samp,col="orange")
title("Weights distribution by province",cex.main=0.7)
boxplot(weight ~ stratum, data=samp,col="orange")
title("Weights distribution by stratum",cex.main=0.7)
It can be seen that the sample is fully self-weighted inside strata, and approximately self-weighted in aggregations of strata, that is the result we wanted to obtain.