The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette explains the usage of the ipf() function,
which has been used for calibrating the labour force survey of Austria
for several years. It is based on the Iterative Proportional Fitting
algorithm and gives some flexibility about the details of the
implementation. See (Meraner, Gumprecht, and
Kowarik 2016) or vignette("methodology") for more
details.
We will assume the output of demo.eusilc() is our
population. From this population, a sample without replacement is drawn.
The sample covers 10 percent of the population. We assign a weight of
one for all observations of the population and a weight of ten for all
observations of the sample.
We will start with an example where we want to adapt the weights of
pop_sample such that the weighted number of males and
females matches the ones of population. We can see that
this is currently not the case.
(gender_distribution <- xtabs(pWeight ~ gender, population))
#> gender
#>   male female 
#>   7267   7560
xtabs(pWeight ~ gender, pop_sample)
#> gender
#>   male female 
#>   6930   7890Due to random sampling (rather than stratified sampling), there are
differences between the gender distributions. We can pass
gender_distribution as a parameter to ipf() to
obtain modified weights.
The resulting dataset, pop_sample_c is similar to
pop_sample but has an additional column with the adjusted
weights.
dim(pop_sample)
#> [1] 1482   30
dim(pop_sample_c)
#> [1] 1482   31
setdiff(names(pop_sample_c), names(pop_sample))
#> [1] "calibWeight"We can now calculate the weighted number of males and females according to this new weight. This will result in a match for the constraints.
xtabs(calibWeight ~ gender, pop_sample_c)
#> gender
#>   male female 
#>   7267   7560
xtabs(pWeight ~ gender, population)
#> gender
#>   male female 
#>   7267   7560In this simple case, ipf just performs a post
stratification step. This means, that all males and all females have the
same weight.
xtabs(~ calibWeight + gender, pop_sample_c)
#>                   gender
#> calibWeight        male female
#>   9.58174904942966    0    789
#>   10.4862914862915  693      0All females have been weighted down
(calibWeight < 10) to compensate for the
overrepresentation in the sample.
Let’s now assume that we want to put constraints on the number of
males and females for each age group. The numbers from the original
population can be obtained with xtabs().
(con_ga <- xtabs(pWeight ~ gender + age, population))
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(pWeight ~ gender + age, pop_sample)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1380     770    2200    1560      1020
#>   female      1380     780    2560    1980      1190Again, we can see that those constraints are not met. Supplying the
contingency table con_ga to ipf() will again
resolve this.
pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga), w = "pWeight")
xtabs(pWeight ~ gender + age, population)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237Now we assume that we know the number of persons living in each nuts2 region from registry data.
However, those registry data does not provide any information about
age or gender. Therefore, the two contingency tables
(con_ga and registry_table) have to be
specified independently. This can be done by supplying a list to
conP
pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight")
xtabs(pWeight ~ gender + age, population)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(pWeight ~ region, population)
#> region
#>    Burgenland     Carinthia Lower Austria      Salzburg        Styria 
#>           549          1078          2804           924          2295 
#>         Tyrol Upper Austria        Vienna    Vorarlberg 
#>          1317          2805          2322           733
xtabs(calibWeight ~ region, pop_sample_c2)
#> region
#>    Burgenland     Carinthia Lower Austria      Salzburg        Styria 
#>      548.9999     1077.9999     2804.0000      924.0000     2295.0001 
#>         Tyrol Upper Austria        Vienna    Vorarlberg 
#>     1317.0000     2804.9999     2322.0003      732.9998this time, the constraints are not matched perfectly. That is,
because we provided more than one constraint. therefore, the
ipf() algorithm had to work iteratively.
If the dataset has a household structure, household constraints can
be passed via the parameter conH. If this parameter is
used, it is also necessary to supply hid, which defines the
column names that contains household ids.
(conH1 <- xtabs(pWeight ~ hsize + region, data = population[!duplicated(hid)]))
#>      region
#> hsize Burgenland Carinthia Lower Austria Salzburg Styria Tyrol Upper Austria
#>     1         58       117           325      103    264   118           262
#>     2         82       126           345      102    260   149           321
#>     3         37        80           189       55    187    79           203
#>     4         33        63           169       71    122   102           168
#>     5         16        39           103       30     83    48           114
#>      region
#> hsize Vienna Vorarlberg
#>     1    431         67
#>     2    355         72
#>     3    175         44
#>     4     96         53
#>     5     50         34
pop_sample_hh <- ipf(pop_sample, hid = "hid", conH = list(conH1), w = "pWeight",
                     bound = 10)
xtabs(calibWeight ~ hsize + region, data = pop_sample_hh[!duplicated(hid)])
#>      region
#> hsize Burgenland Carinthia Lower Austria Salzburg Styria Tyrol Upper Austria
#>     1         58       117           325      103    264   118           262
#>     2         82       126           345      102    260   149           321
#>     3         37        80           189       55    187    79           203
#>     4         33        63           169       71    122   102           168
#>     5         16        39           103       30     83    48           114
#>      region
#> hsize Vienna Vorarlberg
#>     1    431         67
#>     2    355         72
#>     3    175         44
#>     4     96         53
#>     5     50         34If conP or conH contain several contingency
tables or if conP and conH are used at the
same time, the ipf algorithm will operate iteratively. This means that
the calibrated dataset will satisfy the constraints only approximately.
The default tolerances of the approximation can be overwritten using the
parameters conP and conH.
Lowering the tolerances will improve the match between the constraints and the contingency tables according to the calibrated weights. However, lower tolerances will also make it so more iterations are necessary until a convergence is met. If the constraints are too small, ipf will return with a warning that indicates that a convergence could not be reached.
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
    verbose = TRUE, epsP = 0.01)
#> Iteration stopped after 3 steps
#> Convergence reached
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
    verbose = TRUE, epsP = 0.0001)
#> Iteration stopped after 4 steps
#> Convergence reachedWe see that changing the tolerances from 0.01 (one
percent) to 0.0001 increases the number of required
iterations.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.