Introduction to the cellKey-Package

Bernhard Meindl

2023-03-10

About the cellKey package

This package implements methods to provide perturbation for statistical tables. The implementation if greatly inspired by the the paper Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics (Thompson, Broadfoot, Elazar). This approach however was generalized and a new specification on how record keys are specified and the lookup-tables are defined is used. This package makes usage of perturbation tables that can be defined in the ptable package. This document describes the usage of version 1.0.0 of the package.

Main Features

In the cellKey package it is possible to pertub multidimensional count and magnitude tables. Functionality to generate suitable record keys is provided in function ck_generate_rkeys(). Using sha1-checksums on the input datasets, we can make sure that the same record keys are generated whenever the same input data set is used for which record keys should be generated. However, it is also possible of course to use already pre-generated record keys available as variable in the microdata set.

The package allows to make use of sampling weights. Thus, both weighted and unweighted count and magnitude tables can be perturbed. The hierarchical structure of the variables spanning the table can be arbitrarily complex. To generate these hierarchies, the cellKey packages depends on functionality available in package sdcHierarchies. Finally, auxiliary methods are provided that allow to extract valuable information from objects created by the main function of this package perturbTable(). For example mod_counts() returns a data object showing for each cell the perturbation value and where it was found in the lookup table.

What is left is of course identifying bugs and issues and also to optimize performance of the package as well as to include some real world examples, eg. census tables.

An Example

We now show the capabilities of the cellKey package by running an example that

Load the Package

library(cellKey)
packageVersion("cellKey")
## [1] '1.0.0'

The first step in this approach is to generate a statistical table, which can be achieved using ck_setup. This function generates an object that contains all the information required to perturb count- (and optionally continuously scaled) variables. There are however a few inputs that ck_setup() requires and that need to be generated beforehand. These inputs are:

We continue to show how to generate the required inputs. The first step is to prepare the inputdata.

Specifying inputdata

The testdata set we are using in this example contains information on person-level along with sampling weights as well as some categorical and continuously scaled variables. We also compute some binary variables for which we also want to create perturbed tables.

dat <- ck_create_testdata()
dat <- dat[, c("sex", "age", "savings", "income", "sampling_weight")]
dat[, cnt_highincome := ifelse(income >= 9000, 1, 0)]

We are now adding record keys (which will be referenced later) with 7 digits to the data set using function ck_generate_rkeys as shown below:

dat$rkeys <- ck_generate_rkeys(dat = dat, nr_digits = 7)
print(head(dat))
##       sex        age savings income sampling_weight cnt_highincome     rkeys
## 1:   male age_group3      12   5780              99              0 0.6278257
## 2: female age_group3      28   2530              80              0 0.5779098
## 3:   male age_group1     550   6920              68              0 0.0455409
## 4:   male age_group1     870   7960              98              0 0.5637732
## 5:   male age_group4      20   9030              69              1 0.8930465
## 6: female age_group3     102   3290              88              0 0.9764589

To ensure that the same record keys are computed each and every time for the same data set, a seed based on the sha1-hash of the input dataset is computed by default in ck_generate_rkeys. This seed is used before sampling the record keys and may be overwritten using the seed argument.

The goal of this introduction is to create a perturbed table of counts of variables sex by age for all observations as well as for the subgroups given by cnt_highincome that are non-zero. We also want to create perturbed tables of continously scaled variables savings and income also giving the hierarchical structure defined by sex by age.

Specifying dimensions

It is required to define hierarchies for each of the classifying variables of the desired table. There are two ways how the hierarchical structure of these variables (including sub-totals) can be specified. One way is to use "@; value format as (also) used in sdcTable. We would suggest however, to use the second alternative which is using functionality from package sdcHierarchies. In this vignette, we only show the preferred way to generate hierarchies for categorical variables age and sex below:

dim_sex <- hier_create(root = "Total", nodes = c("male", "female"))
hier_display(dim_sex)
## Total
## ├─male
## └─female

For variable age the process is very much the same:

dim_age <- hier_create(root = "Total", nodes = paste0("age_group", 1:6))
hier_display(dim_age)
## Total
## ├─age_group1
## ├─age_group2
## ├─age_group3
## ├─age_group4
## ├─age_group5
## └─age_group6

The idea of sdcHierarchies is to create a tree object using hier_create() and then add (using hier_add()), delete (with hier_delete()) or rename (using hier_rename()) elements from this hierarchical structure. For more examples have a look at the package vignette of the package which can be accessed with sdcHierarchies::hier_vignette(). sdcHierarchies also contains a shiny-based app that can be called with hier_app() which allows to interactively change and modify a hierarchy and allows to convert tree- and data.frame based inputs into various formats. For more information, have a look at ?hier_app and the other help-files of the package.

After all dimensions have been specified, these inputs must then be combined into a named list. In this object, the list-names refer to variable names of the input data set and the elements the data objects that hold the hierarchy specification itself. This is why the list elements of dims in this example have to be named sex and age as the specification refers to variables age and sex in input set dat.

dims <- list(sex = dim_sex, age = dim_age)

Setup a table instance

We now have prepared the inputs and can define a generic statistical table using ck_setup() as shown below:

tab <- ck_setup(
  x = dat,
  rkey = "rkeys",
  dims = dims,
  w = "sampling_weight",
  countvars = "cnt_highincome",
  numvars = c("income", "savings"))

ck_setup() returns a R6 class object that contains not only the relevant data but also all available methods. Thus, it is not required to assign the results of such methods to new objects, instead, the object itself is automatically updated.

These objects also have a custom print method, showing some general information about the object:

print(tab)
## ── Table Information ───────────────────────────────────────────────────────────
## ✔ 21 cells in 2 dimensions ('sex', 'age')
## ✔ weights: yes
## ── Tabulated / Perturbed countvars ─────────────────────────────────────────────
## ☐ 'total'
## ☐ 'cnt_highincome'
## ── Tabulated / Perturbed numvars ───────────────────────────────────────────────
## ☐ 'income'
## ☐ 'savings'

Defining perturbation parameters

Perturbation parameters for count variables

The next task is to define parameters that are used to perturb count variables which can be achieved with ck_params_cnts. This function requires as input the result of either pt_create_pParams, pt_create_pTable or create_cnt_ptable from the ptable package. Please refer also to the documentation of this package for information on the required parameters. In this example we are going to use - amongst others - exemplary ptables that can are provided by the ptable-pkg for demonstration purposes:

# two different perturbation parameter sets from the ptable-pkg
# an example ptable provided directly
ptab1 <- ptable::pt_ex_cnts()

# creating a ptable by specifying parameters
para2 <- ptable::create_cnt_ptable(
  D = 8, V = 3, js = 2, pstay = 0.5, 
  optim = 1, mono = TRUE)

We then need to create the required inputs for the cellKey package.

p_cnts1 <- ck_params_cnts(ptab = ptab1)
p_cnts2 <- ck_params_cnts(ptab = para2)

ck_params_cnts() returns objects that can be used as inputs in method params_cnts_set(). In argument v one may specify count variables for which the supplied perturbation parameters should be used. If v is not specified, the perturbation parameters are used for all count variables.

# use `p_cnts1` for variable "total" (which always exists)
tab$params_cnts_set(val = p_cnts1, v = "total")
## --> setting perturbation parameters for variable 'total'
# use `p_cnts2` for "cnt_highincome"
tab$params_cnts_set(val = p_cnts2, v = "cnt_highincome")
## --> setting perturbation parameters for variable 'cnt_highincome'

It is therefore entirely possible to use different parameter sets for different variables. Modifying perturbation parameters for some variables is easy, too. It is only required to apply the params_cnts_set()-method again which will replace any previously defined parameters.

Perturbation parameters for continuous variables

Setting and defining perturbation parameters for continuous variables works similarily. The required functions are ck_params_num() to create input objects that can be set with the params_nums_set method. Please note that it is possibly by specifying the path argument in both ck_params_nums() and ck_params_cnts() to save the parameters additionally as yaml-file. Using ck_read_yaml(), these files can later be imported again. This feature is useful for re-using parameter settings.

The underlying framework on how to perturb continuous tables differs from the proposed method from ABS. One possible approach is based on a “flex function”. This approach (which is described in deliverable D4.2 in the project perturbative confidentiality methods allows to apply different magnitude of noise to larger and smaller cells. Users can define the required parameters for the flex-approach with function ck_flexparams(). The required inputs are:

  • fp: the flexpoint defining at which point should the underlying noise coefficient function reach its desired maximum (which is defined by the first element of p)
  • p: numeric vector of length 2 with p[1] > p[2] where both elements specify a percentage. The first value refers to the desired maximum perturbation percentage for small cells (depending on fp) while the second element refers to the desired maximum perturbation percentage for large cells.
  • epsilon: a numeric vector in descending order with all values in [0; 1] and with the first element forced to equal 1. The length of this parameter must correspond with the number of top_k specified in ck_params_nums() (which will be discussed later).
# parameters for the flex-function
p_flex <- ck_flexparams(
  fp = 1000,
  p = c(0.3, 0.03),
  epsilon = c(1, 0.5, 0.2))

In the cellKey package it is possible to select the underlying data that form the base for the perturbation differently. In ck_params_nums() the specific approach can be selected in argument type. The valid choices for this argument are:

  • "top_contr": the k largest contributions to each cell are used in the perturbation procedure with the number k required to be specified in argument top_k
  • "mean": weighted cellmeans are used as starting points
  • "range": the difference between largest and smallest unweighted contributions for each cell are used as base for the perturbation procedure
  • "sum": weighted cellvalues are used as starting points for the perturbation

Another, more basic approach, is to use a constant perturbation magnitude for all cells, independent on their (weighted) values. The required parameters can be defined with ck_simpleparams() as shown below:

# parameters for the simple approach
p_simple <- ck_simpleparams(
  p = 0.05,
  epsilon = 1)

In this appraoch it is only required to specify a single percentage value p and - as in the case for the flex function - a vector of epsilons that are used in the case when top_k > 1.

Further important parameters for ck_params_nums() are:

  • mu_c: an extra amount of perturbation applied to sensitive cells (restricted to the first of top_k noise components). In the following example we demonstrate how to identify sensitive cells for numeric variables.
  • same_key: a logical value specifying if the original cell key (TRUE) should be used for the lookup of the largest contributor of a cell or if a perturbation of the cellkey itself (FALSE) should take place.
  • use_zero_rkeys: a logical value defining if record keys of units not contributing to a specific numeric variables should be used (TRUE) or ignored (FALSE) when cell keys are computed.

A very important parameter is ptab which actually holds the perturbation tables in which perturbation values are looked up. This input can be specified differently in the case when numeric variables should be perturbed. It can be either an object derived from ptable::pt_create_pTable(..., table = "nums") in the most simple case. More advanced is to supply a named list, where the allowed names are shown below and each element must be the output of ptable::pt_create_pTable(..., table = "nums").

  • "all": this ptable will be used for all cells; if specified, list-elements named "even"or "odd" are ignored
  • "even": this perturbation table will be used to look up perturbation values for cells with an even number of contributors
  • "odd": will be used to look up perturbation values for cells with an odd number of contributors
  • "small_cells": if specified, this ptable will be used to extract perturbation values for very small cells

Please note, that if the goal is to have different perturbation tables for cells with an even/odd number of contributors, both "even" or "odd" must be available in the input list. In the chunk below we create four different perturbation tables. For details on the parameters, please look at the documentation of the ptable package, especially in ptable::create_num_ptable().

# same ptable for all cells except for very small ones
ex_ptab1 <- ptable::pt_ex_nums(parity = TRUE, separation = TRUE)

We can now use these tables to finally create objects containing all the required information to create perturbed magnitude tables using ck_params_nums. In the first case we want the same perturbation table (ptab_all) for cells with an even/odd number of contributors but want to use ptab_sc for very small cells.

p_nums1 <- ck_params_nums(
  type = "top_contr",
  top_k = 3,
  ptab = ex_ptab1,
  mult_params = p_flex,
  mu_c = 2,
  same_key = FALSE,
  use_zero_rkeys = TRUE)

The second input we generate should use different ptables for cells with an even/odd number of contributing units (ptab_even and ptab_odd) but should not use a specific perturbation table for very small cells.

ex_ptab2 <- ptable::pt_ex_nums(parity = FALSE, separation = FALSE)

As above, we need to use ck_params_nums() to compute suitable inputs.

p_nums2 <- ck_params_nums(
  type = "mean",
  ptab = ex_ptab2,
  mult_params = p_simple,
  mu_c = 1.5,
  same_key = FALSE,
  use_zero_rkeys = TRUE)

The package internally computes the separation point that is used for very small cells in case this is required. Details on this can also be found in deliverable D4.2.

Now we can attach the results from ck_params_nums() to numeric variables using the params_nums_set()-method as shown below:

tab$params_nums_set(v = "income", val = p_nums1)
## --> setting perturbation parameters for variable 'income'
tab$params_nums_set(v = "savings", val = p_nums1)
## --> setting perturbation parameters for variable 'savings'

In order to make use of parameter mu_c that allows ab add extra amount of protection to sensitive cells, one may identify sensitive cells according to some rules. The following methods to identify sensitive cells are implemented:

  • supp_p(): identify sensitive cells based on p%-rule
  • supp_pp(): identify sensitive cells based on pq%-rule
  • supp_nk(): identify sensitive cells based on nk-dominance rule
  • supp_freq(): identify sensitive cells based on minimal frequencies for (weighted) number of contributors
  • supp_val(): identify sensitive cells based on (weighted) cell values
  • supp_cells(): identify sensitive cells based on their “names”

We now want to set all cells for variable income as sensitive to which less than 15 units contribute.

tab$supp_freq(v = "income", n = 15, weighted = FALSE)
## freq-rule: 3 new sensitive cells (incl. duplicates) found (total: 3)

To set specific cells independent on values but their names, one may use the $supp_cells()-method. This cell requires a data.frame as input that contains a column for each dimensional variable specified. Each row of this input is considered as a cell where NAs are used as placeholders that match any characteristic of the relevant variable. Using the data.frame inp show below, the programm would suppress the following cells:

  • female x age_group1
  • male x age_group3
  • male x any age group available in the data
inp <- data.frame(
  "sex" = c("female", "male", "male"),
  "age" = c("age_group1", "age_group3", NA)
)

Compute perturbed tables

It is now possible to finally perturbed variables using the perturb()-method. As tab - the object created with ck_setup() - already contains all possible data, the only required input is the name of a variable that should be perturbed.

tab$perturb(v = "total")
## Count variable 'total' was perturbed.

After this call, object tab is updated and contains now also perturbed values for variable total. We note that no explicit assignment is required. The following code shows that we also can perturb cnt- and numerical variables in one single call.

tab$perturb(v = c("cnt_highincome", "savings", "income"))
## Count variable 'cnt_highincome' was perturbed.
## Numeric variable 'savings' was perturbed.
## Numeric variable 'income' was perturbed.

A data.table containing original and perturbed values can now be extracted using the freqtab()- and numtab() methods as discussed next.

Extracting results

Obtain perturbed tables for count tables

Applying the freqtab()-method to one ore more already perturbed variables returns a data.table that contains for each table cell the unpertubed and perturbed (weighted and/or unweighted) counts. This function has the following arguments:

  • v: one or more variable names of already perturbed count variables
  • path: if not NULL, a (relative or absolute) path to which the resulting output table should be written. A csv file will be generated and .csv will be appended to the value provided.

The method returns a data.table with all combinations of the dimensional variables in the first \(n\) columns and after those the following columns:

  • vname: name of the perturbed variable
  • uwc: unweighted counts
  • wc: weighted counts
  • puwc: perturbed unweighted counts
  • pwc: perturbed weighted counts
tab$freqtab(v = c("total", "cnt_highincome"))
##        sex        age          vname  uwc     wc puwc         pwc
##  1:  Total      Total          total 4580 274237 4580 274237.0000
##  2:  Total age_group1          total 1969 117458 1968 117398.3464
##  3:  Total age_group2          total 1143  67825 1144  67884.3395
##  4:  Total age_group3          total  864  52688  863  52627.0185
##  5:  Total age_group4          total  423  25022  424  25081.1537
##  6:  Total age_group5          total  168  10483  168  10483.0000
##  7:  Total age_group6          total   13    761   11    643.9231
##  8:   male      Total          total 2296 138349 2295 138288.7435
##  9:   male age_group1          total 1015  61295 1016  61355.3892
## 10:   male age_group2          total  571  33753  571  33753.0000
## 11:   male age_group3          total  424  26312  423  26249.9434
## 12:   male age_group4          total  195  11538  194  11478.8308
## 13:   male age_group5          total   84   5064   84   5064.0000
## 14:   male age_group6          total    7    387    8    442.2857
## 15: female      Total          total 2284 135888 2283 135828.5044
## 16: female age_group1          total  954  56163  954  56163.0000
## 17: female age_group2          total  572  34072  572  34072.0000
## 18: female age_group3          total  440  26376  438  26256.1091
## 19: female age_group4          total  228  13484  228  13484.0000
## 20: female age_group5          total   84   5419   85   5483.5119
## 21: female age_group6          total    6    374    5    311.6667
## 22:  Total      Total cnt_highincome  445  27069  447  27190.6584
## 23:  Total age_group1 cnt_highincome  192  11255  192  11255.0000
## 24:  Total age_group2 cnt_highincome  123   7609  127   7856.4472
## 25:  Total age_group3 cnt_highincome   82   5081   82   5081.0000
## 26:  Total age_group4 cnt_highincome   34   2258   31   2058.7647
## 27:  Total age_group5 cnt_highincome   14    866   15    927.8571
## 28:  Total age_group6 cnt_highincome    0      0    0      0.0000
## 29:   male      Total cnt_highincome  219  13610  219  13610.0000
## 30:   male age_group1 cnt_highincome   90   5457   89   5396.3667
## 31:   male age_group2 cnt_highincome   66   4229   66   4229.0000
## 32:   male age_group3 cnt_highincome   41   2611   41   2611.0000
## 33:   male age_group4 cnt_highincome   15    963   13    834.6000
## 34:   male age_group5 cnt_highincome    7    350    7    350.0000
## 35:   male age_group6 cnt_highincome    0      0    0      0.0000
## 36: female      Total cnt_highincome  226  13459  226  13459.0000
## 37: female age_group1 cnt_highincome  102   5798  101   5741.1569
## 38: female age_group2 cnt_highincome   57   3380   57   3380.0000
## 39: female age_group3 cnt_highincome   41   2470   41   2470.0000
## 40: female age_group4 cnt_highincome   19   1295   22   1499.4737
## 41: female age_group5 cnt_highincome    7    516    7    516.0000
## 42: female age_group6 cnt_highincome    0      0    0      0.0000
##        sex        age          vname  uwc     wc puwc         pwc

Obtain perturbed tables for magnitude tables

Using the numtab()-method allows to extract results for continuous variables. The required inputs are the same as for the freqtab()-method and the output returns a data.table with all combinations of the dimensional variables in the first \(n\) columns and the following additional columns:

  • vname: name of the perturbed variable
  • uws: unweighted sum
  • ws: weighted cellsum
  • pws: perturbed weighted sum of the given cell

We now have a look at the results of the variables savings and income that we already have perturbed.

tab$numtab(v = c("savings", "income"))
##        sex        age   vname      uws         ws          pws
##  1:  Total      Total savings  2273532  136170447  136173843.1
##  2:  Total age_group1 savings   982386   58442508   58448803.0
##  3:  Total age_group2 savings   552336   32563281   32561910.1
##  4:  Total age_group3 savings   437101   26605134   26614735.9
##  5:  Total age_group4 savings   214661   12960961   12963684.0
##  6:  Total age_group5 savings    80451    5205377    5203819.5
##  7:  Total age_group6 savings     6597     393186     386674.0
##  8:   male      Total savings  1159816   69853100   69855211.2
##  9:   male age_group1 savings   517660   31084464   31090585.8
## 10:   male age_group2 savings   280923   16450477   16454439.8
## 11:   male age_group3 savings   214970   13274983   13281374.2
## 12:   male age_group4 savings    99420    6147416    6143482.0
## 13:   male age_group5 savings    43233    2674894    2677686.0
## 14:   male age_group6 savings     3610     220866     219237.3
## 15: female      Total savings  1113716   66317347   66309167.0
## 16: female age_group1 savings   464726   27358044   27352666.2
## 17: female age_group2 savings   271413   16112804   16114738.4
## 18: female age_group3 savings   222131   13330151   13326818.5
## 19: female age_group4 savings   115241    6813545    6813337.9
## 20: female age_group5 savings    37218    2530483    2530421.2
## 21: female age_group6 savings     2987     172320     172115.2
## 22:  Total      Total  income 22952978 1372531159 1372562394.8
## 23:  Total age_group1  income  9810547  583385782  583439989.1
## 24:  Total age_group2  income  5692119  339883540  339863756.2
## 25:  Total age_group3  income  4406946  267407882  267489603.4
## 26:  Total age_group4  income  2133543  124903566  124952624.2
## 27:  Total age_group5  income   848151   53297550   53277370.7
## 28:  Total age_group6  income    61672    3652839    3560784.8
## 29:   male      Total  income 11262049  679518216  679536671.7
## 30:   male age_group1  income  4877164  294032335  294056625.4
## 31:   male age_group2  income  2811379  167614619  167662763.8
## 32:   male age_group3  income  2168169  134600127  134649749.8
## 33:   male age_group4  income   978510   57622860   57519554.9
## 34:   male age_group5  income   393134   23639144   23669082.2
## 35:   male age_group6  income    33693    2009131    1994026.5
## 36: female      Total  income 11690929  693012943  692882453.8
## 37: female age_group1  income  4933383  289353447  289288097.0
## 38: female age_group2  income  2880740  172268921  172293631.1
## 39: female age_group3  income  2238777  132807755  132777518.5
## 40: female age_group4  income  1155033   67280706   67268674.1
## 41: female age_group5  income   455017   29658406   29655554.7
## 42: female age_group6  income    27979    1643708    1680270.2
##        sex        age   vname      uws         ws          pws

Utility-measures

Utility measures for count variables

Method measures_cnts() allows to compute information loss measures for perturbed count variables. Its application is as simple as:

tab$measures_cnts(v = "total", exclude_zeros = TRUE)
## $overview
##    noise cnt       pct
## 1:    -1   5 0.2380952
## 2:     0   7 0.3333333
## 3:     1   7 0.3333333
## 4:     2   2 0.0952381
## 
## $measures
##       what    d1    d2    d3
##  1:    Min 0.000 0.000 0.000
##  2:    Q10 0.000 0.000 0.000
##  3:    Q20 0.000 0.000 0.000
##  4:    Q30 0.000 0.000 0.000
##  5:    Q40 1.000 0.000 0.010
##  6:   Mean 0.762 0.024 0.045
##  7: Median 1.000 0.001 0.015
##  8:    Q60 1.000 0.001 0.017
##  9:    Q70 1.000 0.002 0.024
## 10:    Q80 1.000 0.005 0.048
## 11:    Q90 1.000 0.143 0.183
## 12:    Q95 2.000 0.154 0.213
## 13:    Q99 2.000 0.164 0.274
## 14:    Max 2.000 0.167 0.289
## 
## $cumdistr_d1
##    cat cnt       pct
## 1:   0   7 0.3333333
## 2:   1  19 0.9047619
## 3:   2  21 1.0000000
## 
## $cumdistr_d2
##            cat cnt       pct
## 1:    [0,0.02]  18 0.8571429
## 2: (0.02,0.05]  18 0.8571429
## 3:  (0.05,0.1]  18 0.8571429
## 4:   (0.1,0.2]  21 1.0000000
## 5:   (0.2,0.3]  21 1.0000000
## 6:   (0.3,0.4]  21 1.0000000
## 7:   (0.4,0.5]  21 1.0000000
## 8:   (0.5,Inf]  21 1.0000000
## 
## $cumdistr_d3
##            cat cnt       pct
## 1:    [0,0.02]  13 0.6190476
## 2: (0.02,0.05]  17 0.8095238
## 3:  (0.05,0.1]  18 0.8571429
## 4:   (0.1,0.2]  19 0.9047619
## 5:   (0.2,0.3]  21 1.0000000
## 6:   (0.3,0.4]  21 1.0000000
## 7:   (0.4,0.5]  21 1.0000000
## 8:   (0.5,Inf]  21 1.0000000
## 
## $false_zero
## [1] 0
## 
## $false_nonzero
## [1] 0
## 
## $exclude_zeros
## [1] TRUE

This function returns a list with several utility measures that are now discussed. For further information have a look at ?ck_cnt_measures as the same set of measures can also be computed for two vectors containing original and perturbed values.

  • overview: a data.table with the following three columns:
    • noise: amount of noise computed as orig - pert
    • cnt: number of cells perturbed with the value given in column noise
    • pct: percentage of cells perturbed with the value given in column noise
  • measures: a data.table containing measures of the distribution of three different distances between original and perturbed values of the unweighted counts. Column what specifies the computed measure. The three distances considered are:
    • d1: absolute distance between original and masked values
    • d2: relative absolute distance between original and masked values
    • d3: absolute distance between square-roots of original and perturbed values
  • cumdistr_d1, cumdistr_d2 and cumdistr_d3: for each distance d1, d2 and d3, a data.table with the following three columns:
    • cat: a specific value (for d1) or interval (for distances d2 and d3)
    • cnt: number of records smaller or equal the value in column cat for the given distance
    • pct: proportion of records smaller or equal the value in column cat for the selected distance
  • false_zero: number of cells that were perturbed to zero
  • false_nonzero: number of cells that were initially zero but have been perturbed to a number different from zero

If argument exclude_zeros is TRUE (the default setting), empty cells are excluded when computing distance-based measures d1, d2 and d3.

Utility measures for continuous variables

For now, no utility measures for continuous variables are available. This will change in a future version.

Additional Features

Finally we note that there are print and summary methods implemented for objects created with from ck_setup() which can be used as shown below:

The print()-method shows the dimension of the table as well as the variables that are (or possibly can be) perturbed. It is also displayed if the table was constructed using weights.

tab$print() # same as (print(tab))
## ── Table Information ───────────────────────────────────────────────────────────
## ✔ 21 cells in 2 dimensions ('sex', 'age')
## ✔ weights: yes
## ── Tabulated / Perturbed countvars ─────────────────────────────────────────────
## ☒ 'total' (perturbed)
## ☒ 'cnt_highincome' (perturbed)
## ── Tabulated / Perturbed numvars ───────────────────────────────────────────────
## ☒ 'income' (perturbed)
## ☒ 'savings' (perturbed)

The summary-method shows some utility statistics for already perturbed variables as shown below:

tab$summary()
## ┌──────────────────────────────────────────────┐
## │Utility measures for perturbed count variables│
## └──────────────────────────────────────────────┘
## ── Distribution statistics of perturbations ────────────────────────────────────
##          countvar Min Q10 Q20 Q30 Q40   Mean Median Q60 Q70 Q80 Q90 Q95 Q99 Max
## 1:          total  -2  -1  -1  -1  -1 -0.286      0   0   0   1   1   1 1.0   1
## 2: cnt_highincome  -3  -1   0   0   0  0.143      0   0   0   0   2   3 3.8   4
## 
## ── Distance-based measures ─────────────────────────────────────────────────────
## ✔ Variable: 'total'
## 
##       what    d1    d2    d3
##  1:    Min 0.000 0.000 0.000
##  2:    Q10 0.000 0.000 0.000
##  3:    Q20 0.000 0.000 0.000
##  4:    Q30 0.000 0.000 0.000
##  5:    Q40 1.000 0.000 0.010
##  6:   Mean 0.762 0.024 0.045
##  7: Median 1.000 0.001 0.015
##  8:    Q60 1.000 0.001 0.017
##  9:    Q70 1.000 0.002 0.024
## 10:    Q80 1.000 0.005 0.048
## 11:    Q90 1.000 0.143 0.183
## 12:    Q95 2.000 0.154 0.213
## 13:    Q99 2.000 0.164 0.274
## 14:    Max 2.000 0.167 0.289
## 
## ✔ Variable: 'cnt_highincome'
## 
##       what    d1    d2    d3
##  1:    Min 0.000 0.000 0.000
##  2:    Q10 0.000 0.000 0.000
##  3:    Q20 0.000 0.000 0.000
##  4:    Q30 0.000 0.000 0.000
##  5:    Q40 0.000 0.000 0.000
##  6:   Mean 0.944 0.028 0.073
##  7: Median 0.000 0.000 0.000
##  8:    Q60 1.000 0.006 0.048
##  9:    Q70 1.000 0.011 0.053
## 10:    Q80 2.000 0.056 0.160
## 11:    Q90 3.000 0.102 0.264
## 12:    Q95 3.150 0.137 0.277
## 13:    Q99 3.830 0.154 0.321
## 14:    Max 4.000 0.158 0.332
## 
## ┌──────────────────────────────────────────────────┐
## │Utility measures for perturbed numerical variables│
## └──────────────────────────────────────────────────┘
## ── Distribution statistics of perturbations ────────────────────────────────────
##      vname         Min        Q10        Q20        Q30       Q40      Mean
## 1:  income -130489.209 -92054.182 -30236.471 -19783.823 -12031.94 -2068.528
## 2: savings   -8180.028  -5377.784  -3332.478  -1557.465   -207.13   617.254
##       Median       Q60       Q70       Q80       Q90       Q95       Q99
## 1: 18455.682 24710.086 31235.785 48144.774 49622.849 54207.067 76218.571
## 2:   -61.752  2111.192  2792.006  3962.805  6294.981  6391.223  8959.746
##          Max
## 1: 81721.448
## 2:  9601.877

Summary

The package is “work in progress” and therefore, suggestions and/or bugreports are welcome. Please feel free to file an issue at our issue tracker or contribute to the package by filing a pull request against the master branch.