The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

transformations

Transformations

Out of the box, deident features a set of transformations to aid in the de-identification of data sets. Each transformation is implemented via R6Class and extends BaseDeident. User defined transformations can be implemented in a similar manner.

To demonstrate the different transformation we supply a toy data set, df, comprising 26 observations of three variables:

A: character, a to z
B: numeric, 1 to 26
C: character, X if B <= 13, Y if B > 13

Psudonymizer

Apply a cached random replacement cipher. Re-occurrence of the same key will receive the same hash.

Implemented deident options:

deident(df, "psudonymize", A)
deident(df, "Pseudonymizer", A)
deident(df, Pseudonymizer, A)
deident(df, Pseudonymizer$new(), A)

psu <- Pseudonymizer$new()
deident(df, psu, A)

Options

By default Pseudonymizer replaces values in variables with a random alpha-numeric string of 5 characters. This can be replaced via calling set_method on an instantiated Pseudonymizer with the desired function:

psu <- Pseudonymizer$new()

new_method <- function(key, ...){
  paste(sample(letters, 12, T), collapse="")
}

psu$set_method(new_method)

deident(df, psu, A)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Pseudonymizer' on variable(s) A 
#> For data:
#>    columns: A, B, C

The first argument to the method receives the key to be transformed.

Shuffler

Implemented deident options:

deident(df, "shuffle", A)
deident(df, "Shuffler", A)
deident(df, Shuffler, A)
deident(df, Shuffler$new(), A)

shuffle <- Shuffler$new()
deident(df, shuffle, A)

Encrypter

Apply cryptographic hashing to a variable.

Implemented deident options:

deident(df, "encrypt", A)
deident(df, "Encrypter", A)
deident(df, Encrypter, A)
deident(df, Encrypter$new(), A)

encrypt <- Encrypter$new()
deident(df, encrypt, A)

Options

At initialization, Encrypter can be given hash_key and seed values to control the cryptographic encryption. It is recommended users set these values and do not disclose them.

encrypt <- Encrypter$new(hash_key="deident_hash_key_123", seed=202)
deident(df, encrypt, A)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Encrypter' on variable(s) A 
#> For data:
#>    columns: A, B, C

Perturber

Apply Gaussian white noise to a numeric variable.

Implemented deident options:

deident(df, "perturb", A)
deident(df, "Perturber", A)
deident(df, Perturber, A)
deident(df, Perturber$new(), A)

perturb <- Perturber$new()
deident(df, perturb, A)

Options

At initialization, Perturber can be given a scale for the white noise via the sd argument.

# perturb <- Perturber$new(noise=adaptive_noise(0.2))
# deident(df, perturb, B)

Blurer

Aggregate categorical values dependent on a user supplied list. the list must be supplied to Blur at initialization.

Implemented deident options:

letter_blur <- c(rep("Early", 13), rep("Late", 13))
names(letter_blur) <- letters

blur <- Blurer$new(blur = letter_blur)
deident(df, blur, A)

NumericBlurer

Aggregate numeric values dependent on a user supplied vector of breaks/ cuts. If no vector is supplied NumericBlurer defaults to a binary classification about 0.

Implemented deident options:

deident(df, "numeric_blur", B)
deident(df, "NumericBlurer", B)
deident(df, NumericBlurer, B)
deident(df, NumericBlurer$new(), B)

numeric_blur <- NumericBlurer$new()
deident(df, numeric_blur, B)

Options

At initialization NumericBlurer takes an argument cuts to define the limits of each interval.

numeric_blur <- NumericBlurer$new(cuts=c(5, 10, 15, 20))
deident(df, numeric_blur, B)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'NumericBlurer' on variable(s) B 
#> For data:
#>    columns: A, B, C

GroupedShuffler

Apply Shuffler to a data set having first grouped the data on column(s). The grouping needs to be defined at initialization.

Implemented deident options:

grouped_shuffle <- GroupedShuffler$new(C)
deident(df, grouped_shuffle, B)

Options

At initialization GroupedShuffler takes an argument limit such that if any aggregated sub group has fewer than limit observations all values are dropped.

numeric_blur <- GroupedShuffler$new(C, limit=1)
deident(df, numeric_blur, B)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'GroupedShuffler(group_on=C)' on variable(s) B 
#> For data:
#>    columns: A, B, C

Drop

Define a column to be removed from the pipeline.

Implemented deident options:


deident(df, Drop, B)

drop <- deident:::Drop$new()
deident(df, drop, B)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.