The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Blur Example

One way to reduce identifiability of a data set is by converting a categorical variable to have a more aggregated taxonomy (i.e. a many-to-one mapping). Here we refer to such a method as a ‘blur’ as it causes features to be joined together in such a way to hide the underlying information.

As an example, consider the ShiftsWorked data:

library(deident)
head(ShiftsWorked)
#> # A tibble: 6 × 7
#>   `Record ID` Employee    Date       Shift `Shift Start` `Shift End` `Daily Pay`
#>         <int> <chr>       <date>     <chr> <chr>         <chr>             <dbl>
#> 1           1 Maria Cook  2015-01-01 Night 17:01         00:01              78.1
#> 2           2 Stephen Cox 2015-01-01 Day   08:01         16:01             155. 
#> 3           3 Kimberly O… 2015-01-01 Day   08:01         16:01              77.8
#> 4           4 Nathan Alv… 2015-01-01 Day   08:01         15:01             203. 
#> 5           5 Samuel Par… 2015-01-01 Night 16:01         23:01             211. 
#> 6           6 Scott Morr… 2015-01-01 Night 17:01         00:01             142.

A simple ‘blur’ might be to change the taxonomy of ‘Shift’ e.g. combine ‘Day’ and ‘Night’ into a new group ‘Working’ and ignore the ‘Rest’ shifts. To do this we define the values we wish to change as a vector, build a pipeline and apply it to the data:

shift_blur <- c("Day" = "Working", "Night" = "Working")
blur_pipe <- ShiftsWorked |>
  add_blur(Shift, blur=shift_blur)

apply_deident(ShiftsWorked, blur_pipe)
#> # A tibble: 3,100 × 7
#>    `Record ID` Employee   Date       Shift `Shift Start` `Shift End` `Daily Pay`
#>          <int> <chr>      <date>     <chr> <chr>         <chr>             <dbl>
#>  1           1 Maria Cook 2015-01-01 Work… 17:01         00:01              78.1
#>  2           2 Stephen C… 2015-01-01 Work… 08:01         16:01             155. 
#>  3           3 Kimberly … 2015-01-01 Work… 08:01         16:01              77.8
#>  4           4 Nathan Al… 2015-01-01 Work… 08:01         15:01             203. 
#>  5           5 Samuel Pa… 2015-01-01 Work… 16:01         23:01             211. 
#>  6           6 Scott Mor… 2015-01-01 Work… 17:01         00:01             142. 
#>  7           7 Nathan Sa… 2015-01-01 Rest  <NA>          <NA>                0  
#>  8           8 Jose Lopez 2015-01-01 Work… 17:01         00:01             213. 
#>  9           9 Donna Bro… 2015-01-01 Work… 16:01         00:01             219. 
#> 10          10 George Ki… 2015-01-01 Work… 16:01         00:01             242. 
#> # ℹ 3,090 more rows

The category_blur utility

Applying the blur is relatively simple, but constructing it can be complex. Consider the starwars data set supplied by dplyr:

starwars <- dplyr::starwars
head(starwars)
#> # A tibble: 6 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
#> 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
#> 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
#> 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

And notably the species variable:

table(starwars$species)
#> 
#>         Aleena       Besalisk         Cerean       Chagrian       Clawdite 
#>              1              1              1              1              1 
#>          Droid            Dug           Ewok      Geonosian         Gungan 
#>              6              1              1              1              3 
#>          Human           Hutt       Iktotchi        Kaleesh       Kaminoan 
#>             35              1              1              1              2 
#>        Kel Dor       Mirialan   Mon Calamari           Muun       Nautolan 
#>              1              2              1              1              1 
#>      Neimodian         Pau'an       Quermian         Rodian        Skakoan 
#>              1              1              1              1              1 
#>      Sullustan     Tholothian        Togruta          Toong      Toydarian 
#>              1              1              1              1              1 
#>     Trandoshan        Twi'lek     Vulptereen        Wookiee          Xexto 
#>              1              2              1              2              1 
#> Yoda's species         Zabrak 
#>              1              2

Imagine we wanted to reduce identifiability by aggregating the data into Human vs Non-Human. We could code the vector by hand, but human error can lead to mistakes. To aid in designing complex blurs we supply the category_blur utility which uses regex to define the groups.

human_blur <- category_blur(
  starwars$species,
  "NotHuman" = "^(?!Human)" # Doesn't start with "Human"
)

And the vector returned can be passed into a new pipeline as before.

species_pipe <- starwars |>
  add_blur(species, blur=human_blur)
  
new_starwars <- apply_deident(starwars, species_pipe)

table(new_starwars$species)
#> 
#>    Human NotHuman 
#>       35       48

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.