Introduction to synthesizer

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Package version 0.6.0. Please use citation('synthesizer') to cite the package.

Introduction

synthetiser is an R package for quickly and easily synthesizing data. It also provides a few basic functions based on pMSE to measure some utility of the synthesized data.

The package supports numerical, categorical/ordinal, and mixed data, it synthesizes times series (ts) objects and also correctly takes account of missing values and mixed (or zero-inflated) distributions. A rankcor parameter lets you gradually shift between realistic data with high utility and less realistic data with decreased correlations between original and syntesized data.

Installation

Next, the package can be loaded. You can use packageVersion (from base R) to check which version you have installed.

> library(synthesizer)
> # check the package version
> packageVersion("synthesizer")
[1] ‘0.6.0’

A first example

> data(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> set.seed(1)
> synth_iris <- synthesize(iris)

By default synthesize will return a dataset of the same size as the input dataset. However, it is possible to ask for any number of records.

> more_synth <- synthesize(iris, n=250)
> dim(more_synth)
[1] 250   5

Controlling the utility-privacy trade-off

Synthetic data can be too realistic, in the sense that it might reveal actual properties of the original entities used to create the synthetic data. One way to mitigate this is to decrease the rank correlation between the original and the synthetic data.

When synthesizing data frames this can be controlled with the rankcor parameter. This parameter varies from 0, representing the lowest utility, to 1, the default and maximum utility. The rankcor refers to the maximum rank correlation between original and synthesized variables. If rankcor is a single (unnamed) value, all synthetic variables are rank-decorrelated from the original data by random permutations until the rank correlation between synthetic and original data drops below the rankcor value. It is also possible to lower the utility of a selection of variables. Variables for which rankcor is not specified will default to perfect rank correlation (rankcor=1).

> # decorrelate rank matching to 0.5
> s1 <- synthesize(iris, rankcor=0.5)
> # decorrelate only Species
> s2 <- synthesize(iris, rankcor=c("Species"=0.5))

In the left figure, we show the three variables of a synthesized iris dataset, where all variables are decorrelated. Both the geometric clustering and the species are now garbled. In the right figure we only decorrelate the Species variable. Here, the spatial clustering is retained while the correlation between color (Species) and location is lost.

Synthesizing (multivariate) time series

Synthesizing time series is as easy as synthesizing data frames, but there are a few differences.

As a demonstration, we create a synthetic version of the UKDriverDeaths dataset that is included with base R.

> data(UKDriverDeaths)
> synth_udd <- synthesize(UKDriverDeaths)

How it works

These steps ensure a synthetic dataset that closely resembles the original data. The rank order matching ensures a certain resiliance to the influence of outliers. If the rankcor argument has a value less than the default 1, a third step is performed:

Except for the case of time series it is possible to sample datasets that are larger or smaller than their originals. This is done by (if necessary) creating multiple synthetic datasets and sample records uniformly without replacement from the combined dataset.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.