The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
synthesizer
Package version 0.3.1.
Use citation('synthesizer')
to cite the package.
synthetiser
is an R package for quickly and easily synthesizing data. It also provides a few basic functions based on pMSE to measure some utility of the synthesized data.
The package supports numerical, categorical/ordinal, and mixed data and also correctly takes account of missing values.
At the moment the method used seems promising but we are working on investigating where the method shines and where it fails. So we have no guarantees yet on utility, privacy, and so on. Having said that, our preliminary results are promising, and using the package is very easy.
The latest CRAN release can be installed as follows.
install.packages("synthesizer")
Next, the package can be loaded. You can use packageVersion
(from base R) to check which version you have installed.
We will use the iris
dataset, that is built into R.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Creating a synthetic version of this dataset is easy.
To compare the datasets we can make some side-by-side scatterplots.
By default, synthesize
will return a dataset of the same size as the input dataset. However it is possible to ask for any number of records.
The pMSE method is a popular way of measuring the quality of a dataset. The idea is to train a model to predict whether a record is synthetic or not. The worse a model can do that, the better a synthic data instance resembles the real data. The value scales between 0 and 0.25 (if the synthetic and real datasets have the same number of records). Smaller is better.
Synthetic data is prepared as follows.
Given an original dataset with n records:
If less than m < n records are needed, sample m records uniformly from the dataset just created. If m > n records are needed, create ⌈m/n⌉ synthetic datasets of size m and sample uniformly m records from the combined data sets.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.