The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
categoryEncodings intends to provide a fast way to encode ‘factor’ or qualitative variables through various methods. The packages uses data.table as the backend for speed, with as few other dependencies as possible. Most of the methods are based on the paper of Johannemann et al.(2019) - Sufficient Representations for Categorical Variables (arXiv:1908.09874).
The current version features automatic inference of factors and uses a very simple heuristic for encoding, as well as allowing manual controls.
You can install the latest version of categoryEncodings from github using the devtools package
::install_github("JSzitas/categoryEncodings") devtools
Soon the package will be submitted to CRAN, and hopefully will be accepted.
Here we want to encode all of the factors in a given data.frame.
library(categoryEncodings)
# currently
<- cbind( data.frame(matrix(rnorm(5*100),ncol = 5)),
data_fm sample(sample(letters, 10), 100, replace = TRUE))
colnames(data_fm)[6] <- "few_letters"
# encoding is done automatically, as is the inference of factors
<- encode_categories(X = data_fm)
result # note that due to the data.table backend, the result has to be saved to an object to be
# visible: otherwise printing is surpressed.
print(result)
<- cbind( data.frame(
data_fm matrix( rnorm(5*100),ncol = 5)),
sample(sample(letters, 10), 100, replace = TRUE),
sample(sample(letters, 20), 100, replace = TRUE),
sample(sample(1:10, 5), 100, replace = TRUE),
sample(sample(1:50, 35), 100, replace = TRUE ),
sample(1:2, 100, replace = TRUE ))
colnames(data_fm)[6:10] <- c( "few_letters", "many_letters",
"some_numbers", "many_numbers",
"binary" )
# it does not matter how many factor variables they are, whether they are encoded as factors
# and whether you supply a method to encode them by - some simple inference of factors is done
# based on the number of distinct values in every variable - over a certain threshold
# a variable is deemed as essentialy a factor, and treated as such for conversion
# you will be notified of which variables are being converted via a warning
<- encode_categories(data_fm)
result print(result)
If you would like to contribute a pull request, please do contribute! All contributions will be considered for acceptance, provided they are justifiable and the code is reasonable, regardless of anything related to the person submitting the pull request. Please keep things civil - there is no need for negativity. Also, please do refrain from adding unnecessary dependencies (Ex: pipe) to the package (such pull requests as would add an unnecessary dependencies will be denied/ suspended until the code can be made dependency free).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.