The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette will go over the steps required to implement a custom
user-defined function within the bestNormalize
framework.
There are 3 steps.
Create transformation function
Create predict method for transformation function (that can be applied to new data)
Pass through new function and predict method to bestNormalize
Here, we start by defining a new function that we’ll call
cuberoot_x
, which will take an argument a
(as
does the sqrt_x
function) which will try to add a constant
if it sees any negative numbers in x
. It will also take the
argument standardize
which will center and scale the
transformed data so that it’s centered at 0 with SD = 1.
## Define user-function
<- function(x, a = NULL, standardize = TRUE, ...) {
cuberoot_x stopifnot(is.numeric(x))
<- max(0, -(min(x, na.rm = TRUE)))
min_a if(!length(a))
<- min_a
a if(a < min_a) {
warning("Setting a < max(0, -(min(x))) can lead to transformation issues",
"Standardize set to FALSE")
<- FALSE
standardize
}
<- (x + a)^(1/3)
x.t <- mean(x.t, na.rm = TRUE)
mu <- sd(x.t, na.rm = TRUE)
sigma if (standardize) x.t <- (x.t - mu) / sigma
# Get in-sample normality statistic results
<- nortest::pearson.test(x.t)
ptest
<- list(
val x.t = x.t,
x = x,
mean = mu,
sd = sigma,
a = a,
n = length(x.t) - sum(is.na(x)),
norm_stat = unname(ptest$statistic / ptest$df),
standardize = standardize
)
# Assign class
class(val) <- c('cuberoot_x', class(val))
val }
Note that we assigned a class to the object this returns of the same
name; this is necessary for successful implementation within
bestNormalize
. We’ll also need an associated
predict
method that is used to apply the transformation to
newly observed data. `
<- function(object, newdata = NULL, inverse = FALSE, ...) {
predict.cuberoot_x
# If no data supplied and not inverse
if (is.null(newdata) & !inverse)
<- object$x
newdata
# If no data supplied and inverse
if (is.null(newdata) & inverse)
<- object$x.t
newdata
# Actually performing transformations
# Perform inverse transformation as estimated
if (inverse) {
# Reverse-standardize
if (object$standardize)
<- newdata * object$sd + object$mean
newdata
# Reverse-cube-root (cube)
<- newdata^3 - object$a
newdata
# Otherwise, perform transformation as estimated
else if (!inverse) {
} # Take cube root
<- (newdata + object$a)^(1/3)
newdata
# Standardize to mean 0, sd 1
if (object$standardize)
<- (newdata - object$mean) / object$sd
newdata
}
# Return transformed data
unname(newdata)
}
This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.
<- function(x, ...) {
print.cuberoot_x cat(ifelse(x$standardize, "Standardized", "Non-Standardized"),
'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n',
'Relevant statistics:\n',
'- a =', x$a, '\n',
'- mean (before standardization) =', x$mean, '\n',
'- sd (before standardization) =', x$sd, '\n')
}
Note: if you can find a similar transformation in the source code,
it’s easy to model your code after it. For instance, for
cuberoot_x
and predict.cuberoot_x
, I used
sqrt_x.R
as a template file.
# Store custom functions into list
<- list(
custom_transform cuberoot_x = cuberoot_x,
predict.cuberoot_x = predict.cuberoot_x,
print.cuberoot_x = print.cuberoot_x
)
set.seed(123129)
<- rgamma(100, 1, 1)
x <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE)) (b
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.2347
## - Box-Cox: 1.0267
## - cuberoot_x: 0.9787
## - Double Reversed Log_b(x+a): 2.4507
## - Exp(x): 4.7947
## - Log_b(x+a): 1.3547
## - No transform: 2.0027
## - orderNorm (ORQ): 1.1627
## - sqrt(x + a): 1.0907
## - Yeo-Johnson: 1.0987
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Non-Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - a = 0
## - mean (before standardization) = 0.9588261
## - sd (before standardization) = 0.3298665
Evidently, the cube-rooting was the best normalizing transformation!
Is this code actually performing the cube-rooting?
all.equal(x^(1/3), b$chosen_transform$x.t)
## [1] TRUE
all.equal(x^(1/3), predict(b))
## [1] TRUE
It does indeed.
The bestNormalize package can estimate any univariate statistic using
its CV framework. A user-defined function can be passed in through the
norm_stat_fn
argument, and this function will then be
applied in lieu of the Pearson test statistic divided by its degree of
freedom.
The user-defined function must take an argument x
, which
indicates the data on which a user wants to evaluate the statistic.
Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic:
bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat)
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.1958
## - Box-Cox: 0.1785
## - Center+scale: 0.2219
## - Double Reversed Log_b(x+a): 0.2624
## - Exp(x): 0.3299
## - Log_b(x+a): 0.1959
## - orderNorm (ORQ): 0.186
## - sqrt(x + a): 0.1829
## - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.3281193
## - mean (before standardization) = -0.1263882
## - sd (before standardization) = 0.9913552
Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test’s p-value:
<- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p)) (dont_do_this
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.4327
## - Box-Cox: 0.4831
## - Center+scale: 0.2958
## - Double Reversed Log_b(x+a): 0.2028
## - Exp(x): 0.0675
## - Log_b(x+a): 0.3589
## - orderNorm (ORQ): 0.4492
## - sqrt(x + a): 0.4899
## - Yeo-Johnson: 0.4531
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized exp(x) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - mean (before standardization) = 6.885396
## - sd (before standardization) = 13.66084
Note: bestNormalize
will attempt to minimize this
statistic by default, which is definitely not what you want to do when
calculating the p-value. This is seen in the example above, as the WORST
normalization transformation is chosen.
In this case, a user is advised to either manually select the best one:
<- names(which.max(dont_do_this$norm_stats))
best_transform <- dont_do_this$other_transforms[[best_transform]]) (do_this
## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.:
## Relevant statistics:
## - a = 0
## - mean (before standardization) = 0.9811849
## - sd (before standardization) = 0.4779252
Or, the user can reverse their defined statistic (in this case by subtracting it from 1):
<- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p)) (do_this
## Best Normalizing transformation with 100 Observations
## Estimated Normality Statistics (using custom normalization statistic)
## - arcsinh(x): 0.5166
## - Box-Cox: 0.4191
## - Center+scale: 0.6521
## - Double Reversed Log_b(x+a): 0.8005
## - Exp(x): 0.9601
## - Log_b(x+a): 0.5338
## - orderNorm (ORQ): 0.4646
## - sqrt(x + a): 0.4475
## - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.3281193
## - mean (before standardization) = -0.1263882
## - sd (before standardization) = 0.9913552
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.