The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Customization within bestNormalize

Ryan A Peterson

2023-08-17

Custom functions with bestNormalize

This vignette will go over the steps required to implement a custom user-defined function within the bestNormalize framework.

There are 3 steps.

  1. Create transformation function

  2. Create predict method for transformation function (that can be applied to new data)

  3. Pass through new function and predict method to bestNormalize

Example: cube-root

S3 methods

Here, we start by defining a new function that we’ll call cuberoot_x, which will take an argument a (as does the sqrt_x function) which will try to add a constant if it sees any negative numbers in x. It will also take the argument standardize which will center and scale the transformed data so that it’s centered at 0 with SD = 1.

## Define user-function
cuberoot_x <- function(x, a = NULL, standardize = TRUE, ...) {
  stopifnot(is.numeric(x))
  
  min_a <- max(0, -(min(x, na.rm = TRUE)))
  if(!length(a)) 
    a <- min_a
  if(a < min_a) {
    warning("Setting a <  max(0, -(min(x))) can lead to transformation issues",
            "Standardize set to FALSE")
    standardize <- FALSE
  }
  
  
  x.t <- (x + a)^(1/3)
  mu <- mean(x.t, na.rm = TRUE)
  sigma <- sd(x.t, na.rm = TRUE)
  if (standardize) x.t <- (x.t - mu) / sigma
  
  # Get in-sample normality statistic results
  ptest <- nortest::pearson.test(x.t)
  
  val <- list(
    x.t = x.t,
    x = x,
    mean = mu,
    sd = sigma,
    a = a,
    n = length(x.t) - sum(is.na(x)),
    norm_stat = unname(ptest$statistic / ptest$df),
    standardize = standardize
  )
  
  # Assign class
  class(val) <- c('cuberoot_x', class(val))
  val
}

Note that we assigned a class to the object this returns of the same name; this is necessary for successful implementation within bestNormalize. We’ll also need an associated predict method that is used to apply the transformation to newly observed data. `

predict.cuberoot_x <- function(object, newdata = NULL, inverse = FALSE, ...) {
  
  # If no data supplied and not inverse
  if (is.null(newdata) & !inverse)
    newdata <- object$x
  
  # If no data supplied and inverse
  if (is.null(newdata) & inverse)
    newdata <- object$x.t
  
  # Actually performing transformations
  
  # Perform inverse transformation as estimated
  if (inverse) {
    
    # Reverse-standardize
    if (object$standardize) 
      newdata <- newdata * object$sd + object$mean
    
    # Reverse-cube-root (cube)
    newdata <-  newdata^3 - object$a
    
    
    # Otherwise, perform transformation as estimated
  } else if (!inverse) {
    # Take cube root
    newdata <- (newdata + object$a)^(1/3)
    
    # Standardize to mean 0, sd 1
    if (object$standardize) 
      newdata <- (newdata - object$mean) / object$sd
  }
  
  # Return transformed data
  unname(newdata)
}

Optional: print method

This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.

print.cuberoot_x <- function(x, ...) {
  cat(ifelse(x$standardize, "Standardized", "Non-Standardized"),
      'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n', 
      'Relevant statistics:\n',
      '- a =', x$a, '\n',
      '- mean (before standardization) =', x$mean, '\n',
      '- sd (before standardization) =', x$sd, '\n')
}

Note: if you can find a similar transformation in the source code, it’s easy to model your code after it. For instance, for cuberoot_x and predict.cuberoot_x, I used sqrt_x.R as a template file.

Implementing with bestNormalize

# Store custom functions into list
custom_transform <- list(
  cuberoot_x = cuberoot_x,
  predict.cuberoot_x = predict.cuberoot_x,
  print.cuberoot_x = print.cuberoot_x
)

set.seed(123129)
x <- rgamma(100, 1, 1)
(b <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 1.2347
##  - Box-Cox: 1.0267
##  - cuberoot_x: 0.9787
##  - Double Reversed Log_b(x+a): 2.4507
##  - Exp(x): 4.7947
##  - Log_b(x+a): 1.3547
##  - No transform: 2.0027
##  - orderNorm (ORQ): 1.1627
##  - sqrt(x + a): 1.0907
##  - Yeo-Johnson: 1.0987
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Non-Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - a = 0 
##  - mean (before standardization) = 0.9588261 
##  - sd (before standardization) = 0.3298665

Evidently, the cube-rooting was the best normalizing transformation!

Sanity check

Is this code actually performing the cube-rooting?

all.equal(x^(1/3), b$chosen_transform$x.t)
## [1] TRUE
all.equal(x^(1/3), predict(b))
## [1] TRUE

It does indeed.

Using custom normalization statistics

The bestNormalize package can estimate any univariate statistic using its CV framework. A user-defined function can be passed in through the norm_stat_fn argument, and this function will then be applied in lieu of the Pearson test statistic divided by its degree of freedom.

The user-defined function must take an argument x, which indicates the data on which a user wants to evaluate the statistic.

Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic:

bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat)
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.1958
##  - Box-Cox: 0.1785
##  - Center+scale: 0.2219
##  - Double Reversed Log_b(x+a): 0.2624
##  - Exp(x): 0.3299
##  - Log_b(x+a): 0.1959
##  - orderNorm (ORQ): 0.186
##  - sqrt(x + a): 0.1829
##  - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193 
##  - mean (before standardization) = -0.1263882 
##  - sd (before standardization) = 0.9913552

Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test’s p-value:

(dont_do_this <- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.4327
##  - Box-Cox: 0.4831
##  - Center+scale: 0.2958
##  - Double Reversed Log_b(x+a): 0.2028
##  - Exp(x): 0.0675
##  - Log_b(x+a): 0.3589
##  - orderNorm (ORQ): 0.4492
##  - sqrt(x + a): 0.4899
##  - Yeo-Johnson: 0.4531
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized exp(x) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - mean (before standardization) = 6.885396 
##  - sd (before standardization) = 13.66084

Note: bestNormalize will attempt to minimize this statistic by default, which is definitely not what you want to do when calculating the p-value. This is seen in the example above, as the WORST normalization transformation is chosen.

In this case, a user is advised to either manually select the best one:

best_transform <- names(which.max(dont_do_this$norm_stats))
(do_this <- dont_do_this$other_transforms[[best_transform]])
## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.:
##  Relevant statistics:
##  - a = 0 
##  - mean (before standardization) = 0.9811849 
##  - sd (before standardization) = 0.4779252

Or, the user can reverse their defined statistic (in this case by subtracting it from 1):

(do_this <- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.5166
##  - Box-Cox: 0.4191
##  - Center+scale: 0.6521
##  - Double Reversed Log_b(x+a): 0.8005
##  - Exp(x): 0.9601
##  - Log_b(x+a): 0.5338
##  - orderNorm (ORQ): 0.4646
##  - sqrt(x + a): 0.4475
##  - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193 
##  - mean (before standardization) = -0.1263882 
##  - sd (before standardization) = 0.9913552

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.