Getting started with dcorBSS

Sarah Leyder and Klaus Nordhausen

Overview

The dcorBSS package provides distance correlation based tools for blind source separation (BSS) and dependence analysis. The main functionality can be grouped into four parts:

  1. distance covariance and distance correlation calculations, including blockwise versions for larger data sets
  2. robust transformations, in particular the bowl and biloop transformations, tailored for integration with distance-based measures
  3. independent component analysis through dcorICA()
  4. serial-dependence diagnostics and tests based on distance correlation or HSIC
library(dcorBSS)

This vignette introduces the main workflow with small simulated examples. The examples are deliberately modest so that the vignette can be built quickly.

Measuring dependence with distance correlation

Distance correlation (dCor) was first introduced by Szekely et al. (2007) as a measure of statistical dependence between random variables. Unlike classical correlation measures, it is capable of capturing a broad class of dependence structures, including nonlinear relationships, and can be applied to both univariate and multivariate data. This generality makes it a powerful tool for assessing dependence in a wide range of statistical applications. It can be defined as a normalized version of the distance covariance (dCov), obtained by scaling with the marginal distance variances. The squared distance covariance is given by \[ \mathrm{dCov}^2(X,Y) = \mathbb{E}\big[\|X - X'\|\,\|Y - Y'\|\big] + \mathbb{E}\big[\|X - X'\|\big]\mathbb{E}\big[\|Y - Y'\|\big] - 2\,\mathbb{E}\big[\|X - X'\|\,\|Y - Y''\|\big], \] where \((X,Y)\), \((X',Y')\), and \((X'',Y'')\) are independent and identically distributed copies of \((X,Y)\), and \(\|\cdot\|\) denotes the Euclidean norm. The corresponding distance correlation is then defined as \[ \mathrm{dCor}(X,Y) = \frac{\mathrm{dCov}(X,Y)}{\sqrt{\mathrm{dCov}(X,X)\,\mathrm{dCov}(Y,Y)}} \in [0,1]. \] It has the important property that \(\mathrm{dCor}(X,Y) = 0 \iff\) \(X\) and \(Y\) are statistically independent.

The dcorBSS package provides several functions for computing the above distance measures. The function dcor_large() computes the distance correlation between two numeric vectors or matrices. The related functions dcov_large() and dcov2_large() compute the distance covariance and squared distance covariance, respectively.

set.seed(1)
n <- 200

x <- rnorm(n)
y_dep <- x^2 + 0.25 * rnorm(n)
y_ind <- rnorm(n)

c(
  dependent = dcor_large(x, y_dep),
  independent = dcor_large(x, y_ind)
)
#>   dependent independent 
#>   0.5427687   0.1377335

The larger value for y_dep reflects the nonlinear relationship between x and y_dep. In contrast, y_ind is generated independently of x.

For multivariate observations, rows are observations and columns are variables.

X <- cbind(rnorm(n), rnorm(n))
Y <- cbind(X[, 1]^2 + 0.2 * rnorm(n), rnorm(n))

dcor_large(X, Y)
#> [1] 0.3690015

For larger samples, the block_size argument can be used to avoid allocating the full n x n pairwise distance matrices at once. This reduces memory use while computing the same sample quantity up to small floating-point differences.

r_full <- dcor_large(X, Y)
r_block <- dcor_large(X, Y, block_size = 64L)

c(full = r_full, blockwise = r_block)
#>      full blockwise 
#> 0.3690015 0.3690015
all.equal(r_full, r_block, tolerance = 1e-10)
#> [1] TRUE

Distance correlation can also be used to quantify lagged dependence between \(X_t\) and \(X_{t+k}\). This is referred to as distance autocorrelation at lag \(k\). It is computed by evaluating the distance correlation between \(x[1:(n-lag), ]\) and \(x[(1+lag):n, ]\), as implemented in the function dacor_large(). The following example illustrates this procedure for an MA(1) time series.

set.seed(1)

x <- arima.sim(n = 500, list(ma = c(0.8)))

dacor_large(x, lag = 1)
#> [1] 0.4341547
dacor_large(x, lag = 2)
#> [1] 0.06566249

dacov_large() and dacov2_large() operate similarly.

Robust transformations

The package includes bounded, redescending data transformations that can be useful before computing dependence measures on heavy-tailed or contaminated data. Their goal is to increase robustness against outliers.

A first transform is the bowl transform of Leyder et al. (2026). It is a robust, nonlinear transformation designed to reduce the influence of outliers. The bowl transform preserves the underlying dependence structure of the data by its injective property. It maps observations into a bounded embedding in which extreme values are smoothly “pulled back” towards the origin, improving the stability of dependence measures such as distance correlation. This makes it particularly useful in robust multivariate analysis and independence testing.

The bowl transform maps a p-dimensional observation to p + 1 transformed coordinates as follows. Let \(x_i \in \mathbb{R}^p\) be an observation, \(\|x_i\|\) its Euclidean norm, and \(q = \sqrt{\chi^2_{p,\alpha}}\) a hyperparameter. Define \(u_i = \tanh(\|x_i\| / q).\) The bowl transformed observation is then \(\left(10u_i^2(1-u_i)^2 x_i,\; 10u_i^6(1-u_i)^2\right) \in \mathbb{R}^{p+1}.\)

It is for example used by dcorICA(transform = "bowl") for robust independent component analysis, see further. Before applying the bowl transform, one should scale the data.

set.seed(2)
X_heavy <- cbind(rt(100, df = 3), rnorm(100))

head(bowl_transform(X_heavy, do_scale = TRUE))
#>                                          y
#> [1,] -0.553740386 -0.60302640 1.359686e-02
#> [2,]  1.045776548 -0.01697597 1.082630e-01
#> [3,]  0.007580968  0.03605834 1.419488e-05
#> [4,] -0.017558991  0.01468177 5.282323e-06
#> [5,] -0.127506848  0.67114610 8.620165e-03
#> [6,]  0.708005230 -0.82878439 2.981865e-02
dim(bowl_transform(X_heavy, do_scale = TRUE))
#> [1] 100   3

A second transformation is the biloop transform of Leyder et al. (2025). In contrast to the bowl transform, it is applied columnwise by mapping each univariate variable \(x\) to two coordinates \((u(x),v(x))\) by a nonlinear embedding as follows:

\[ u(x) = \begin{cases} c_2\big(1 + \cos(2\pi \tanh(x/c_1) + \pi)\big), & x \ge 0 \\ -c_2\big(1 + \cos(2\pi \tanh(x/c_1) - \pi)\big), & x < 0 \end{cases} \]

\[ v(x) = \sin(2\pi \tanh(x/c_2)). \]

The constant c1 and c2 default to 4, but can be changed as argument. Before applying the transformation, it is advised to robustly scale the data.

z <- rt(100, df = 3)
z_biloop <- biloop_transform(z, do_scale = TRUE)

head(z_biloop)
#>      V1_biloop1 V1_biloop2
#> [1,]  -7.092396 -0.6342853
#> [2,]   1.448988  0.7702409
#> [3,]   2.686603  0.9445564
#> [4,]  -2.150797 -0.8867232
#> [5,]  -1.986848 -0.8641188
#> [6,]   7.942019  0.1696482
dim(z_biloop)
#> [1] 100   2

A typical use of the bowl and biloop is to compute dependence robustly after applying the transformation.

set.seed(3)
x <- rt(200, df = 3)
y <- x^2 + 0.3 * rnorm(200)

x_b <- biloop_transform(x, do_scale = TRUE)
y_b <- biloop_transform(y, do_scale = TRUE)

dcor_large(x_b, y_b)
#> [1] 0.5138393

Independent component analysis with dcorICA()

Blind source separation is a class of statistical methods that starts from observed mixtures and attempts to recover latent components that are mutually independent. The dcorICA() function implements an independent component analysis (ICA) approach for linearly mixed data based on distance measures. It builds on the ICA framework of Matteson and Tsay (2017). The algorithm first whitens the observations to remove second-order dependencies and then sequentially searches for a rotation in the form of an orthogonal matrix that minimizes distance correlation between components.

The following example simulates three independent sources and mixes them with a random matrix.

set.seed(4)
n <- 300

S <- cbind(
  uniform = runif(n, -1, 1),
  normal = rnorm(n),
  chisq = rchisq(n, df = 3)
)

A <- matrix(rnorm(9), 3, 3)
X <- tcrossprod(S, A)

fit <- dcorICA(X, seed = 1, sweeps = 2)

str(fit, max.level = 1)
#> List of 7
#>  $ W          : num [1:3, 1:3] 2.5436 -0.0103 -2.5747 0.6035 -0.1434 ...
#>  $ S          : num [1:300, 1:3] -0.797 0.785 -1.689 -1.581 0.546 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ mu         : num [1:3] -4.71 -3.92 -6.22
#>  $ transform  : chr "none"
#>  $ alpha      : num 0.998
#>  $ block_size : NULL
#>  $ convergence:List of 3
#>  - attr(*, "class")= chr "bss"
head(fit$S)
#>            IC.1       IC.2       IC.3
#> [1,] -0.7966035 -1.1018374  0.3007555
#> [2,]  0.7852794 -0.8772089 -1.7342561
#> [3,] -1.6892157  0.5840958 -0.6004088
#> [4,] -1.5806593  1.1170900 -0.6520316
#> [5,]  0.5458954 -0.4245358  1.0027885
#> [6,] -0.5426240 -0.9298818 -0.8086506

The returned object contains the estimated unmixing matrix W, the estimated components S, the centering vector mu, such that \(S=(X−μ)W^ T\), and some additional optimization diagnostics.

fit$W
#>             [,1]       [,2]        [,3]
#> [1,]  2.54364186  0.6035219 -2.29397167
#> [2,] -0.01028315 -0.1433589 -0.09028123
#> [3,] -2.57470675 -2.1760647  3.31456053
pairs(fit$S, main = "Estimated components from dcorICA()")

For larger data sets, block_size can be passed to dcorICA() so that the distance-correlation objective is evaluated blockwise to reduce memory usage.

fit_block <- dcorICA(X, seed = 1, sweeps = 2, block_size = 128L)

A robust variant of the ICA method can be obtained by computing dependencies in the algorithm after applying the bowl transformation, see Leyder et al. (2026) for details. Robust location and scatter estimates should then also be supplied. For robust whitening, the scatter matrices should have the independence property; alternatively, one should make an assumption such as at most one independent component being skew.

Below, a robust ICA example is shown, although the code is not evaluated by default because it depends on the optional robustbase package.

if (requireNamespace("robustbase", quietly = TRUE)) {
  mcd <- robustbase::covMcd(X)

  fit_robust <- dcorICA(
    X,
    mu = mcd$center,
    scatter = mcd$cov,
    transform = "bowl",
    seed = 1,
    sweeps = 2
  )

  head(fit_robust$S)
}

Serial dependence diagnostics

The package also provides tools for detecting serial dependence in univariate time series. The function dacf_curve() computes distance autocovariance or distance autocorrelation over a set of lags and returns an object that can be plotted as a dependogram.

A small real-data example is the annual flow of the river Nile at Aswan, available as the Nile time series in the stats package:

data(Nile)

curve <- dacf_curve(Nile, lags = 1:12, measure = "dcor")
curve$estimate
#>     lag_1     lag_2     lag_3     lag_4     lag_5     lag_6     lag_7     lag_8 
#> 0.4773073 0.3929648 0.3838925 0.3365129 0.2903405 0.2708372 0.2996517 0.3402672 
#>     lag_9    lag_10    lag_11    lag_12 
#> 0.2352337 0.1856797 0.2853965 0.2939691
plot(curve, type = "line")

The function dcor_serial_test() performs a permutation-based portmanteau test using lagwise distance covariance or distance correlation values to detect serial dependence. Two types of tests are implemented, the classical Box-Pierce test BP or the kernel-weighted Fokianos-Pitsillou statistic FP, for more details see Fokianos and Pitsillou (2017).

dcor_serial_test() returns an object of the class "sdt", a serial dependence test object, which can be used to plot the accompanying dependogram.

test_dcor <- dcor_serial_test(
  Nile,
  type = "FP",
  measure = "dcor",
  lags = 1:6,
  B = 99,
  seed = 1
)

test_dcor
#> 
#>  Permutation test of serial dependence based on dCor using a FP
#>  portmanteau statistic
#> 
#> data:  Nile
#> T = 27.393, B = 99, lags = 6, p-value = 0.01
#> alternative hypothesis: serial dependence
#> sample estimates:
#>     lag_1     lag_2     lag_3     lag_4     lag_5     lag_6 
#> 0.4773073 0.3929648 0.3838925 0.3365129 0.2903405 0.2708372
plot(test_dcor)

For final analyses, it is advized to use a larger number of permutations, for example B = 2000 or more, to obtain more stable p-values.

The package also includes an serial-dependence test based on the Hilbert-Schmidt Independence Criterion of Gretton et al. (2005) and the test of Hong (1996). It uses the same "sdt" plotting interface.

test_hsic <- hsic_serial_test(
  Nile,
  lags = 1:6,
  type = "H96",
  B = 99,
  seed = 1
)

test_hsic
#> 
#>  Permutation test of serial dependence based on normalized HSIC using a
#>  H96 portmanteau statistic
#> 
#> data:  Nile
#> T = 15.903, B = 99, lags = 6, normalize = 1, p-value = 0.01
#> alternative hypothesis: serial dependence
#> sample estimates:
#>     lag_1     lag_2     lag_3     lag_4     lag_5     lag_6 
#> 0.3437072 0.2995278 0.3172480 0.3018563 0.2567619 0.2257860
plot(test_hsic, type = "line")

Normalized HSIC

The helper function nHSIC() computes a normalized Hilbert–Schmidt independence criterion. Like distance correlation, it is intended to behave as a scale-free dependence measure.

set.seed(6)
x <- matrix(rnorm(200), ncol = 1)
y <- x^2 + 0.3 * rnorm(200)
z <- matrix(rnorm(200), ncol = 1)

c(
  dependent = nHSIC(x, y),
  independent = nHSIC(x, z)
)
#>   dependent independent 
#>   0.6403181   0.1233047

Overview over Functions

Task Function
Distance correlation or covariance dcor_large(), dcov_large(), dcov2_large()
Distance autocorrelation dacor_large(), dacov_large(), dacov2_large()
Robust bounded transformations bowl_transform(), biloop_transform()
Independent component analysis dcorICA()
Dependogram dacf_curve()
Distance-correlation serial-dependence test dcor_serial_test()
HSIC serial-dependence test hsic_serial_test()
Normalized HSIC nHSIC()

References

Fokianos, K. and Pitsillou, M (2017). Consistent testing for pairwise dependence in time series. Technometrics, 59(2), 262–-270

Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. International conference on algorithmic learning theory, 63–77

Hong, Y. (1996). Consistent testing for serial correlation of unknown form. Econometrica, 64(4), 837–-864.

Leyder, S., Raymaekers, J., and Rousseeuw, P. J. (2025). Robust Distance Covariance. International Statistical Review, 94(1), 1–25.

Leyder, S., Raymaekers, J., Rousseeuw, P. J., Van Deuren, T., and Verdonck, T. (2026). Independent component analysis by robust distance correlation. Advances in Data Analysis and Classification.

Matteson, D. S. and Tsay, R. S. (2017). Independent component analysis via distance covariance. Journal of the American Statistical Association, 112(518), 623–637.

Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6), 2769–2794.