The dcorBSS package provides distance correlation based tools for blind source separation (BSS) and dependence analysis. The main functionality can be grouped into four parts:
dcorICA()This vignette introduces the main workflow with small simulated examples. The examples are deliberately modest so that the vignette can be built quickly.
Distance correlation (dCor) was first introduced by Szekely et al. (2007) as a measure of statistical dependence between random variables. Unlike classical correlation measures, it is capable of capturing a broad class of dependence structures, including nonlinear relationships, and can be applied to both univariate and multivariate data. This generality makes it a powerful tool for assessing dependence in a wide range of statistical applications. It can be defined as a normalized version of the distance covariance (dCov), obtained by scaling with the marginal distance variances. The squared distance covariance is given by \[ \mathrm{dCov}^2(X,Y) = \mathbb{E}\big[\|X - X'\|\,\|Y - Y'\|\big] + \mathbb{E}\big[\|X - X'\|\big]\mathbb{E}\big[\|Y - Y'\|\big] - 2\,\mathbb{E}\big[\|X - X'\|\,\|Y - Y''\|\big], \] where \((X,Y)\), \((X',Y')\), and \((X'',Y'')\) are independent and identically distributed copies of \((X,Y)\), and \(\|\cdot\|\) denotes the Euclidean norm. The corresponding distance correlation is then defined as \[ \mathrm{dCor}(X,Y) = \frac{\mathrm{dCov}(X,Y)}{\sqrt{\mathrm{dCov}(X,X)\,\mathrm{dCov}(Y,Y)}} \in [0,1]. \] It has the important property that \(\mathrm{dCor}(X,Y) = 0 \iff\) \(X\) and \(Y\) are statistically independent.
The dcorBSS package provides several functions for
computing the above distance measures. The function
dcor_large() computes the distance correlation between two
numeric vectors or matrices. The related functions
dcov_large() and dcov2_large() compute the
distance covariance and squared distance covariance, respectively.
set.seed(1)
n <- 200
x <- rnorm(n)
y_dep <- x^2 + 0.25 * rnorm(n)
y_ind <- rnorm(n)
c(
dependent = dcor_large(x, y_dep),
independent = dcor_large(x, y_ind)
)
#> dependent independent
#> 0.5427687 0.1377335The larger value for y_dep reflects the nonlinear
relationship between x and y_dep. In contrast,
y_ind is generated independently of x.
For multivariate observations, rows are observations and columns are variables.
X <- cbind(rnorm(n), rnorm(n))
Y <- cbind(X[, 1]^2 + 0.2 * rnorm(n), rnorm(n))
dcor_large(X, Y)
#> [1] 0.3690015For larger samples, the block_size argument can be used
to avoid allocating the full n x n pairwise distance
matrices at once. This reduces memory use while computing the same
sample quantity up to small floating-point differences.
r_full <- dcor_large(X, Y)
r_block <- dcor_large(X, Y, block_size = 64L)
c(full = r_full, blockwise = r_block)
#> full blockwise
#> 0.3690015 0.3690015
all.equal(r_full, r_block, tolerance = 1e-10)
#> [1] TRUEDistance correlation can also be used to quantify lagged dependence
between \(X_t\) and \(X_{t+k}\). This is referred to as distance
autocorrelation at lag \(k\). It is
computed by evaluating the distance correlation between \(x[1:(n-lag), ]\) and \(x[(1+lag):n, ]\), as implemented in the
function dacor_large(). The following example illustrates
this procedure for an MA(1) time series.
set.seed(1)
x <- arima.sim(n = 500, list(ma = c(0.8)))
dacor_large(x, lag = 1)
#> [1] 0.4341547
dacor_large(x, lag = 2)
#> [1] 0.06566249dacov_large() and dacov2_large() operate
similarly.
The package includes bounded, redescending data transformations that can be useful before computing dependence measures on heavy-tailed or contaminated data. Their goal is to increase robustness against outliers.
A first transform is the bowl transform of Leyder et al. (2026). It is a robust, nonlinear transformation designed to reduce the influence of outliers. The bowl transform preserves the underlying dependence structure of the data by its injective property. It maps observations into a bounded embedding in which extreme values are smoothly “pulled back” towards the origin, improving the stability of dependence measures such as distance correlation. This makes it particularly useful in robust multivariate analysis and independence testing.
The bowl transform maps a p-dimensional observation to
p + 1 transformed coordinates as follows. Let \(x_i \in \mathbb{R}^p\) be an observation,
\(\|x_i\|\) its Euclidean norm, and
\(q = \sqrt{\chi^2_{p,\alpha}}\) a
hyperparameter. Define \(u_i = \tanh(\|x_i\| /
q).\) The bowl transformed observation is then \(\left(10u_i^2(1-u_i)^2 x_i,\;
10u_i^6(1-u_i)^2\right) \in \mathbb{R}^{p+1}.\)
It is for example used by dcorICA(transform = "bowl")
for robust independent component analysis, see further. Before applying
the bowl transform, one should scale the data.
set.seed(2)
X_heavy <- cbind(rt(100, df = 3), rnorm(100))
head(bowl_transform(X_heavy, do_scale = TRUE))
#> y
#> [1,] -0.553740386 -0.60302640 1.359686e-02
#> [2,] 1.045776548 -0.01697597 1.082630e-01
#> [3,] 0.007580968 0.03605834 1.419488e-05
#> [4,] -0.017558991 0.01468177 5.282323e-06
#> [5,] -0.127506848 0.67114610 8.620165e-03
#> [6,] 0.708005230 -0.82878439 2.981865e-02
dim(bowl_transform(X_heavy, do_scale = TRUE))
#> [1] 100 3A second transformation is the biloop transform of Leyder et al. (2025). In contrast to the bowl transform, it is applied columnwise by mapping each univariate variable \(x\) to two coordinates \((u(x),v(x))\) by a nonlinear embedding as follows:
\[ u(x) = \begin{cases} c_2\big(1 + \cos(2\pi \tanh(x/c_1) + \pi)\big), & x \ge 0 \\ -c_2\big(1 + \cos(2\pi \tanh(x/c_1) - \pi)\big), & x < 0 \end{cases} \]
\[ v(x) = \sin(2\pi \tanh(x/c_2)). \]
The constant c1 and c2 default to 4, but
can be changed as argument. Before applying the transformation, it is
advised to robustly scale the data.
z <- rt(100, df = 3)
z_biloop <- biloop_transform(z, do_scale = TRUE)
head(z_biloop)
#> V1_biloop1 V1_biloop2
#> [1,] -7.092396 -0.6342853
#> [2,] 1.448988 0.7702409
#> [3,] 2.686603 0.9445564
#> [4,] -2.150797 -0.8867232
#> [5,] -1.986848 -0.8641188
#> [6,] 7.942019 0.1696482
dim(z_biloop)
#> [1] 100 2A typical use of the bowl and biloop is to compute dependence robustly after applying the transformation.
dcorICA()Blind source separation is a class of statistical methods that starts from observed mixtures and attempts to recover latent components that are mutually independent. The dcorICA() function implements an independent component analysis (ICA) approach for linearly mixed data based on distance measures. It builds on the ICA framework of Matteson and Tsay (2017). The algorithm first whitens the observations to remove second-order dependencies and then sequentially searches for a rotation in the form of an orthogonal matrix that minimizes distance correlation between components.
The following example simulates three independent sources and mixes them with a random matrix.
set.seed(4)
n <- 300
S <- cbind(
uniform = runif(n, -1, 1),
normal = rnorm(n),
chisq = rchisq(n, df = 3)
)
A <- matrix(rnorm(9), 3, 3)
X <- tcrossprod(S, A)
fit <- dcorICA(X, seed = 1, sweeps = 2)
str(fit, max.level = 1)
#> List of 7
#> $ W : num [1:3, 1:3] 2.5436 -0.0103 -2.5747 0.6035 -0.1434 ...
#> $ S : num [1:300, 1:3] -0.797 0.785 -1.689 -1.581 0.546 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ mu : num [1:3] -4.71 -3.92 -6.22
#> $ transform : chr "none"
#> $ alpha : num 0.998
#> $ block_size : NULL
#> $ convergence:List of 3
#> - attr(*, "class")= chr "bss"
head(fit$S)
#> IC.1 IC.2 IC.3
#> [1,] -0.7966035 -1.1018374 0.3007555
#> [2,] 0.7852794 -0.8772089 -1.7342561
#> [3,] -1.6892157 0.5840958 -0.6004088
#> [4,] -1.5806593 1.1170900 -0.6520316
#> [5,] 0.5458954 -0.4245358 1.0027885
#> [6,] -0.5426240 -0.9298818 -0.8086506The returned object contains the estimated unmixing matrix
W, the estimated components S, the centering
vector mu, such that \(S=(X−μ)W^
T\), and some additional optimization diagnostics.
fit$W
#> [,1] [,2] [,3]
#> [1,] 2.54364186 0.6035219 -2.29397167
#> [2,] -0.01028315 -0.1433589 -0.09028123
#> [3,] -2.57470675 -2.1760647 3.31456053
pairs(fit$S, main = "Estimated components from dcorICA()")For larger data sets, block_size can be passed to
dcorICA() so that the distance-correlation objective is
evaluated blockwise to reduce memory usage.
A robust variant of the ICA method can be obtained by computing dependencies in the algorithm after applying the bowl transformation, see Leyder et al. (2026) for details. Robust location and scatter estimates should then also be supplied. For robust whitening, the scatter matrices should have the independence property; alternatively, one should make an assumption such as at most one independent component being skew.
Below, a robust ICA example is shown, although the code is not evaluated by default because it depends on the optional robustbase package.
The package also provides tools for detecting serial dependence in
univariate time series. The function dacf_curve() computes
distance autocovariance or distance autocorrelation over a set of lags
and returns an object that can be plotted as a dependogram.
A small real-data example is the annual flow of the river Nile at Aswan, available as the Nile time series in the stats package:
data(Nile)
curve <- dacf_curve(Nile, lags = 1:12, measure = "dcor")
curve$estimate
#> lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7 lag_8
#> 0.4773073 0.3929648 0.3838925 0.3365129 0.2903405 0.2708372 0.2996517 0.3402672
#> lag_9 lag_10 lag_11 lag_12
#> 0.2352337 0.1856797 0.2853965 0.2939691
plot(curve, type = "line")The function dcor_serial_test() performs a
permutation-based portmanteau test using lagwise distance covariance or
distance correlation values to detect serial dependence. Two types of
tests are implemented, the classical Box-Pierce test BP or
the kernel-weighted Fokianos-Pitsillou statistic FP, for
more details see Fokianos and Pitsillou (2017).
dcor_serial_test() returns an object of the class
"sdt", a serial dependence test object, which can be used
to plot the accompanying dependogram.
test_dcor <- dcor_serial_test(
Nile,
type = "FP",
measure = "dcor",
lags = 1:6,
B = 99,
seed = 1
)
test_dcor
#>
#> Permutation test of serial dependence based on dCor using a FP
#> portmanteau statistic
#>
#> data: Nile
#> T = 27.393, B = 99, lags = 6, p-value = 0.01
#> alternative hypothesis: serial dependence
#> sample estimates:
#> lag_1 lag_2 lag_3 lag_4 lag_5 lag_6
#> 0.4773073 0.3929648 0.3838925 0.3365129 0.2903405 0.2708372
plot(test_dcor)For final analyses, it is advized to use a larger number of
permutations, for example B = 2000 or more, to obtain more
stable p-values.
The package also includes an serial-dependence test based on the
Hilbert-Schmidt Independence Criterion of Gretton et al. (2005) and the
test of Hong (1996). It uses the same "sdt" plotting
interface.
test_hsic <- hsic_serial_test(
Nile,
lags = 1:6,
type = "H96",
B = 99,
seed = 1
)
test_hsic
#>
#> Permutation test of serial dependence based on normalized HSIC using a
#> H96 portmanteau statistic
#>
#> data: Nile
#> T = 15.903, B = 99, lags = 6, normalize = 1, p-value = 0.01
#> alternative hypothesis: serial dependence
#> sample estimates:
#> lag_1 lag_2 lag_3 lag_4 lag_5 lag_6
#> 0.3437072 0.2995278 0.3172480 0.3018563 0.2567619 0.2257860
plot(test_hsic, type = "line")The helper function nHSIC() computes a normalized
Hilbert–Schmidt independence criterion. Like distance correlation, it is
intended to behave as a scale-free dependence measure.
| Task | Function |
|---|---|
| Distance correlation or covariance | dcor_large(), dcov_large(),
dcov2_large() |
| Distance autocorrelation | dacor_large(), dacov_large(),
dacov2_large() |
| Robust bounded transformations | bowl_transform(), biloop_transform() |
| Independent component analysis | dcorICA() |
| Dependogram | dacf_curve() |
| Distance-correlation serial-dependence test | dcor_serial_test() |
| HSIC serial-dependence test | hsic_serial_test() |
| Normalized HSIC | nHSIC() |
Fokianos, K. and Pitsillou, M (2017). Consistent testing for pairwise dependence in time series. Technometrics, 59(2), 262–-270
Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. International conference on algorithmic learning theory, 63–77
Hong, Y. (1996). Consistent testing for serial correlation of unknown form. Econometrica, 64(4), 837–-864.
Leyder, S., Raymaekers, J., and Rousseeuw, P. J. (2025). Robust Distance Covariance. International Statistical Review, 94(1), 1–25.
Leyder, S., Raymaekers, J., Rousseeuw, P. J., Van Deuren, T., and Verdonck, T. (2026). Independent component analysis by robust distance correlation. Advances in Data Analysis and Classification.
Matteson, D. S. and Tsay, R. S. (2017). Independent component analysis via distance covariance. Journal of the American Statistical Association, 112(518), 623–637.
Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6), 2769–2794.