---
title: "Getting started with dcorBSS"
author: "Sarah Leyder and Klaus Nordhausen"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with dcorBSS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4,
  message = FALSE,
  warning = FALSE
)
```

## Overview

The **dcorBSS** package provides distance correlation based tools for blind
source separation (BSS) and dependence analysis. The main functionality can be grouped
into four parts:

1. distance covariance and distance correlation calculations, including
   blockwise versions for larger data sets
2. robust transformations, in particular the bowl and biloop transformations, tailored for integration with distance-based measures
3. independent component analysis through `dcorICA()`
4. serial-dependence diagnostics and tests based on distance correlation or HSIC

```{r setup}
library(dcorBSS)
```

This vignette introduces the main workflow with small simulated examples. The
examples are deliberately modest so that the vignette can be built quickly.

## Measuring dependence with distance correlation

Distance correlation (dCor) was first introduced by Szekely et al. (2007) as a measure of statistical dependence between random variables. Unlike classical correlation measures, it is capable of capturing a broad class of dependence structures, including nonlinear relationships, and can be applied to both univariate and multivariate data. This generality makes it a powerful tool for assessing dependence in a wide range of statistical applications. It can be defined as a normalized version of the distance covariance (dCov), obtained by scaling with the marginal distance variances. The squared distance covariance is given by
\[
\mathrm{dCov}^2(X,Y)
= \mathbb{E}\big[\|X - X'\|\,\|Y - Y'\|\big]
+ \mathbb{E}\big[\|X - X'\|\big]\mathbb{E}\big[\|Y - Y'\|\big]
- 2\,\mathbb{E}\big[\|X - X'\|\,\|Y - Y''\|\big],
\]
where $(X,Y)$, $(X',Y')$, and $(X'',Y'')$ are independent and identically distributed copies of $(X,Y)$, and $\|\cdot\|$ denotes the Euclidean norm. The corresponding distance correlation is then defined as
\[
\mathrm{dCor}(X,Y)
= \frac{\mathrm{dCov}(X,Y)}{\sqrt{\mathrm{dCov}(X,X)\,\mathrm{dCov}(Y,Y)}} \in [0,1].
\]
It has the important property that $\mathrm{dCor}(X,Y) = 0 \iff$ $X$ and $Y$ are statistically independent. 

The **dcorBSS** package provides several functions for computing the above distance measures. The function `dcor_large()` computes the distance correlation between two numeric vectors or matrices. The related functions `dcov_large()` and `dcov2_large()` compute the distance covariance and squared distance covariance, respectively.

```{r distance-correlation}
set.seed(1)
n <- 200

x <- rnorm(n)
y_dep <- x^2 + 0.25 * rnorm(n)
y_ind <- rnorm(n)

c(
  dependent = dcor_large(x, y_dep),
  independent = dcor_large(x, y_ind)
)
```

The larger value for `y_dep` reflects the nonlinear relationship between `x` and
`y_dep`. In contrast, `y_ind` is generated independently of `x`.

For multivariate observations, rows are observations and columns are variables.

```{r multivariate-dcor}
X <- cbind(rnorm(n), rnorm(n))
Y <- cbind(X[, 1]^2 + 0.2 * rnorm(n), rnorm(n))

dcor_large(X, Y)
```

For larger samples, the `block_size` argument can be used to avoid allocating the
full `n x n` pairwise distance matrices at once. This reduces memory use while computing
the same sample quantity up to small floating-point differences.

```{r blockwise-dcor}
r_full <- dcor_large(X, Y)
r_block <- dcor_large(X, Y, block_size = 64L)

c(full = r_full, blockwise = r_block)
all.equal(r_full, r_block, tolerance = 1e-10)
```

Distance correlation can also be used to quantify lagged dependence between $X_t$ and $X_{t+k}$. This is referred to as distance autocorrelation at lag $k$. It is computed by evaluating the distance correlation between $x[1:(n-lag), ]$ and $x[(1+lag):n, ]$, as implemented in the function `dacor_large()`. The following example illustrates this procedure for an MA(1) time series.

```{r lagged-dcor}
set.seed(1)

x <- arima.sim(n = 500, list(ma = c(0.8)))

dacor_large(x, lag = 1)
dacor_large(x, lag = 2)

```

`dacov_large()` and `dacov2_large()` operate similarly.

## Robust transformations

The package includes bounded, redescending data transformations that can be useful before computing
dependence measures on heavy-tailed or contaminated data. Their goal is to increase robustness against outliers.

A first transform is the bowl transform of Leyder et al. (2026). It is a robust, nonlinear transformation designed to reduce the influence of outliers. The bowl transform preserves the underlying dependence structure of the data by its injective property. It maps observations into a bounded embedding in which extreme values are smoothly “pulled back” towards the origin, improving the stability of dependence measures such as distance correlation. This makes it particularly useful in robust multivariate analysis and independence testing. 

The bowl transform maps a `p`-dimensional observation to `p + 1` transformed coordinates as follows. Let $x_i \in \mathbb{R}^p$ be an observation, $\|x_i\|$ its Euclidean norm, and
$q = \sqrt{\chi^2_{p,\alpha}}$ a hyperparameter. Define
$u_i = \tanh(\|x_i\| / q).$
The bowl transformed observation is then
$\left(10u_i^2(1-u_i)^2 x_i,\; 10u_i^6(1-u_i)^2\right) \in \mathbb{R}^{p+1}.$

It is for example used by `dcorICA(transform = "bowl")` for robust independent component analysis, see further. Before applying the bowl transform, one should scale the data.

```{r bowl-transform}
set.seed(2)
X_heavy <- cbind(rt(100, df = 3), rnorm(100))

head(bowl_transform(X_heavy, do_scale = TRUE))
dim(bowl_transform(X_heavy, do_scale = TRUE))
```

A second transformation is the biloop transform of Leyder et al. (2025). In contrast to the bowl transform, it is applied columnwise by mapping each univariate variable $x$ to two coordinates $(u(x),v(x))$ by a nonlinear embedding as follows:

\[
u(x) =
\begin{cases}
c_2\big(1 + \cos(2\pi \tanh(x/c_1) + \pi)\big), & x \ge 0 \\
-c_2\big(1 + \cos(2\pi \tanh(x/c_1) - \pi)\big), & x < 0
\end{cases}
\]

\[
v(x) = \sin(2\pi \tanh(x/c_2)).
\]

The constant `c1` and `c2` default to 4, but can be changed as argument. Before applying the transformation, it is advised to robustly scale the data.

```{r biloop-transform}
z <- rt(100, df = 3)
z_biloop <- biloop_transform(z, do_scale = TRUE)

head(z_biloop)
dim(z_biloop)
```

A typical use of the bowl and biloop is to compute dependence robustly after applying the transformation.

```{r transformed-dependence}
set.seed(3)
x <- rt(200, df = 3)
y <- x^2 + 0.3 * rnorm(200)

x_b <- biloop_transform(x, do_scale = TRUE)
y_b <- biloop_transform(y, do_scale = TRUE)

dcor_large(x_b, y_b)
```

## Independent component analysis with `dcorICA()`

Blind source separation is a class of statistical methods that starts from observed mixtures and attempts to recover
latent components that are mutually independent. The dcorICA() function implements an independent component analysis (ICA) approach for linearly mixed data based on distance measures. It builds on the ICA framework of Matteson and Tsay (2017). The algorithm first whitens the observations to remove second-order dependencies and then sequentially searches for a rotation in the form of an orthogonal matrix that
minimizes distance correlation between components.

The following example simulates three independent sources and mixes them with a random matrix.

```{r dcorica-simulation}
set.seed(4)
n <- 300

S <- cbind(
  uniform = runif(n, -1, 1),
  normal = rnorm(n),
  chisq = rchisq(n, df = 3)
)

A <- matrix(rnorm(9), 3, 3)
X <- tcrossprod(S, A)

fit <- dcorICA(X, seed = 1, sweeps = 2)

str(fit, max.level = 1)
head(fit$S)
```

The returned object contains the estimated unmixing matrix `W`, the estimated components `S`, the centering vector `mu`, such that $S=(X−μ)W^ 
T$, and some additional optimization diagnostics.

```{r dcorica-result}
fit$W
pairs(fit$S, main = "Estimated components from dcorICA()")
```

For larger data sets, `block_size` can be passed to `dcorICA()` so that the
distance-correlation objective is evaluated blockwise to reduce memory usage.

```{r dcorica-blockwise, eval = FALSE}
fit_block <- dcorICA(X, seed = 1, sweeps = 2, block_size = 128L)
```

A robust variant of the ICA method can be obtained by computing dependencies in the algorithm after applying the bowl transformation, see Leyder et al. (2026) for details. Robust location and
scatter estimates should then also be supplied. For robust whitening, the scatter
matrices should have the independence property; alternatively, one should make
an assumption such as at most one independent component being skew.

Below, a robust ICA example is shown, although the code is not evaluated by default because it depends on the optional **robustbase** package.

```{r dcorica-robust, eval = FALSE}
if (requireNamespace("robustbase", quietly = TRUE)) {
  mcd <- robustbase::covMcd(X)

  fit_robust <- dcorICA(
    X,
    mu = mcd$center,
    scatter = mcd$cov,
    transform = "bowl",
    seed = 1,
    sweeps = 2
  )

  head(fit_robust$S)
}
```

## Serial dependence diagnostics

The package also provides tools for detecting serial dependence in univariate
time series. The function `dacf_curve()` computes distance autocovariance or
distance autocorrelation over a set of lags and returns an object that can be
plotted as a dependogram.

A small real-data example is the annual flow of the river Nile at Aswan,
available as the Nile time series in the stats package:

```{r dacf-curve}
data(Nile)

curve <- dacf_curve(Nile, lags = 1:12, measure = "dcor")
curve$estimate
plot(curve, type = "line")
```

The function `dcor_serial_test()` performs a permutation-based portmanteau test using lagwise distance covariance or distance correlation values to detect serial dependence. Two types of tests are implemented, the classical Box-Pierce test `BP` or the kernel-weighted Fokianos-Pitsillou statistic `FP`, for more details see Fokianos and Pitsillou (2017). 

`dcor_serial_test()` returns an object of the class `"sdt"`, a serial dependence test object, which can be used to plot the accompanying dependogram.

```{r dcor-serial-test}
test_dcor <- dcor_serial_test(
  Nile,
  type = "FP",
  measure = "dcor",
  lags = 1:6,
  B = 99,
  seed = 1
)

test_dcor
plot(test_dcor)
```

For final analyses, it is advized to use a larger number of permutations, for example `B = 2000`
or more, to obtain more stable p-values.

The package also includes an serial-dependence test based on the Hilbert-Schmidt Independence Criterion of Gretton et al. (2005) and the test of Hong (1996). It uses the same `"sdt"` plotting interface.

```{r hsic-serial-test}
test_hsic <- hsic_serial_test(
  Nile,
  lags = 1:6,
  type = "H96",
  B = 99,
  seed = 1
)

test_hsic
plot(test_hsic, type = "line")
```

## Normalized HSIC

The helper function `nHSIC()` computes a normalized Hilbert--Schmidt independence criterion. Like distance correlation, it is intended to behave as a scale-free dependence measure.

```{r nhsic}
set.seed(6)
x <- matrix(rnorm(200), ncol = 1)
y <- x^2 + 0.3 * rnorm(200)
z <- matrix(rnorm(200), ncol = 1)

c(
  dependent = nHSIC(x, y),
  independent = nHSIC(x, z)
)
```

## Overview over Functions

| Task | Function |
|---|---|
| Distance correlation or covariance | `dcor_large()`, `dcov_large()`, `dcov2_large()` |
| Distance autocorrelation | `dacor_large()`, `dacov_large()`, `dacov2_large()` |
| Robust bounded transformations | `bowl_transform()`, `biloop_transform()` |
| Independent component analysis | `dcorICA()` |
| Dependogram | `dacf_curve()` |
| Distance-correlation serial-dependence test | `dcor_serial_test()` |
| HSIC serial-dependence test | `hsic_serial_test()` |
| Normalized HSIC | `nHSIC()` |

## References

Fokianos, K. and Pitsillou, M (2017). Consistent testing for pairwise dependence in time series. *Technometrics*, 59(2), 262–-270

Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. *International conference on algorithmic learning theory*, 63--77

Hong, Y. (1996). Consistent testing for serial correlation of unknown form. *Econometrica*, 64(4), 837–-864.

Leyder, S., Raymaekers, J., and Rousseeuw, P. J. (2025). Robust Distance Covariance. *International Statistical Review*, 94(1), 1--25.

Leyder, S., Raymaekers, J., Rousseeuw, P. J., Van Deuren, T., and Verdonck, T.
(2026). Independent component analysis by robust distance correlation. *Advances
in Data Analysis and Classification*.

Matteson, D. S. and Tsay, R. S. (2017). Independent component analysis via
distance covariance. *Journal of the American Statistical Association*, 112(518),
623--637.

Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing
dependence by correlation of distances. *Annals of Statistics*, 35(6),
2769--2794.

