Data depth and rank-based tests for HPD matrices

Joris Chau

2017-12-09

Introduction

In second-order stationary multivariate time series analysis, non-degenerate autocovariance matrices or spectral density matrices at the Fourier frequencies are necessarily elements of the space of Hermitian positive definite (HPD) matrices. In (Chau, Ombao, and von Sachs 2017), we generalize the classical concept of data depth for Euclidean vectors to intrinsic manifold data depth for matrix-valued observations in the non-Euclidean space of HPD matrices. Data depth is an important tool in statistical data analysis measuring the depth of a point with respect to a data cloud or probability distribution. In this way, data depth provides a center-to-outward ordering of multivariate data observations, generalizing the notion of a rank for univariate observations.

The proposed data depth measures can be used to characterize central regions or detect outlying observations in samples of HPD matrices, such as collections of covariance or spectral density matrices. The depth functions also provide a practical framework to perform rank-based hypothesis testing for samples of HPD matrices by replacing the usual ranks by their depth-induced counterparts. Other applications of data depth include the construction of confidence regions, clustering, or classification for samples of HPD matrices.

In this vignette we demonstrate the use of the functions pdDepth() and pdRankTests() to compute data depth values of HPD matrix-valued observations and perform rank-based hypothesis testing for samples of HPD matrices, where the space of HPD matrices can be equipped with several different metrics, e.g. the invariant Riemannian metric discussed in (Chau, Ombao, and von Sachs 2017).

Shiny app

A demo Shiny app for data depth and rank-based tests in the context of samples of HPD matrices is available here. By clicking on the tab ‘Data depth’, for simulated samples of HPD matrices, the user can examine the performance of the different rank-based tests (via pdRankTests()) based on depth-induced ranks computed with pdDepth().

Data depth of HPD matrices with pdDepth()

First, we generate a pointwise random sample of (2,2)-dimensional HPD matrix-valued observations using the exponential map Expm(), with underlying geometric (i.e. Karcher or Fréchet) mean equal to the identity matrix diag(2). Second, we generate a random sample of sequences (curves) of (2,2)-dimensional HPD matrix-valued observations, with underlying geometric mean curve equal to an array of rescaled identity matrices. We can think of the first sample as a random collection of covariance matrices, and the second sample as a random collection of spectral matrices along frequency.

library(pdSpecEst)
set.seed(100)

## Pointwise random sample
X1 <- replicate(50, Expm(diag(2), H.coeff(0.5 * rnorm(4), inverse = T))) 
str(X1)
#>  cplx [1:2, 1:2, 1:50] 0.7794+0i -0.0314-0.0523i -0.0314+0.0523i ...

## Curve random sample
X2 <- replicate(50, sapply(1:5, function(i) Expm(i * diag(2), H.coeff(0.5 * rnorm(4), inverse = T) / i), simplify = "array"))
str(X2)
#>  cplx [1:2, 1:2, 1:5, 1:50] 1.074+0i 0.352+0.147i 0.352-0.147i ...

Remark: The function H.coeff() converts (real-valued) basis components to Hermitian matrices via an inverse orthonormal basis expansion on the real vector space of Hermitian matrices.

pdDepth() computes the data depth of a single HPD matrix (resp. curve of HPD matrices) y with respect to a sample of HPD matrices (resp. sample of curves of HPD matrices) X. The function computes the intrinsic data depths based on the metric space of HPD matrices equipped with one of the following metrics: (i) Riemannian metric (default), (ii) log-Euclidean metric, the Euclidean inner product between matrix logarithms, (iii) Cholesky metric, the Euclidean inner product between Cholesky decompositions, (iv) Euclidean metric and (v) root-Euclidean metric, the Euclidean inner product between Hermitian square root matrices. The default choice (Riemannian) has several appealing properties not shared by the other metrics. See (Chau, Ombao, and von Sachs 2017) for more details and additional properties of the available manifold depth functions.

## Pointwise depth
pdDepth(y = diag(2), X = X1, method = "gdd") ## geodesic distance depth
#> [1] 0.4326915
pdDepth(y = diag(2), X = X1, method = "zonoid") ## manifold zonoid depth
#> [1] 0.7600822
pdDepth(y = diag(2), X = X1, method = "spatial") ## manifold spatial depth
#> [1] 0.7932682

## Integrated depth 
pdDepth(y = sapply(1:5, function(i) i * diag(2), simplify = "array"), X = X2, method = "gdd") 
#> [1] 0.751167
pdDepth(y = sapply(1:5, function(i) i * diag(2), simplify = "array"), X = X2, method = "zonoid") 
#> [1] 0.8631369
pdDepth(y = sapply(1:5, function(i) i * diag(2), simplify = "array"), X = X2, method = "spatial") 
#> [1] 0.8825155

We can also compute the data depth of each individual object in X with respect to the sample X itself by leaving the argument y in the function pdDepth() unspecified.

(dd1 <- pdDepth(X = X1, method = "gdd")) ## pointwise geodesic distance depth
#>  [1] 0.4136872 0.4072360 0.4025387 0.4044413 0.2559378 0.3873209 0.3832640
#>  [8] 0.3002407 0.4034977 0.3358114 0.2597040 0.3120139 0.1967228 0.1876342
#> [15] 0.1922631 0.2483946 0.3696490 0.3386755 0.2068996 0.2058298 0.2020919
#> [22] 0.3757002 0.2523476 0.2142259 0.2864097 0.3353675 0.3192103 0.3354539
#> [29] 0.3210167 0.3004936 0.3990564 0.3107404 0.3665464 0.2987992 0.4306943
#> [36] 0.2448184 0.3170413 0.2921210 0.4100427 0.4292756 0.4257262 0.3645616
#> [43] 0.3668837 0.2471900 0.2340217 0.3391741 0.3632063 0.3623502 0.2293419
#> [50] 0.2719987

(dd2 <- pdDepth(X = X2, method = "gdd")) ## integrated geodesic distance depth
#>  [1] 0.7001805 0.6885243 0.5746611 0.7066090 0.6931021 0.6423633 0.6738678
#>  [8] 0.6197673 0.5654389 0.7123940 0.6469901 0.7158848 0.6487223 0.6243373
#> [15] 0.7075508 0.6356606 0.6654029 0.7162644 0.6380360 0.7091247 0.6595957
#> [22] 0.6520224 0.6627836 0.6932087 0.6783008 0.6798760 0.6490358 0.7024616
#> [29] 0.6324775 0.6889687 0.6728189 0.6927310 0.6464986 0.6492694 0.6591718
#> [36] 0.6874394 0.6743475 0.6365783 0.6644634 0.7112869 0.6935939 0.7361826
#> [43] 0.7209363 0.6161036 0.6952725 0.6470956 0.6250286 0.7059682 0.7125312
#> [50] 0.7113942

A center-to-outwards ordering of the individual objects is then obtained by computing the data depth induced ranks, with the most central observation having smallest rank and the most outlying observation having largest rank.

(dd1.ranks <- rank(1 - dd1)) ## pointwise depth ranks
#>  [1]  4  6  9  7 37 11 12 31  8 22 36 28 48 50 49 39 14 21 45 46 47 13 38
#> [24] 44 34 24 26 23 25 30 10 29 16 32  1 41 27 33  5  2  3 17 15 40 42 20
#> [47] 18 19 43 35

(dd2.ranks <- rank(1 - dd2)) ## integrated depth ranks
#>  [1] 14 21 49 11 18 40 26 47 50  6 38  4 36 46 10 43 28  3 41  9 31 33 30
#> [24] 17 24 23 35 13 44 20 27 19 39 34 32 22 25 42 29  8 16  1  2 48 15 37
#> [47] 45 12  5  7

## Explore sample X1
head(order(dd1.ranks)) ## most central observations 
#> [1] 35 40 41  1 39  2
rev(tail(order(dd1.ranks))) ## most outlying observations
#> [1] 14 15 13 21 20 19
X1[ , , which(dd1.ranks == 1)] ## most central HPD matrix 
#>                       [,1]                  [,2]
#> [1,]  0.9407902+0.0000000i -0.0154483+0.1752587i
#> [2,] -0.0154483-0.1752587i  1.2710700+0.0000000i
X1[ , , which(dd1.ranks == 50)] ## most outlying HPD matrix
#>                     [,1]                [,2]
#> [1,]  1.847918+0.000000i -1.288255+1.075924i
#> [2,] -1.288255-1.075924i  2.490724+0.000000i

We can compare the most central HPD matrix above with the (approximate) empirical geometric mean of the observations obtained with pdMean() based on the Riemannian metric. The empirical geometric mean is known to maximize the data depth for observations from a centrally symmetric distribution (as in this example). For more details, see (Chau, Ombao, and von Sachs 2017).

(mean.X1 <- pdMean(X1)) 
#>                         [,1]                    [,2]
#> [1,]  0.92208421+0.00000000i -0.04690471+0.02281304i
#> [2,] -0.04690471-0.02281304i  1.14165705+0.00000000i

pdDepth(y = mean.X1, X = X1, method = "gdd")
#> [1] 0.4407462

Computation times

The figure below displays average computation times in milliseconds (single core Intel Xeon E5-2650) of the depth of a single (d, d)-dimensional HPD matrix with respect to a sample of (d, d)-dimensional HPD matrices of size n under the Riemannian metric. The computation times of the data depths based on one of the other metrics are either similar or faster than the times displayed below (in particular computation times of the manifold zonoid and spatial depth are significantly faster).

In the left-hand image, the sample size is fixed at n = 500, and in the right-hand image the dimension is fixed at d = 6. (The computation times are based on the median computation time of 100 depth calculations for 50 random samples). The manifold zonoid depth can only be calculated if \(d^2 < n\) and for this reason there are some missing values in the left-hand image.

Rank-based tests for HPD matrices with pdRankTests()

The null hypotheses of the available rank-based hypothesis tests in pdRankTests() are:

Below, we construct several simulated examples for which (i) the null hypotheses listed above are satisfied, and (ii) the null hypotheses listed above are not satisfied. Analogous to the previous section, we generate pointwise random samples (resp. random samples of sequences) of (2,2)-dimensional HPD matrix-valued observations, with underlying geometric mean equal to the identity matrix (resp. sequence of scaled identity matrices).

Instead of equipping the space of HPD matrices with the Riemannian metric (the default), pdRankTests() can also compute rank-based test statistics in the complete metric space of HPD matrices equipped with the Log-Euclidean metric (metric = 'logEuclidean'). The default Riemannien metric is invariant under congruence transformation by any invertible matrix, whereas the Log-Euclidean metric is only invariant under congruence transformation by unitary matrices. The space of HPD matrices equipped with one of the other metrics listed above (i.e. Cholesky, Euclidean and root-Euclidean) is no longer a complete metric space, and the derivations of asymptotic null distributions of the rank-based test statistics break down.

Let us first consider simulated examples of the manifold Wilcoxon rank-sum test ("rank.sum") and manifold Kruskal-Wallis test ("krusk.wall").

## Generate data (null true)
data1 <- array(c(X1, replicate(50, Expm(diag(2), H.coeff(0.5 * rnorm(4), inverse = T)))), dim = c(2, 2, 100)) ## pointwise sample
data2 <- array(c(X2, replicate(50, sapply(1:5, function(i) Expm(i * diag(2), H.coeff(0.5 * rnorm(4), inverse = T) / i), simplify = "array"))), dim = c(2, 2, 5, 100)) ## curve sample

## Generate data (null false)
data1a <- array(c(X1, replicate(50, Expm(diag(2), H.coeff(rnorm(4), inverse = T)))), dim = c(2, 2, 100)) ## pointwise scale change
data2a <- array(c(X2, replicate(50, sapply(1:5, function(i) Expm(i * diag(2), H.coeff(rnorm(4), inverse = T) / i), simplify = "arra"))), dim = c(2, 2, 5, 100)) ## curve scale change

## Rank-sum test
pdRankTests(data1, sample.sizes = c(50, 50), "rank.sum")[1:4] ## null true (pointwise)
#> $test
#> [1] "Manifold Wilcoxon rank-sum"
#> 
#> $p.value
#> [1] 0.1037495
#> 
#> $statistic
#> [1] 1.626941
#> 
#> $null.distr
#> [1] "Standard normal distribution"
pdRankTests(data2, sample.sizes = c(50, 50), "rank.sum")[2] ## null true (curve)
#> $p.value
#> [1] 0.9834998
pdRankTests(data1a, sample.sizes = c(50, 50), "rank.sum")[2] ## null false (pointwise)
#> $p.value
#> [1] 6.958285e-11
pdRankTests(data2a, sample.sizes = c(50, 50), "rank.sum")[2] ## null false (curve)
#> $p.value
#> [1] 1.020181e-15

## Kruskal-Wallis test
pdRankTests(data1, sample.sizes = c(50, 25, 25), "krusk.wall")[1:4] ## null true (pointwise)
#> $test
#> [1] "Manifold Kruskal-Wallis"
#> 
#> $p.value
#> [1] 0.1443239
#> 
#> $statistic
#> [1] 3.87139
#> 
#> $null.distr
#> [1] "Chi-squared distribution (df = 2)"
pdRankTests(data2, sample.sizes = c(50, 25, 25), "krusk.wall")[2] ## null true (curve)
#> $p.value
#> [1] 0.1526552
pdRankTests(data1a, sample.sizes = c(50, 25, 25), "krusk.wall")[2] ## null false (pointwise)
#> $p.value
#> [1] 5.495634e-10
pdRankTests(data2a, sample.sizes = c(50, 25, 25), "krusk.wall")[2] ## null false (curve)
#> $p.value
#> [1] 1.033775e-14

To apply the manifold Wilcoxon signed-rank test ("signed-rank"), we generate paired observations for independent trials (or subjects) by introducing trial-specific random effects, such that the paired observations in each trial share a trial-specific geometric mean. Note that for such data the manifold Wilcoxon rank-sum test is no longer valid due to the introduced sample dependence.

## Trial-specific means
mu <- replicate(50, Expm(diag(2), H.coeff(0.1 * rnorm(4), inverse = T)))

## Generate paired samples X,Y
make_sample <- function(null) sapply(1:50, function(i) Expm(mu[, , i], pdSpecEst:::T_coeff_inv(ifelse(null, 1, 0.5) * rexp(4) - 1, mu[, , i])), simplify = "array") 

X3 <- make_sample(null = T)
Y3 <- make_sample(null = T) ## null true
Y3a <- make_sample(null = F) ## null false (scale change)

## Signed-rank test
pdRankTests(array(c(X3, Y3), dim = c(2, 2, 100)), test = "signed.rank")[1:4] ## null true
#> $test
#> [1] "Manifold Wilcoxon signed-rank"
#> 
#> $p.value
#> [1] 0.2025809
#> 
#> $statistic
#>   V 
#> 505 
#> 
#> $null.distr
#> [1] "Wilcoxon signed rank test with continuity correction"
pdRankTests(array(c(X3, Y3a), dim = c(2, 2, 100)), test = "signed.rank")[2] ## null false
#> $p.value
#> [1] 0.001444632

The manifold signed-rank test also provides a valid procedure to test for equivalence of spectral matrices of two (independent) multivariate stationary time series based on the HPD periodogram matrices obtained via pdPgram(). In contrast to other available tests in the literature, this asymptotic test does not require consistent spectral estimators or resampling/bootstrapping of test statistics, and therefore remains computationally efficient for higher-dimensional spectral matrices or a large number of sampled Fourier frequencies.

## Signed-rank test for equivalence of spectra
## vARMA(1,1) process: Example 11.4.1 in (Brockwell and Davis, 1991)
Phi <- array(c(0.7, 0, 0, 0.6, rep(0, 4)), dim = c(2, 2, 2))
Theta <- array(c(0.5, -0.7, 0.6, 0.8, rep(0, 4)), dim = c(2, 2, 2))
Sigma <- matrix(c(1, 0.71, 0.71, 2), nrow = 2)
pgram <- function(Sigma) pdPgram(rARMA(2^10, 2, Phi, Theta, Sigma)$X)$P ## HPD periodogram

## Null is true
pdRankTests(array(c(pgram(Sigma), pgram(Sigma)), dim = c(2, 2, 2^10)), test = "signed.rank")[2]
#> $p.value
#> [1] 0.3846862

## Null is false
pdRankTests(array(c(pgram(Sigma), pgram(0.5 * Sigma)), dim = c(2, 2, 2^10)), test = "signed.rank")[2]
#> $p.value
#> [1] 1.03424e-38

To apply the manifold Bartels-von Neumman test ("bartels"), we generate an independent non-identically distributed sample with a trend in the scale of the distribution across observations, such that the null hypothesis of randomness breaks down.

## Null is true
data3 <- replicate(200, Expm(diag(2), H.coeff(rnorm(4), inverse = T))) ## pointwise samples
data4 <- replicate(100, sapply(1:5, function(i) Expm(i * diag(2), H.coeff(rnorm(4), inverse = T) / i), simplify = "array")) ## curve samples

## Null is false
data3a <- sapply(1:200, function(j) Expm(diag(2), H.coeff(((200 - j) / 200 + j * 2 / 200) * rnorm(4), inverse = T)), simplify = "array") ## pointwise trend in scale
data4a <- sapply(1:100, function(j) sapply(1:5, function(i) Expm(i * diag(2), H.coeff(((100 - j) / 100 + j * 2 / 100) * rnorm(4), inverse = T) / i), simplify = "array"), simplify = "array") ## curve trend in scale

## Bartels-von Neumann test
pdRankTests(data3, test = "bartels")[1:4] ## null true (pointwise)
#> $test
#> [1] "Manifold Bartels-von Neumann"
#> 
#> $p.value
#> [1] 0.2641266
#> 
#> $statistic
#> [1] 1.116691
#> 
#> $null.distr
#> [1] "Standard normal distribution"
pdRankTests(data4, test = "bartels")[2] ## null true (curve)
#> $p.value
#> [1] 0.7612759
pdRankTests(data3a, test = "bartels")[2] ## null false (pointwise)
#> $p.value
#> [1] 1.612058e-05
pdRankTests(data4a, test = "bartels")[2] ## null false (curve)
#> $p.value
#> [1] 0.000732231

Computation times

The figures below display average computation times in milliseconds (single core, Intel Xeon E5-2650) of the rank-based hypothesis tests based on the Riemannian metric in terms of the dimension of the matrix-valued data and the size of the samples considered in the tests. The computation times based on the Log-Euclidean metric are either similar or significantly faster (in particular for the rank-based tests based on the manifold zonoid or spatial depth) than the times displayed below.

Manifold Wilcoxon rank-sum test The figure below displays average computation times of the manifold Wilcoxon rank-sum test for two equal-sized samples of (d,d)-dimensional HPD matrices of size n, with depth-induced ranks based on the geodesic distance depth (gdd), manifold zonoid depth (zonoid) and manifold spatial depth (spatial) respectively.


Manifold Kruskal-Wallis test The figure below displays average computation times of the manifold Kruskal-Wallis test for three equal-sized samples of (d,d)-dimensional HPD matrices of size n, with depth-induced ranks based on the geodesic distance depth (gdd), manifold zonoid depth (zonoid) and manifold spatial depth (spatial) respectively.


Manifold Bartels-von Neumann test The figure below displays average computation times of the manifold Bartels-von Neumann test for a single sample of size n of (d,d)-dimensional HPD matrices, with depth-induced ranks based on the geodesic distance depth (gdd), manifold zonoid depth (zonoid) and manifold spatial depth (spatial) respectively.


Manifold Wilcoxon signed-rank test The figure below displays average computation times of the manifold Wilcoxon signed-rank test for two equal-sized samples of size n of (d,d)-dimensional HPD matrices. This test is not based on data depth but on a specific intrinsic manifold difference score.

To conclude, we note again that an interactive demo Shiny app to test and tune the different rank-based test procedures detailed in this vignette is available at https://jchau.shinyapps.io/pdSpecEst/.

References

Chau, J., H. Ombao, and R. von Sachs. 2017. “Data Depth and Rank-Based Tests for Covariance and Spectral Density Matrices.” http://arxiv.org/abs/1706.08289.