| Title: | Clustering and Classification Inference with U-Statistics | 
| Version: | 1.0.0 | 
| Description: | Clustering and classification inference for high dimension low sample size (HDLSS) data with U-statistics. The package contains implementations of nonparametric statistical tests for sample homogeneity, group separation, clustering, and classification of multivariate data. The methods have high statistical power and are tailored for data in which the dimension L is much larger than sample size n. See Gabriela B. Cybis, Marcio Valk and Sílvia RC Lopes (2018) <doi:10.1080/00949655.2017.1374387>, Marcio Valk and Gabriela B. Cybis (2020) <doi:10.1080/10618600.2020.1796398>, Debora Z. Bello, Marcio Valk and Gabriela B. Cybis (2021) <doi:10.48550/arXiv.2106.09115>. | 
| Depends: | R (≥ 3.4.0),dendextend,robcor | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.1.1 | 
| Suggests: | testthat | 
| NeedsCompilation: | no | 
| Packaged: | 2021-06-18 17:26:56 UTC; gabrielacybis | 
| Author: | Gabriela Cybis [aut, cre], Marcio Valk [aut], Kazuki Yokoyama [ctb], Debora Zava Bello [ctb] | 
| Maintainer: | Gabriela Cybis <gcybis@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2021-06-18 22:10:02 UTC | 
Computes Bn Statistic.
Description
Returns the value for the Bn statistic that measures the degree of separation between two groups. The statistic is computed through the difference of average within group distances to average between group distances. Large values of Bn indicate large group separation. Under overall sample homogeneity we have E(Bn)=0.
Usage
bn(group_id, md = NULL, data = NULL)
Arguments
| group_id | A vector of 0s and 1s indicating to which group the samples belong. Must be in the same order as data or md. | 
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
Details
Either data OR md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance, which is compatible with
is_homo, uclust and uhclust.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
Value
Value of the Bn statistic.
Examples
n=5
x=matrix(rnorm(n*10),ncol=10)
bn(c(1,0,0,0,0),data=x)     # option (a) entering the data matrix directly
md=as.matrix(dist(x))^2
bn(c(0,1,1,1,1),md)         # option (b) entering the distance matrix
Computes Bn Statistic for 3 Groups.
Description
Returns the value for the Bn statistic that measures the degree of separation between three groups. The statistic is computed as a combination of differences of average within group and between group distances. Large values of Bn indicate large group separation. Under overall sample homogeneity we have E(Bn)=0.
Usage
bn3(group_id, md = NULL, data = NULL)
Arguments
| group_id | A vector of 1s, 2s and 3s indicating to which group the samples belong. Must be in the same order as data or md. | 
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
Details
Either data OR md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Value
Value of the Bn3 statistic.
Examples
n=7
set.seed(1234)
x=matrix(rnorm(n*10),ncol=10)
bn3(c(1,2,2,2,3,3,3),data=x)     # option (a) entering the data matrix directly
md=as.matrix(dist(x))^2
bn3(c(1,2,2,2,3,3,3),md)         # option (b) entering the distance matrix
U-statistic based homogeneity test
Description
Homogeneity test based on the statistic bn. The test assesses whether there exists a data partition
for which group separation is statistically significant according to the U-test. The null hypothesis
is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into
two statistically significant subgroups.
Usage
is_homo(md = NULL, data = NULL, rep = 10)
Arguments
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| rep | Number of times to repeat optimization procedure. Important for problems with multiple optima. | 
Details
This is the homogeneity test of Cybis et al. (2017) extended to account for groups of size 1. The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
Value
Returns a list with the following elements:
- minFobj
- Test statistic. Minimum of the objective function for optimization (-stdBn). 
- group1
- Elements in group 1 in the maximal partition. (obs: this is not the best partition for the data, see - uclust)
- group2
- Elements in group 2 in the maximal partition. 
- p.MaxTest
- P-value for the homogeneity test. 
- Rep.Fobj
- Values for the minimum objective function on all - repoptimization runs.
- bootB
- Resampling variance estimate for partitions with groups of size n/2 (or (n-1)/2 and (n+1)/2 if n is odd). 
- bootB1
- Resampling variance estimate for partitions with one group of size 1. 
Examples
x = matrix(rnorm(500000),nrow=50)  #creating homogeneous Gaussian dataset
res = is_homo(data=x)
x[1:30,] = x[1:30,]+0.15   #Heterogeneous dataset (first 30 samples have different mean)
res = is_homo(data=x)
md = as.matrix(dist(x)^2)   #squared Euclidean distances for the same data
res = is_homo(md)
# Multidimensional sacling plot of distance matrix
fit <- cmdscale(md, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
U-statistic based homogeneity test for 3 groups
Description
Homogeneity test based on the statistic bn3. The test assesses whether there exists a data partition for which three group separation is statistically significant according to utest3. The null hypothesis is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into three groups with at least one significantly different from the others.
Usage
is_homo3(md = NULL, data = NULL, rep = 20, test_max = TRUE, alpha = 0.05)
Arguments
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| rep | Number of times to repeat optimization procedure. Important for problems with multiple optima. | 
| test_max | Logical indicating whether to employ the max test | 
| alpha | Significance level | 
Details
This is the homogeneity test of Bello et al. (2021). The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Value
Returns a list with the following elements:
- stdBn
- Test statistic. Maximum standardized Bn. 
- group1
- Elements in group 1 in the maximal partition. (obs: this is not the best partition for the data, see - uclust3)
- group2
- Elements in group 2 in the maximal partition. 
- group3
- Elements in group 3 in the maximal partition. 
- pvalue.Bonferroni
- P-value for the homogeneity test. 
- alpha_Bonferroni
- Alpha after Bonferroni correction 
- bootB
- Resampling variance estimate for partitions with central group sizes. 
- bootB1
- Resampling variance estimate for partitions with one group of size 1. 
- varBn
- Estimated variance of Bn for maximal standardized Bn configuration. 
Examples
set.seed(123)
x = matrix(rnorm(70000),nrow=7)  #creating homogeneous Gaussian dataset
res = is_homo3(data=x)
res
#uncomment to run
# x = matrix(rnorm(18000),nrow=18)
# x[1:5,] = x[1:5,]+0.5 #Heterogeneous dataset (first 5 samples have different mean)
# x[6:9,] = x[6:9,]+1.5
# res = is_homo3(data=x)
#  res
# md = as.matrix(dist(x)^2) #squared Euclidean distances for the same data
# res = is_homo3(md)       # uncomment to run
# Multidimensional sacling plot of distance matrix
#fit <- cmdscale(md, eig = TRUE, k = 2)
#x <- fit$points[, 1]
#y <- fit$points[, 2]
#plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))
Plot function for the result of uhclust
Description
This function plots the p-value annotated dendrogram resulting from uhclust
Usage
plot_uhclust(
  uhclust,
  pvalues_cex = 0.8,
  pvalues_dx = 2,
  pvalues_dy = 0.08,
  print_pvalues = TRUE
)
Arguments
| uhclust | Result from  | 
| pvalues_cex | Graphical parameter for p-value font size. | 
| pvalues_dx | Graphical parameter for p-value position shift on x axis. | 
| pvalues_dy | Graphical parameter for p-value position shift on y axis. | 
| print_pvalues | Logical. Should the p-values be printed? | 
Examples
x = matrix(rnorm(100000),nrow=50)
x[1:35,] = x[1:35,]+0.7
x[1:15,] = x[1:15,]+0.4
res = uhclust(data=x, plot=FALSE)
plot_uhclust(res)
Simple print method for utest_classify objects.
Description
Simple print method for utest_classify objects.
Usage
## S3 method for class 'utest_classify'
print(x, ...)
Arguments
| x | utest_classify object | 
| ... | additional parameters passed to the function | 
Optimization function with multiple staring points (for local optima)
Description
Finds the configuration with max Bn among all configurations.
Usage
rep_optimBn(mdm, rep = 15, bootB = -1)
Arguments
| mdm | Matrix of squared Euclidean distances between all data points. | 
| rep | Number of replications | 
| bootB | Result of previous bootstrap (if available). If, -1, a new bootstrap is performed for the variance of Bn. | 
U-statistic based significance clustering
Description
Partitions the sample into the two significant subgroups with the largest Bn statistic. If no significant partition exists, the test will return "homogeneous".
Usage
uclust(md = NULL, data = NULL, alpha = 0.05, rep = 15)
Arguments
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| alpha | Significance level. | 
| rep | Number of times to repeat optimization procedures. Important for problems with multiple optima. | 
Details
This is the significance clustering procedure of Valk and Cybis (2018).
The method first performs a homogeneity test to verify whether the data can be significantly
partitioned. If the hypothesis of homogeneity is rejected, then the method will search, among all
the significant partitions, for the partition that better separates the data, as measured by larger
bn statistic. This function should be used in high dimension small sample size settings.
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics."
Journal of Statistical Computation and Simulation 88.10 (2018)
and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
See also is_homo, uhclust, Utest_class.
Value
Returns a list with the following elements:
- cluster1
- Elements in group 1 in the final partition. This is the significant partition with maximal Bn, if sample is heterogeneous. 
- cluster2
- Elements in group 2 in the final partition. 
- p.value
- P-value for the test that renders the final partition, if heterogeneous. Homogeneity test p-value, if homogeneous. 
- alpha_corrected
- Bonferroni corrected significance level for the test that renders the final partition, if heterogeneous. Homogeneity test significance level, if homogeneous. 
- n1
- Size of the smallest cluster 
- ishomo
- Logical, returns - TRUEwhen the sample is homogeneous.
- Bn
- Value of Bn statistic for the final partition, if heterogeneous. Value of Bn statistic for the maximal homogeneity test partition, if homogeneous. 
- varBn
- Variance estimate for final partition, if heterogeneous. Variance estimate for the maximal homogeneity test partition, if homogeneous. 
- ishomoResult
- Result of homogeneity test (see - is_homo).
Examples
set.seed(17161)
x = matrix(rnorm(100000),nrow=50)  #creating homogeneous Gaussian dataset
res = uclust(data=x)
x[1:30,] = x[1:30,]+0.25   #Heterogeneous dataset (first 30 samples have different mean)
res = uclust(data=x)
md = as.matrix(dist(x)^2)   #squared Euclidean distances for the same data
res = uclust(md)
# Multidimensional scaling plot of distance matrix
fit <- cmdscale(md, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
col=rep(3,dim(md)[1])
col[res$cluster2]=2
plot(x,y, main=paste("Multidimensional scaling plot of data:
                    homogeneity p-value =",res$ishomoResult$p.MaxTest),col=col)
U-statistic based significance clustering for three way partitions
Description
Partitions data into three groups only when these partitions are statistically significant. If no significant partition exists, the test will return "homogeneous".
Usage
uclust3(md = NULL, data = NULL, alpha = 0.05, rep = 15)
Arguments
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| alpha | Significance level. | 
| rep | Number of times to repeat optimization procedures. Important for problems with multiple optima. | 
Details
This is the significance clustering procedure of Bello et al. (2021).
The method first performs a homogeneity test to verify whether the data can be significantly
partitioned. If the hypothesis of homogeneity is rejected, then the method will search, among all
the significant partitions, for the partition that better separates the data, as measured by larger
bn statistic. This function should be used in high dimension small sample size settings.
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see
Bello, Debora Zava,  Marcio Valk and Gabriela Bettella Cybis.
"Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
See also is_homo3, uclust.
Value
Returns a list with the following elements:
- groups
- List with elements of final three groups 
- p.value
- P-value for the test that renders the final partition, if heterogeneous. Homogeneity test p-value, if homogeneous. 
- alpha_corrected
- Bonferroni corrected significance level for the test that renders the final partition, if heterogeneous. Homogeneity test significance level, if homogeneous. 
- ishomo
- Logical, returns - TRUEwhen the sample is homogeneous.
- Bn
- Value of Bn statistic for the final partition, if heterogeneous. Value of Bn statistic for the maximal homogeneity test partition, if homogeneous. 
- varBn
- Variance estimate for final partition, if heterogeneous. Variance estimate for the maximal homogeneity test partition, if homogeneous. 
Examples
set.seed(123)
x = matrix(rnorm(70000),nrow=7)  #creating homogeneous Gaussian dataset
res = uclust3(data=x)
res
# uncomment to run
# x = matrix(rnorm(15000),nrow=15)
# x[1:6,] = x[1:6,]+1.5 #Heterogeneous dataset (first 5 samples have different mean)
# x[7:12,] = x[7:12,]+3
# res = uclust3(data=x)
# res$groups
U-statistic based significance hierarchical clustering
Description
Hierarchical clustering method that partitions the data only when these partitions are statistically significant.
Usage
uhclust(md = NULL, data = NULL, alpha = 0.05, rep = 15, plot = TRUE)
Arguments
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| alpha | Significance level. | 
| rep | Number of times to repeat optimization procedures. Important for problems with multiple optima. | 
| plot | Logical,  | 
Details
This is the significance hierarchical clustering procedure of Valk and Cybis (2018). The data are
repeatedly partitioned into two subgroups, through function uclust, according to a hierarchical scheme.
The procedure stops when resulting subgroups are homogeneous or have fewer than 3 elements.
This function should be used in high dimension small sample size settings.
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean distance.
Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
See also is_homo, uclust and Utest_class.
Value
Returns an object of class hclust with three additional attribute arrays:
- Pvalues
- P-values from uclust for the final data partition at each node of the dendrogram. This array is in the same order of - height, and only contains values for tests that were performed.
- alpha
- Bonferroni corrected significance levels for uclust for the data partitions at each node of the dendrogram. This array is in the same order of - height, and only contains values for tests that were performed.
- groups
- Final group assignments. 
Examples
x = matrix(rnorm(100000),nrow=50)  #creating homogeneous Gaussian dataset
res = uhclust(data=x)
x[1:30,] = x[1:30,]+0.7   #Heterogeneous dataset
x[1:10,] = x[1:10,]+0.4
res = uhclust(data=x)
res$groups
U test
Description
Test for the separation of two groups. The null hypothesis states that the groups are homogeneous and the alternative hypothesis states that they are separate.
Usage
utest(group_id, md = NULL, data = NULL, numB = 1000)
Arguments
| group_id | A vector of 0s and 1s indicating to which group the samples belong. Must be in the same order as data or md. | 
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| numB | Number of resampling iterations. | 
Details
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance, which is compatible with is_homo, uclust and
uhclust.
For more details see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018)
Value
Returns a list with the following elements:
- Bn
- Test Statistic 
- Pvalue
- Replication based p-value 
- Replication
- Number of replications used to compute p-value 
See Also
Examples
# Simulate a dataset with two separate groups, the first 5 rows have mean 0 and
# the last 5 rows have mean 5.
data <- matrix(c(rnorm(75, 0), rnorm(75, 5)), nrow = 10, byrow=TRUE)
# U test for mixed up groups
utest(group_id=c(1,0,1,0,1,0,1,0,1,0), data=data, numB=3000)
# U test for correct group definitions
utest(group_id=c(1,1,1,1,1,0,0,0,0,0), data=data, numB=3000)
U-test for three groups
Description
Test for the separation of three groups. The null hypothesis states that the groups are homogeneous and the alternative hypothesis states that at least one is separated from the others.
Usage
utest3(group_id, md = NULL, data = NULL, alpha = 0.05, numB = 1000)
Arguments
| group_id | A vector of 1s, 2s and 3s indicating to which group the samples belong. Must be in the same order as data or md. | 
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| alpha | Significance level | 
| numB | Number of resampling iterations. | 
Details
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance.
For more detail see Bello, Debora Zava, Marcio Valk and Gabriela Bettella Cybis. "Clustering inference in multiple groups." arXiv preprint arXiv:2106.09115 (2021).
Value
Returns a list with the following elements:
- is.homo
- Logical of whether test indicates that data is homogeneous 
- Pvalue
- Replication based p-value 
- Bn
- Test Statistic 
- sdBn
- Standard error for Bn statistic computed through resampling 
See Also
Examples
# Simulate a dataset with two separate groups,
# the first row has mean -4, the next 5 rows have mean 0 and the last 5 rows have mean 4.
data <- matrix(c(rnorm(15, -4),rnorm(75, 0), rnorm(75, 4)), nrow = 11, byrow=TRUE)
# U test for mixed up groups
utest3(group_id=c(1,2,3,1,2,3,1,2,3,1,2), data=data, numB=3000)
# U test for correct group definitions
utest3(group_id=c(1,2,2,2,2,2,3,3,3,3,3), data=data, numB=3000)
Test for classification of a sample in one of two groups.
Description
The null hypothesis is that the new data is not well classified into the first group when compared to the second group. The alternative hypothesis is that the data is well classified into the first group.
Usage
utest_classify(x, data, group_id, bootstrap_iter = 1000)
Arguments
| x | A numeric vector to be classified. | 
| data | Data matrix. Each row represents an observation. | 
| group_id | A vector of 0s (first group) and 1s indicating to which group the samples belong. Must be in the same order as data. | 
| bootstrap_iter | Numeric scalar. The number of bootstraps. It's recommended
 | 
Details
The test is performed considering the squared Euclidean distance.
For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." arXiv preprint arXiv:1805.12179 (2018).
Value
A list with class "utest_classify" containing the following components:
| statistic | the value of the test statistic. | 
| p_value | The p-value for the test. | 
| bootstrap_iter | the number of bootstrap iterations. | 
Examples
# Example 1
# Five observations from each group, G1 and G2. Each observation has 60 dimensions.
data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE)
# Test data comes from G1.
x <- rnorm(60, 0)
# The test correctly indicates that the test data should be classified into G1 (p < 0.05).
utest_classify(x, data, group_id = c(rep(0,times=5),rep(1,times=5)))
# Example 2
# Five observations from each group, G1 and G2. Each observation has 60 dimensions.
data <- matrix(c(rnorm(300, 0), rnorm(300, 10)), ncol = 60, byrow=TRUE)
# Test data comes from G2.
x <- rnorm(60, 10)
# The test correctly indicates that the test data should be classified into G2 (p > 0.05).
utest_classify(x, data, group_id = c(rep(1,times=5),rep(0,times=5)))
Variance of Bn
Description
Estimates the variance of the Bn statistic using the resampling procedure described in Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." Journal of Computational and Graphical Statistics 30(1) (2021).
Usage
var_bn(group_sizes, md = NULL, data = NULL, numB = 2000)
Arguments
| group_sizes | A vector with two entries: size of group 1 and size of group 2. | 
| md | Matrix of distances between all data points. | 
| data | Data matrix. Each row represents an observation. | 
| numB | Number of resampling iterations. Only used if no groups are of size 1. | 
Details
Either data or md should be provided.
If data are entered directly, Bn will be computed considering the squared Euclidean
distance, which is compatible with is_homo, uclust and
uhclust.
Value
Variance of Bn
See Also
Examples
n=5
x=matrix(rnorm(n*20),ncol=20)
# option (a) entering the data matrix directly and considering a group of size 1
var_bn(c(1,4),data=x)
# option (b) entering the distance matrix and considering a groups of size 2 and 3
md=as.matrix(dist(x))^2
var_bn(c(2,3),md)