The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Clustering of unknown subpopulations in admixture models

Xavier Milhaud

library(admix)

The clustering of populations following admixture models is, for now, based on the K-sample test theory (see (Milhaud and Vandekerkhove 2023). Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.\]

We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the K-sample test procedure to obtain a data-driven method that cluster the \(K\) populations into \(N\) subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:

This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the K-sample 2-component mixture clustering (K2MC).

Algorithm

We now detail the steps of the algorithm.

  1. Initialization: create the first cluster to be filled, i.e. \(c = 1\). By convention, \(S_0=\emptyset\).
  2. Select \(\{x,y\}={\rm argmin}\{d_n(i,j); i \neq j \in S \setminus \bigcup_{k=1}^c S_{k-1}\}\).
  3. Test \(H_0\) between \(x\) and \(y\).
If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\
Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$.
  1. While \(S\setminus \bigcup_{k=1}^c S_k = \emptyset\) do
Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$;
Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\
  If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\
  Else $S_{c+1} = \{u\}$ and $c = c+1$.

Applications

On \(\mathbb{R}^+\)

We present a case study with 5 populations to cluster on \(\mathbb{R}^+\), with Gamma-Exponential and Gamma-Gamma mixtures.

## Simulate data (chosen parameters indicate 3 clusters (populations (1,3), (2,5) and 4)!):
list.comp <- list(f1 = "gamma", g1 = "exp",
                  f2 = "gamma", g2 = "exp",
                  f3 = "gamma", g3 = "gamma",
                  f4 = "exp", g4 = "exp",
                  f5 = "gamma", g5 = "exp")
list.param <- list(f1 = list(shape = 16, rate = 4), g1 = list(rate = 1/3.5),
                   f2 = list(shape = 14, rate = 2), g2 = list(rate = 1/5),
                   f3 = list(shape = 16, rate = 4), g3 = list(shape = 12, rate = 2),
                   f4 = list(rate = 1/2), g4 = list(rate = 1/7),
                   f5 = list(shape = 14, rate = 2), g5 = list(rate = 1/6))
A.sim <- rsimmix(n=6000, unknownComp_weight=0.8, comp.dist = list(list.comp$f1,list.comp$g1),
                 comp.param = list(list.param$f1, list.param$g1))$mixt.data
B.sim <- rsimmix(n=6000, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2),
                 comp.param = list(list.param$f2, list.param$g2))$mixt.data
C.sim <- rsimmix(n=6000, unknownComp_weight=0.65, comp.dist = list(list.comp$f3,list.comp$g3),
                 comp.param = list(list.param$f3, list.param$g3))$mixt.data
D.sim <- rsimmix(n=6000, unknownComp_weight=0.75, comp.dist = list(list.comp$f4,list.comp$g4),
                 comp.param = list(list.param$f4, list.param$g4))$mixt.data
E.sim <- rsimmix(n=6000, unknownComp_weight=0.55, comp.dist = list(list.comp$f5,list.comp$g5),
                 comp.param = list(list.param$f5, list.param$g5))$mixt.data
## Look for the clusters:
list.comp <- list(f1 = NULL, g1 = "exp",
                  f2 = NULL, g2 = "exp",
                  f3 = NULL, g3 = "gamma",
                  f4 = NULL, g4 = "exp",
                  f5 = NULL, g5 = "exp")
list.param <- list(f1 = NULL, g1 = list(rate = 1/3.5),
                   f2 = NULL, g2 = list(rate = 1/5),
                   f3 = NULL, g3 = list(shape = 12, rate = 2),
                   f4 = NULL, g4 = list(rate = 1/7),
                   f5 = NULL, g5 = list(rate = 1/6))
clusters <- admix_clustering(samples = list(A.sim,B.sim,C.sim,D.sim,E.sim), n_sim_tab = 10,
                             comp.dist=list.comp, comp.param=list.param, parallel=FALSE, n_cpu=2)
#>   |                                                          |                                                  |   0%  |                                                          |====                                              |   8%  |                                                          |==============================                    |  60%  |                                                          |========================================          |  80%  |                                                          |==================================================| 100%
clusters
#> Call:
#> admix_clustering(samples = list(A.sim, B.sim, C.sim, D.sim, E.sim), 
#>     n_sim_tab = 10, comp.dist = list.comp, comp.param = list.param, 
#>     parallel = FALSE, n_cpu = 2)
#> 
#> The number of populations/samples under study is 5.
#> The level of the underlying k-sample testing procedure is set to 5%.
#> 
#> The number of detected clusters in these populations equals 3.
#> The p-values of the k-sample tests (showing when to close the clusters (i.e. p-value < 0.05) equal: 0.556, 0, 0, 0.857.
#> 
#> The list of clusters with populations belonging to them (in numeric format, i.e. inside c()) :
#>    - Cluster #1: vector of populations c(1, 3)
#>   - Cluster #2: vector of populations 4
#>   - Cluster #3: vector of populations c(2, 5)
#> 
#> The list of estimated weights for the unknown component distributions in each detected cluster
#>       (in the same format and order as listed populations for clusters just above) :
#>    - estimated weights of the unknown component distributions for cluster  1 :  c(0.797839734013291, 0.619500508134639)
#>   - estimated weights of the unknown component distributions for cluster  2 :  numeric(0)
#>   - estimated weights of the unknown component distributions for cluster  3 :  c(0.573687909599797, 0.629136693101988)
#> 
#> The matrix giving the distances between populations, used in the clustering procedure through the k-sample tests:
#>             [,1]        [,2]        [,3]      [,4]        [,5]
#> [1,]  0.00000000 30.96428162  0.02947413  7.001645 35.09156510
#> [2,] 30.96428162  0.00000000 19.81562521  4.223762  0.03876151
#> [3,]  0.02947413 19.81562521  0.00000000 65.371758 13.24943788
#> [4,]  7.00164538  4.22376209 65.37175813  0.000000  2.39935356
#> [5,] 35.09156510  0.03876151 13.24943788  2.399354  0.00000000
Milhaud, Pommeret, X., and P. Vandekerkhove. 2023. “Contamination-Source Based k-Sample Clustering.” https://hal.science/hal-04129130.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.