Choosing the NMF rank on data with a known true rank

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction

Choosing the number of basis vectors (the rank Q) is the central tuning problem in NMF. nmfkc offers several rank-selection tools, each based on a different principle:

function	principle	a good rank…
`nmfkc.rank`	integrated diagnostics: R-squared, effective rank, element-wise CV	minimises CV error / R-squared elbow
`nmfkc.ecv`	element-wise cross-validation (Wold)	minimises held-out error
`nmfkc.bicv`	block bi-cross-validation (Owen & Perry)	minimises held-out error
`nmfkc.consensus`	clustering stability (Brunet)	maximises stability (or minimises PAC)
`nmfkc.ard`	Bayesian automatic relevance determination (Tan & Fevotte)	surviving components after pruning

Since there is no single “best” criterion, this vignette applies all of them to a synthetic dataset whose true rank is known, so we can see which ones recover it.

library(nmfkc)

A synthetic dataset (true rank = 3)

We build a non-negative 24 x 60 matrix from three feature blocks and three sample clusters (20 samples each), then add noise. By construction the true rank is 3.

set.seed(123)
r_true <- 3
P <- 24                       # features
n_each <- 20                  # samples per cluster
N <- r_true * n_each          # 60 samples

# basis: each latent factor is a distinct block of features
X <- matrix(0, P, r_true)
X[1:8, 1] <- 1; X[9:16, 2] <- 1; X[17:24, 3] <- 1
X <- X + matrix(runif(P * r_true, 0, 0.1), P, r_true)   # small background

# coefficients: each sample loads mainly on its own cluster's factor
cl_true <- rep(1:r_true, each = n_each)
B <- matrix(runif(r_true * N, 0, 0.2), r_true, N)
for (j in seq_len(N)) B[cl_true[j], j] <- runif(1, 1, 2)

Y <- X %*% B
Y <- Y + matrix(rnorm(P * N, 0, 0.15 * mean(X %*% B)), P, N)  # noise
Y[Y < 0] <- 0

dim(Y)
#> [1] 24 60
image(1:N, 1:P, t(Y), xlab = "sample", ylab = "feature",
      main = "Synthetic data (true rank = 3)")

1. Integrated diagnostics: `nmfkc.rank`

nmfkc.rank() fits each rank and draws three curves on one plot: R-squared (red, with a “Best (Elbow)” marker), the broken-stick-corrected effective rank index (green, shown for context), and the element-wise ECV error (blue, with a “Best (Min)” marker). The recommended rank is the ECV minimum (or the R-squared elbow).

rk <- nmfkc.rank(Y, rank = 1:6)
#> Y(24,60)~X(24,1)B(1,60)...0sec
#> Y(24,60)~X(24,2)B(2,60)...0sec
#> Y(24,60)~X(24,3)B(3,60)...0sec
#> Y(24,60)~X(24,4)B(4,60)...0sec
#> Y(24,60)~X(24,5)B(5,60)...0sec
#> Y(24,60)~X(24,6)B(6,60)...0sec
#> Running Element-wise CV (this may take time)...
#> Performing Element-wise CV for Q = 1,2,3,4,5,6 (5-fold)...
#> Note: sample-clustering quality (silhouette / CPCC / dist.cor) is not part of rank selection; compute it from a list of fits with nmf.cluster.criteria().  See ?nmf.cluster.criteria

rk$rank.best                       # recommended rank (ECV minimum)
#> [1] 3
round(rk$criteria, 3)
#>   rank effective.rank effective.rank.ratio r.squared sigma.ecv
#> 1    1             NA                   NA     0.044     0.710
#> 2    2          2.000                1.000     0.549     0.520
#> 3    3          2.988                0.996     0.984     0.106
#> 4    4          3.438                0.860     0.985     0.113
#> 5    5          3.440                0.688     0.986     0.120
#> 6    6          4.587                0.764     0.987     0.132
#>   effective.rank.expected effective.rank.index
#> 1                   1.000                   NA
#> 2                   1.649                0.999
#> 3                   2.301                0.982
#> 4                   2.955                0.462
#> 5                   3.609                0.000
#> 6                   4.263                0.186

2. Cross-validation: `nmfkc.ecv` and `nmfkc.bicv`

Both are predictive engines that return the held-out error per rank; the recommended rank minimises it. nmfkc.ecv holds out scattered entries (Wold’s CV); nmfkc.bicv holds out a block of rows and columns at once (Owen & Perry).

ev <- nmfkc.ecv (Y, rank = 1:6)
#> Performing Element-wise CV for Q = 1,2,3,4,5,6 (5-fold)...
bv <- nmfkc.bicv(Y, rank = 1:6)   # nfolds = 2 (Owen & Perry) by default
#> bi-CV: ranks 1,2,3,4,5,6, 2x2 fold grid (Owen-Perry 2009)...
data.frame(rank = 1:6,
           sigma.ecv  = round(ev$sigma, 3),
           sigma.bicv = round(bv$sigma, 3))
#>     rank sigma.ecv sigma.bicv
#> Q=1    1     0.710      0.725
#> Q=2    2     0.520      0.531
#> Q=3    3     0.106      0.110
#> Q=4    4     0.113      0.111
#> Q=5    5     0.120      0.111
#> Q=6    6     0.132      0.111
cat("ECV  picks rank", which.min(ev$sigma),
    "| bi-CV picks rank", bv$rank[which.min(bv$sigma)], "\n")
#> ECV  picks rank 3 | bi-CV picks rank 3

3. Clustering stability: `nmfkc.consensus`

For each rank, NMF is run many times from random initialisations and the reproducibility of the sample clustering is summarised by cophenetic (Brunet), dispersion (Kim & Park; higher = crisper) and pac (Senbabaoglu; lower = less ambiguous).

cs <- nmfkc.consensus(Y, rank = 2:6, nrun = 20, keep.consensus = TRUE)
#> consensus: ranks 2,3,4,5,6, 20 runs/rank (Brunet 2004); 100 fits total...
cs
#> Consensus rank selection (Brunet 2004), 20 runs/rank
#>   best: dispersion max = rank 3 | PAC min = rank 2
#>  rank cophenetic dispersion   pac
#>     2      0.999      0.916 0.000
#>     3      1.000      1.000 0.000
#>     4      1.000      0.968 0.009
#>     5      1.000      0.956 0.016
#>     6      0.999      0.904 0.081
plot(cs)                               # stability curves

Each panel is the consensus matrix at one rank, reordered by hierarchical clustering (blue = 0 / different cluster, red = 1 / same cluster):

plot(cs, type = "heatmap")   # all ranks

A striking point: the consensus stays essentially three crisp blocks even at ranks 4–6. The surplus basis vectors do not create new reproducible clusters — at ranks 4–5 they stay near-empty (the hard clustering still uses only three factors), and at rank 6 a cluster splits only erratically across restarts, which averages out (just a few intermediate, pink entries, and a small drop in dispersion / rise in PAC). So consensus reveals the genuine stable structure (3) robustly, even when the rank is over-specified — a useful property. (Use plot(cs, type = "heatmap", rank = 3) to enlarge a single rank with sample labels.)

4. Bayesian ARD: `nmfkc.ard`

ARD fits NMF once at an over-complete rank and prunes unneeded components automatically; the relevance bar plot is read like a scree plot (look for the knee). Because it is a sensitive point estimate, we aggregate over several random restarts (nrun).

ar <- nmfkc.ard(Y, rank = 8, nrun = 20)   # >=10-20 restarts for a stable mode
ar                                        # see "rank over runs" for confidence
#> ARD-NMF rank determination (Tan-Fevotte 2013, L2 prior, 20 runs)
#>   starting rank: 8  ->  estimated rank (mode): 3
#>   rank over runs: 3:20 (median 3)
#>   relevance (representative run): 1 0.989 0.875 0 0 0 0 0
plot(ar)

Summary: did they recover the true rank?

data.frame(
  method = c("nmfkc.rank (ECV)", "nmfkc.ecv", "nmfkc.bicv",
             "nmfkc.consensus (dispersion)", "nmfkc.consensus (PAC)",
             "nmfkc.ard"),
  estimate = c(rk$rank.best,
               which.min(ev$sigma),
               bv$rank[which.min(bv$sigma)],
               cs$rank[which.max(cs$dispersion)],
               cs$rank[which.min(cs$pac)],
               ar$rank),
  true = r_true
)
#>                         method estimate true
#> 1             nmfkc.rank (ECV)        3    3
#> 2                    nmfkc.ecv        3    3
#> 3                   nmfkc.bicv        3    3
#> 4 nmfkc.consensus (dispersion)        3    3
#> 5        nmfkc.consensus (PAC)        2    3
#> 6                    nmfkc.ard        3    3

How the clustering evolves with rank

A complementary view: nmf.cluster.flow() draws a Sankey / alluvial diagram of how the hard sample clustering changes as the rank grows. We fit ranks 1:6 (so the list index equals the rank) and colour the flows by the clustering at the true rank, reference = 3. The adjacent-rank ARI (agreement) is printed along the top.

fits <- lapply(1:6, function(q) nmfkc(Y, Q = q, print.dims = FALSE))
fl <- nmf.cluster.flow(fits, reference = 3)

head(fl$clusters)      # rows = individuals, columns = rank, entries = cluster id
#>    1 2 3 4 5 6
#> i1 1 1 1 2 2 3
#> i2 1 1 1 2 3 2
#> i3 1 1 1 1 1 1
#> i4 1 1 1 1 1 1
#> i5 1 1 1 1 1 1
#> i6 1 1 1 1 3 2

At rank = 1 all 60 samples form a single group; at rank = 3 they split cleanly into the three true clusters (20 each); beyond that the groups over-split (the ARI to the next rank starts to fall) — visual confirmation that 3 is the right number of basis vectors here.

Remarks

On this clean dataset with a clear rank-3 structure, almost every criterion recovers the true rank: the predictive cross-validations (nmfkc.ecv, nmfkc.bicv), the integrated nmfkc.rank, the consensus dispersion (with a crisp three-block consensus matrix) and ARD (every nrun restart agrees on 3, relevance dropping to zero after the third component) all give 3. Only PAC leans to the coarser rank 2 — a small hint of the low-rank bias of stability measures.

On real data the criteria often disagree much more: stability/consensus tends to favour low (coarse) ranks, and ARD’s count can track the starting rank when the energy spectrum decays smoothly (no clear knee). A robust practical recipe is to triangulate: use ECV for the predictive upper bound, the effective-rank / relevance knee for the lower bound, and the consensus heatmap to confirm an interpretable block structure — then choose one rank by purpose and domain knowledge.

See ?nmfkc.rank, ?nmfkc.ecv, ?nmfkc.bicv, ?nmfkc.consensus and ?nmfkc.ard (all cross-referenced) for details.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Choosing the NMF rank on data with a known true rank

Introduction

A synthetic dataset (true rank = 3)

1. Integrated diagnostics: nmfkc.rank

2. Cross-validation: nmfkc.ecv and nmfkc.bicv

3. Clustering stability: nmfkc.consensus

4. Bayesian ARD: nmfkc.ard

Summary: did they recover the true rank?

How the clustering evolves with rank

Remarks

1. Integrated diagnostics: `nmfkc.rank`

2. Cross-validation: `nmfkc.ecv` and `nmfkc.bicv`

3. Clustering stability: `nmfkc.consensus`

4. Bayesian ARD: `nmfkc.ard`