Getting started with moc.gapbk

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

The moc.gapbk package implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) proposed by Parraga-Alava and others (2018). The algorithm combines:

NSGA-II as the underlying multi-objective evolutionary engine,
Path-Relinking as an intensification strategy, and
Pareto Local Search as a diversification strategy.

It receives two distance matrices and produces a set of non-dominated clustering solutions. The second matrix is typically used to encode a-priori biological knowledge (for example, semantic similarity between genes).

Basic usage

library(moc.gapbk)

set.seed(2025)

# Toy data: 50 objects (e.g. genes) described by 20 features (e.g. samples).
x <- matrix(stats::runif(50 * 20, min = -5, max = 10),
            nrow = 50, ncol = 20)

# Two distance matrices over the same set of objects.
# Here we use amap if available (correlation distance is biologically
# common), and fall back to base R otherwise so the vignette knits
# under any configuration.
if (requireNamespace("amap", quietly = TRUE)) {
  d1 <- as.matrix(amap::Dist(x, method = "euclidean"))
  d2 <- as.matrix(amap::Dist(x, method = "correlation"))
} else {
  d1 <- as.matrix(stats::dist(x, method = "euclidean"))
  d2 <- as.matrix(stats::dist(x, method = "manhattan"))
}

res <- moc.gapbk(dmatrix1 = d1,
                 dmatrix2 = d2,
                 num_k = 3,
                 generation = 5,
                 pop_size = 6)

Pareto-front population

res$population contains the medoids that survived the last generation, together with the values of the two objective functions, the Pareto ranking and the crowding distance.

head(res$population)
#>   V1 V2 V3     obj1     obj2 paretoranking crowding
#> 1  1 28  9 3.060216 4.821277             1      Inf
#> 2  1 28  3 3.357799 3.090347             1      Inf

Cluster assignments per solution

res$matrix.solutions is a data frame whose columns are the clustering assignments produced by each non-dominated solution.

head(res$matrix.solutions)
#>   1 2
#> 1 1 1
#> 2 1 1
#> 3 3 3
#> 4 1 1
#> 5 1 1
#> 6 1 1

Convenient per-solution vectors

res$clustering exposes the same information as a list of named integer vectors, ready to be passed to validation indices, plotting helpers, etc.

str(res$clustering[[1]])
#>  Named int [1:50] 1 1 3 1 1 1 3 3 3 3 ...
#>  - attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
table(res$clustering[[1]])
#> 
#>  1  2  3 
#> 24  6 20

Enabling Path-Relinking and Pareto Local Search

The full algorithm activates the intensification and diversification strategies through the local_search argument. Because Pareto Local Search has quadratic cost on the size of the Pareto front, this option is disabled by default in the vignette and the example below is shown but not evaluated.

res_full <- moc.gapbk(d1, d2,
                      num_k = 3,
                      generation = 10,
                      pop_size = 10,
                      local_search = TRUE,
                      cores = 2)

Tips for biological applications

In bioinformatics workflows, dmatrix1 is usually a distance derived from numerical expression profiles (for example, correlation or Euclidean distance on log-expression values), while dmatrix2 is a distance derived from a-priori biological knowledge (for example, semantic similarity between Gene Ontology terms). The Xie-Beni validity index is computed independently on each matrix and acts as one of the two objective functions of the NSGA-II engine.

Backward compatibility

Versions before 0.2.0 exported the function as moc.gabk (with a single p). That name is preserved as a deprecated alias and emits a warning; all new code should call moc.gapbk directly.

References

Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Mining 11(1), 1-16. https://doi.org/10.1186/s13040-018-0178-4

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.