mwcsr
is an R package to solve maximum weight connected subgraph (MWCS) problem and its variants. The package implements and provides an interface to several solvers: both exact and heuristic.
mwcsr
can be installed from GitHub repository using devtools
package:
library(devtools)
install_github("ctlab/mwcsr")
Load mwcsr
, as well as igraph
package, which contains functions for graph manipulations.
library(mwcsr)
library(igraph)
Let’s load an example instance of MWCS problem. The instance is a simple igraph
object with weight
vertex attribute.
data("mwcs_example")
print(mwcs_example)
## IGRAPH f86c56f UN-- 194 209 --
## + attr: name (v/c), label (v/c), weight (v/n), label (e/c)
## + edges from f86c56f (vertex names):
## [1] C00022_2--C00024_0 C00022_0--C00024_1 C00025_0--C00026_0
## [4] C00025_1--C00026_1 C00025_2--C00026_2 C00025_4--C00026_4
## [7] C00025_7--C00026_7 C00024_1--C00033_0 C00024_0--C00033_1
## [10] C00022_0--C00041_0 C00022_1--C00041_1 C00022_2--C00041_2
## [13] C00036_0--C00049_0 C00036_1--C00049_1 C00036_2--C00049_2
## [16] C00036_4--C00049_4 C00037_1--C00065_0 C00037_0--C00065_1
## [19] C00022_0--C00074_5 C00022_1--C00074_6 C00022_2--C00074_7
## [22] C00024_0--C00083_0 C00024_1--C00083_1 C00026_1--C00091_0
## + ... omitted several edges
summary(V(mwcs_example)$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7379 -0.7379 1.9294 5.9667 7.2931 38.1546
Now let us initialize a heuristic relax-and-cut MWCS solver (Alvarez-Miranda and Sinnl, 2017):
<- rmwcs_solver() rcsolver
Now we can use this solver to solve the example instance:
<- solve_mwcsp(rcsolver, mwcs_example)
m print(m$graph)
## IGRAPH 5ad5bde UN-- 162 164 --
## + attr: name (v/c), label (v/c), weight (v/n)
## + edges from 5ad5bde (vertex names):
## [1] C00022_0--C00024_1 C00022_0--C00041_0 C00022_0--C00074_5
## [4] C00022_0--C00149_0 C00022_1--C00041_1 C00022_1--C00074_6
## [7] C00022_1--C00149_2 C00022_2--C00024_0 C00022_2--C00041_2
## [10] C00022_2--C00074_7 C00022_2--C00149_1 C00024_0--C00033_1
## [13] C00024_0--C00222_0 C00024_1--C00033_0 C00024_1--C00222_2
## [16] C00025_0--C00026_0 C00025_1--C00026_1 C00025_2--C00026_2
## [19] C00025_4--C00026_4 C00025_7--C00026_7 C00026_0--C00091_1
## [22] C00026_0--C00311_1 C00026_1--C00091_0 C00026_1--C00311_0
## + ... omitted several edges
print(m$weight)
## [1] 1178.432
Supported MWCS variants are:
In mwcsr
, instances of all of the above problems are represented by an igraph
object with certain specified attributes. The validity and the type of the instance can be checked using get_instance_type
function, for example:
get_instance_type(mwcs_example)
## $type
## [1] "MWCS"
##
## $valid
## [1] TRUE
Simple maximum weight connected subgraph (MWCS) problem can be defined as follows. Let \(G = (V, E)\) be an undirected graph and \(\omega : V \rightarrow \mathbb{R}\) is a weight function defined on the vertices. Then MWCS problem consists in finding a connected subgraph \(\widetilde{G} = (\widetilde{V}, \widetilde{E})\) with a maximal total sum of vertex weights:
\[\Omega(\widetilde{G}) = \sum_{v \in \widetilde{V}} \omega(v) \rightarrow max.\]
An important property of MWCS is that solution can always be represented as a tree.
In mwcsr
an MWCS instance is defined as an igraph
with weight
vertex attribute (and without weight
edge attribute).
mwcs_example
## IGRAPH f86c56f UN-- 194 209 --
## + attr: name (v/c), label (v/c), weight (v/n), label (e/c)
## + edges from f86c56f (vertex names):
## [1] C00022_2--C00024_0 C00022_0--C00024_1 C00025_0--C00026_0
## [4] C00025_1--C00026_1 C00025_2--C00026_2 C00025_4--C00026_4
## [7] C00025_7--C00026_7 C00024_1--C00033_0 C00024_0--C00033_1
## [10] C00022_0--C00041_0 C00022_1--C00041_1 C00022_2--C00041_2
## [13] C00036_0--C00049_0 C00036_1--C00049_1 C00036_2--C00049_2
## [16] C00036_4--C00049_4 C00037_1--C00065_0 C00037_0--C00065_1
## [19] C00022_0--C00074_5 C00022_1--C00074_6 C00022_2--C00074_7
## [22] C00024_0--C00083_0 C00024_1--C00083_1 C00026_1--C00091_0
## + ... omitted several edges
summary(V(mwcs_example)$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7379 -0.7379 1.9294 5.9667 7.2931 38.1546
Budget MWCS is a slight modification of the original problem. While the objective function remains the same, an additional constraint is introduced. In this problem nonnegative values called budget costs are assigned to vertices. The sum of budget costs of the vertices in the solution is then constrained to not exceed a predefined budget.
More formally, let \(c: V \rightarrow R^+\) be the function of the budget costs. \(B \in R^+\) is a budget of the problem. In this terms Budget MWCS is defined as a problem where the following conditions are satisfied:
\[ \Omega(\widetilde{G}) = \sum\limits_{v \in \widetilde{V}} \omega(v) \rightarrow max.\\ \text{subject to} \sum\limits_{v \in \widetilde{V}} c(v) \leq B \]
In mwcsr
a Budget MWCS instance is an igraph
object with weight
vertex attribute as in original MWCS problem and additional budget_cost
vertex attribute. Value of the budget not the part of the instance itself but rather a parameter to be passed to a function that solves an instance.
<- mwcs_example
budget_mwcs_example set.seed(42)
V(budget_mwcs_example)$budget_cost <- runif(vcount(budget_mwcs_example))
get_instance_type(budget_mwcs_example)
## $type
## [1] "Budget MWCS"
##
## $valid
## [1] TRUE
Generalized MWCS (GMWCS) is similar to MWCS, but edges are also weighted. More formally, let \(G = (V, E)\) be an undirected graph and \(\omega : (V \cup E) \rightarrow \mathbb{R}\) is a weight function defined on the vertices and the edges. Then GMWCS problem consists in finding a connected subgraph \(\widetilde{G} = (\widetilde{V}, \widetilde{E})\) with a a maximal total sum of vertex and edge weights:
\[ \Omega(\widetilde{G}) = \sum_{v \in \widetilde{V}} \omega(v) + \sum_{e \in \widetilde{E}} \omega(e) \rightarrow max. \]
An important consequence of edge weights is that the optimal solution can contain cycles.
A GMWCS instance is defined as an igraph
with weight
attribute defined for both vertices and edges.
data(gmwcs_example)
gmwcs_example
## IGRAPH f86c56f UNW- 194 209 --
## + attr: name (v/c), label (v/c), weight (v/n), label (e/c), weight
## | (e/n)
## + edges from f86c56f (vertex names):
## [1] C00022_2--C00024_0 C00022_0--C00024_1 C00025_0--C00026_0 C00025_1--C00026_1
## [5] C00025_2--C00026_2 C00025_4--C00026_4 C00025_7--C00026_7 C00024_1--C00033_0
## [9] C00024_0--C00033_1 C00022_0--C00041_0 C00022_1--C00041_1 C00022_2--C00041_2
## [13] C00036_0--C00049_0 C00036_1--C00049_1 C00036_2--C00049_2 C00036_4--C00049_4
## [17] C00037_1--C00065_0 C00037_0--C00065_1 C00022_0--C00074_5 C00022_1--C00074_6
## [21] C00022_2--C00074_7 C00024_0--C00083_0 C00024_1--C00083_1 C00026_1--C00091_0
## [25] C00042_2--C00091_0 C00026_0--C00091_1 C00042_1--C00091_1
## + ... omitted several edges
summary(V(gmwcs_example)$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7379 -0.7379 1.9294 5.9667 7.2931 38.1546
summary(E(gmwcs_example)$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2715 -0.5762 -0.1060 0.5452 1.0458 8.5829
The signal generalized MWCS (SGMWCS) variant continues to generalize MWCS problem and introduces a concept of signals. Instead of specifying vertex and edge weights directly, the weights are defined for a set of signals, and vertices and edges are marked with these signals. The difference from GMWCS is that the signals can be repeated in the graph, while the weight of the subgraph is defined as a sum of weights of its unique signals.
Formally, let \(G = (V, E)\) be an undirected graph, \(S\) – a set of signlas with weights \(\omega: S \rightarrow \mathbb{R}\), and \(\sigma: (V \cup E) \rightarrow 2^S\) – markings of the vertices and edges with signals. An SGMWCS problem consists in finding a connected subgraph, with a maximal total sum of its unique signals:
\[ \Omega(\widetilde{G}) = \sum_{s \in \sigma(\widetilde{V} \cup \widetilde{E})} \omega(s) \rightarrow max, \] where \(\sigma(\widetilde{V} \cup \widetilde{E}) = \bigcup_{x \in (\widetilde{V} \cup \widetilde{E})} \sigma(x)\).
SGMWCS instances can arise when the data, from which the weights are inferred, map to the graph ambigously. For example, when m/z peak from mass-spectrometry data is assigned to multiple isomer metabolites, or when the same enzyme catalyze multiple reactions, and so on. Practically, it is usually assumed that the signals with negative weights are not repeated.
An SGMWCS is represented as an igraph
object, with signal
attribute defined for both vertices and edges and a signals
attribute defined for the graph, containing the signal weights. Specification of multiple signals per node or edge is not yet supported.
data("sgmwcs_example")
sgmwcs_example
## IGRAPH f86c56f UN-- 194 209 --
## + attr: signals (g/n), name (v/c), label (v/c), signal (v/c), label
## | (e/c), signal (e/c)
## + edges from f86c56f (vertex names):
## [1] C00022_2--C00024_0 C00022_0--C00024_1 C00025_0--C00026_0 C00025_1--C00026_1
## [5] C00025_2--C00026_2 C00025_4--C00026_4 C00025_7--C00026_7 C00024_1--C00033_0
## [9] C00024_0--C00033_1 C00022_0--C00041_0 C00022_1--C00041_1 C00022_2--C00041_2
## [13] C00036_0--C00049_0 C00036_1--C00049_1 C00036_2--C00049_2 C00036_4--C00049_4
## [17] C00037_1--C00065_0 C00037_0--C00065_1 C00022_0--C00074_5 C00022_1--C00074_6
## [21] C00022_2--C00074_7 C00024_0--C00083_0 C00024_1--C00083_1 C00026_1--C00091_0
## [25] C00042_2--C00091_0 C00026_0--C00091_1 C00042_1--C00091_1
## + ... omitted several edges
str(V(sgmwcs_example)$signal)
## chr [1:194] "s1" "s1" "s1" "s2" "s3" "s4" "s4" "s4" "s4" "s4" "s5" "s5" ...
str(E(sgmwcs_example)$signal)
## chr [1:209] "s103" "s104" "s105" "s106" "s107" "s108" "s109" "s110" "s110" ...
head(sgmwcs_example$signals)
## s1 s2 s3 s4 s5 s6
## 5.008879 -0.737898 -0.737898 20.112627 19.890279 2.069292
Sometimes, construction of SGMWCS instances can be simplified using normalize_sgmwcs_instance
function.
Let consider an example graph obtained from gatom package.
data("gatom_example")
print(gatom_example)
## IGRAPH f86c56f UNW- 194 209 --
## + attr: name (v/c), metabolite (v/c), element (v/c), label (v/c), url
## | (v/c), pval (v/n), origin (v/n), HMDB (v/c), log2FC (v/n), baseMean
## | (v/n), logPval (v/n), signal (v/c), signalRank (v/n), score (v/n),
## | weight (v/n), label (e/c), pval (e/n), origin (e/n), RefSeq (e/c),
## | gene (e/c), enzyme (e/c), reaction_name (e/c), reaction_equation
## | (e/c), url (e/c), reaction (e/c), rpair (e/c), log2FC (e/n), baseMean
## | (e/n), logPval (e/n), signal (e/c), signalRank (e/n), score (e/n),
## | weight (e/n)
## + edges from f86c56f (vertex names):
## [1] C00022_2--C00024_0 C00022_0--C00024_1 C00025_0--C00026_0 C00025_1--C00026_1
## + ... omitted several edges
In this graph, the signals
graph attributed is absent with weights specified directly as vertex or edge attributes along with signal
attributes, which is a very practical intermediate representation. However, it is recognized as a GMWCS instance:
get_instance_type(gatom_example)
## $type
## [1] "GMWCS"
##
## $valid
## [1] TRUE
Let convert this represantation into a valid SGMWCS instance using normalize_sgmwcs_instance
function:
<- normalize_sgmwcs_instance(gatom_example)
gatom_instance get_instance_type(gatom_instance)
## $type
## [1] "SGMWCS"
##
## $valid
## [1] TRUE
And let call the same function with explicit arguments:
<- normalize_sgmwcs_instance(gatom_example,
gatom_instance nodes.weight.column = "weight",
edges.weight.column = "weight",
nodes.group.by = "signal",
edges.group.by = "signal",
group.only.positive = TRUE)
The function does the following:
NULL
can be specified as a value of nodes.group.by
or edges.group.by
if there are no corresponding signals in the data, in which case zero signals will be created.group.only.positive
is set to FALSE
.Currently, four solvers are supported:
rmwcs_solver
for MWCS and Budget MWCS;rnc_solver
for MWCS/GMWCS/SGMWCS;annealing_solver
for MWCS/GMWCS/SGMWCS;virgo_solver
for MWCS/GMWCS/SGMWCS.While selecting a particular solver depends on the particular class of instances, the general recommendations are:
rmwcs_solver
if small suboptimality can be tolerated. It is very fast and usually is able to find optimal or very close to optimal solution. To find exact solution virgo_solver
can be used if CPLEX library is available.rmwcs_solver
is available.virgo_solver
is the recommended solver if CPLEX library is available. Without CPLEX it is recommended to use rnc_solver
giving both acceptable primal solution and an upper bound on the objective function. Manual tuning of the annealing_solver
may give better solutions in some cases.Relax-and-cut solver is a heuristic solver able to rapidly find high-quality solutions for MWCS problem (Alvarez-Miranda and Sinnl, 2017, https://doi.org/10.1016/j.cor.2017.05.015). The solver does not require any additional libraries.
Relax-and-cut solver can be constructed using rmwcs_solver
function with the default arguments.
<- rmwcs_solver()
rmwcs <- solve_mwcsp(rmwcs, mwcs_example)
m print(m$weight)
## [1] 1178.432
print(m$solved_to_optimality)
## [1] FALSE
rmwcs_solver
supports Budget MWCS instances and its special case where all budget costs are set to one and called MWCS with cardinality constraints. In mwcsr
such problems are represented as Simple MWCS problems and maximum cardinality is passed to solve_mwcsp
funciton as argument:
<- solve_mwcsp(rmwcs, mwcs_example, max_cardinality = 10)
m print(vcount(m$graph))
## [1] 10
print(m$weight)
## [1] 134.9068
To solve Budget MWCS passing budget limit is necessary as well:
<- solve_mwcsp(rmwcs, budget_mwcs_example, budget = 10)
m print(sum(V(m$graph)$budget_cost))
## [1] 4.625048
print(m$weight)
## [1] 163.6254
Rnc solver is another relax-and-cut solver made for GMWCS/SGMWCS problems inspired by rmwcs solver. The solver does not require any libraries as well. Although with rnc_solver
it is possible to solve MWCS problems, running rmwcs_solver
for this type of problems of this type is preferable. No budget and cardinality variants are available.
The solver can be constructed using rnc_solver
function.
<- rnc_solver()
rnc <- solve_mwcsp(rnc, gmwcs_example)
m print(m$weight)
## [1] 1295.815
print(m$solved_to_optimality)
## [1] FALSE
And for SGMWCS instance:
<- rnc_solver()
rnc <- solve_mwcsp(rnc, sgmwcs_example)
m print(m$weight)
## [1] 263.8406
Another heuristic solver is a simulated annealing based solver. The solver is rather generic, but can produce good enough solutions if parameters are tuned well. As it is heuristic solver with no estimate on upper bound of the objective function the solved to optimality flag is always set to FALSE
.
This solver does support warm start allowing to run series of restarts of annealing solver with different temperature schedules.
The use of this solver may be beneficial in some cases, although there are no generic guidelines.
<- NULL
m for (i in 0:15) {
<- annealing_solver(schedule = "boltzmann", initial_temperature = 8.0 / (2 ** i),
asolver final_temperature = 1 / (2 ** i))
if (i != 0) {
<- solve_mwcsp(asolver, gmwcs_example, warm_start = m)
m else {
} <- solve_mwcsp(asolver, gmwcs_example)
m
}print(m$weight)
}
## [1] 955.6224
## [1] 1157.73
## [1] 1216.681
## [1] 1218.817
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
## [1] 1222.339
The mwcsr
package also provide and interface to exact Java-based Virgo solver
(https://github.com/ctlab/virgo-solver) which can be used to solve MWCS, GMWCS and SGMWCS instances. The solver requires Java (11+) to be installed on your machine.
There are two modes of execution:
Heuristic solver can be constructed, by specifying cplex_dir=NULL
. As it is a heuristic solver, the solved to optimality flag is always set to FALSE
.
<- virgo_solver(cplex_dir=NULL)
mst_solver <- solve_mwcsp(mst_solver, sgmwcs_example)
m print(m$weight)
## [1] 227.3432
print(m$solved_to_optimality)
## [1] FALSE
Exact solver requires setting cplex_dir
argument with a path to CPLEX installation. The cplex_dir
requires, that cplex.jar
file and CPLEX dynamic library file (depending on the operating system: libcplex<version>.dll
for Windows, libcplex<version>.so
for Linux, libcplex<version>.jnilib
for OS X) can be found there with recursive search. Alternatively, cplex_jar
argument pointing to cplex.jar
file and cplex_bin
argument pointing to the directory with CPLEX dynamic library files can be specified. Additionally, it is convenient to put the path to CPLEX into a CPLEX_HOME
environment variable, so that it does not have to be changed from one system to another, when run.
<- Sys.getenv('CPLEX_HOME')
cplex_dir <- virgo_solver(cplex_dir=cplex_dir)
exact_solver <- solve_mwcsp(exact_solver, sgmwcs_example) m
As the CPLEX found the optimal solution, the corresponding flag is set to TRUE
:
print(m$weight)
## [1] 270.7676
print(m$solved_to_optimality)
## [1] TRUE
Some additional information like the running time, instance files, solver version is available as in the stats
field. Refer to Virgo documentation for the description of the values.
print(m$stats)
## isOpt VPrep EPrep time nodes edges
## 1 1 136 148 679 40 39
## nodefile
## 1 /tmp/Rtmpc6LW3I/graph384842f2bee0b/nodes.txt
## edgefile
## 1 /tmp/Rtmpc6LW3I/graph384842f2bee0b/edges.txt
## sigfile version
## 1 /tmp/Rtmpc6LW3I/graph384842f2bee0b/signals.txt 0.1.3
Computational resources availble for Virgo can be specified with the following parameters:
memory
– maximum amount of memory, more specifically Java heap size, specified via -Xmx
Java option, default: 2G
.threads
– number of threads for simultaneous computation, default: the number of available cores.timelimit
– maximum time in seconds to solve the problem, if the solver is interrupted due to time limit, the best solution found so far is reported and solved_to_optimality
flag is set to FALSE
.Another useful parameter is penalty
. The non-zero penalty make the solver run an additional pass over the solution, with each edge penalized with the specified value. As there can be multiple solutions, having the same weight, especially in case of SGMWCS, this procedure allows to locally minimize the solution size, while preserving the weight.
<- virgo_solver(cplex_dir=cplex_dir, penalty=0.001)
psolver <- solve_mwcsp(psolver, sgmwcs_example)
min_m print(min_m$weight)
## [1] 270.7676
print(min_m$stats)
## isOpt VPrep EPrep time nodes edges
## 1 1 136 148 723 39 38
## nodefile
## 1 /tmp/Rtmpc6LW3I/graph384842722d3c37/nodes.txt
## edgefile
## 1 /tmp/Rtmpc6LW3I/graph384842722d3c37/edges.txt
## sigfile version
## 1 /tmp/Rtmpc6LW3I/graph384842722d3c37/signals.txt 0.1.3
Now the solution, has min_m$stats$nodes
nodes instead of m$stats$nodes
for the solution that was found before, while having the same total weight of min_m$weight
.
This part of the tutorial shows how mwcsr
solvers can be combined with BioNet
package to find an active module in a protein-protein interaction network.
Let start with generating an example scored network, following BioNet tutorial:
library(BioNet)
library(DLBCL)
data(dataLym)
data(interactome)
<- cbind(t = dataLym$t.pval, s = dataLym$s.pval)
pvals rownames(pvals) <- dataLym$label
<- aggrPvals(pvals, order = 2, plot = FALSE)
pval <- dataLym$diff
logFC names(logFC) <- dataLym$label
<- subNetwork(dataLym$label, interactome)
subnet <- rmSelfLoops(subnet)
subnet <- fitBumModel(pval, plot = FALSE)
fb <- scoreNodes(subnet, fb, fdr = 0.001) scores
Here we have network object subnet
of type graphNEL
and a vector of node scores scores
:
subnet
## A graphNEL graph with undirected edges
## Number of Nodes = 2559
## Number of Edges = 7788
str(scores)
## Named num [1:2559] -4.829 -4.225 -5.514 3.931 -0.386 ...
## - attr(*, "names")= chr [1:2559] "ABL1(25)" "ACPP(55)" "ACTG1(71)" "ACTN1(87)" ...
BioNet comes with a heuristic MWCS FastHeinz solver, that we can use to find the module following the BioNet tutorial:
<- runFastHeinz(subnet, scores)
bionet_h plotModule(bionet_h, scores=scores, diff.expr=logFC)
sum(scores[nodes(bionet_h)])
## [1] 60.40548
We can construct an MWCS instance by converting graphNEL
object into igraph
and add node weights:
<- igraph.from.graphNEL(subnet, weight=FALSE) # ignoring edge weights of 1
bionet_example V(bionet_example)$weight <- scores[V(bionet_example)]
get_instance_type(bionet_example)
## $type
## [1] "MWCS"
##
## $valid
## [1] TRUE
Now the instance can be solved with the relax-and-cut solver:
<- rmwcs_solver()
rmwcs <- solve_mwcsp(rmwcs, bionet_example)
bionet_m plotModule(bionet_m$graph, scores=scores, diff.expr=logFC)
Note that the weight increased, compared to FastHeinz solution:
print(bionet_m$weight)
## [1] 70.16602
Similarly, Virgo can be used to solve the instance to provable optimality, but in this case it produces the same results:
<- solve_mwcsp(exact_solver, bionet_example)
bionet_m_exact print(bionet_m_exact$weight)
## [1] 70.16602
print(bionet_m_exact$solved_to_optimality)
## [1] TRUE