The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
sparsecommunity implements spectral clustering for community detection in sparse networks. It provides:
community_detect(): main estimation function,
supporting both the stochastic block model ("sbm") and the
degree-corrected stochastic block model ("dcsbm").simulate_sbm() / simulate_dcsbm():
simulation utilities for both models.misclustering_rate(): best-permutation evaluation
metric.The methods implement the regularized normalized Laplacian spectral clustering of Lei and Rinaldo (2015), which is consistent for community recovery when the maximum expected degree grows as slowly as \(\log(n)\) — the sparse regime where many standard spectral methods break down.
C:5xW2PpK8D-intro.R
A stochastic block model with \(K\) communities is specified by a block probability matrix \(B \in [0,1]^{K \times K}\). Nodes in community \(k\) and community \(\ell\) are connected independently with probability \(B_{k\ell}\).
# Two balanced communities, n = 300 nodes
# Within-community edge probability: 0.25; between: 0.04
B_sbm <- matrix(c(0.25, 0.04,
0.04, 0.25), nrow = 2)
sim_sbm <- simulate_sbm(n = 300, K = 2, B = B_sbm, seed = 1)
print(sim_sbm)
#> SBM simulation: n = 300 , K = 2
#> Community sizes: 140 160
#> Mean degree : 43.69C:5xW2PpK8D-intro.R
The returned object contains the sparse adjacency matrix
A and the true community labels:
# Mean degree (note: sparse regime ~ log(n)/n * n = log(n))
mean(Matrix::rowSums(sim_sbm$A))
#> [1] 43.69333C:5xW2PpK8D-intro.R
For the standard SBM, community_detect() embeds the
network via the regularized normalized Laplacian and applies \(k\)-means:
fit_sbm <- community_detect(sim_sbm$A, K = 2, model = "sbm", seed = 1)
print(fit_sbm)
#> Community detection fit (sparsecommunity)
#> Model : sbm
#> Nodes (n) : 300
#> Communities : 2
#> Regularized : TRUE
#> Objective : 0.0399
#> Sizes : 160 140C:5xW2PpK8D-intro.R
The fitted object contains the community labels, the spectral embedding, the top eigenvalues, and the clustering objective:
# Top-K eigenvalues of the regularized Laplacian
fit_sbm$eigenvalues
#> [1] 0.5024848 0.3675187
# Community sizes
table(fit_sbm$labels)
#>
#> 1 2
#> 160 140C:5xW2PpK8D-intro.R
misclustering_rate() computes the fraction of
misclassified nodes under the best alignment of estimated and true
labels (Hungarian algorithm):
C:5xW2PpK8D-intro.R
The two-dimensional embedding reveals the cluster structure directly:
U <- fit_sbm$embedding
plot(U[, 1], U[, 2],
col = sim_sbm$labels + 1,
pch = 19, cex = 0.6,
xlab = "Eigenvector 1",
ylab = "Eigenvector 2",
main = "SBM: spectral embedding")
legend("topright", legend = c("Community 1", "Community 2"),
col = 2:3, pch = 19, bty = "n")Spectral embedding for SBM. Points are colored by true community.
C:5xW2PpK8D-intro.R
In many real networks, nodes within the same community have very different degrees. The DCSBM captures this by introducing per-node degree parameters \(\theta_i > 0\): the edge probability between nodes \(i\) and \(j\) is \(\theta_i \theta_j B_{z_i z_j}\).
Under degree heterogeneity, standard spectral methods fail because high-degree nodes concentrate together in the embedding regardless of their community. The spherical \(k\)-median algorithm (Lei and Rinaldo 2015, Theorem 4.2) corrects for this by row-normalizing the embedding before clustering.
# Three communities with strong degree heterogeneity
B_dcsbm <- matrix(c(0.5, 0.04, 0.04,
0.04, 0.5, 0.04,
0.04, 0.04, 0.5), nrow = 3)
# Degree parameters: Uniform(0.3, 1.7), creating substantial heterogeneity
set.seed(2)
theta <- runif(400, min = 0.3, max = 1.7)
sim_dcsbm <- simulate_dcsbm(n = 400, K = 3, B = B_dcsbm,
theta = theta, seed = 2)
print(sim_dcsbm)
#> DCSBM simulation: n = 400 , K = 3
#> Community sizes: 136 153 111
#> Mean degree : 77.13
#> Theta range : 0.31 to 1.697C:5xW2PpK8D-intro.R
Applying the SBM method to DCSBM data shows the problem:
fit_wrong <- community_detect(sim_dcsbm$A, K = 3, model = "sbm", seed = 2)
cat("Misclustering rate (SBM method on DCSBM data):",
misclustering_rate(sim_dcsbm$labels, fit_wrong$labels), "\n")
#> Misclustering rate (SBM method on DCSBM data): 0C:5xW2PpK8D-intro.R
The "dcsbm" model row-normalizes the embedding and
applies spherical \(k\)-median
clustering:
fit_dcsbm <- community_detect(sim_dcsbm$A, K = 3, model = "dcsbm", seed = 2)
print(fit_dcsbm)
#> Community detection fit (sparsecommunity)
#> Model : dcsbm
#> Nodes (n) : 400
#> Communities : 3
#> Regularized : TRUE
#> Objective : 22.052
#> Sizes : 136 111 153
cat("Misclustering rate (DCSBM method):",
misclustering_rate(sim_dcsbm$labels, fit_dcsbm$labels), "\n")
#> Misclustering rate (DCSBM method): 0C:5xW2PpK8D-intro.R
The row-normalized embedding maps all nodes to the unit sphere. Degree heterogeneity is absorbed by the normalization, and clusters separate by angle:
U_dc <- fit_dcsbm$embedding
plot(U_dc[, 1], U_dc[, 2],
col = sim_dcsbm$labels + 1,
pch = 19, cex = 0.5,
xlab = "Eigenvector 1 (normalized)",
ylab = "Eigenvector 2 (normalized)",
main = "DCSBM: row-normalized spectral embedding")
legend("topright",
legend = paste("Community", 1:3),
col = 2:4, pch = 19, bty = "n")Row-normalized spectral embedding for DCSBM. Colors indicate true communities.
C:5xW2PpK8D-intro.R
Zachary’s (1977) karate club network is a benchmark for community detection. The network has 34 members (nodes) and 78 friendships (edges). A conflict led to a split into two factions, providing known ground-truth community membership.
if (!requireNamespace("igraphdata", quietly = TRUE)) {
message("igraphdata not installed; skipping real-data example.")
knitr::knit_exit()
}
library(igraph)
#> Warning: package 'igraph' was built under R version 4.3.1
data("karate", package = "igraphdata")
# Extract adjacency matrix and true community labels
A_karate <- igraph::as_adjacency_matrix(karate, sparse = TRUE)
true_comm <- igraph::V(karate)$Faction
cat("Nodes:", vcount(karate), "| Edges:", ecount(karate),
"| Communities:", length(unique(true_comm)), "\n")
#> Nodes: 34 | Edges: 78 | Communities: 2
cat("Community sizes:", table(true_comm), "\n")
#> Community sizes: 16 18
cat("Mean degree:", round(mean(degree(karate)), 2), "\n")
#> Mean degree: 4.59C:5xW2PpK8D-intro.R
fit_karate <- community_detect(A_karate, K = 2, model = "sbm",
n_init = 30, seed = 42)
summary(fit_karate)
#> Community detection summary
#> Call : community_detect(A = A_karate, K = 2, model = "sbm", n_init = 30, seed = 42)
#> Model : sbm
#> Nodes (n) : 34
#> Communities : 2
#> Regularized : TRUE
#> Restarts : 30
#>
#> Community sizes:
#> Community Size Proportion
#> 1 16 0.471
#> 2 18 0.529
#>
#> Top eigenvalues:
#> [1] 0.5463 0.4289
#>
#> Objective (within-cluster sum of distances): 0.3856
cat("Misclustering rate:", misclustering_rate(true_comm, fit_karate$labels), "\n")
#> Misclustering rate: 0C:5xW2PpK8D-intro.R
# Plot network colored by detected community
shape_map <- ifelse(true_comm == 1, "circle", "square")
igraph::plot.igraph(
karate,
vertex.color = fit_karate$labels + 1,
vertex.shape = shape_map,
vertex.size = 8,
vertex.label = NA,
main = "Karate club: detected vs. true communities"
)
legend("bottomleft",
legend = c("Detected: 1", "Detected: 2"),
fill = 2:3, bty = "n", cex = 0.9)
legend("bottomright",
legend = c("True: faction 1", "True: faction 2"),
pch = c(19, 15), bty = "n", cex = 0.9)Karate club network. Node color = detected community; node shape = true faction.
C:5xW2PpK8D-intro.R
The NCAA football network is a standard benchmark for community detection with \(K > 2\) communities. The network has 115 Division I-A college football teams (nodes) connected by regular-season games during Fall 2000 (613 edges). Teams are divided into 12 athletic conferences, providing unambiguous ground-truth community labels (Girvan & Newman 2002).
The network is bundled directly in sparsecommunity:
data("football")
cat("Nodes:", nrow(football_A), "| Edges:", sum(football_A) / 2, "\n")
#> Nodes: 115 | Edges: 613
cat("Mean degree:", round(mean(Matrix::rowSums(football_A)), 2),
" log(n):", round(log(nrow(football_A)), 2), "\n")
#> Mean degree: 10.66 log(n): 4.74
table(football_labels) # 12 conferences
#> football_labels
#> 1 2 3 4 5 6 7 8 9 10 11 12
#> 9 8 11 12 10 5 13 8 10 12 7 10C:5xW2PpK8D-intro.R
The mean degree (10.66) is well above \(\log(115) = 4.74\), placing the network in
the sparse-but-identifiable regime. We first use
estimate_K() to check whether the number of conferences can
be recovered from the data alone:
C:5xW2PpK8D-intro.R
The estimator returns 10, slightly underestimating due to two very small conferences (sizes 5 and 7) that are hard to distinguish spectrally. We proceed with the known \(K = 12\):
fit_football <- community_detect(football_A, K = 12, model = "sbm",
n_init = 30, seed = 1)
misclustering_rate(football_labels, fit_football$labels)
#> [1] 0.08695652C:5xW2PpK8D-intro.R
sparsecommunity misclassifies 10 of 115 teams (8.7%). These are primarily independent teams and teams from small conferences that schedule games across conference lines. The spectral embedding confirms the 12-cluster structure:
Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated.
C:5xW2PpK8D-intro.R
community_detect() interface| Argument | Default | Description |
|---|---|---|
A |
— | Adjacency matrix (sparse or dense) |
K |
— | Number of communities |
model |
"dcsbm" |
"sbm" for k-means; "dcsbm" for spherical
k-median |
n_init |
20 |
Random restarts for clustering |
max_iter |
100 |
Max iterations per restart |
reg |
TRUE |
Degree regularization in Laplacian embedding |
seed |
NULL |
Random seed |
The returned "sparsecommunity" object contains:
| Component | Description |
|---|---|
labels |
Integer vector of community assignments (length \(n\)) |
embedding |
\(n \times K\) spectral embedding matrix |
eigenvalues |
Top-\(K\) eigenvalues of the regularized Laplacian |
centers |
\(K \times K\) cluster centers in embedding space |
objective |
Within-cluster sum of distances at convergence |
Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1), 215–237. https://doi.org/10.1214/14-AOS1274
Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826. https://doi.org/10.1073/pnas.122653799
Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4), 452–473.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.