The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
SpoMAG is an R-based machine learning tool developed to predict the sporulation potential of Metagenome-Assembled Genomes (MAGs) from uncultivated Firmicutes species, particularly from the Bacilli and Clostridia classes.
SpoMAG leverages the complex combination of presence or absence of sporulation-associated genes to infer whether a genome is capable of undergoing sporulation, even in the absence of cultivation or complete genome assemblies. This strategy allows researchers to assess sporulation potential using only functional annotations from metagenomic data.
SpoMAG predicts the sporulation potential of a given genome through a three-step workflow:
sporulation_gene_name()
, which parses a functional
annotation table, such as those generated by eggNOG-mapper, and
identifies genes related to sporulation using curated gene names and
KEGG orthologs. It requires as input a data frame with the columns
Preferred_name
, KEGG_ko
, and
genome_ID
(named exactly as such). The function outputs a
filtered table of annotations containing sporulation-related
genes.
build_binary_matrix()
, which converts the filtered
annotations into a binary matrix, with rows representing genomes and
columns representing genes (1 = present, 0 = absent). Missing genes are
automatically filled with zeros to ensure consistent input for machine
learning models.
predict_sporulation()
, which applies a pre-trained
ensemble model combining Random Forest and Support Vector Machine
predictions into a stacked meta-classifier. It outputs the
classification label (Sporulating
or
Non_sporulating
), model probabilities, and final ensemble
probability value.
SpoMAG abstracts the complexity of machine learning, allowing users to simply provide an annotation table and receive interpretable predictions. The tool is designed to be accessible even to those without prior expertise in bioinformatics or machine learning.
Its ensemble learning approach combines the predictions from Random Forest and Support Vector Machine classifiers, trained on high-quality labeled datasets of known spore-formers and non-spore-formers. The predictions are then used as features in a meta-classifier using model stacking, enhancing prediction accuracy and allowing SpoMAG to capture complementary decision boundaries from each model.
Model performance was evaluated using cross-validation and standard metrics including AUC-ROC, F1-score, Accuracy, precision, specificity, and recall. As a result, SpoMAG delivers high sensitivity and specificity across MAGs recovered from different hosts’ microbiota.
Whether analyzing hundreds of MAGs or a single novel lineage, SpoMAG offers a robust, automated, and cultivation-independent solution to assess sporulation potential. No phenotypic validation or manual annotation is required, making it a practical tool for exploring the ecological and functional roles of spore-forming bacteria.
The repository for SpoMAG is at GitHub on the https://github.com/labinfo-lncc/SpoMAG. In this website, you can report a bug and get help.
Paper under publication.
You can install the SpoMAG package directly from GitHub using:
# Install devtools if not already installed
install.packages("devtools")
# Install SpoMAG from GitHub
::install_github("labinfo-lncc-br/SpoMAG") devtools
SpoMAG depends on the following packages:
sporulation_gene_name()
It extracts sporulation-related genes from an annotation dataframe by
searching for gene names and KEGG orthologs. - Input: A dataframe with
at least Preferred_name
, KEGG_ko
and
genome_ID
columns.
spo_gene_name
and spo_process
.<- sporulation_gene_name(df) genes
build_binary_matrix()
It creates a binary matrix indicating the presence (1) or absence (0)
of known sporulation genes in each genome. - Input: A dataframe output
from the sporulation_gene_name()
function.
<- build_binary_matrix(genes) matrix
Note: The function automatically fills in missing genes with 0 to ensure consistent input for sporulation-capacity prediction.
predict_sporulation()
It applies a pre-trained ensemble machine learning model to predict the sporulation potential of genomes based on the binary matrix of genes.
Input:
binary_matrix: Output from
build_binary_matrix()
Output: A dataframe with:
genome_ID: the genome ID you are using as input
RF_Prob: Random Forest probability of being a spore-former
SVM_Prob: Support Vector Machine probability of being a spore-former
Meta_Prob_Sporulating: Ensemble probability of being a spore-former
Meta_Prediction: Final prediction (Sporulating
or
Non_sporulating
)
<- predict_sporulation(binary_matrix = matrix) results
To use SpoMAG, your input must be a functional annotation table, such as the output from eggNOG-mapper, containing at least three columns:
genome_ID | Preferred_name | KEGG_ko |
---|---|---|
G001 | spoIIIE | K03466 |
G001 | spo0A | K07699 |
G001 | - | K01056 |
G001 | pth | - |
… | … | … |
Each row should represent one gene annotation.
Another difference of SpoMAG is its ability to infer gene presence in
the annotation file even in cases where annotations are ambiguous. As
shown in the example above, some rows can contain a valid
KEGG_ko
code but a missing or undefined
Preferred_name
(e.g., “-”), while others have a predicted
gene name but no associated KO. SpoMAG integrates both fields to assign
a unified spo_gene_name
:
If Preferred_name
is missing but
KEGG_ko
matches a known sporulation-associated KO, the gene
is identified based on the KO.
If KEGG_ko
is missing but
Preferred_name
matches a known sporulation gene, the gene
is identified based on the name.
If both are informative and match known references, preference is
given to Preferred_name
.
This is a quick example using the included files:
one_sporulating.csv
(a known spore-former) and
one_asporogenic.csv
(a known non-spore-former). The genome
used for the spore-former here is the following:
genome_ID | Preferred_name | KEGG_ko |
---|---|---|
GCF_000007625.1 | spoIIIE | K03466 |
GCF_000007625.1 | spo0A | K07699 |
GCF_000007625.1 | - | K01056 |
GCF_000007625.1 | pth | - |
… | … | … |
The genome used for the non-spore-former here is the following:
genome_ID | Preferred_name | KEGG_ko |
---|---|---|
GCF_000006785.2 | spo0A | K07699 |
GCF_000006785.2 | - | K01056 |
GCF_000006785.2 | pth | - |
… | … | … |
# Load package
library(SpoMAG)
# Load example annotation tables
<- system.file("extdata", "one_sporulating.csv", package = "SpoMAG")
file_spor <- system.file("extdata", "one_asporogenic.csv", package = "SpoMAG")
file_aspo
# Read files
<- readr::read_csv(file_spor, show_col_types = FALSE)
df_spor <- readr::read_csv(file_aspo, show_col_types = FALSE)
df_aspo
# Step 1: Extract sporulation-related genes
<- sporulation_gene_name(df_spor)
genes_spor <- sporulation_gene_name(df_aspo)
genes_aspo
# Step 2: Convert to binary matrix
<- build_binary_matrix(genes_spor)
bin_spor <- build_binary_matrix(genes_aspo)
bin_aspo
# Step 3: Predict using ensemble model (preloaded in package)
<- predict_sporulation(bin_spor)
result_spor <- predict_sporulation(bin_aspo)
result_aspo
# View results
print(result_spor)
print(result_aspo)
This is a quick example using the included files:
ten_sporulating.csv
(ten known spore-formers) and
ten_asporogenic.csv
(ten known non-spore-formers). The
genomes used for the spore-formers here are the following:
genome_ID | Preferred_name | KEGG_ko |
---|---|---|
GCF_000011985.1 | spoIIIE | K03466 |
GCF_000011985.1 | spo0A | K07699 |
GCF_000011045.1 | - | K01056 |
GCF_000011045.1 | pth | - |
… | … | … |
The genomes used for the non-spore-formers here are the following:
genome_ID | Preferred_name | KEGG_ko |
---|---|---|
GCF_000010165.1 | spoIIIE | K03466 |
GCF_000009205.2 | - | K01056 |
GCF_000009205.2 | pth | - |
… | … | … |
# Load package
library(SpoMAG)
# Load example annotation tables
<- system.file("extdata", "ten_sporulating.csv", package = "SpoMAG")
file_spor <- system.file("extdata", "ten_asporogenic.csv", package = "SpoMAG")
file_aspo
# Read files
<- readr::read_csv(file_spor, show_col_types = FALSE)
df_spor <- readr::read_csv(file_aspo, show_col_types = FALSE)
df_aspo
# Step 1: Extract sporulation-related genes
<- sporulation_gene_name(df_spor)
genes_spor <- sporulation_gene_name(df_aspo)
genes_aspo
# Step 2: Convert to binary matrix
<- build_binary_matrix(genes_spor)
bin_spor <- build_binary_matrix(genes_aspo)
bin_aspo
# Step 3: Predict using ensemble model (preloaded in package)
<- predict_sporulation(bin_spor)
result_spor <- predict_sporulation(bin_aspo)
result_aspo
# View results
print(result_spor)
print(result_aspo)
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.