Repository Mirror for your Cloud Server and Webhosting

Title:

Find Neighbor Species of a Bacteria of Interest in the Human Gut Microbiota

Version:

1.0.1

Description:

Implementation of the local approach described in Sola et al., 2026 <doi:10.64898/2025.12.05.692507> to identify companion species of a bacteria of interest. From several abundance tables of metagenomic data, 'NeighborFinder' suggests a shortlist of companion species based on the integration of results. A visualization via a network is proposed.

License:

MIT + file LICENSE

Depends:

R (≥ 3.5.0)

Imports:

dplyr, GGally, glmnet, glue, igraph, magrittr, Matrix, mvtnorm, network, purrr, rlang, sna, stats, stringr, tibble, tidyr, utils

Suggests:

knitr, rmarkdown, testthat

VignetteBuilder:

knitr

Config/fusen/version:

0.7.2

Config/roxygen2/version:

8.0.0

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.3

NeedsCompilation:

Packaged:

2026-06-24 15:05:58 UTC; mmariadasso

Author:

Mathilde Sola

[aut, cre], Mahendra Mariadassou

[aut], Magali Berland

[aut]

Maintainer:

Mathilde Sola <mathilde.sola@inrae.fr>

Repository:

CRAN

Date/Publication:

2026-06-30 12:20:32 UTC

Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Value

No return value, called for side effects

Apply NeighborFinder simplest version on raw data

Description

Apply NeighborFinder simplest version on raw data

Usage

apply_NF_simple(
  data_with_annotation,
  object_of_interest,
  col_module_id,
  annotation_level,
  prev_level = 0.3,
  filtering_top = 20,
  seed = NULL,
  ...
)

Arguments

data_with_annotation

Dataframe. The abundance table merged with the module names. Required format: modules are the rows and samples are the columns. The first column must be the modules name (e.g. species), the second is the module ID (e.g. msp), and each subsequent column is a sample

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

prev_level

Numeric. The prevalence to be studied. Required format is decimal: 0.20 for 20% of prevalence

filtering_top

Numeric. The filtering top percentage to be studied. Required format is: 10 for top 10%

seed

Numeric. The seed number, ensuring reproducibility

...

Additional arguments passed on to cvglm_to_coeffs_by_object()

Value

Dataframe. Returns results after using apply_NeighborFinder(): for each module ID from 'object_of_interest', the names of their neighbors and the corresponding coefficients calculated by cv.glmnet()

Examples

data(data)
res_CRC_JPN <- apply_NF_simple(
  data$CRC_JPN[, 1:100],
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  seed = 20242025
)

Apply NeighborFinder on raw data

Description

Apply NeighborFinder on raw data

Usage

apply_NeighborFinder(
  data_with_annotation,
  object_of_interest,
  col_module_id,
  annotation_level,
  prev_level = 0.3,
  filtering_top = 20,
  .seed = NULL,
  ...
)

Arguments

data_with_annotation

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

prev_level

Numeric. The prevalence to be studied. Required format is decimal: 0.20 for 20% of prevalence

filtering_top

Numeric. The filtering top percentage to be studied. Required format is: 10 for top 10%

.seed

Integer. Top level RNG seed to control the generation of RNG seed for the inner loop or NULL if reproducibility is not required.

...

Additional arguments passed on to apply_NF_simple()

Value

Dataframe. Returns results after using apply_NeighborFinder(): for each module ID from 'object_of_interest', the names of their neighbors and the corresponding coefficients calculated by cv.glmnet()

Examples

data(data)
res_CRC_JPN <- apply_NeighborFinder(
  data$CRC_JPN[, 1:100],
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  .seed = 123
)

Render a table to give an indication of the values to choose for the prevalence level and the top filtering percentage

Description

Render a table to give an indication of the values to choose for the prevalence level and the top filtering percentage

Usage

choose_params_values(
  data_with_annotation,
  object_of_interest,
  sample_size,
  prev_list = c(0.2, 0.3, 0.4),
  filtering_list = c(10, 20, 30),
  graph_file = NULL,
  col_module_id,
  annotation_level,
  seed = NULL
)

Arguments

data_with_annotation

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

sample_size

Numeric. Number of samples in each dataset.

prev_list

List of numeric. The prevalences to be studied. Required format is decimal: 0.20 for 20% of prevalence

filtering_list

List of numeric. The filtering top percentages to be studied. Required format is: 10 for the top 10%

graph_file

Dataframe. The object generated by graph_step() function

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

seed

Numeric. The seed number, ensuring reproducibility

Value

Dataframe. Returns F1 rates before and after using apply_NeighborFinder()

Examples


data(data)
data(graphs)
choose_params_values(
  data_with_annotation = data$CRC_JPN,
  object_of_interest = "Escherichia coli",
  sample_size = 100,
  prev_list = c(0.20, 0.30),
  filtering_list = c(10, 20),
  graph_file = graphs$CRC_JPN,
  col_module_id = "msp_id",
  annotation_level = "species",
  seed = 123
)

Compute precision rate

Description

Compute precision rate

Usage

compute_precision(true, detected)

Arguments

true

List. The one of true neighbors

detected

List. The one of detected neighbors

Value

Numeric. Returns the precision rate

Examples

compute_precision(c("a"), c("a", "b", "c"))
compute_precision(c("a", "b"), c("a", "c"))

Compute recall rate

Description

Compute recall rate

Usage

compute_recall(true, detected)

Arguments

true

List. The one of true neighbors

detected

List. The one of detected neighbors

Value

Numeric. Returns the recall rate

Examples

compute_recall(c("a"), c("a", "b", "c"))
compute_recall(c("a", "b"), c("a", "c"))

Apply cv.glmnet() for a list of module IDs and for each prevalence level

Description

Apply cv.glmnet() for a list of module IDs and for each prevalence level

Usage

cvglm_to_coeffs_by_object(
  list_dfs,
  test_module = identify_module(),
  seed = NULL,
  ...
)

Arguments

list_dfs

List of dataframe. A normalized dataframe

test_module

List of string. The module IDs

seed

Numeric. The seed number, ensuring reproducibility

...

Additional arguments passed on to find_all_module_neighbors()

Value

Dataframe. Returns the module ID, its detected neighbor and the corresponding coefficient

Examples

data(data)
data(metadata)
# Simple example
normed_JPN <- norm_data(
  data$CRC_JPN,
  col_module_id = "msp_id",
  annotation_level = "species",
  prev_list = c(0.25, 0.30)
)
neighbors_JPN <- cvglm_to_coeffs_by_object(
  list_dfs = normed_JPN,
  test_module = c("msp_0030", "msp_0345"),
  seed = 20242025
)
# Example with covariate
# normed_CHN <- norm_data(
#   data$CRC_CHN,
#   col_module_id = "msp_id",
#   annotation_level = "species",
#   prev_list = c(0.25, 0.30)
# )
# neighbors_CHN <- cvglm_to_coeffs_by_object(
#   list_dfs = normed_CHN,
#   test_module = c("msp_0030", "msp_0345"),
#   seed = 20242025,
#   covar = ~study_accession,
#   meta_df = metadata$CRC_CHN,
#   sample_col = "secondary_sample_accession"
# )

data

Description

#' @format A list of dataframes corresponding to abundance tables merges with taxonomic information.

CRC_JPN: dataframe with only Japanese patients diagnosed with colorectal cancer
CRC_CHN: dataframe with only Chinese patients diagnosed with colorectal cancer
CRC_EUR: dataframe with only European patients diagnosed with colorectal cancer
CRC_JPN_CHN_EUR: dataframe with patients diagnosed with colorectal cancer from the 3 previous countries

Usage

data

Format

An object of class list of length 4.

Source

https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/7IVO3E

Gather lists of neighbors of true ones from the graph and detected ones from cv.glmnet()

Description

Gather lists of neighbors of true ones from the graph and detected ones from cv.glmnet()

Usage

final_step(df_truth, df_glm, robustness_step = NULL)

Arguments

df_truth

Dataframe. The one resulting from truth_by_prevalence()

df_glm

Dataframe. The one resulting from cvglm_to_coeffs_by_object()

robustness_step

Boolean. When TRUE, filtering_top will be different from 100%, when FALSE the reults from the naïve method are looked at

Value

Dataframe. Returns for each level of prevalence and module ID, the list of true and/or detected neighbors and the corresponding list of coefficients

Examples

# Dataframe with true neighbors
df_true <- list(
  tibble::tibble(
    node1 = c("msp_1", "msp_1", "msp_2", "msp_3"),
    node2 = c("msp_55", "msp_20", "msp_3", "msp_18"),
    prev1 = c(0.28, 0.28, 0.96, 0.75),
    prev2 = c(0.76, 0.25, 0.75, 0.60)
  ),
  tibble::tibble(
    node1 = c("msp_2", "msp_3"),
    node2 = c("msp_3", "msp_18"),
    prev1 = c(0.96, 0.75),
    prev2 = c(0.75, 0.60)
  )
) %>%
  rlang::set_names(c("0.20", "0.30"))

# Dataframe with detected neighbors
df_detected <- list(
  tibble::tibble(
    prev_level = c("0.20", "0.30", "0.30", "0.30"),
    node1 = c("msp_2", "msp_2", "msp_3", "msp_3"),
    node2 = c("msp_3", "msp_3", "msp_18", "msp_8"),
    coef = c(0.406, -0.025, 0.160, 0.005),
    filtering_top = c(100, 100, 100, 100)
  ),
  tibble::tibble()
) %>%
  rlang::set_names(c("0.20", "0.30"))
# Use final_step() to gather both
neighbors <- final_step(df_true, df_detected, robustness_step = FALSE)

Apply cv.glmnet() for a list of module IDs

Description

Apply cv.glmnet() for a list of module IDs

Usage

find_all_module_neighbors(df, test_module, seed = NULL, ...)

Arguments

df

Dataframe. A normalized dataframe

test_module

List of string. The module IDs

seed

Numeric. The seed number, ensuring reproducibility

...

Additional arguments passed on to find_module_neighbors()

Value

Dataframe. Returns the module ID, its detected neighbor and the corresponding coefficient

Examples

data(data)
data(metadata)
# Simple example
x <- norm_data(data$CRC_JPN, 0.30, annotation_level = "species")[[1]]
neighbors_JPN <- find_all_module_neighbors(
  df = x,
  test_module = c("msp_0030", "msp_0345"),
  seed = 20242025
)
# Example with covariate
# x <- norm_data(data$CRC_CHN, 0.30, annotation_level = "species")[[1]]
# neighbors_CHN <- find_all_module_neighbors(
#   df = x,
#   test_module = c("msp_0030", "msp_0345"),
#   seed = 20242025,
#   covar = ~study_accession,
#   meta_df = metadata$CRC_CHN,
#   sample_col = "secondary_sample_accession"
# )

Apply cv.glmnet() for a given mmodule ID

Description

Apply cv.glmnet() for a given mmodule ID

Usage

find_module_neighbors(
  df,
  module,
  seed = NULL,
  covar = NULL,
  meta_df = NULL,
  sample_col = NULL
)

Arguments

df

Dataframe. A normalized dataframe

module

String. The module ID name

seed

Numeric. The seed number, ensuring reproducibility

covar

String or formula. Formula or the name of the column of the covariate in the metadata table. Note that "study_accession" is equivalent to ~study_accession

meta_df

Dataframe. The dataframe giving metadata information

sample_col

String. The name of the column in metadata indicating the sample names, it should be consistent with the colnames of 'df'

Value

Dataframe. Returns the module ID, its detected neighbor and the corresponding coefficient

Examples

data(data)
data(metadata)
# Simple example
x <- norm_data(data$CRC_JPN, 0.30)[[1]]
neighbors_JPN <- find_module_neighbors(
  df = x,
  module = "msp_0030",
  seed = 20242025
)
# Example with covariate
# x <- norm_data(data$CRC_CHN, 0.30)[[1]]
# neighbors_CHN <- find_module_neighbors(
#   df = x,
#   module = "msp_0030",
#   seed = 20242025,
#   covar = ~study_accession,
#   meta_df = metadata$CRC_CHN,
#   sample_col = "secondary_sample_accession"
# )

Conversion to count table function with prevalence filter

Description

Conversion to count table function with prevalence filter

Usage

get_count_table(
  abund.path = NULL,
  abund.table = NULL,
  sample.id = NULL,
  prev.min,
  verbatim = TRUE,
  msp = NULL
)

Arguments

abund.path

String. Path to the abundance table

abund.table

Dataframe. Abundance table, it should have the bacterial species names as first column

sample.id

String vector. IDs of samples to keep in the final table

prev.min

Numeric. The value is between 0 and 1 and corresponds to the minimal prevalence threshold of bacterial species to keep in the final table

verbatim

Boolean. Controls verbosity

msp

String vector. It indicates bacterial species names, if they are not specified in the abundance table first column

Value

A list containing

data:

the final count table (tibble)

prevalences:

a tibble gathering the prevalence of each bacterial species

Source

    The MIT License text for the original package is as follows:
    ---
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
    ---

Examples

tiny_data <- data.frame(
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07),
  SAMPLE2 = c(1.251707e-07, 1.251707e-07, 3.985320e-07, 0),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0)
)
# Applying a prevalence filter of 30% on the new count_table
count_table <- get_count_table(
  abund.table = tiny_data,
  sample.id = colnames(tiny_data),
  prev.min = 0.3
)

Generate a graph with a "cluster-like" structure, only needed for simulation purposes

Description

Generate a graph with a "cluster-like" structure, only needed for simulation purposes

Usage

graph_step(
  data_with_annotation,
  col_module_id,
  annotation_level,
  seed = 10010,
  data_type = "shotgun",
  ...
)

Arguments

data_with_annotation

col_module_id

String. The name of the column with the module names in the annotation table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

seed

Numeric. Seed number for data generation (new_synth_data)

data_type

String. Enables the treatment of 16S data with "16S", default value is "shotgun"

...

Additional arguments passed on to the generator_graph() routine of new_synth_data()

Value

Dataframe. The dataframe is composed of 0 and 1 corresponding to the existence of edges on the graph.

Examples

tiny_data <- data.frame(
  species = c(
    "One bacteria",
    "One bacterium L",
    "One bacterium G",
    "Two bact",
    "Three bact A",
    "Three bact B"
  ),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4", "msp_5", "msp_6"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07, 0, 0),
  SAMPLE2 = c(1.251707e-07, 0, 3.985320e-07, 0, 1.33607e-04, 0.8675e-03),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06, 0, 0.662e-03),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0, 1.275e-04, 0),
  SAMPLE5 = c(0.0976, 0.9862, 2.98320e-03, 0, 3.9754e-03, 0),
  SAMPLE6 = c(0.26417e-06, 0, 1.0077e-05, 3.983320e-08, 0, 0)
)

tiny_graph <- graph_step(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  seed = 20242025
) %>%
  suppressWarnings()

graphs

Description

graphs

Usage

graphs

Format

A list of dataframes corresponding to graphs based on synthetic data.

CRC_JPN: graph corresponding to data with only Japanese patients diagnosed with colorectal cancer
CRC_JPN_CHN_EUR: graph corresponding to data gathering Japanese, Chinese and European patients diagnosed with colorectal cancer

List the modules corresponding to a given object of interest

Description

List the modules corresponding to a given object of interest

Usage

identify_module(
  object_of_interest,
  annotation_table,
  col_module_id,
  annotation_level = "species"
)

Arguments

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

annotation_table

Dataframe. The dataframe gathering the taxonomic or functional module correspondence information

col_module_id

String. The name of the column with the module names in the annotation table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1. Default value is set to the species level

Value

List of string. The module names of the corresponding object of interest

Examples

df_taxo <- data.frame(
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  genus = c("One", "One", "One", "Two"),
  species = c("One bacteria", "One bacterium L", "One bacterium G", "Two bact")
)
identify_module(
  object_of_interest = "bacterium",
  annotation_table = df_taxo,
  col_module_id = "msp_name",
  annotation_level = "species"
)
identify_module(
  object_of_interest = "One",
  annotation_table = df_taxo,
  col_module_id = "msp_name",
  annotation_level = "species"
)

Display the intersection network from 2 or more datasets

Description

Display the intersection network from 2 or more datasets

Usage

intersections_network(
  res_list,
  threshold,
  annotation_table,
  col_module_id,
  annotation_level,
  object_of_interest,
  annotation_option = FALSE,
  node_size = 12,
  label_size = 4,
  edge_label_size = 2,
  object_color = "cadetblue2",
  seed = NULL
)

Arguments

res_list

List of dataframes. The results from apply_NeighborFinder() on several datasets

threshold

Numeric. Integer corresponding to the minimum number of datasets in which you want neighbors to have been found

annotation_table

Dataframe. The dataframe gathering the taxonomic or functional module correspondence information

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

annotation_option

Boolean. Default value is False. If True: labels on nodes become module names instead of module IDs

node_size

Numeric. The parameter to adjust size of nodes

label_size

Numeric. The parameter to adjust size of labels

edge_label_size

Numeric. The parameter to adjust the size of edge labels

object_color

String. The name of the color to differentiate the nodes corresponding to 'object_of_interest' from the other module IDs

seed

Numeric. The seed number, ensuring reproducibility

Value

Network. Visualization of NeighborFinder results from several datasets. Edge color encodes the mean coefficient sign: green if positive, red if negative.

Examples

data(taxo)
data(result_example)
intersections_network(
  res_list = list(
    result_example$res_CRC_JPN,
    result_example$res_CRC_CHN,
    result_example$res_CRC_EUR
  ),
  taxo,
  threshold = 2,
  "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 7,
  edge_label_size = 4,
  node_size = 15,
  annotation_option = TRUE,
  seed = 3
)

Display the intersection table summarizing the results from 2 or more datasets

Description

Display the intersection table summarizing the results from 2 or more datasets

Usage

intersections_table(
  res_list,
  threshold,
  annotation_table,
  col_module_id,
  annotation_level,
  object_of_interest
)

Arguments

res_list

List of dataframes. The results from apply_NeighborFinder() on several datasets

threshold

Numeric. Integer corresponding to the minimum number of datasets in which you want neighbors to have been found

annotation_table

Dataframe. The dataframe gathering the taxonomic or functional module correspondence information

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

Value

Dataframe. Table gathering the intersection of NeighborFinder results from several datasets. The column 'datasets' indicates the datasets in which the same neighbor has been found, the column 'intersections' indicates the number of datasets in which the same neighbor has been found

Examples

data(taxo)
data(result_example)
intersections_table(
  res_list = list(
    result_example$res_CRC_JPN,
    result_example$res_CRC_CHN,
    result_example$res_CRC_EUR
  ),
  threshold = 2,
  taxo,
  col_module_id = "msp_id",
  annotation_level = "species",
  "Escherichia coli"
)

Modified central log ratio (mclr) transformation extracted from the SPRING package

Description

Modified central log ratio (mclr) transformation extracted from the SPRING package

Usage

mclr(dat, base = exp(1), tol = 1e-16, eps = NULL, atleast = 1)

Arguments

dat

raw count data or compositional data (n by p) does not matter.

base

exp(1) for natural log

tol

tolerance for checking zeros

eps

epsilon in eq (2) of the paper "Yoon, Gaynanova, Müller (2019), Frontiers in Genetics". positive shifts to all non-zero compositions. Refer to the paper for more details. eps = absolute value of minimum of log ratio counts plus c.

atleast

default value is 1. Constant c which ensures all nonzero values to be strictly positive. default is 1.

Details

This function implements the mclr normalization introduced in Yoon G, Gaynanova I and Müller CL (2019) Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data. Front. Genet. 10:516. doi:10.3389/fgene.2019.00516

Value

mclr returns a data matrix of the same dimension with input data matrix.

metadata

Description

#' @format A list of dataframes corresponding to metadata, giving more information on each patient.

CRC_JPN: dataframe with only Japanese patients diagnosed with colorectal cancer
CRC_CHN: dataframe with only Chinese patients diagnosed with colorectal cancer
CRC_EUR: dataframe with only European patients diagnosed with colorectal cancer
CRC_JPN_CHN_EUR: dataframe with patients diagnosed with colorectal cancer from the 3 previous countries

Usage

metadata

Format

An object of class list of length 4.

Source

https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/7IVO3E

Correspondence between the module ID (msp or functional module) and its name (bacteria or function)

Description

Correspondence between the module ID (msp or functional module) and its name (bacteria or function)

Usage

module_to_node(
  module,
  annotation_table,
  col_module_id = "msp_name",
  annotation_level
)

Arguments

module

String. The name of the biological object (msp or functional module), can be a single one or a list

annotation_table

Dataframe. The dataframe gathering the taxonomic or functional module correspondence information

col_module_id

String. The name of the column with the module names in the annotation table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

Value

Dictionary. The name of the module, can be a single one or a list

Examples

df_taxo <- data.frame(
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  genus = c("A", "B", "C", "D"),
  species = c("A a", "B a", "C c", "D b")
)
# Correspondence for one specific msp
module_to_node(
  "msp_1",
  annotation_table = df_taxo,
  col_module_id = "msp_name",
  annotation_level = "species"
)
# or for several msps
module_to_node(
  c("msp_1", "msp_3", "msp_4"),
  annotation_table = df_taxo,
  col_module_id = "msp_name",
  annotation_level = "species"
)
# and if one msp is repeated
module_to_node(
  c("msp_1", "msp_1", "msp_2"),
  annotation_table = df_taxo,
  col_module_id = "msp_name",
  annotation_level = "genus"
)

Simulate data from some empirical count dataset with a "cluster-like" structure

Description

Simulate data from some empirical count dataset with a "cluster-like" structure

Usage

new_synth_data(
  real_data,
  graph_type = "cluster",
  must_connect = TRUE,
  graph = NULL,
  n = 300,
  seed = NULL,
  r = 50,
  dens = 4,
  k = 3,
  verbatim = TRUE,
  signed = FALSE
)

Arguments

real_data

Matrix. Empirical count table

graph_type

String. Structure type for the conditional dependency structure. Here only "cluster" was kept, see EMtree package for more options

must_connect

Boolean. TRUE to force the output graph to be connected

graph

Boolean. Optional graph to be used, must have rownames and colnames and reference all features from real_data

n

Numeric. Number of samples to simulate

seed

Numeric. Seed number for data generation (rmvnorm)

r

Numeric. For cluster structure, controls the within/between ratio connection probability

dens

Numeric. Graph density (for cluster graphs) or edges probability (for erdös-renyi graphs)

k

Numeric. For cluster structure, number of groups

verbatim

Boolean. Controls verbosity

signed

Boolean. TRUE for simulating both positive and negative partial correlations. Default is to FALSE, which implies only negative partial correlations

Value

List. Containing the simulated discrete counts, the corresponding true partial correlation matrix from the latent Gaussian layer of the model and the original graph structure that was used

Source

    The MIT License text for the original package is as follows:
    ---
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
    ---

Examples

tiny_data <- data.frame(
  species = c(
    "One bacteria",
    "One bacterium L",
    "One bacterium G",
    "Two bact",
    "Three bact A",
    "Three bact B"
  ),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4", "msp_5", "msp_6"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07, 0, 0),
  SAMPLE2 = c(1.251707e-07, 0, 3.985320e-07, 0, 1.33607e-04, 0.8675e-03),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06, 0, 0.662e-03),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0, 1.275e-04, 0),
  SAMPLE5 = c(0.0976, 0.9862, 2.98320e-03, 0, 3.9754e-03, 0),
  SAMPLE6 = c(0.26417e-06, 0, 1.0077e-05, 3.983320e-08, 0, 0)
)

count_table <- get_count_table(
  abund.table = tiny_data %>% dplyr::select(-species),
  sample.id = colnames(tiny_data),
  prev.min = 0.1
)
tiny_graph <- graph_step(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  seed = 20242025
) %>%
  suppressWarnings()
sim_data <- new_synth_data(
  count_table$data,
  n = 50,
  graph = as.matrix(tiny_graph %>% dplyr::select(-species)),
  verbatim = FALSE,
  seed = 20242025
) %>%
  suppressWarnings()

Normalize data and filters it by prevalence level

Description

Normalize data and filters it by prevalence level

Usage

norm_data(
  data_with_annotation,
  col_module_id,
  prev_list = c(0.3),
  annotation_level,
  data_type = "shotgun"
)

Arguments

data_with_annotation

col_module_id

String. The name of the column with the module names in annotation_table

prev_list

Vector of numeric. The prevalences to be studied. Required format is decimal: 0.20 for 20% of prevalence

annotation_level

String. Annotation level to aggregate the taxa

data_type

String. Enables the treatment of 16S data with "16S", default value is "shotgun"

Value

List of dataframes. Each element of the list corresponds to a normalized 'data_with_annotation', by level of prevalence

Examples

tiny_data <- data.frame(
  species = c("One bacteria", "One bacterium L", "One bacterium G", "Two bact"),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07),
  SAMPLE2 = c(1.251707e-07, 1.251707e-07, 3.985320e-07, 0),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0)
)

tiny_normed <- norm_data(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  prev_list = c(0.20, 0.30)
)

Extract edges in graph involving any module in object_of_interest set

Description

Extract edges in graph involving any module in object_of_interest set

Usage

prev_for_selected_nodes(
  data_with_annotation,
  graph_file,
  col_module_id,
  annotation_level,
  object_of_interest = NULL
)

Arguments

data_with_annotation

graph_file

Dataframe. The object generated by graph_step() function

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

Value

Dataframe. The dataframe of edges in the graph involving modules corresponding to object_of_interest and their corresponding prevalences.

Examples

tiny_data <- data.frame(
  species = c("One bacteria", "One bacterium L", "One bacterium G", "Two bact"),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07),
  SAMPLE2 = c(1.251707e-07, 1.251707e-07, 3.985320e-07, 0),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0)
)
tiny_graph <- graph_step(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  seed = 20242025
) %>%
  suppressWarnings()

tiny_truth <- prev_for_selected_nodes(
  tiny_data,
  tiny_graph,
  col_module_id = "msp_name",
  annotation_level = "species",
  object_of_interest = "bacterium"
)

result_example

Description

result_example

Usage

result_example

Format

A list of dataframes corresponding to apply_NeighborFinder() results.

res_CRC_JPN: edge table of Escherichia coli neighbors with only Japanese patients diagnosed with colorectal cancer
res_CRC_CHN: edge table of Escherichia coli neighbors with only Chinese patients diagnosed with colorectal cancer
res_CRC_EUR: edge table of Escherichia coli neighbors with only European patients diagnosed with colorectal cancer

List the simulated count tables by level of prevalence

Description

List the simulated count tables by level of prevalence

Usage

simulate_by_prevalence(
  data_with_annotation,
  prev_list,
  graph_file = NULL,
  col_module_id,
  annotation_level,
  sample_size = 500,
  seed = NULL,
  verbatim = FALSE,
  data_type = "shotgun"
)

Arguments

data_with_annotation

prev_list

List of numeric. The prevalences to be studied. Required format is decimal: 0.20 for 20% of prevalence.

graph_file

Dataframe. The object generated by graph_step() function

col_module_id

String. The name of the column with the module names in data_with_annotation

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

sample_size

Numeric. The size to be considerated, the value of 500 is recommended

seed

Numeric. The seed number, ensuring reproducibility

verbatim

Boolean. Controls verbosity

data_type

String. Enables the treatment of 16S data with "16S", default value is "shotgun"

Value

List of dataframes. Each element of the list corresponds to a level of prevalence and is a simulated abundance table

Examples

tiny_data <- data.frame(
  species = c("One bacteria", "One bacterium L", "One bacterium G", "Two bact"),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07),
  SAMPLE2 = c(1.251707e-07, 1.251707e-07, 3.985320e-07, 0),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0)
)

tiny_graph <- graph_step(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  seed = 20242025
) %>%
  suppressWarnings()

tiny_sims <- simulate_by_prevalence(
  tiny_data,
  prev_list = c(0.20, 0.30),
  graph_file = tiny_graph,
  col_module_id = "msp_name",
  annotation_level = "species",
  sample_size = 500,
  seed = 20242025
)

Simulate data Generates synthetic count data based on empirical cumulative distribution (ecdf) of real count data

Description

Simulate data Generates synthetic count data based on empirical cumulative distribution (ecdf) of real count data

Usage

simulate_from_ecdf(real_data, Sigma, n, seed = NULL, verbatim = FALSE)

Arguments

real_data

Matrix. Contains real count data of size n by p

Sigma

Matrix. Covariance structure of size p by p

n

Numeric. Number of samples

seed

Numeric. Seed number for data generation

verbatim

Boolean. If TRUE: iteration and index calculation for each step are printed out

Value

Matrix. The vector from the upper triangular part of A.mat

Source

    The MIT License text for the original package is as follows:
    ---
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
    ---

taxo

Description

#' @format A dataframe with 2537 msps (rows) and 4 columns:

msp_id: string
species: string
genus: string
catalogue: string indicating gut and/or oral

Usage

taxo

Format

An object of class data.frame with 2537 rows and 4 columns.

Source

https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/7IVO3E

Render a table gathering precision and recall rates before and after filtering on coefficient values

Description

Render a table gathering precision and recall rates before and after filtering on coefficient values

Usage

test_filter(df_before, df_after, prevs = NULL)

Arguments

df_before

Dataframe. The one of true neighbors

df_after

Dataframe. The one of detected neighbors

prevs

List of numeric. The prevalences to be studied. Required format is decimal: 0.20 for 20% of prevalence

Value

Dataframe. Returns the precision and recall rates before and after the modification

Examples

# Dataframe with true neighbors
list_true <- list(
  tibble::tibble(
    node1 = c("msp_1", "msp_1", "msp_2", "msp_3"),
    node2 = c("msp_55", "msp_20", "msp_3", "msp_18"),
    prev1 = c(0.28, 0.28, 0.96, 0.75),
    prev2 = c(0.76, 0.25, 0.75, 0.60)
  ),
  tibble::tibble(
    node1 = c("msp_2", "msp_3"),
    node2 = c("msp_3", "msp_18"),
    prev1 = c(0.96, 0.75),
    prev2 = c(0.75, 0.60)
  )
) %>%
  rlang::set_names(c("0.20", "0.30"))

# Dataframes with detected neighbors
list_detected <- list(
  tibble::tibble(
    prev_level = c("0.20", "0.30", "0.30", "0.30"),
    node1 = c("msp_2", "msp_2", "msp_3", "msp_3"),
    node2 = c("msp_3", "msp_3", "msp_18", "msp_8"),
    coef = c(0.406, -0.025, 0.160, 0.005),
    filtering_top = c(100, 100, 100, 100)
  ),
  tibble::tibble()
) %>%
  rlang::set_names(c("0.20", "0.30"))
list_detected2 <- list(
  tibble::tibble(
    prev_level = c("0.20", "0.20"),
    node1 = c("msp_2", "msp_3"),
    node2 = c("msp_3", "msp_18"),
    coef = c(0.160, 0.005),
    filtering_top = c(100, 100)
  ),
  tibble::tibble()
) %>%
  rlang::set_names(c("0.20", "0.30"))
# Use final_step() to gather both
neighbors <- final_step(list_true, list_detected, robustness_step = FALSE)
neighbors2 <- final_step(list_true, list_detected2, robustness_step = FALSE) %>%
  dplyr::mutate(filtering_top = 10)
# Calculate scores
scores <- test_filter(neighbors, neighbors2)

Give true neighbors by level of prevalence

Description

Give true neighbors by level of prevalence

Usage

truth_by_prevalence(edge_table, prev_list)

Arguments

edge_table

Dataframe. The result of prev_for_selected_nodes()

prev_list

List of numeric. The prevalences to be studied. Required format is decimal: 0.20 for 20% of prevalence

Value

List of dataframes. Each element of the list corresponds to a dataframe of true edges by level of prevalence

Examples

tiny_data <- data.frame(
  species = c("One bacteria", "One bacterium L", "One bacterium G", "Two bact"),
  msp_name = c("msp_1", "msp_2", "msp_3", "msp_4"),
  SAMPLE1 = c(0, 1.328425e-06, 0, 1.527688e-07),
  SAMPLE2 = c(1.251707e-07, 1.251707e-07, 3.985320e-07, 0),
  SAMPLE3 = c(0, 0, 4.926046e-09, 5.626392e-06),
  SAMPLE4 = c(0, 0, 2.98320e-05, 0)
)

tiny_graph <- graph_step(
  tiny_data,
  col_module_id = "msp_name",
  annotation_level = "species",
  seed = 20242025
) %>%
  suppressWarnings()

tiny_truth <- prev_for_selected_nodes(
  tiny_data,
  tiny_graph,
  col_module_id = "msp_name",
  annotation_level = "species",
  object_of_interest = "bacterium"
)

tiny_true_edges <- truth_by_prevalence(tiny_truth, c(0.20, 0.30))

Display network after applying NeighborFinder

Description

Display network after applying NeighborFinder

Usage

visualize_network(
  res_NeighborFinder,
  annotation_table,
  col_module_id,
  annotation_level,
  object_of_interest,
  annotation_option = FALSE,
  node_size = 12,
  label_size = 4,
  object_color = "cadetblue2",
  seed = NULL
)

Arguments

res_NeighborFinder

Dataframe. The result from apply_NeighborFinder()

annotation_table

Dataframe. The dataframe gathering the taxonomic or functional module correspondence information

col_module_id

String. The name of the column with the module names in annotation_table

annotation_level

String. The name of the column with the level to be studied. Examples: species, genus, level_1

object_of_interest

String. The name of the bacteria or species of interest or a key word in the functional module definition

annotation_option

Boolean. Default value is False. If True: labels on nodes become module names instead of module IDs

node_size

Numeric. The parameter to adjust size of nodes

label_size

Numeric. The parameter to adjust size of labels

object_color

String. The name of the color to differentiate the nodes corresponding to 'object_of_interest' from the other module IDs

seed

Numeric. The seed number, ensuring reproducibility

Value

Network. Visualization of NeighborFinder results. Edge color encodes coefficient sign: green if positive, red if negative; edge width encodes magnitude.

Examples

data(taxo)
visualize_network(
  result_example$res_CRC_JPN,
  taxo,
  object_of_interest = "Escherichia coli",
  col_module_id = "msp_id",
  annotation_level = "species",
  label_size = 5
)
# #With species names instead of msp names
# visualize_network(
#   result_example$res_CRC_JPN,
#   taxo,
#   object_of_interest = "Escherichia coli",
#   col_module_id = "msp_id",
#   annotation_level = "species",
#   label_size = 5,
#   annotation_option = TRUE,
#   seed = 2
# )
# #With esthetic changes
# visualize_network(
#   result_example$res_CRC_JPN,
#   taxo,
#   object_of_interest = "Escherichia coli",
#   col_module_id = "msp_id",
#   annotation_level = "species",
#   annotation_option = TRUE,
#   node_size = 15,
#   label_size = 6,
#   object_color = "orange",
#   seed = 2
# )

Package {NeighborFinder}

Pipe operator

Description

Usage

Value

Apply NeighborFinder simplest version on raw data

Description

Usage

Arguments

Value

Examples

Apply NeighborFinder on raw data

Description

Usage

Arguments

Value

Examples

Render a table to give an indication of the values to choose for the prevalence level and the top filtering percentage

Description

Usage

Arguments

Value

Examples

Compute precision rate

Description

Usage

Arguments

Value

Examples

Compute recall rate

Description

Usage

Arguments

Value

Examples

Apply cv.glmnet() for a list of module IDs and for each prevalence level

Description

Usage

Arguments

Value

Examples

data

Description

Usage

Format

Source

Gather lists of neighbors of true ones from the graph and detected ones from cv.glmnet()

Description

Usage

Arguments

Value

Examples

Apply cv.glmnet() for a list of module IDs

Description

Usage

Arguments

Value

Examples

Apply cv.glmnet() for a given mmodule ID

Description

Usage

Arguments

Value

Examples

Conversion to count table function with prevalence filter

Description

Usage

Arguments

Value

Source

Examples

Generate a graph with a "cluster-like" structure, only needed for simulation purposes

Description

Usage

Arguments

Value

Examples

graphs

Description

Usage