ActivePathways is a statistical method and software tool for pathway enrichment analysis across multiple omics datasets. Different omics techniques and analyses reveal different facets of underlying biology and such integrative analysis across multivariate data helps find their commonalities in the context of pathways, biological processes and functionally related gene sets. ActivePathways uses a data fusion technique to merge multiple gene lists with p-values into an integrated gene list and conducts enrichment analysis using the ranked hypergeometric statistic.
From a matrix of p-values representing genes and their significance in different omics datasets, ActivePathways creates an integrated gene list (i.e., a query) by merging the p-values to determine a single p-value for each gene. The integrated gene list is then filtered and sorted by merged p-values. ActivePathways tests if a pathway is ‘enriched’, or over-represented in our gene list, by performing a hypergeometric test on the gene list and the genes annotated to the pathway. However, rather than testing the entire gene list, ActivePathways performs incremental enrichment analysis with an increasingly larger subset of the gene list starting at the top of the list and identifies the subset of gene list that yields the smallest p-value (known as the ranked hypergeometric statistic). This approach is useful when the genes in the gene list are of decreasing biological importance. The ranked test is more sensitive than a single test of a regular “flat” gene list as it finds smaller and more specific pathways of genes ranked at the top of the list, as well as larger and more general pathways distributed across the entire input list.
The following example uses a dataset that contains a small subset of driver genes in a cohort of adenocarcinomas of multiple cancer types. The driver scores are p-values which represent the confidence that a given gene is a driver. Driver scores are provided separately for different genetic elements: untranslated regions (5’ and 3’ UTRs), coding regions, and promoters.
The GMT file, representing a set of molecular pathways, are a small subset of the pathways curated by the Reactome database Gene Matrix Transposed file. Broadly, GMT files can be downloaded from the Bader Lab repository, the MSigDB database hosted at the Broad Institute or the g:ProfileR web server. Click on ‘Data sources’, select database preferences and download the name.gmt
file. Note that the gene IDs or symbols in the GMT file need to match the IDs or symbols used in the gene matrix.
Downloading GMT Files from gProfileR
Let’s start by reading the input data from the file. ActivePathways
expects a matrix, so the table has to be cast to the correct class.
scores <- read.table(system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package='ActivePathways'), header=TRUE, sep='\t', row.names='Gene')
scores <- as.matrix(scores)
scores
## X3UTR X5UTR CDS promCore
## A2M NA 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS NA 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT 0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1 0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5 0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8 NA 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6 0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1 0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
## [ reached getOption("max.print") -- omitted 2406 rows ]
Additionally, ActivePathways
will not allow any missing values (NA) in the input. Rows containing missing values may be removed. Alternatively, as conducted in this example, missing values will be assigned a value of 1, indicating that the element is not a driver.
scores[is.na(scores)] <- 1
The ActivePathways
function requires two inputs. The first is the data matrix of p-values. The second is the path to the GMT file.
library(ActivePathways)
gmt.file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package='ActivePathways')
ActivePathways(scores, gmt.file)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
A pathway is considered to be significantly enriched if the associated p-value is less than the significance threshold (i.e. p.val <= significant
). The significant
parameter represents the maximum p-value for a result to be considered statistically significant after multiple testing correction. By default, only the significant pathways are returned. If all results are desired, the user can set the return.all
parameter to TRUE
. ActivePathways conducts multiple testing correction by default using the Holm method.
nrow(ActivePathways(scores, gmt.file, return.all=FALSE, significant=0.05))
## [1] 20
nrow(ActivePathways(scores, gmt.file, return.all=FALSE, significant=0.1))
## [1] 23
nrow(ActivePathways(scores, gmt.file, return.all=TRUE, significant=0.05))
## [1] 265
Note that while setting significant=1
gives the same result as return.all=TRUE
, it is strongly recommended the latter is used because the results written to cytoscape.file.dir
use the significant
value to determine which pathways to include in the network. Thus ActivePathways(scores, gmt.file, return.all=TRUE, cytoscape.file.dir='/results/directory/')
and ActivePathways(scores, gmt.file, significant=1, cytoscape.file.dir='/results/directory/)
will return the same result table but will write different results files.
The GMT file contains the pathways to be analyzed for gene enrichment. Each line corresponds to a pathway and contains the pathway ID and name followed by all associated genes. While the location of the GMT file to ActivePathways
can be directly specified, in some cases the user may desire to examine the GMT file interactively. The ActivePathways package includes an interface for working with GMT objects. The GMT file can be read into R as a GMT object with the read.GMT
function. The GMT is structured as a list of terms, where each term is a list containing an id, a name, and a list of genes.
gmt <- read.GMT(gmt.file)
names(gmt[[1]])
## [1] "id" "name" "genes"
# Pretty-print the GMT
gmt[1:3]
## $`REAC:1630316`
## $`REAC:1630316`$id
## [1] "REAC:1630316"
##
## $`REAC:1630316`$name
## [1] "Glycosaminoglycan metabolism"
##
## $`REAC:1630316`$genes
## [1] "ST3GAL1" "CHP1" "B4GALT5" "ST3GAL4" "SLC9A1"
## [6] "CHST11" "B3GAT3" "SLC26A1" "ARSB" "ABCC5"
## [11] "LUM" "PAPSS1" "SDC4" "NAGLU" "AC022400.6"
## [16] "HYAL1" "NDST3" "SLC35B2" "B3GAT1" "CHST2"
## [21] "DCN" "CEMIP" "IDS" "IDUA" "HS3ST3B1"
## [26] "B3GNT7" "CHST12" "CHST14" "BCAN" "B4GALT3"
## [31] "HS3ST1" "B4GALT1" "CSPG4" "CHPF2" "HEXB"
## [ reached getOption("max.print") -- omitted 92 entries ]
##
##
## $`REAC:5633007`
## $`REAC:5633007`$id
## [1] "REAC:5633007"
##
## $`REAC:5633007`$name
## [1] "Regulation of TP53 Activity"
##
## $`REAC:5633007`$genes
## [1] "PPP1R13L" "RFC3" "SGK1" "HDAC1" "SETD9" "RPA3"
## [7] "CSNK2A2" "EHMT1" "PDPK1" "CCNG1" "TAF6" "ING5"
## [13] "TP53INP1" "BRCA1" "TAF1" "TAF9" "EHMT2" "RBBP8"
## [19] "CHD4" "USP2" "BRPF1" "BANP" "PPP2R5C" "SSRP1"
## [25] "TAF4" "TMEM55B" "PPP1R13B" "TAF10" "PRR5" "ATM"
## [31] "ZNF385A" "MRE11" "PIP4K2C" "SMYD2" "BARD1"
## [ reached getOption("max.print") -- omitted 125 entries ]
##
##
## $`REAC:5358346`
## $`REAC:5358346`$id
## [1] "REAC:5358346"
##
## $`REAC:5358346`$name
## [1] "Hedgehog ligand biogenesis"
##
## $`REAC:5358346`$genes
## [1] "PSMB11" "PSMD8" "PSMB6" "PSMB8" "PSMD13"
## [6] "PSMA2" "PSMB3" "PSMB7" "PSME2" "PSMD1"
## [11] "PSMA8" "PSMA1" "PSMD2" "PSMD14" "PSMC1"
## [16] "IHH" "SEL1L" "SCUBE2" "PSMB10" "PSMD10"
## [21] "PSMA5" "PSMD3" "PSMC2" "PSMC6" "PSME1"
## [26] "PSME4" "PSMC4" "AC010132.3" "PSMD5" "UBB"
## [31] "GPC5" "PSMA7" "PSMB2" "PSMD9" "SEM1"
## [ reached getOption("max.print") -- omitted 31 entries ]
##
##
## attr(,"class")
## [1] "GMT"
# Look at the genes annotated to the first term
gmt[[1]]$genes
## [1] "ST3GAL1" "CHP1" "B4GALT5" "ST3GAL4" "SLC9A1"
## [6] "CHST11" "B3GAT3" "SLC26A1" "ARSB" "ABCC5"
## [11] "LUM" "PAPSS1" "SDC4" "NAGLU" "AC022400.6"
## [16] "HYAL1" "NDST3" "SLC35B2" "B3GAT1" "CHST2"
## [21] "DCN" "CEMIP" "IDS" "IDUA" "HS3ST3B1"
## [26] "B3GNT7" "CHST12" "CHST14" "BCAN" "B4GALT3"
## [31] "HS3ST1" "B4GALT1" "CSPG4" "CHPF2" "HEXB"
## [ reached getOption("max.print") -- omitted 92 entries ]
# Get the full name of Reactome pathway 2424491
gmt$`REAC:2424491`$name
## [1] "DAP12 signaling"
GMT files can be filtered for size to remove large and small pathways which are overly generic or specific. Here, pathways that have less than 10 or more than 500 genes are removed. This step removes very small and specific gene sets that often represent a large fraction of all pathways, and large and unspecific gene sets that are subject to p-value inflation due to the properties of the hypergeometric distribution. Both very small and very large gene sets are often not helpful for data interpretation and can be safely discarded in most cases.
gmt <- Filter(function(term) length(term$genes) >= 10, gmt)
gmt <- Filter(function(term) length(term$genes) <= 500, gmt)
This GMT object can be written to a file.
write.GMT(gmt, 'hsapiens_REAC_subset_filtered.gmt', path = ".")
This filtering can also be done automatically using the geneset.filter
parameter in the ActivePathways
function by specifying a vector of length 2 containing the minimum and maximum pathway size desired by the user. For example, the above task can be completed by setting geneset.filter=c(10, 500)
. By default, ActivePathways
removes any pathways with less than 5 or more than 1000 annotated genes.
The new GMT object can now be used for analysis with ActivePathways
.
ActivePathways(scores, gmt)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 0.0002904356 358
## 2: REAC:177929 Signaling by EGFR 0.0032110362 366
## 3: REAC:2559583 Cellular Senescence 0.0001568181 196
## overlap evidence Genes_X3UTR
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 2: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS NA
## 3: TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,... CDS NA
## Genes_X5UTR Genes_CDS Genes_promCore
## 1: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 3: NA TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,... NA
## [ reached getOption("max.print") -- omitted 14 rows ]
To perform pathway enrichment analysis, ActivePathways
uses a set of genes as a statistical background. By default, the background
is every gene found at least once in the GMT, but it can also be specified by the user by supplying a vector of genes to the background
parameter in ActivePathways
. For example, if the analysis is conducted on a set of genes captured by a sequencing panel, then the user may wish to limit the background
to the genes whose characteristics can be measured. Genes that have no plausible signal in an experiment or analysis of interest should not be part of a statistical background as their inclusion in an analysis would lead to statistical inflation.
The list of genes in the GMT can be extracted with the makeBackground
function (automatically performed by ActivePathways
). In this example, let’s try excluding the gene TP53. This gene can be removed from the background and the new background can be used for the analysis.
background <- makeBackground(gmt)
background <- background[background != 'TP53']
ActivePathways(scores, gmt.file, background=background)
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 0.002258987 358
## 2: REAC:422475 Axon guidance 0.009416431 555
## 3: REAC:177929 Signaling by EGFR 0.021767252 366
## overlap evidence
## 1: PIK3CA,KRAS,PTEN,BRAF,NRAS,B2M,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR
## 3: PIK3CA,KRAS,PTEN,BRAF,NRAS,CALM2,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS Genes_promCore
## 1: PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,... NA
## 2: NA NA
## 3: PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,... NA
## [ reached getOption("max.print") -- omitted 13 rows ]
Note that only genes found in the background are used for testing enrichment. Any genes in the input data that are not in the background will be automatically removed by ActivePathways
.
ActivePathways
statistically merges evidence from each source (e.e., omics dataset or analysis) by merging the p-values for each gene into a single combined score. The two main methods to merge p-values are Brown’s method (the default, which assumes a correlation between different sources of evidence) and Fisher’s method (which assumes independence of different sources of evidence), but other methods are also available.
scores
## X3UTR X5UTR CDS promCore
## A2M 1.0000000000 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS 1.0000000000 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT 0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1 0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5 0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8 1.0000000000 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6 0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1 0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
## [ reached getOption("max.print") -- omitted 2406 rows ]
merge_p_values(scores, 'Fisher')
## A2M AAAS ABAT ABCC1 ABCC5
## 8.580195e-01 9.310603e-01 2.463661e-01 5.587650e-01 8.697586e-01
## ABCC8 ABHD6 ABI1 ABL1 AC142381.1
## 9.655215e-01 6.087795e-01 3.184240e-01 7.462578e-01 1.000000e+00
## ACAA1 ACADS ACAT1 ACD ACOT6
## 5.552635e-01 4.613152e-01 4.281889e-01 9.692612e-01 7.933962e-02
## ACOT8 ACOX2 ACOX3 ACSM5 ACTB
## 3.011535e-01 9.039274e-01 8.307694e-01 3.486612e-01 2.750060e-04
## ACTL6A ACTN1 ACTN2 ACTR1B ACVR2A
## 3.361929e-01 3.449892e-01 8.649147e-01 5.511921e-01 9.876590e-12
## ACVR2B ACY1 ACY3 ADAM10 ADAM17
## 7.538053e-01 9.751877e-01 7.524688e-01 1.915695e-02 2.445514e-01
## ADAM22 ADARB1 ADAT2 ADAT3 ADCY1
## 2.508503e-01 6.775969e-01 5.005089e-01 9.450412e-01 8.432588e-01
## [ reached getOption("max.print") -- omitted 2379 entries ]
merge_p_values(scores, 'Brown')
## A2M AAAS ABAT ABCC1 ABCC5
## 7.784384e-01 8.627499e-01 2.672114e-01 5.148585e-01 7.908670e-01
## ABCC8 ABHD6 ABI1 ABL1 AC142381.1
## 9.127001e-01 5.551149e-01 3.254985e-01 6.714814e-01 1.000000e+00
## ACAA1 ACADS ACAT1 ACD ACOT6
## 5.120659e-01 4.379288e-01 4.119781e-01 9.189688e-01 1.165005e-01
## ACOT8 ACOX2 ACOX3 ACSM5 ACTB
## 3.116936e-01 8.291729e-01 7.506855e-01 3.494921e-01 2.490608e-03
## ACTL6A ACTN1 ACTN2 ACTR1B ACVR2A
## 3.396220e-01 3.465883e-01 7.856999e-01 5.088222e-01 3.755813e-08
## ACVR2B ACY1 ACY3 ADAM10 ADAM17
## 6.782366e-01 9.294653e-01 6.770367e-01 4.313449e-02 2.657156e-01
## ADAM22 ADARB1 ADAT2 ADAT3 ADCY1
## 2.709006e-01 6.120292e-01 4.687120e-01 8.817002e-01 7.632312e-01
## [ reached getOption("max.print") -- omitted 2379 entries ]
The user can directly calculate merged p-values using the merge_p_values
function. For example, the X3UTR
, X5UTR
, and promCore
columns can be merged into a single non_coding
column. This will consider the three non-coding regions as a single column, rather than giving them all equal weight as the CDS
column.
scores2 <- cbind(scores[, 'CDS'], merge_p_values(scores[, c('X3UTR', 'X5UTR', 'promCore')], 'Brown'))
colnames(scores2) <- c('CDS', 'non_coding')
scores # with separate columns for 3' UTR, 5'UTR & Promoter
## X3UTR X5UTR CDS promCore
## A2M 1.0000000000 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS 1.0000000000 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT 0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1 0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5 0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8 1.0000000000 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6 0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1 0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
## [ reached getOption("max.print") -- omitted 2406 rows ]
scores2 # with a single 'noncoding' column for the merged p-value of 3' UTR, 5'UTR & Promoter
## CDS non_coding
## A2M 9.051708e-01 6.310758e-01
## AAAS 7.047723e-01 8.042034e-01
## ABAT 7.600985e-01 1.692698e-01
## ABCC1 2.599319e-01 5.952884e-01
## ABCC5 4.788846e-01 8.002823e-01
## ABCC8 6.890250e-01 8.829674e-01
## ABHD6 7.587176e-01 4.219903e-01
## ABI1 4.815032e-01 2.669029e-01
## ABL1 6.685222e-01 5.729940e-01
## AC142381.1 9.999658e-01 1.000000e+00
## ACAA1 4.139803e-01 4.917295e-01
## ACADS 6.352493e-01 3.370438e-01
## ACAT1 6.031193e-02 7.938979e-01
## ACD 9.487000e-01 8.196343e-01
## ACOT6 5.149489e-01 7.837917e-02
## ACOT8 5.532113e-01 2.374103e-01
## ACOX2 6.549615e-01 7.752572e-01
## [ reached getOption("max.print") -- omitted 2397 rows ]
ActivePathways(scores, gmt.file) # results on the dataset with separate columns for 3' UTR, 5'UTR & Promoter
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
ActivePathways(scores2, gmt.file) # results on the dataset with a single 'noncoding' column for the merged p-value of 3' UTR, 5'UTR & Promoter, reducing the impact of noncoding mutations
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 0.0003694968 358
## 2: REAC:422475 Axon guidance 0.0051873493 555
## 3: REAC:177929 Signaling by EGFR 0.0018584534 366
## 4: REAC:2559583 Cellular Senescence 0.0005822291 196
## overlap evidence
## 1: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,RPS6KA3,CALM2,... combined
## 3: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## 4: TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,... CDS
## Genes_CDS Genes_non_coding
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 4: TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,... NA
## [ reached getOption("max.print") -- omitted 26 rows ]
In addition to removing any genes not found in the background, ActivePathways
also removes any genes that have a p-value higher than the cutoff
parameter after merging the scores. This threshold represents the maximum p-value for a gene to be considered of interest in our analysis. The cutoff
is 0.1
by default.
To avoid inflated p-values that arise from multiple testing, p-values are adjusted after running enrichment analysis on each pathway. ActivePathways
uses the base R p.adjust
function to adjust p-values. Thus, all statistical methods performed by p.adjust
can be used in ActivePathways
by altering the correction.method
parameter. By default, the holm
correction method is used. Setting correction.method='none'
will instruct the software to avoid p-value adjustment.
Here are the results from a basic ActivePathways analysis.
res <- ActivePathways(scores, gmt.file)
res
## term.id term.name adjusted.p.val term.size
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 2.028966e-02 555
## 3: REAC:177929 Signaling by EGFR 6.245734e-04 366
## overlap evidence
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## 3: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## 3: NA NA
## Genes_CDS
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## Genes_promCore
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 17 rows ]
The term.id
, term.name
, and term.size
columns give the identifier, name and number of genes in each pathway. The adjusted.p.val
column gives the adjusted p-value for each pathway indicating the confidence that the pathway is enriched. If the return.all
option is FALSE
, all of these p-values will be smaller than the significance threshold, 0.05, and the table contains only the pathways that were found to be significantly enriched.
The overlap
column gives the intersection between the genes annotated to the term and the subset of genes in the merged gene list that returned this p-value.
res$overlap[1:3]
## [[1]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "B2M"
## [8] "CALM2" "CDKN1A" "CDKN1B"
##
## [[2]]
## [1] "PIK3CA" "KRAS" "BRAF" "NRAS" "CALM2" "RPS6KA3" "ACTB"
## [8] "EFNA1" "SCN2B" "EPHA2" "GAP43" "COL4A1" "RASAL2" "CLTC"
## [15] "IQGAP1" "NF1" "FGF9" "ADAM10" "PTPRC" "ITGA10" "PDGFB"
## [22] "COL4A2" "RGMB" "RASA1" "FGF6" "CALM1" "PSMB7" "PPP2R5D"
## [29] "PPP2R1A"
##
## [[3]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "CALM2" "CDKN1A"
## [9] "CDKN1B"
The overlaps are in the same order as the merged gene list, meaning that more significant genes from scores
are ranked higher, allowing the user to return to the gene space and identify significant genes in each pathway.
The evidence
column gives the names of columns (sources of evidence such as omics datasets) in scores
in which a given pathway is found to be enriched. For example, the majority of the pathways in this example have ‘CDS’ as the only evidence, meaning that the pathway is enriched in this column (and in the combined data). Pathway REAC:422475
has as evidence list('X3UTR', 'promCore')
, which means that the pathway is found to be enriched when considering the X3UTR
column, the promCore
column, and the combined data. If a pathway is found to be enriched only in the combined data and not in any individual column, ‘combined’ will be listed as the evidence. If a pathway is not found to be enriched in any individual column or in the combined data (only possible if return.all==TRUE
), then the evidence will be ‘none’.
The rest of the columns are named as Genes_{column}
, and represent the gene overlaps of each individual column if a pathway is found to be enriched using the p-values from that column. Otherwise, the column is NA
.
The results are returned as a data.table
, but since the overlap column is a list, it can be a challenge to write the table to a file. The usual write.table
and write.csv
functions will struggle with this column unless the user manually transforms the list into a string. Fortunately, data.table
includes a function fwrite
to handle such complex structures.
result.file <- paste(system.file('extdata', package='ActivePathways'), 'example_write.csv', sep='/')
data.table::fwrite(res, result.file)
term.id | term.name | adjusted.p.val | term.size |
---|---|---|---|
REAC:2424491 | DAP12 signaling | 0.0000449 | 358 |
REAC:422475 | Axon guidance | 0.0202897 | 555 |
REAC:177929 | Signaling by EGFR | 0.0006246 | 366 |
The Cytoscape program, combined with the EnrichmentMap app, provides a powerful tool to visualize pathway enrichment analysis as a network. ActivePathways
can help speed up this process by automatically writing the files that can be uploaded to Cytoscape to create a network. To create these files, a file directory must be supplied to ActivePathways
using the parameter cytoscape.file.dir
. If the user plans on reanalyzing the same data with changes to the parameters, the parameter reanalyze
can be set to TRUE
, and subdirectories of the form Version1A
, Version1B
, etc. will be created in the cytoscape.file.dir
to ensure that each version of the analysis is stored.
res <- ActivePathways(scores, gmt.file, cytoscape.file.dir = system.file('extdata', package='ActivePathways'))
Four files are written:
pathways.txt
contains the list of significant pathways and the associated p-values. Note that only terms with p.val <= significant
are written, even if the option return.all=TRUE
is specified.
subgroups.txt
contains a matrix indicating the columns of the original scores matrix which contributed significant pathways. A 1 indicates that the pathway is significant if ActivePathways
is run on the p-values in that column.
pathways.gmt
contains a shortened version of the supplied GMT file which consists of only the significant pathways. This file can accelerate the process of creating the network using Cytoscape.
legend.pdf
a pdf file of a legend which can be used as a reference to the generated Cytoscape network.
files <- paste(system.file('extdata', package='ActivePathways'),
c('pathways.txt',
'subgroups.txt',
'pathways.gmt',
'legend.pdf'),
sep='/')
cat(paste(readLines(files[1])[1:5], collapse='\n'))
## term.id term.name adjusted.p.val
## REAC:2424491 DAP12 signaling 4.49126833230489e-05
## REAC:422475 Axon guidance 0.0202896586319552
## REAC:177929 Signaling by EGFR 0.000624573372270557
## REAC:2559583 Cellular Senescence 6.59544666946083e-05
cat(paste(readLines(files[2])[1:5], collapse='\n'))
## term.id CDS X3UTR promCore combined instruct
## REAC:2424491 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:422475 0 1 1 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:177929 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:2559583 1 0 0 0 piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
cat(paste(readLines(files[3])[1], collapse='\n'))
## REAC:2424491 DAP12 signaling IL17RD PSMC1 PDGFRB PSMD14 TNRC6C CD80 DUSP10 SPTBN4 PIP5K1B NRG1 TNRC6A FGF22 ADCY5 CHUK PSME2 CUL3 NRG2 PSMB8 PSMA2 PSMB3 PSMD13 AC010132.3 PIK3CB FGF18 PRKCE FN1 SHC2 RBX1 PSMD5 FGF19 NF1 ARRB1 FGFR1 PRKACG FGB PSME1 VAV1 PSMD3 ITGA2B HBEGF ERBB3 FGF3 KSR2 PPP2R5C VAV3 PPP2CB IL5RA PSMB9 CDKN1B RASA2 PPP2R5E PIK3AP1 PSMF1 FOXO4 PSMC3 PSMA6 AL358075.4 FGF2 ICOS IL2RB BTC ADCY2 RAPGEF2 CDKN1A DUSP6 ERBB4 PTK2 MET FGA TSC2 NRAS SPTB SPRED3 TNRC6B FOXO3 GSK3A FGF1 IL3RA PIP5K1A IL2RA GFRA4 PSMD1 TEK RICTOR PSMD10 CASP9 CAMK2G RASAL3 PPP2R1A PSMB6 FGF6 IL3 MAPKAP1 SYK CNKSR1 RASGRF2 GRIN2C PSMB7 CD86 ADCY3 TLN1 PSMC4 HRAS MLST8 VAV2 LAMTOR2 THEM4 DUSP7 RASGRF1 CALM3 RAP1B PSMC2 KSR1 RAC1 ADCY1 JAK1 SPRED2 PDE1A FGF10 PSME4 EGF PDE1B GRIN2A IRS2 VWF PIK3CD FGFR4 PHB AKT1 IQGAP1 PTPRA PSMB1 PRKAR1B KL PRKCD PSMD9 PSMB2 EGFR MAP2K2 PRKAR2B KRAS CAMK2D SRC PIK3CA NRTN IL2 CAMK2B KIT CSF2RA CSK UBC SPTBN1 RASA4 CD19 DUSP2 LAMTOR3 ADCY4 PHLPP2 ARAF PSMD2 PDGFB PSMA8 FGFR3 NEFL TRAT1 MIR26A2 FGF23 AGO3 PTPN11 PSMB10 GAB1 MAPK1 TREM2 PRKAR2A LCK ADCY6 SPRED1 MIR26A1 SPTAN1 PSMB11 PDPK1 ITPR3 KBTBD7 RET PSMD8 FGG GFRA1 AHCYL1 FOXO1 PIK3R1 RASGEF1A JAK2 B2M FGF20 PIK3R2 FGF4 RASAL1 FGF7 PIP4K2C PDGFA PRR5 TYROBP EREG PPP2R5B GRAP2 SOS1 DUSP8 PRKACA RASA1 PSMA5 DUSP4 PRKCA ADCY7 CSF2 DUSP16 PHLPP1 CAMK4 KLB GDNF AKT2 CREB1 PPP2CA FGF8 FGF9 ITPR2 BAD PAQR3 SYNGAP1 APBB1IP SEM1 RPS6KB2 AKT1S1 PPP5C PLCG1 PSMA7 SHC1 ARTN PRKCG PSMA3 KITLG GRK2 AKAP9 ANGPT1 FGF17 MTOR PDE1C GRIN2B NR4A1 ITGB3 PSMC5 AGO1 KLRD1 BRAF AKT3 PSMD4 PIP4K2A TRIB3 RASA3 MDM2 PSMA4 SPTBN2 DAB2IP DUSP5 PSMB4 PSMA1 MAPK3 KLRK1 GSK3B PDGFRA CAMK2A VCL GFRA2 ITPR1 KLRC2 LCP2 ERBB2 GRIN1 GRB2 LAT RAF1 SPTA1 RANBP9 PIP4K2B FRS3 INS-IGF2 RASGRP3 SHC3 IER3 INSR BRAP PEBP1 CD28 GRIN2D FYN YWHAB IL5 AGO4 UBB FGF16 IL2RG FGF5 PSMC6 IRS1 PPP2R1B ARRB2 MARK3 BTK PLCG2 RAP1A PEA15 PSMD6 ADCY8 JAK3 SPTBN5 ACTN2 PPP2R5A MAP3K11 DUSP1 DUSP9 RPS27A CSF2RB HLA-E AL672043.1 RASAL2 CALM1 PIK3R3 FRS2 PRKAR1A ADCY9 FGFR2 DLG4 PPP2R5D CNKSR2 NCAM1 RASGRP4 PSMD12 PSMB5 TP53 WDR83 NRG4 PRKACB MAP2K1 MOV10 PIP5K1C NRG3 CALM2 PTEN UBA52 PSMD7 RNASE1 PSPN GFRA3 PSMD11 INS RASGRP1 AGO2 HGF PSME3
To learn how to use these files to create an enrichment map, take a look at the enrichmentMap-vignette
included in this package.