▶ AbsFilterGSEA provides an efficient gene-permuting GSEA methods for small replicate RNA-seq data
Gene-set enrichment analysis (GSEA) has been popularly used for assessing the enrichment of differential signal in a pre-defined gene-set without using a cutoff threshold for differential expression. The significance of enrichment is evaluated through sample- or gene-permutation method. Although the sample-permutation approach is highly recommended due to its good false positive control, gene-permuting method is the only choice for small replicate data. However, such gene-permuting GSEA (or preranked GSEA) generates a lot of false positive gene-sets due to the inter-gene correlation in each gene set. These false positives can be effectively reduced by filtering with the one-tailed absolute GSEA results. This package provides a function that performs gene-permuting GSEA with or without the absolute filtering. One-tailed absolute GSEA is also provided.
AbsFilterGSEA package contains only one function GenePermGSEA that performs gene-permuting GSEA methods for four different gene statistics. Here, we exemplify the scripts for a quick introduction.
The GenePermGSEA function accepts gene expression matrix composed of raw read counts. Already normalized counts or microarray data are also acceptable using the option normalization=‘AlreadyNormalized’. Here, the example dataset, example contains raw read counts generated from negative biomial distribution using ‘rnbinom’ function. It contains 10000 genes, and the case and control groups are composed of three replicates, respectively.
library(AbsFilterGSEA, quietly = TRUE)
data(example)
head(example)
## groupA1 groupA2 groupA3 groupB1 groupB2 groupB3
## Gene1 5642 900 236 1239 20673 2699
## Gene2 2401 1264 1464 2413 2456 1969
## Gene3 611 352 364 548 651 332
## Gene4 140 50 111 97 111 80
## Gene5 2224 915 2129 1930 2720 1757
## Gene6 5987 2959 6094 6700 8630 4974
Next, a gene-set file should be prepared. It must be tab-delimited. For example, the .gmt file stored on the Molecular Signature Database (mSigDB - http://software.broadinstitute.org/gsea/msigdb) is directly applicable. Here, an example gene-set file (geneset.txt) is generated and stored on your local directory. Each gene-set contains 50 non-overlapping genes, and the inter-gene correlation of each gene-set is varied from 5% to 60%. The 41st to 50th gene-sets are set as differentially expressed gene-sets.
# Example gene-set generation: It contains 50 gene-sets each having 100 genes.
# Geneset_41 ~ Geneset_50 are differentially expressed and others are not.
# It will stored in your working directory with the file name 'geneset.txt'.
for(Geneset in 1:50)
{
GenesetName = paste("Geneset", Geneset, sep = "_")
Genes = paste("Gene", (Geneset*100-99):(Geneset*100), sep="", collapse = '\t')
Geneset = paste(GenesetName, Genes, sep = '\t')
write(Geneset, file = "geneset.txt", append = TRUE, ncolumns = 1)
}
readLines('geneset.txt', 2) # Show two gene-sets.
## [1] "Geneset_1\tGene1\tGene2\tGene3\tGene4\tGene5\tGene6\tGene7\tGene8\tGene9\tGene10\tGene11\tGene12\tGene13\tGene14\tGene15\tGene16\tGene17\tGene18\tGene19\tGene20\tGene21\tGene22\tGene23\tGene24\tGene25\tGene26\tGene27\tGene28\tGene29\tGene30\tGene31\tGene32\tGene33\tGene34\tGene35\tGene36\tGene37\tGene38\tGene39\tGene40\tGene41\tGene42\tGene43\tGene44\tGene45\tGene46\tGene47\tGene48\tGene49\tGene50\tGene51\tGene52\tGene53\tGene54\tGene55\tGene56\tGene57\tGene58\tGene59\tGene60\tGene61\tGene62\tGene63\tGene64\tGene65\tGene66\tGene67\tGene68\tGene69\tGene70\tGene71\tGene72\tGene73\tGene74\tGene75\tGene76\tGene77\tGene78\tGene79\tGene80\tGene81\tGene82\tGene83\tGene84\tGene85\tGene86\tGene87\tGene88\tGene89\tGene90\tGene91\tGene92\tGene93\tGene94\tGene95\tGene96\tGene97\tGene98\tGene99\tGene100"
## [2] "Geneset_2\tGene101\tGene102\tGene103\tGene104\tGene105\tGene106\tGene107\tGene108\tGene109\tGene110\tGene111\tGene112\tGene113\tGene114\tGene115\tGene116\tGene117\tGene118\tGene119\tGene120\tGene121\tGene122\tGene123\tGene124\tGene125\tGene126\tGene127\tGene128\tGene129\tGene130\tGene131\tGene132\tGene133\tGene134\tGene135\tGene136\tGene137\tGene138\tGene139\tGene140\tGene141\tGene142\tGene143\tGene144\tGene145\tGene146\tGene147\tGene148\tGene149\tGene150\tGene151\tGene152\tGene153\tGene154\tGene155\tGene156\tGene157\tGene158\tGene159\tGene160\tGene161\tGene162\tGene163\tGene164\tGene165\tGene166\tGene167\tGene168\tGene169\tGene170\tGene171\tGene172\tGene173\tGene174\tGene175\tGene176\tGene177\tGene178\tGene179\tGene180\tGene181\tGene182\tGene183\tGene184\tGene185\tGene186\tGene187\tGene188\tGene189\tGene190\tGene191\tGene192\tGene193\tGene194\tGene195\tGene196\tGene197\tGene198\tGene199\tGene200"
The example gene-set looks like this… Each line contains gene-set name followed by the member genes which are tab-delimited.
Now you can run gene-permuting GSEA using GenePermGSEA function as shown in below code.
# If you want to perform absolute filtering GSEA...
res = GenePermGSEA(countMatrix = example, GeneScoreType = 'moderated_t', idxCase=1:3, idxControl = 4:6, GenesetFile = './geneset.txt', normalization = 'DESeq', GSEAtype = 'absFilter', minCount = 3, FDR = 0.05)
res
## GenesetName Size NES Nominal.P.value FDR.Q.value Direction
## 12 Geneset_15 98 -2.542121 0.00000 0.00000 DOWN
## 33 Geneset_43 97 2.205307 0.00000 0.00000 UP
## 35 Geneset_45 98 2.435747 0.00000 0.00000 UP
## 36 Geneset_46 99 -2.659055 0.00000 0.00000 DOWN
## 40 Geneset_50 97 -2.244998 0.00000 0.00000 DOWN
## 32 Geneset_42 98 1.376518 0.03792 0.04082 UP
When gene scores are chosen, GSEA implements a (weighted) Kolmogorov-Smirnov (K-S) statistic to calculate the enrichment score (ES) of each pre-defined gene set. AbsFilterGSEA provides three modes of gene-permuting GSEA methods: (1) original two-tailed GSEA, (2) absolute one-tailed GSEA and (3) the ordinary GSEA filtered with absolute GSEA results.
1. Original two-tailed GSEA (GSEAtype = ‘original’)
This is the standard GSEA method introduced by the original GSEA paper[1]. When \(S\) is a gene-set of size \(N_H\), and \(r_i\) is a gene score of a gene \(g_i\), the enrichment score \(ES(S)\) is defined as the maximum deviation of \(p_{hit}-p_{miss}\) from zero, that is \[ ES(S) =
\begin{cases}
\max_i(p_{hit,i}-p_{miss,i}) , & \text{if } |max_i(p_{hit,i}-p_{miss,i})| \geq |min_i(p_{hit,i}-p_{miss,i})|\\
min_i(p_{hit,i}-p_{miss,i}), & \text{otherwise}
\end{cases}
\] where \[p_{hit,i}=\sum_{g_j \in S, j \leq i} \frac{|r_j|^q}{N_R}\] \[p_{miss,i}=\sum_{g_j \in S^c, j \leq i}\frac{1}{(N-N_H)}\] and \[N_R = \sum_{g_j \in S}|r_j|^q\].
Features
2. Absolute one-tailed GSEA (GSEAtype = ‘absolute’)
It is shown in [3] and the main manuscript that the absolute gene statistic effectively reduces false positives and maintains a good statistical power for gene-permuting GSEA methods.In this approach, the absolute gene scores are used to calculate one-tailed K-S statistics that only consider the positive deviation. Thus, the enrichment score for the absolute one-tailed GSEA is simply defined as \[ES(S)=max_i(p_{hit,i}-p_{miss,i})\]
Features
3. Absolute filtering GSEA (GSEAtype = ‘absFilter’)
To provide a robust GSEA result with the directionality of regulation, we suggest users to use the absolute filtering GSEA method. It filters the result obtained from two-tailed GSEA with that obtained from absolute one-tailed GSEA. In other words, It returns gene-set significant in both two-tailed and absolute one-tailed GSEA.
There are 14 options in the GenePermGSEA function.
head(GenePermGSEA, 4)
##
## 1 function (countMatrix, GeneScoreType, idxCase, idxControl, GenesetFile,
## 2 normalization, minGenesetSize = 10, maxGenesetSize = 300,
## 3 q = 1, nPerm = 1000, GSEAtype = "absFilter", FDR = 0.05,
## 4 FDRfilter = 0.05, minCount = 3)
1. countMatrix: Input gene expression matrix. Both microarray and (normalized) RNA-seq count data can be used.[1] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15545-15550.
[2] Wu D, Smyth GK: Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Research 2012, 40(17).
[3] Nam, D. Effect of the absolute statistic on gene-sampling gene-set analysis methods. Statistical methods in medical research 2015
[4] Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology 2010, 11(10). \end{document}