gsEasy

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

gsEasy

Daniel Greene

2024-02-20

Calculate p-values for enrichment of set

gsEasy has a function gset for calculating p-values of enrichment for sets (of genes) in ranked/scored lists (of genes) by permutation (see ‘Gene Set Enrichment Analysis’ described by Subramanian et al, 2005). The arguments of gset are named as in the paper:

N: the total number of genes,
S: integer vector giving the ranks of the genes in the test set amongst the N, or giving the indices within the scores vector r (see below) or a character vector of the names of genes in the test set,
r: (optional) vector of length N of correlation scores, e.g. gene expression correlation. If unspecified, it defaults to 1-(i-1)/N for the ith gene. If S is given as the names of genes, r must either be a character vector of genes in rank order or named by genes (necessarily containing all the genes in S).
p: a numeric value used to exponentiate the enrichment scores given by r, with higher values having the effect of increasing the weight on the highest scores/ranks (for more details, see Subramanian et al, 2005). The default value is 1 [i.e. not transformed].

Say we had a set of 5 genes which appeared at the top five ranks out of 1000 (i.e. highly enriched at the high ranks!). We could then calculate an enrichment p-value using the command:

gset(S=1:5, N=1000)

## [1] 9.9999e-06

So the p-value is close to zero. However for random sets, the p-values are distributed uniformly:

replicate(n=10, expr=gset(S=sample.int(n=1000, size=5), N=1000))

##  [1] 0.7213930 0.1111111 0.1058394 0.5870647 0.1142857 0.9850746 0.9701493
##  [8] 0.1840796 0.5074627 0.8905473

Alternatively, you can pass the names of genes as S with a sorted list of gene names as r (in which case the scores default to the ranks in the list), or a numeric vector of scores named by genes as r.

gset(S=c("gene 1", "gene 5", "gene 40"), r=paste("gene", 1:100))

## [1] 0.07792208

Multiple gene sets can thus be tested for enrichment with a single call to a high level function such as sapply (or, if you have many sets to test and multiple cores available, mclapply), for instance:

gene_sets <- c(list(1:5), replicate(n=10, simplify=FALSE, expr=sample.int(n=1000, size=5)))
names(gene_sets) <- c("enriched set", paste("unenriched set", 1:10))
gene_sets

## $`enriched set`
## [1] 1 2 3 4 5
## 
## $`unenriched set 1`
## [1]   6  53 974 143 936
## 
## $`unenriched set 2`
## [1] 438 237 252 458 293
## 
## $`unenriched set 3`
## [1] 624 680 778 230 727
## 
## $`unenriched set 4`
## [1] 997 254 714 307 513
## 
## $`unenriched set 5`
## [1] 104 786 263 857 632
## 
## $`unenriched set 6`
## [1] 149 970 961  73 265
## 
## $`unenriched set 7`
## [1] 435 138 982 387 722
## 
## $`unenriched set 8`
## [1] 667 441 506 186 913
## 
## $`unenriched set 9`
## [1] 791 911 638 468 274
## 
## $`unenriched set 10`
## [1] 758 626 817 698  45

sapply(gene_sets, function(set) gset(S=set, N=1000))

##      enriched set  unenriched set 1  unenriched set 2  unenriched set 3 
##      0.0000099999      0.0017199828      0.2388059701      0.9353233831 
##  unenriched set 4  unenriched set 5  unenriched set 6  unenriched set 7 
##      0.7562189055      0.5024875622      0.0247097529      0.4228855721 
##  unenriched set 8  unenriched set 9 unenriched set 10 
##      0.7462686567      0.9054726368      0.5124378109

Ontological annotations

gsEasy has a function get_ontological_gene_sets for creating lists of gene sets corresponding to annotation with ontological terms such that ontological is-a relations are propagated. get_ontological_gene_sets accepts an ontological_index (see the R package ontologyIndex for more details) argument and two character vectors, corresponding to genes and terms respectively, whereby the n-th element in each vector corresponds to one annotation pair. The result, a list of character vectors of gene names, can then be used as an argument of gset.

library(ontologyIndex)
data(hpo)
df <- data.frame(
    gene=c("gene 1", "gene 2"), 
    term=c("HP:0000598", "HP:0000118"), 
    name=hpo$name[c("HP:0000598", "HP:0000118")], 
    stringsAsFactors=FALSE,
    row.names=NULL)
df

##     gene       term                   name
## 1 gene 1 HP:0000598 Abnormality of the ear
## 2 gene 2 HP:0000118 Phenotypic abnormality

get_ontological_gene_sets(hpo, gene=df$gene, term=df$term)

## $`HP:0000001`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000118`
## [1] "gene 1" "gene 2"
## 
## $`HP:0000598`
## [1] "gene 1"

Gene Ontology (GO) annotations

gsEasy comes with a list of GO annotations, GO_gene_sets [based on annotations downloaded from geneontology.org on 07/08/2016], which can be loaded with data. This comprises a list of all gene sets (i.e. character vectors of gene names) associated with each GO term, for GO terms being annotated with at most 500 genes.

data(GO_gene_sets)
GO_gene_sets[1:6]

## $`GO:0000002`
##  [1] "AKT3"     "LONP1"    "MEF2A"    "MGME1"    "MPV17"    "MRPL17"  
##  [7] "MRPL39"   "OPA1"     "PIF1"     "SLC25A33" "SLC25A36" "SLC25A4" 
## [13] "TYMP"    
## 
## $`GO:0000003`
##  [1] "EIF4H"  "IL12B"  "LEP"    "LEPR"   "MMP23A" "RHOXF1" "SEPP1"  "STAT3" 
##  [9] "TNP1"   "VGF"    "WDR43" 
## 
## $`GO:0000009`
## [1] "ALG12"
## 
## $`GO:0000010`
## [1] "PDSS1" "PDSS2"
## 
## $`GO:0000011`
## [1] "RBSN"
## 
## $`GO:0000012`
##  [1] "APLF"   "APTX"   "E9PQ18" "LIG4"   "M0R2N6" "Q6ZNB5" "SIRT1"  "TDP1"  
##  [9] "TNP1"   "XRCC1"

It also has a function get_GO_gene_sets which is a specialisation of get_ontological_gene_sets for the Gene Ontology (GO) which can be called passing just a file path to the annotation file (official up-to-date version available at https://geneontology.org/).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.