ontologySimilarity
uses ontologyIndex
’s ontology_index
classed objects to access information about a given ontology, and character vectors of term IDs to refer to terms. It has functions for calculating:
get_term_sim_mat
get_sim_mat
and get_sim_grid
get_sim_to_profile
(p-value estimated by permutation of terms included in term set)n
term sets, get_sim
get_sim_p
(again, p-value estimated by permutation)To try out the package, we first load the ontologyIndex
package to get an example ontology_index
object, the Human Phenotype Ontology, hpo
.
suppressPackageStartupMessages(library(ontologyIndex))
suppressPackageStartupMessages(library(ontologySimilarity))
data(hpo)
set.seed(1)
Next, choose a random vector of term IDs to be the global set of terms (we’ll use this to sample term sets from). To get the between term similarities, we need to assign an information content to each term. Ordinarily, this might be based on frequency of annotation in a database (for example, the proportion of diseases which are annotated with a particular term), or on some estimate of population frequency. Here we use an information content based on each term having a hypothetical frequency 1/length(terms)
. Once we have the global set of terms, terms
and information_content
, we can compute a matrix of similarities between terms - tsm
.
#random set of terms with ancestors
terms <- get_ancestors(hpo, sample(hpo$id, size=30))
#set information content of terms (as if each term occurs with frequency `1/n`)
information_content <- get_term_info_content(hpo, term_sets=as.list(terms))
#similarity of term pairs
tsm <- get_term_sim_mat(hpo, information_content)
To try the functions for calculating between term-sets, we need to sample some term sets. We’ll sample 5 random term sets (call them phenotypes) with (at most) 8 terms from terms
(removing redundant ones using the minimal_set
from the ontologyIndex
package).
phenotypes <- replicate(simplify=FALSE, n=5, expr=minimal_set(hpo, sample(terms, size=8)))
Similarity matrix, containing between term-set similarities:
sim_mat <- get_sim_mat(tsm, phenotypes)
sim_mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.4794360 0.5152054 0.5312025 0.5078293
## [2,] 0.4794360 1.0000000 0.6169854 0.4169368 0.6305535
## [3,] 0.5152054 0.6169854 1.0000000 0.3850290 0.7020807
## [4,] 0.5312025 0.4169368 0.3850290 1.0000000 0.2636186
## [5,] 0.5078293 0.6305535 0.7020807 0.2636186 1.0000000
Group similarity of phenotypes 1-3:
get_sim(sim_mat, 1:3)
## [1] 0.6914726
Group similarity p-value of group of phenotypes 1-3:
get_sim_p(sim_mat, 1:3)
## [1] 0.393
Similarity p-value of phenotype 2 to phenotype 1 (taking phenotype 1 as the ‘profile’):
get_sim_to_profile_p(tsm, phenotypes[[1]], phenotypes[[2]])
## [1] 0.189