ontologySimilarity uses ontologyIndex’s ontology_index classed objects to access information about a given ontology, and character vectors of term IDs to refer to terms. It has functions for calculating:

Set up

To try out the package, we first load the ontologyIndex package to get an example ontology_index object, the Human Phenotype Ontology, hpo.

suppressPackageStartupMessages(library(ontologyIndex))
suppressPackageStartupMessages(library(ontologySimilarity))
data(hpo)
set.seed(1)

Next, choose a random vector of term IDs to be the global set of terms (we’ll use this to sample term sets from). To get the between term similarities, we need to assign an information content to each term. Ordinarily, this might be based on frequency of annotation in a database (for example, the proportion of diseases which are annotated with a particular term), or on some estimate of population frequency. Here we use an information content based on each term having a hypothetical frequency 1/length(terms). Once we have the global set of terms, terms and information_content, we can compute a matrix of similarities between terms - tsm.

#random set of terms with ancestors
terms <- get_ancestors(hpo, sample(hpo$id, size=30))

#set information content of terms (as if each term occurs with frequency `1/n`)
information_content <- get_term_info_content(hpo, term_sets=as.list(terms))

#similarity of term pairs
tsm <- get_term_sim_mat(hpo, information_content)

To try the functions for calculating between term-sets, we need to sample some term sets. We’ll sample 5 random term sets (call them phenotypes) with (at most) 8 terms from terms (removing redundant ones using the minimal_set from the ontologyIndex package).

phenotypes <- replicate(simplify=FALSE, n=5, expr=minimal_set(hpo, sample(terms, size=8)))

Calculations

Similarity matrix, containing between term-set similarities:

sim_mat <- get_sim_mat(tsm, phenotypes)
sim_mat
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.4794360 0.5152054 0.5312025 0.5078293
## [2,] 0.4794360 1.0000000 0.6169854 0.4169368 0.6305535
## [3,] 0.5152054 0.6169854 1.0000000 0.3850290 0.7020807
## [4,] 0.5312025 0.4169368 0.3850290 1.0000000 0.2636186
## [5,] 0.5078293 0.6305535 0.7020807 0.2636186 1.0000000

Group similarity of phenotypes 1-3:

get_sim(sim_mat, 1:3)
## [1] 0.6914726

Group similarity p-value of group of phenotypes 1-3:

get_sim_p(sim_mat, 1:3)
## [1] 0.393

Similarity p-value of phenotype 2 to phenotype 1 (taking phenotype 1 as the ‘profile’):

get_sim_to_profile_p(tsm, phenotypes[[1]], phenotypes[[2]])
## [1] 0.189