This RMarkdown document demonstrates how key elements from the notebook for case study 3 in the EpiGraphDB paper can be achieved using the R package. For detailed explanations of the case study please refer to the paper or the case study notebook.
The biomedical literature contains a wealth of information than far exceeds our capacity for systematic manual extraction. For this reason, there are many existing literature mining methods to extract the key concepts and content. Here we use data from SemMedDB, a well established database that provides subject-predicate-object triples from all PubMed titles and abstracts. Using a subset of this data we created MELODI-presto (https://melodi-presto.mrcieu.ac.uk/), a method to assign triples to any given biomedical query via a PubMed search and some basic enrichment, and have applied this systematically to traits represented in EpiGraphDB. This allows us to identify overlapping terms connecting any set of GWAS traits, e.g. exposure and disease outcome. From here we can attempt to triangulate causal estimates, and conversely, check the mechanisms identified from the literature against the causal evidence.
This case study goes as follows:
library("magrittr")
library("dplyr")
library("purrr")
library("glue")
library("ggplot2")
library("igraph")
library("epigraphdb")
Here we set the starting trait, which we will use to explore associated disease traits.
<- "Sleep duration" STARTING_TRAIT
Given an exposure trait, find all traits with causal evidence. This method searches the causal evidence data for cases where our exposure trait has a potential casual effect on an outcome trait.
<- function(trait) {
get_mr <- "/mr"
endpoint <- list(
params exposure_trait = trait,
pval_threshold = 1e-10
)<- query_epigraphdb(route = endpoint, params = params, mode = "table")
mr_df
mr_df
}
<- get_mr(STARTING_TRAIT)
mr_df %>% glimpse()
mr_df #> Rows: 1,178
#> Columns: 10
#> $ exposure.id <chr> "ieu-a-1088", "ieu-a-1088", "ieu-a-1088", "ieu-a-1088",…
#> $ exposure.trait <chr> "Sleep duration", "Sleep duration", "Sleep duration", "…
#> $ outcome.id <chr> "met-a-353", "met-a-307", "ieu-a-99", "ieu-a-59", "ieu-…
#> $ outcome.trait <chr> "Cortisol", "Cholesterol", "Hip circumference", "Hip ci…
#> $ mr.b <dbl> 0.059176931, -0.024282343, -0.288661815, 0.072874008, 0…
#> $ mr.se <dbl> 1.385651e-03, 5.409778e-05, 5.928478e-03, 1.617032e-04,…
#> $ mr.pval <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ mr.method <chr> "FE IVW", "FE IVW", "FE IVW", "FE IVW", "FE IVW", "FE I…
#> $ mr.selection <chr> "DF", "DF", "DF", "DF", "DF", "DF", "DF", "DF", "DF", "…
#> $ mr.moescore <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
For this example, we are interested in traits mapped to a disease node. To do this we utilise the mapping from GWAS trait to Disease via EFO term.
# NOTE: this block takes a bit of time, depending on the size of `mr_df` above.
# We will not run the block here in package building,
# but readers are encouraged to try it offline!
<- function(trait) {
trait_to_disease <- "/ontology/gwas-efo-disease"
endpoint <- list(trait = trait)
params <- query_epigraphdb(route = endpoint, params = params, mode = "table")
disease_df if (nrow(disease_df) > 0) {
<- disease_df %>% pull(`disease.label`)
res else {
} <- c()
res
}
res
}
<- mr_df %>%
outcome_disease_df select(`outcome.trait`) %>%
distinct() %>%
mutate(disease = map(`outcome.trait`, trait_to_disease)) %>%
filter(map_lgl(`disease`, function(x) !is.null(x)))
outcome_disease_df
For the multiple exposure -> outcome
relationships as reported from the table above, here we look at the literature evidence for one pair in detail:
The following looks for enriched triples of information (Subject-Predicate-Object) associated with our two traits. These have been derived via PubMed searches and corresponding SemMedDB data.
<- function(gwas_id, assoc_gwas_id) {
get_gwas_pair_literature <- "/literature/gwas/pairwise"
endpoint # NOTE in this example we blacklist to semmentic types
<- list(
params gwas_id = gwas_id,
assoc_gwas_id = assoc_gwas_id,
by_gwas_id = TRUE,
pval_threshold = 1e-1,
semmantic_types = "nusq",
semmantic_types = "dsyn",
blacklist = TRUE,
limit = 1000
)<- query_epigraphdb(route = endpoint, params = params, mode = "table")
lit_df
lit_df
}
<- "ieu-a-1088"
GWAS_ID_X <- "ieu-a-6"
GWAS_ID_Y <- get_gwas_pair_literature(GWAS_ID_X, GWAS_ID_Y)
lit_df
glimpse(lit_df)
#> Rows: 1,000
#> Columns: 20
#> $ gwas.id <chr> "ieu-a-1088", "ieu-a-1088", "ieu-a-1088", "ieu-a-1088…
#> $ gwas.trait <chr> "Sleep duration", "Sleep duration", "Sleep duration",…
#> $ gs1.pval <dbl> 9.468989e-05, 2.478516e-06, 2.478516e-06, 1.854114e-0…
#> $ gs1.localCount <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ st1.name <chr> "Sulpiride", "Glycine", "Glycine", "gamma-aminobutyri…
#> $ s1.id <chr> "C0038803:INTERACTS_WITH:C0003596", "C0017890:STIMULA…
#> $ s1.subject_id <chr> "C0038803", "C0017890", "C0017890", "C0016904", "C001…
#> $ s1.object_id <chr> "C0003596", "C0033487", "C0033487", "C0033487", "C003…
#> $ s1.predicate <chr> "INTERACTS_WITH", "STIMULATES", "STIMULATES", "STIMUL…
#> $ st.name <chr> "Apomorphine", "propofol", "propofol", "propofol", "p…
#> $ st.type <list> <"orch", "phsu">, <"orch", "phsu">, <"orch", "phsu">…
#> $ s2.id <chr> "C0003596:TREATS:C0242350", "C0033487:AUGMENTS:C01517…
#> $ s2.subject_id <chr> "C0003596", "C0033487", "C0033487", "C0033487", "C003…
#> $ s2.object_id <chr> "C0242350", "C0151744", "C0010054", "C0151744", "C001…
#> $ s2.predicate <chr> "TREATS", "AUGMENTS", "TREATS", "AUGMENTS", "TREATS",…
#> $ gs2.pval <dbl> 0.045878175, 0.002682225, 0.001627351, 0.002682225, 0…
#> $ gs2.localCount <int> 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ st2.name <chr> "Erectile dysfunction", "Myocardial Ischemia", "Coron…
#> $ assoc_gwas.id <chr> "ieu-a-6", "ieu-a-6", "ieu-a-6", "ieu-a-6", "ieu-a-6"…
#> $ assoc_gwas.trait <chr> "Coronary heart disease", "Coronary heart disease", "…
# Predicate counts for SemMed triples for trait X
%>%
lit_df count(`s1.predicate`) %>%
arrange(desc(n))
#> # A tibble: 7 x 2
#> s1.predicate n
#> <chr> <int>
#> 1 INTERACTS_WITH 719
#> 2 STIMULATES 96
#> 3 INHIBITS 93
#> 4 NEG_INTERACTS_WITH 51
#> 5 higher_than 23
#> 6 COEXISTS_WITH 15
#> 7 NEG_INHIBITS 3
# Predicate counts for SemMed triples for trait Y
%>%
lit_df count(`s2.predicate`) %>%
arrange(desc(n))
#> # A tibble: 16 x 2
#> s2.predicate n
#> <chr> <int>
#> 1 TREATS 191
#> 2 STIMULATES 132
#> 3 CAUSES 120
#> 4 PREVENTS 117
#> 5 ASSOCIATED_WITH 103
#> 6 PREDISPOSES 83
#> 7 AFFECTS 71
#> 8 INTERACTS_WITH 62
#> 9 INHIBITS 47
#> 10 COEXISTS_WITH 34
#> 11 DISRUPTS 19
#> 12 NEG_PREVENTS 13
#> 13 higher_than 4
#> 14 AUGMENTS 2
#> 15 NEG_ASSOCIATED_WITH 1
#> 16 same_as 1
Sometimes it is preferable to filter the SemMedDB data, e.g. to remove less informative Predicates, such as COEXISTS_WITH and ASSOCIATED_WITH.
# Filter out some predicates that are not informative
<- c("COEXISTS_WITH", "ASSOCIATED_WITH")
pred_filter <- lit_df %>%
lit_df_filter filter(
!`s1.predicate` %in% pred_filter,
!`s2.predicate` %in% pred_filter
)%>%
lit_df_filter count(`s1.predicate`) %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> s1.predicate n
#> <chr> <int>
#> 1 INTERACTS_WITH 649
#> 2 INHIBITS 87
#> 3 STIMULATES 66
#> 4 NEG_INTERACTS_WITH 26
#> 5 higher_than 22
%>%
lit_df_filter count(`s2.predicate`) %>%
arrange(desc(n))
#> # A tibble: 14 x 2
#> s2.predicate n
#> <chr> <int>
#> 1 TREATS 180
#> 2 STIMULATES 132
#> 3 CAUSES 120
#> 4 PREVENTS 117
#> 5 PREDISPOSES 83
#> 6 AFFECTS 70
#> 7 INTERACTS_WITH 62
#> 8 INHIBITS 46
#> 9 DISRUPTS 19
#> 10 NEG_PREVENTS 13
#> 11 higher_than 4
#> 12 AUGMENTS 2
#> 13 NEG_ASSOCIATED_WITH 1
#> 14 same_as 1
If we explore the full table in lit_df_filter
, we can see lots of links between the two traits, pinned on specific overlapping terms.
We can summarise the SemMedDB semantic type and number of overlapping terms:
<- lit_df_filter %>%
lit_counts count(`st.type`, `st.name`) %>%
arrange(`st.type`, desc(`n`))
%>% print(n = 30)
lit_counts #> # A tibble: 33 x 3
#> st.type st.name n
#> <list> <chr> <int>
#> 1 <chr [2]> ethanol 571
#> 2 <chr [2]> Oral anticoagulants 24
#> 3 <chr [2]> Metoprolol 20
#> 4 <chr [2]> morphine 7
#> 5 <chr [2]> Benzodiazepines 4
#> 6 <chr [2]> propofol 4
#> 7 <chr [2]> Diazepam 2
#> 8 <chr [2]> Midazolam 2
#> 9 <chr [2]> Thiopental 2
#> 10 <chr [2]> Apomorphine 1
#> 11 <chr [2]> Luteolin 1
#> 12 <chr [2]> Neostigmine 1
#> 13 <chr [2]> gamma hydroxybutyrate 1
#> 14 <chr [2]> Glycoproteins 3
#> 15 <chr [2]> glutathione 1
#> 16 <chr [3]> Cytochrome P450 6
#> 17 <chr [3]> mitogen-activated protein kinase p38 4
#> 18 <chr [3]> AHSA1 2
#> 19 <chr [3]> AIMP2 2
#> 20 <chr [3]> CRK 2
#> 21 <chr [3]> GRAP2 2
#> 22 <chr [3]> MAPK1 2
#> 23 <chr [3]> MAPK14 2
#> 24 <chr [3]> POLDIP2 2
#> 25 <chr [3]> Protein Kinase C 2
#> 26 <chr [1]> agonists 31
#> 27 <chr [1]> Antidepressive Agents 12
#> 28 <chr [1]> Anesthetics 8
#> 29 <chr [2]> melatonin 88
#> 30 <chr [2]> Medroxyprogesterone 17-Acetate 5
#> # … with 3 more rows
Note, the SemMedDB semantic types have been pre-filtered to only include a subset of possibilities.
We can also visualise the above table as a bar chart. In this case we will remove Ethanol as it is an outlier.
%>%
lit_counts filter(n < 100) %>%
{ggplot(.) +
aes(x = `st.name`, y = `n`) +
geom_col() +
geom_text(
aes(label = `n`),
position = position_dodge(0.9),
hjust = 0
+
) coord_flip()
}
Here we look at cases where leptin
is the central overlapping term.
<- "leptin"
focus_term <- lit_df_filter %>% filter(`st.name` == focus_term)
lit_detail %>% head()
lit_detail #> # A tibble: 6 x 20
#> gwas.id gwas.trait gs1.pval gs1.localCount st1.name s1.id s1.subject_id
#> <chr> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> 2 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> 3 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> 4 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> 5 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> 6 ieu-a-1… Sleep dura… 2.48e-6 2 Cortico… C0010124… C0010124
#> # … with 13 more variables: s1.object_id <chr>, s1.predicate <chr>,
#> # st.name <chr>, st.type <list>, s2.id <chr>, s2.subject_id <chr>,
#> # s2.object_id <chr>, s2.predicate <chr>, gs2.pval <dbl>,
#> # gs2.localCount <int>, st2.name <chr>, assoc_gwas.id <chr>,
#> # assoc_gwas.trait <chr>
We can create a network diagram to visualise these relationships.
<- lit_detail %>%
lit_detail mutate_at(vars(`gwas.trait`, `assoc_gwas.trait`), stringr::str_to_upper)
# add node types: 1 - selected GWAS, 2 - traits from literature, 3 - current focus term connecting 1 and 2
<- bind_rows(
nodes %>% select(node = `gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
lit_detail %>% select(node = `assoc_gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
lit_detail %>% select(node = `st1.name`) %>% distinct() %>% mutate(node_type = 2),
lit_detail %>% select(node = `st2.name`) %>% distinct() %>% mutate(node_type = 2),
lit_detail %>% select(node = `st.name`) %>% distinct() %>% mutate(node_type = 3),
lit_detail %>% distinct()
)
nodes#> # A tibble: 20 x 2
#> node node_type
#> <chr> <dbl>
#> 1 SLEEP DURATION 1
#> 2 CORONARY HEART DISEASE 1
#> 3 Corticosterone 2
#> 4 Sleep Apnea, Obstructive 2
#> 5 Phosphotransferases 2
#> 6 JAK2 2
#> 7 Janus kinase 2 2
#> 8 Insulin 2
#> 9 Adiponectin 2
#> 10 Coronary heart disease 2
#> 11 Proteome 2
#> 12 Endothelial dysfunction 2
#> 13 Cardiovascular Diseases 2
#> 14 Congestive heart failure 2
#> 15 Acute Coronary Syndrome 2
#> 16 Coronary Arteriosclerosis 2
#> 17 Stable angina 2
#> 18 Metabolic syndrome 2
#> 19 Acute myocardial infarction 2
#> 20 leptin 3
<- bind_rows(
edges # exposure -> s1 subject
%>%
lit_detail select(node = `gwas.trait`, assoc_node = `st1.name`) %>%
distinct(),
# s2 object -> outcome
%>%
lit_detail select(node = `st2.name`, assoc_node = `assoc_gwas.trait`) %>%
distinct(),
# s1 subject - s1 predicate -> s1 object
%>%
lit_detail select(
node = `st1.name`, assoc_node = `st.name`,
label = `s1.predicate`
%>%
) distinct(),
# s2 subject - s2 predicate -> s2 object
%>%
lit_detail select(
node = `st.name`, assoc_node = `st2.name`,
label = `s2.predicate`
%>%
) distinct()
%>%
) distinct()
edges#> # A tibble: 40 x 3
#> node assoc_node label
#> <chr> <chr> <chr>
#> 1 SLEEP DURATION Corticosterone <NA>
#> 2 Sleep Apnea, Obstructive CORONARY HEART DISEASE <NA>
#> 3 Phosphotransferases CORONARY HEART DISEASE <NA>
#> 4 JAK2 CORONARY HEART DISEASE <NA>
#> 5 Janus kinase 2 CORONARY HEART DISEASE <NA>
#> 6 Insulin CORONARY HEART DISEASE <NA>
#> 7 Adiponectin CORONARY HEART DISEASE <NA>
#> 8 Coronary heart disease CORONARY HEART DISEASE <NA>
#> 9 Proteome CORONARY HEART DISEASE <NA>
#> 10 Endothelial dysfunction CORONARY HEART DISEASE <NA>
#> # … with 30 more rows
<- function(edges, nodes) {
plot_network
<- graph_from_data_frame(edges, directed = TRUE, vertices = nodes)
graph $layout <- layout_with_kk
graph
# generate colors based on node type
<- c("tomato", "lightblue", "gold")
colors V(graph)$color <- colors[V(graph)$node_type]
# Configure canvas
<- par("mar")
default_mar <- c(0, 0, 0, 0)
new_mar par(mar = new_mar)
plot.igraph(
graph,vertex.size = 13,
vertex.label.color = "black",
vertex.label.family = "Helvetica",
vertex.label.cex = 0.8,
edge.arrow.size = 0.4,
edge.label.color = "black",
edge.label.family = "Helvetica",
edge.label.cex = 0.5
)par(mar = default_mar)
}
plot_network(edges, nodes)
We can refer back to the articles to check the text that was used to derive the SemMedDB data. This is important due to the imperfect nature of the SemRep annotation process (https://semrep.nlm.nih.gov/).
<- function(gwas_id, semmed_triple_id) {
get_literature <- "/literature/gwas"
endpoint <- list(
params gwas_id = gwas_id,
semmed_triple_id = semmed_triple_id,
by_gwas_id = TRUE,
pval_threshold = 1e-1
)<- query_epigraphdb(route = endpoint, params = params, mode = "table")
df %>% select(`triple.id`, `triple.name`, `lit.id`)
df
}
<- bind_rows(
pub_df %>%
lit_detail select(gwas_id = `gwas.id`, semmed_triple_id = `s1.id`) %>%
distinct(),
%>%
lit_detail select(gwas_id = `assoc_gwas.id`, semmed_triple_id = `s2.id`) %>%
distinct()
%>%
) transpose() %>%
map_df(function(x) get_literature(x$gwas_id, x$semmed_triple_id))
pub_df#> # A tibble: 124 x 3
#> triple.id triple.name lit.id
#> <chr> <chr> <chr>
#> 1 C0010124:NEG_INTERACTS_WITH:C0… Corticosterone NEG_INTERACTS_WITH Le… 200515…
#> 2 C0010124:NEG_INTERACTS_WITH:C0… Corticosterone NEG_INTERACTS_WITH Le… 154755…
#> 3 C0299583:TREATS:C0520679 Leptin|LEP TREATS Sleep Apnea, Obstr… 104496…
#> 4 C0299583:TREATS:C0520679 Leptin|LEP TREATS Sleep Apnea, Obstr… 104496…
#> 5 C0299583:STIMULATES:C0031727 Leptin STIMULATES Phosphotransferases 159272…
#> 6 C0299583:STIMULATES:3717 Leptin|LEP STIMULATES Janus kinase 2… 159272…
#> 7 C0299583:STIMULATES:C0169661 Leptin|LEP STIMULATES Janus kinase 2… 159272…
#> 8 C0299583:INTERACTS_WITH:C00216… Leptin INTERACTS_WITH Insulin 156480…
#> 9 C0299583:INTERACTS_WITH:C00216… Leptin INTERACTS_WITH Insulin 127240…
#> 10 C0299583:STIMULATES:C0389071 Leptin STIMULATES Adiponectin 158112…
#> # … with 114 more rows
sessionInfo
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Manjaro Linux
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas.so.3.9.0
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.3.2 epigraphdb_0.2.2 igraph_1.2.6 glue_1.4.2
#> [5] purrr_0.3.4 dplyr_1.0.2 magrittr_2.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] pillar_1.5.1 compiler_4.0.4 highr_0.8 tools_4.0.4
#> [5] digest_0.6.27 gtable_0.3.0 jsonlite_1.7.2 evaluate_0.14
#> [9] lifecycle_1.0.0 tibble_3.1.0 debugme_1.1.0 pkgconfig_2.0.3
#> [13] rlang_0.4.10 cli_2.3.1 rstudioapi_0.13 curl_4.3
#> [17] yaml_2.2.1 xfun_0.21 withr_2.4.1 stringr_1.4.0
#> [21] httr_1.4.2 knitr_1.31 generics_0.1.0 vctrs_0.3.6
#> [25] gtools_3.8.2 grid_4.0.4 tidyselect_1.1.0 R6_2.5.0
#> [29] fansi_0.4.2 rmarkdown_2.1 farver_2.1.0 scales_1.1.1
#> [33] ps_1.6.0 ellipsis_0.3.1 htmltools_0.5.1.1 assertthat_0.2.1
#> [37] colorspace_2.0-0 labeling_0.4.2 utf8_1.2.1 stringi_1.5.3
#> [41] munsell_0.5.0 crayon_1.4.1