The package mglR standing for “Master Gene List” was developed to download and organize large-scale, publicly available genomic studies on a candidate gene scale. It also includes functions to integrate these datasources and compare features across candidate genes.
This package makes use of the list structure in ‘R’ to organize datasources in an identical fashion for each gene of interest. It is building a nested list from which data can be pulled in a very systematic way. Practically speaking, the package can be divided into two components:
1: building the list structure. These are functions assigned to the family elements and all begin with add. These functions download and organize various datasources.
2: integrating datasources across candidate genes. These are functions assigned to the family output and all begin with make. These functions reflect some simple questions that maybe of general interest. For example: Are two genes co-expressed? Do any genes have the same traits assigned by GWAS? Are any SNPs assigned to a given gene both because they appear as an eQTL and a GWAS hit?
The first section of this vignette covers how to build the list structure, the second discusses some pecularities of the list structure, and the third presents the functions introduced to integrate these datasources. A separate vignette “Building on mglR” presents some ideas about how this package may be expanded by the user to include additional datasources or functions to accomodate their specific interests.
This package relies on several packages that have been developed by others to handle many of these datasources. Perhaps the most valuable, and the only one that is truly essential, is biomaRt. Although biomaRt does a phenomenal job in facilitating navigation between gene names, postions, etc. in a way that is critical for this package, it at times will throw error messages “Request to BioMart web service failed. Verify if you are still connected to the internet. Alternatively the BioMart web service is temporarily down.”. If this occurs please try the function again.
Also, please note that a sample mgl list myMgl is included as a dataset in the package. As this list is complete it can be used to try any of the make functions.
An mgl list can be generated from gene names, ENSG ids, or genomic coordinates using one of the following three functions.
buildFromNames
buildFromEnsgs
buildFromRegion
Each of these functions will generate a standard list in R with length corresponding to the number of candidate genes. Each element of this list is itself a list with length of twenty corresponding to the number of datasources.
The following are all equivalent.
# Load the library
library(mglR)
# Build an empty list from gene names
buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB", "NCOA4", "TIMM23")) -> exampleMgl
# Build an empty list from ENSG identifiers
buildFromEnsgs(c("ENSG00000228326", "ENSG00000230553", "ENSG00000138294", "ENSG00000138293", "ENSG00000138297")) -> exampleMgl
# Build an empty list from a single gene region
buildFromRegion(10,51518775,51600147) -> exampleMgl
For each of the above, the output is an empty, nested list. Where the first level of heirarchy is the gene and the second is the datasource (i.e. mgl[[gene]][[datasource]]
)
# The output is a nested list
class(exampleMgl)
## [1] "list"
#
# The length of the list corresponds to the number of candidate genes
length(exampleMgl)
## [1] 5
names(exampleMgl)
## [1] "RP11-109G10.2" "RP11-481A12.2" "MSMB" "NCOA4"
## [5] "TIMM23"
#
# Each element corresponding to a gene is itself a list
class(exampleMgl[[1]])
## [1] "list"
names(exampleMgl[[1]])
## [1] "name" "enst"
## [3] "location" "antisense"
## [5] "go" "pubmed"
## [7] "gtex.normalized" "gtex.gene.counts"
## [9] "gtex.transcript.counts" "gtex.gene.rpkm"
## [11] "gtex.transcript.rpkm" "dnase"
## [13] "transEqtls" "cisEqtls"
## [15] "sqtlSeek" "sqtlAltrans"
## [17] "pqtl" "gwasCatalog"
## [19] "grasp" "aei"
#
# Currently these are empty placeholders for a particular data source
exampleMgl[[1]][2]
## $enst
## [1] 2
Functions beginning with add can be used to fill in the various elements for each gene. Only those datasources of interest need be filled in, it is no problem for elements to remain empty.
Note: the example mgl ‘myMgl’ was too large to include in the package so the function exMgl
can be used to download and load it
# Load `myMgl`, example mgl list
exMgl() -> myMgl
Element 2: enst
## # Add transcript information
## addEnst(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]
myMgl[[1]][2]
## $enst
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 RP11-109G10.2-001 ENST00000439573 processed_pseudogene
Element 3: location
## # Add position information
## addLoc(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]
myMgl[[1]][3]
## $location
## chromosome_name start_position end_position strand
## 1 10 51532102 51532572 -1
Element 4: antisense
## # Add information about transcripts in antisense position
## addAntisense(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]. note: none of these genes have antisense transcripts
myMgl[[1]][4]
## $antisense
## [1] external_gene_id ensembl_transcript_id transcript_biotype
## [4] chromosome_name start_position end_position
## [7] strand
## <0 rows> (or 0-length row.names)
Element 5: go
## # Add gene ontology terms
## addGo(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]
myMgl[[3]][5]
## $go
## name_1006
## 1 biological_process
## 2 extracellular space
## 3 nucleus
## 4 molecular_function
Element 6: pubmed
## # Add number of papers cataloged in PubMed
## addPubmed(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]
myMgl[[4]][6]
## $pubmed
## [1] 152
Elements 7-11: gtex.normalized, gtex.gene.counts, gtex.transcript.counts, gtex.gene.rpkm, gtex.transcript.rpkm Even with relatively small candidate gene lists (< 10 genes), adding expression data to the list (elements 7-11) can be time and memory intensive. The function works by downloading the data locally (which can be saved for future use using the flag saveDownload = TRUE
), subsetting the datasets for the genes of interest using grep commands run by R through unix, loading the subsetted data into R and incorporating the data into the mgl list. The most time consuming parts are downloading and subsetting the data. Both of these steps can be run outside of R using the command line if preferred.
## # Add expression data from GTEx
## addExpression(myMgl, download = TRUE, saveDownload = FALSE, normalized = TRUE, xpGeneReads = TRUE, xpTranscriptReads = TRUE, xpGeneRpkm = TRUE, xpTranscriptRpkm = TRUE) -> myMgl
##
## # examples mgl[[gene]][[datasource]][[tissue]][expression&covariates, people]
## # normalized expression data
myMgl[[3]][[7]][[1]][1:3,1:3]
## GTEX.111CU GTEX.111FC GTEX.111VG
## ENSG00000138294.9 0.9388143 -1.116486 0.1049847
## C1 -0.0180000 -0.015600 -0.0197000
## C2 0.0010000 0.016800 0.0108000
## # gene reads
myMgl[[3]][[8]][[1]][,1:2]
## GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8
## ENSG00000138294 4 0
## # transcript RPKM
myMgl[[3]][[11]][[1]][,1:2]
## GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8
## ENST00000358559 0.42109 0
## ENST00000487536 0.00000 0
## ENST00000474170 0.00000 0
## ENST00000478719 0.00000 0
## ENST00000466268 0.00000 0
## ENST00000298239 0.00000 0
Element 12: dnase
## # Add results from Maurano Nat. Genetics 2015
## addDnase(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]]
myMgl[[3]][12]
## $dnase
## chrom chromStart chromEnd rs numhets allele.1 readsA allele.2
## 1 chr10 51553064 51553065 rs10994201 4 G 86 A
## 2 chr10 51558124 51558125 rs7076948 7 C 71 T
## 3 chr10 51559468 51559469 rs7904463 8 T 177 C
## 4 chr10 51561798 51561799 rs17178655 14 A 235 G
## readsB totalReads pctRef q.value significance.level
## 1 76 162 0.469 7.63e-01 not_imbalanced
## 2 64 135 0.474 8.53e-01 not_imbalanced
## 3 71 248 0.286 5.17e-10 imbalanced_(0.1%_FDR)
## 4 126 361 0.349 2.70e-07 imbalanced_(5%_FDR)
Element 13: transEqtls
## # Add transEqtl data from GTEx
## addTransEqtl(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[13]][20]
## $`Heart - Left Ventricle`
## [1] DiscoveryMode rs_id_dbSNP142_GRCh37p13
## [3] variant_id snp_chr
## [5] snp_pos gene_id
## [7] gene_name tissue_site_detail
## [9] pvalue FDR
## [11] cis.eGene_id cis.eGene_name
## [13] GWAS_pvalue trait
## [15] PMID
## <0 rows> (or 0-length row.names)
Element 14: cisEqtls
## # Add cisEqtl data from GTEx
## addCisEqtl(myMgl, download = TRUE, saveDownload = FALSE) -> myMgl
##
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[14]][[44]][1:3,]
## tissue variant_id gene_id tss_distance
## 1 Whole_Blood 10_51489257_T_C_b37 ENSG00000138293.15 -75851
## 2 Whole_Blood 10_51489667_T_C_b37 ENSG00000138293.15 -75441
## 3 Whole_Blood 10_51490628_T_C_b37 ENSG00000138293.15 -74480
## pval_nominal slope slope_se slope_fpkm slope_fpkm_se
## 1 8.87850e-06 -0.143856 0.0318135 -1.92895 2.39498
## 2 2.45591e-05 -0.129756 0.0302692 -1.78156 2.27139
## 3 1.45179e-05 -0.137201 0.0311152 -1.50279 2.33963
## pval_nominal_threshold min_pval_nominal pval_beta rsId
## 1 7.60232e-05 1.12164e-07 0.000288916 rs11595194
## 2 7.60232e-05 1.12164e-07 0.000288916 rs12783102
## 3 7.60232e-05 1.12164e-07 0.000288916 rs10825136
Element 15: sqtlSeek
## # Add splicing QTL data from GTEx sqtlSeek algorithm
## addSqtlSeek(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]][[tissue]]. note: none of these genes have splicing QTls as defined by sqtlSeek
myMgl[[1]][[15]][1]
## $AdiposeTissue
## [1] snpId pv transId1 transId2
## <0 rows> (or 0-length row.names)
Element 16: sqtlAltrans
## # Add splicing QTL data from GTEx Altrans algorithm
## addSqtlAltrans(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[16]][2]
## $ArteryTibial
## Variation X.log10.p.
## 2921 rs4519043 6.4089
## 2922 rs79490275 6.1975
Element 17: pqtl
## # Add protein truncating variants from GTEx
## addPtv(myMgl) -> myMgl
##
## # example mgl[[gene]][[datasource]][[tissue]]
myMgl[[4]][[17]][[1]][1:2,]
## identifier of snp alternative identifier of SNP
## 13458 rs117273914 -
## 13935 rs72795897 -
## name of exon that exon belongs to chr_start_end
## 13458 NCOA4 chr10_51580880_51580968
## 13935 NCOA4 chr10_51580880_51580968
## chromosome of SNP chromosome of exon location of SNP
## 13458 10 10 51580469
## 13935 10 10 51581889
## location of middle point of exon distance between 8 and 9
## 13458 51580924 455
## 13935 51580924 965
## spearman correlation pvalue -log10(pvalue) NA
## 13458 0.135 0.1940369 0.7121 NA
## 13935 0.132 0.2062249 0.6857 NA
Element 18: gwasCatalog
## # Add GWAS data from the NHGRI-EBI GWAS Catalog
## addGwasCatalog(myMgl, range = 10, download = TRUE, saveDownload = FALSE) -> myMgl
##
## # example mgl[[gene]][[datasource]][GWASEntry, Details]
myMgl[[3]][[18]][1:2,]
## DATE.ADDED.TO.CATALOG PUBMEDID FIRST.AUTHOR DATE JOURNAL
## 27 2008-06-16 18264097 Eeles RA 2008-02-10 Nat Genet
## 34 2008-06-16 18264096 Thomas G 2008-02-10 Nat Genet
## LINK
## 27 www.ncbi.nlm.nih.gov/pubmed/18264097
## 34 www.ncbi.nlm.nih.gov/pubmed/18264096
## STUDY
## 27 Multiple newly identified loci associated with prostate cancer susceptibility.
## 34 Multiple loci identified in a genome-wide association study of prostate cancer.
## DISEASE.TRAIT
## 27 Prostate cancer
## 34 Prostate cancer
## INITIAL.SAMPLE.SIZE
## 27 1,854 European ancestry cases, 1,894 European ancestry controls
## 34 1,172 European ancestry cases, 1,157 European ancestry controls
## REPLICATION.SAMPLE.SIZE
## 27 3,268 European ancestry cases, 3,366 European ancestry controls
## 34 3,941 European ancestry cases, 3,964 European ancestry controls
## REGION CHR_ID CHR_POS REPORTED.GENE.S. MAPPED_GENE
## 27 10q11.22 10 46046326 MSMB MSMB - RPL23AP61
## 34 10q11.22 10 46046326 MSMB MSMB - RPL23AP61
## UPSTREAM_GENE_ID DOWNSTREAM_GENE_ID SNP_GENE_IDS UPSTREAM_GENE_DISTANCE
## 27 4477 728484 57
## 34 4477 728484 57
## DOWNSTREAM_GENE_DISTANCE STRONGEST.SNP.RISK.ALLELE SNPS MERGED
## 27 16924 rs10993994-T rs10993994 0
## 34 16924 rs10993994-T rs10993994 0
## SNP_ID_CURRENT CONTEXT INTERGENIC RISK.ALLELE.FREQUENCY
## 27 10993994 upstream_gene_variant 1 0.40
## 34 10993994 upstream_gene_variant 1 0.40
## P.VALUE PVALUE_MLOG P.VALUE..TEXT. OR.or.BETA X95..CI..TEXT.
## 27 9e-29 28.04576 1.25 [1.17-1.34]
## 34 7e-13 12.15490 1.16 [1.04-1.29]
## PLATFORM..SNPS.PASSING.QC. CNV
## 27 Illumina [541129] N
## 34 Illumina [527869] N
Element 19: grasp
# Add GWAS data from GRASP
addGrasp(myMgl, range = 0) -> myMgl
# example
#myMgl[[1]][[19]][1:3,]
Element 20: aei Please note that access to allelic ratios from GTEx requires approval and download through dbGaP. See http:/www.ncbi.nlm.nih.gov/gap
# Add allelic ratios from GTEx
addAei(myMgl, fp = '~/Downloads') -> myMgl
# example mgl[[gene]][[datasource]][[tissue]][SNP,Details]
myMgl[[4]][[20]][[1]][1:3,]
The list that is created can be compared to filing cabinet. Where the cabinet represents the entirety of the list, each drawer a different gene, and each datasource a folder within the drawer. If you simply print the list to the screen in R, it would be like dumping an entire filing cabinet full of papers all over the floor. Even printing one element/gene is too much and in some cases the datasource itself (i.e. the folder) is really big and even this is too much to print to the screen. To cautiously work your way through the list, use class
and dim
or length
to preview and subsetting either of lists [[#]]
or of dataframes [row#, column#]
.
# check class
class(myMgl[[3]][[12]])
## [1] "data.frame"
# check dimensions
dim(myMgl[[3]][[12]])
## [1] 4 13
# print subset
myMgl[[3]][[12]][1:3,]
## chrom chromStart chromEnd rs numhets allele.1 readsA allele.2
## 1 chr10 51553064 51553065 rs10994201 4 G 86 A
## 2 chr10 51558124 51558125 rs7076948 7 C 71 T
## 3 chr10 51559468 51559469 rs7904463 8 T 177 C
## readsB totalReads pctRef q.value significance.level
## 1 76 162 0.469 7.63e-01 not_imbalanced
## 2 64 135 0.474 8.53e-01 not_imbalanced
## 3 71 248 0.286 5.17e-10 imbalanced_(0.1%_FDR)
Included are two functions to make navigating the mgl list a bit easier.
(1) Which elements have been added? Use listElements
to display which elements have been added or which ones are still missing.
# display missing elements
listElements(exampleMgl, added = FALSE)
## enst location antisense
## 2 3 4
## go pubmed gtex.normalized
## 5 6 7
## gtex.gene.counts gtex.transcript.counts gtex.gene.rpkm
## 8 9 10
## gtex.transcript.rpkm dnase transEqtls
## 11 12 13
## cisEqtls sqtlSeek sqtlAltrans
## 14 15 16
## pqtl gwasCatalog grasp
## 17 18 19
## aei
## 20
#
# display elements that have already been added
listElements(exampleMgl, added = TRUE)
## name
## 1
(2) One element at a time. Use listSeparate
to subset the list with just one datasource for all genes (i.e. to get ensts only for all genes)
# subset the second element with ENST information
listSeparate(myMgl, 2) -> subset
#
subset
## $`RP11-109G10.2`
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 RP11-109G10.2-001 ENST00000439573 processed_pseudogene
##
## $`RP11-481A12.2`
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 RP11-481A12.2-001 ENST00000449598 processed_pseudogene
##
## $MSMB
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 MSMB-001 ENST00000358559 protein_coding
## 2 MSMB-003 ENST00000487536 processed_transcript
## 3 MSMB-006 ENST00000474170 processed_transcript
## 4 MSMB-004 ENST00000478719 processed_transcript
## 5 MSMB-005 ENST00000466268 processed_transcript
## 6 MSMB-002 ENST00000298239 protein_coding
##
## $NCOA4
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 NCOA4-005 ENST00000498586 processed_transcript
## 2 NCOA4-001 ENST00000374087 protein_coding
## 3 NCOA4-003 ENST00000344348 protein_coding
## 4 NCOA4-002 ENST00000374082 protein_coding
## 5 NCOA4-004 ENST00000431200 protein_coding
## 6 NCOA4-203 ENST00000438493 protein_coding
## 7 NCOA4-205 ENST00000452682 protein_coding
## 8 NCOA4-204 ENST00000443446 protein_coding
## 9 NCOA4-202 ENST00000430396 protein_coding
## 10 NCOA4-201 ENST00000414907 protein_coding
##
## $TIMM23
## external_transcript_id ensembl_transcript_id transcript_biotype
## 1 TIMM23-007 ENST00000444743 protein_coding
## 2 TIMM23-001 ENST00000260867 protein_coding
## 3 TIMM23-006 ENST00000485812 processed_transcript
## 4 TIMM23-002 ENST00000374065 protein_coding
## 5 TIMM23-003 ENST00000476778 processed_transcript
## 6 TIMM23-004 ENST00000469116 processed_transcript
## 7 TIMM23-005 ENST00000480051 processed_transcript
## 8 TIMM23-201 ENST00000374064 protein_coding
Colloquial gene names (e.g. RP11-109G10.2) are often far more convienent for every day use than ENSG identifiers (e.g. ENSG00000228326), although colloquial names are not always specific. biomaRt does a phenomenal job converting between different gene names and identifiers. However, in the case that biomaRt is not able to find a corresponding ENSG id for a given gene a warning message will be displayed. This should be fixed using missNames
and fixNames
as various functions in the package rely on ENSG ids. bklIn the cases where these ENSG ids are not identified through biomaRt and must be fixed, they can often be found by googling the gene name. GeneCards http://www.genecards.org/ does a particularly good job cataloging various colloquial names for a given gene and always provides the ENSG id.
# Build an empty list from gene names
buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB", "NCOA4", "TIMM23")) -> newMgl
## Warning in buildFromNames(c("RP11-109G10.2", "RP11-481A12.2", "MSMB",
## "NCOA4", : Gene names missing: RP11-109G10.2, RP11-481A12.2
#
# Gene name was not found by *biomaRt*
newMgl[[1]][1]
## $name
## [1] "NA"
#
# Identify missing names
missNames(newMgl)
## [1] "RP11-109G10.2" "RP11-481A12.2"
#
# Fix missing names
fixNames(newMgl, c('ENSG00000228326', 'ENSG00000220508')) -> newMgl
#
# Gene names now filled in
newMgl[[1]][1]
## $name
## external_gene_id ensembl_gene_id description
## 1 RP11-109G10.2 ENSG00000228326 NA
The cisEqtl data released by GTEx uses chr_pos… instead of rs numbers to identify SNPs. In order to add rs ids use fixSnpIds
. It is preferable to use rs Ids for functions like makeSnps
, makeMultiEqtl
, makeOverlap
, and makeOverlapTable
.
# add rs Id
fixSnpIds(myMgl) -> myMgl
Some elements are dependent on one another. For example adding information from the GWAS Catalog (element 18) requires that position information for each gene (element 3) has been filled in because GWAS-based associations are pulled not only based on the assigned gene name but also on the position range of the gene. In these instances of inter-dependent elements, if the required element has not yet been filled in an error message appears directing the user to add the required information.
# Build list
## buildFromRegion(10,51518775,51600147) -> exampleMgl
#
# Add GWAS data from the NHGRI-EBI GWAS Catalog [element #18]
addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE) -> exampleMgl
## Error in addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE): location data not available. see function addLoc
#
# Add position information [element #3]
addLoc(exampleMgl) -> exampleMgl
#
exampleMgl[[1]][[3]]
## chromosome_name start_position end_position strand
## 1 10 51532102 51532572 -1
#
# Add GWAS data from the NHGRI-EBI GWAS Catalog [element #18]
addGwasCatalog(exampleMgl, range = 10, download = TRUE, saveDownload = FALSE) -> exampleMgl
#
# example mgl[[gene]][[datasource]][GWASEntry, Details]
exampleMgl[[3]][[18]][1:2,1:5]
## DATE.ADDED.TO.CATALOG PUBMEDID FIRST.AUTHOR DATE JOURNAL
## 1 2008-06-16 18264097 Eeles RA 2008-02-10 Nat Genet
## 2 2008-06-16 18264096 Thomas G 2008-02-10 Nat Genet
Two functions makeGo
and makeGoSearch
are included to organize Gene Ontology terms and determine those GO terms that are shared among candidate genes.
# summarize gene ontology (GO) terms
makeGo(myMgl, saveFile = F) -> myGo
#
# output of makeGo is a list with three elements
class(myGo)
## [1] "list"
names(myGo)
## [1] "goRes" "goTable" "goCount"
#
#goRes is a list with GO terms for each gene (subset of element 5 from mgl)
myGo[[1]][[4]]
## name_1006
## 1 androgen receptor signaling pathway
## 2 response to hormone
## 3 male gonad development
## 4 positive regulation of transcription, DNA-dependent
## 5 transcription, DNA-templated
## 6 nucleus
## 7 androgen receptor binding
## 8 transcription coactivator activity
#
# goTable is a dataframe with GO term and gene
myGo[[2]][1:2,]
## GoTerm Gene
## 1 biological_process MSMB
## 2 extracellular space MSMB
#
# goCount is a ranking of GO terms and the number of genes they are assigned to
myGo[[3]][1:2]
##
## nucleus androgen receptor binding
## 2 1
#
# to search for genes with a specific GO term
makeGoSearch(myMgl, 'nucleus', go = myGo, saveFile = F)
## [1] "MSMB" "NCOA4"
Similar to the functions for summarizing Gene Ontology terms, two functions makePhenotypes
and makePhenotypeSearch
are included to organize GWAS-based trait associations and determine those phenotypes that are shared among candidate genes.
# summarize phenotypes
makePhenotypes(myMgl, saveFile = F) -> myPhenotypes
#
# output of makePhenotypes is a list with four elements
class(myPhenotypes)
## [1] "list"
names(myPhenotypes)
## [1] "phenRes" "phenList" "phenTable" "phenCount"
#
# phenRes is a list with GWAS-based trait associations for each gene split by datasource (NHGRI-EBI GWAS Catalog and GRASP)
myPhenotypes[[1]][[5]]
## $GwasCatalog
## [1] "Mean corpuscular hemoglobin" "Mean corpuscular volume"
## [3] "Prostate cancer"
##
## $GRASP
## [1] "HDL cholesterol"
## [2] "Mean corpuscular hemoglobin (MCH)"
## [3] "Mean corpuscular volume (MCV)"
## [4] "Neuroblastoma (brain cancer)"
## [5] "Gene expression of TIMM23 [probe ILMN_22871] in osteoblasts treated with BMP2"
## [6] "TIMM23 gene expression in liver tissue"
## [7] "Comorbid depressive syndrome and alcohol dependence"
## [8] "Mean corpuscular hemoglobin concentration (MCHC)"
## [9] "Infant head circumference"
## [10] "Gene expression of NCOA4 in blood"
#
# phenList is a list with GWAS-based trait associations for each gene
myPhenotypes[[2]][[4]][1:5]
## [1] "Birth weight"
## [2] "Carboplatin-induced neutropenia/leukopenia in cancer patients"
## [3] "Differential exon level expression of CTGLF5 [probe 3288865] in brain cortex"
## [4] "Differential exon level expression of CTGLF5 [probe 3288879] in brain cortex"
## [5] "Differential exon level expression of CTGLF5 [probe 3288880] in brain cortex"
#
# phenTable is a dataframe with phenotype and gene
myPhenotypes[[3]][1:2,]
## Phenotype Gene
## [1,] "Fasting insulin" "RP11-109G10.2"
## [2,] "HDL cholesterol" "RP11-109G10.2"
#
# phenCount is a ranking of phenotypes and the number of genes they are assigned to
myPhenotypes[[4]][1:2]
##
## HDL cholesterol Fasting insulin
## 4 3
#
# to search for genes with a specific GO term
makePhenotypeSearch(myMgl, 'Mean corpuscular hemoglobin', phen = myPhenotypes, saveFile = F)
## [1] "MSMB" "NCOA4" "TIMM23"
Again similar to Gene Ontology terms and Phenotypes, two functions makeSnps
and makeSnpSearch
are included to organize genetic variants associated with candidate genes and determine those SNPs that are shared among candidate genes.
# summarize SNPs
makeSnps(myMgl, saveFile = F) -> mySnps
#
# output of makeSnps is a list with four elements
class(mySnps)
## [1] "list"
names(mySnps)
## [1] "snpRes" "snpList" "snpTable" "snpCount"
#
# snpRes is a list with SNPs for each gene split by datasource
mySnps[[1]][[5]]
## $TranseQTL
## [1] "None"
##
## $CiseQTL
## [1] "rs17720205" "rs78937603"
## [3] "rs11592181" "rs11597667"
## [5] "rs61847076" "rs61847077"
## [7] "10_51593069_GACA_G_b37" "rs10994675"
## [9] "rs41306524" "rs7085433"
## [11] "rs3122258" "rs188856704"
## [13] "rs142782792" "rs181655860"
## [15] "rs117979646" "10_51507845_GA_G_b37"
## [17] "rs75901817" "rs17720193"
## [19] "rs17776547" "rs41306524"
## [21] "rs74877435" "rs7085433"
##
## $sqtlSeek
## [1] "None"
##
## $sqtlAltrans
## [1] "rs7085005" "rs2177370"
##
## $Ptv
## [1] "None"
##
## $GwasCatalog
## [1] "rs7085433" "rs10993994"
##
## $GRASP
## [1] "rs7085433" "rs7350420"
##
## $DNAse
## [1] "None"
#
# snpList is a list with SNPs for each gene
mySnps[[2]][[5]]
## [1] "10_51507845_GA_G_b37" "10_51593069_GACA_G_b37"
## [3] "rs10993994" "rs10994675"
## [5] "rs11592181" "rs11597667"
## [7] "rs117979646" "rs142782792"
## [9] "rs17720193" "rs17720205"
## [11] "rs17776547" "rs181655860"
## [13] "rs188856704" "rs2177370"
## [15] "rs3122258" "rs41306524"
## [17] "rs61847076" "rs61847077"
## [19] "rs7085005" "rs7085433"
## [21] "rs7350420" "rs74877435"
## [23] "rs75901817" "rs78937603"
#
# snpTable is a dataframe with SNP and gene
mySnps[[3]][1:3,]
## SNP Gene
## [1,] "rs2611504" "RP11-109G10.2"
## [2,] "rs4581397" "RP11-109G10.2"
## [3,] "rs4630240" "RP11-109G10.2"
#
# snpCount is a ranking of SNPs and the number of genes they are assigned to
mySnps[[4]][1:3]
##
## rs2611504 rs4581397 rs4630240
## 3 3 3
#
# to search for genes with a specific SNP
makeSnpSearch(myMgl, 'rs2611504', snp = mySnps, saveFile = F)
## [1] "RP11-109G10.2" "MSMB" "NCOA4"
Two functions makeOverlap
and makeOverlapTable
have been developed to identify SNPs that are assigned to a given gene for multiple reasons, e.g. as both a cis-eQTL and a GWAS based variant. makeOverlap
identifies overlapping SNPs for two user-defined groups, while makeOverlapTable
gives the details for these SNPs.
# find SNPs that are both cis-eQTLs and GWAS-based variants
makeOverlap(myMgl, snpsA = 'cisEqtls', snpsB = 'gwasCatalog', saveFile = FALSE) -> myOverlap
#
myOverlap
## $overlap
## $overlap$`RP11-109G10.2`
## [1] "None"
##
## $overlap$`RP11-481A12.2`
## [1] "None"
##
## $overlap$MSMB
## [1] "rs10993994" "rs3123078" "rs4631830"
##
## $overlap$NCOA4
## [1] "None"
##
## $overlap$TIMM23
## [1] "rs7085433"
##
##
## $snpsA
## [1] "cisEqtls"
##
## $snpsB
## [1] "gwasCatalog"
# get details for these SNPs
makeOverlapTable(myMgl, myOverlap, saveFile = FALSE) -> myOverlapTable
#[[gene]][[snp]][[datasource]]
myOverlapTable[[1]][[1]][[1]][,1:3]
## tissue variant_id gene_id
## 1 Lung 10_51549496_T_C_b37 ENSG00000138294.9
## 2 Skin_Not_Sun_Exposed_Suprapubic 10_51549496_T_C_b37 ENSG00000138294.9
## 3 Skin_Sun_Exposed_Lower_leg 10_51549496_T_C_b37 ENSG00000138294.9
## 4 Stomach 10_51549496_T_C_b37 ENSG00000138294.9
myOverlapTable[[1]][[1]][[2]][,1:3]
## DATE.ADDED.TO.CATALOG PUBMEDID FIRST.AUTHOR
## 1 2008-06-16 18264097 Eeles RA
## 2 2008-06-16 18264096 Thomas G
## 3 2011-07-30 21743057 Schumacher FR
## 4 2010-09-11 20676098 Takata R
## 5 2011-01-13 21160077 Gudmundsson J
## 6 2013-03-22 23269536 Sun J
## 7 2013-09-18 23555189 Chen Z
## 8 2014-11-09 24740154 Lange EM
## 9 2014-11-07 24753544 Knipe DW
## 10 2016-06-13 26443449 Wang M
## 11 2016-02-12 25939597 Berndt SI
## 12 2016-07-15 26034056 Hoffmann TJ
## 13 2016-07-15 26034056 Hoffmann TJ
Two functions makeCoXpGene
and makeCoXpTranscript
are included to test co-expression between gene or transcript pairs in specific tissues. For each pair of genes or transcripts in the specified tissues, co-expression is tested using cor.test. Two flags makePlot
and saveFile
can be used to save a pdf with corresponding plots and/or a csv with the results of cor.test. The plot includes expression of one gene on the x-axis and the other gene on the y-axis with results of the cor.test as part of the figure legend. The csv includes the cor.test output for all combinations of genes or transcripts in the specified tissues.
# plot co-expressions between genes using count normalized data
makeCoXpGene(myMgl, gene = c("RP11-109G10.2", "NCOA4"), makePlot = FALSE, saveFile = FALSE, data = 7, tissue = c('Prostate'))
## $`RP11-109G10.2|NCOA4`
## $`RP11-109G10.2|NCOA4`$Prostate
##
## Pearson's product-moment correlation
##
## data: g1 and g2
## t = 7.0002, df = 85, p-value = 5.546e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4515997 0.7232408
## sample estimates:
## cor
## 0.6047216
To show only those results from Maurano et. al that are significant
# filter DNAse results for significant p-values
makeDnaseSig(myMgl) -> results
#
results
## $`RP11-109G10.2`
## character(0)
##
## $`RP11-481A12.2`
## character(0)
##
## $MSMB
## [1] "rs7904463" "rs17178655"
##
## $NCOA4
## character(0)
##
## $TIMM23
## character(0)
Allelic ratios can be plotted for one gene at a time using makeAeiPlot
. A single pdf entitle ‘AeiPlot_gene.pdf’ will be generated with each individual as a separate page in the document. Tissues are displayed on the x-axis and allelic ratios on the y-axis. Each do represents an individual SNP. Dot colors correspond to the SNP (identified by its position), while dot shapes correspond to the class of SNP - intron, exon, etc. The results used to generate the plot can be saved in an csv file using the ‘saveFile’ flag.
# plot allelic ratios
makeAeiPlot(myMgl, gene = c('NCOA4'), saveFile = FALSE) -> myAeiPlot
The function makeMultiEqtl
was designed to identify eQTL variants that serve as eQTLs for multiple candidate genes. The search can be done in a tissue-specific fashion by specifying a particular tissue using the tissue
flag or irrespective of tissue by using tissue = 'All'
. There are two optional flags makePlot
and saveFile
that will generate, respectively, either a pdf or a csv summarizing the results.
# if there are no MultiEqtls in a given tissue an error message appears
makeMultiEqtl(myMgl, makePlot = FALSE, saveFile = FALSE, cis = TRUE, trans = TRUE, tissue = c('All', 'Prostate')) -> myMultiEqtl
#
# summarize MultiEqtls
makeMultiEqtl(myMgl, makePlot = FALSE, saveFile = FALSE, cis = TRUE, trans = TRUE, tissue = c('All')) -> myMultiEqtl
#
# output of makeMultiEqtl is a list with four elements
# note in the event of multiple tissues the output is a list with length corresponding to the number tissues where each element is a list with four elements as described below
class(myMultiEqtl)
## [1] "list"
names(myMultiEqtl)
## [1] "fullTable" "ranking" "topSnps" "topSummary"
#
# fullTable is a dataframe where each row is an eQTL
myMultiEqtl[[1]][1:3,]
## V1 V2 V3 V4
## 1 RP11-109G10.2 Prostate rs2611504 cis
## 2 RP11-109G10.2 Prostate rs4630240 cis
## 3 MSMB Lung rs2611504 cis
#
# ranking is a list of tables ranking the number of genes each eQTL is assigned to
myMultiEqtl[[2]][[1]][1:3]
##
## rs2611504 rs4630240 10_51501963_AGTT_A_b37
## 3 3 2
#
# topSnps is a list of vectors of eQTLs appearing for the maximum number of genes (i.e. if the most number of genes an eQTL applies to is 3 all eQTLs assinged to 3 genes will be displayed)
myMultiEqtl[[3]]
## [[1]]
##
## rs2611504 rs4630240
## 3 3
#
# topSummary shows the information from fullTable (gene, tissue, snp) for those snps included in topSnps
myMultiEqtl[[4]]
## [[1]]
## V1 V2 V3 V4
## 1 RP11-109G10.2 Prostate rs2611504 cis
## 3 MSMB Lung rs2611504 cis
## 81 MSMB Skin_Not_Sun_Exposed_Suprapubic rs2611504 cis
## 207 MSMB Skin_Sun_Exposed_Lower_leg rs2611504 cis
## 368 MSMB Stomach rs2611504 cis
## 545 NCOA4 Whole_Blood rs2611504 cis
## 2 RP11-109G10.2 Prostate rs4630240 cis
## 4 MSMB Lung rs4630240 cis
## 399 MSMB Stomach rs4630240 cis
## 575 NCOA4 Whole_Blood rs4630240 cis
The function makeSummary
was designed to give a quick overview of the number of SNPs identified for each candidate gene. Of particular interest maybe those genes with molecular-based SNPs (e.g eQTLs) that do not have GWAS hits and vice versa.
# summarize associated SNPs
makeSummary(myMgl, saveFile = F) -> summary
#
summary
## DNAse TranseQTL CiseQTL sqtlSeek sqtlAltrans Ptv GwasCatalog
## RP11-109G10.2 0 0 2 0 0 0 0
## RP11-481A12.2 0 0 1 0 0 0 0
## MSMB 2 0 454 0 0 0 5
## NCOA4 0 0 163 0 6 6 4
## TIMM23 0 0 22 0 2 0 2
## GRASP
## RP11-109G10.2 2
## RP11-481A12.2 0
## MSMB 9
## NCOA4 11
## TIMM23 2