The BioMart project enables users to retrieve a vast diversity of annotation data for specific organisms. Steffen Durinck and Wolfgang Huber provide an powerful interface between the R language and BioMart by providing the R package biomaRt. The following sections will introduce users to the functionality and data retrieval precedures using the biomaRt
package and will then introduce them to the interface functions biomart()
and biomart_organisms()
implemented in biomartr
that are based on the biomaRt
methodology but aim to introduce an more intuitive way of interacting with BioMart.
The best way to get started with the methodology presented by biomaRt is to understand the workflow of data retrieval. The database provided by BioMart is organized in so called: marts
, datasets
, and attributes
. So when users want to retrieve information for a specific organism of interest, first they need to specify the marts
and datasets
in which the information of the corresponding organism can be found and subsequently they can specify the attributes
that shall be returned for that particular organism.
The availability of marts
, datasets
, and attributes
can be checked by the following functions:
# install the biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
# load biomaRt
library(biomaRt)
# look at top 10 databases
head(listMarts(host = "www.ensembl.org"), 10)
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 83
2 ENSEMBL_MART_SNP Ensembl Variation 83
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 83
4 ENSEMBL_MART_VEGA Vega 63
5 pride PRIDE (EBI UK)
Users will observe that several marts
providing annotation for specific classes of organisms or groups of organisms are available.
For our example, we will choose the plants_variations_26
mart
and list all available datasets that are element of this mart
.
head(listDatasets(useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10)
dataset description version
1 oanatinus_gene_ensembl Ornithorhynchus anatinus genes (OANA5) OANA5
2 cporcellus_gene_ensembl Cavia porcellus genes (cavPor3) cavPor3
3 gaculeatus_gene_ensembl Gasterosteus aculeatus genes (BROADS1) BROADS1
4 lafricana_gene_ensembl Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2
6 choffmanni_gene_ensembl Choloepus hoffmanni genes (choHof1) choHof1
7 csavignyi_gene_ensembl Ciona savignyi genes (CSAV2.0) CSAV2.0
8 fcatus_gene_ensembl Felis catus genes (Felis_catus_6.2) Felis_catus_6.2
9 rnorvegicus_gene_ensembl Rattus norvegicus genes (Rnor_6.0) Rnor_6.0
10 psinensis_gene_ensembl Pelodiscus sinensis genes (PelSin_1.0) PelSin_1.0
The useMart()
function is a wrapper function provided by biomaRt
to connect a selected BioMart database (mart
) with a corresponding dataset stored within this mart
.
We select dataset hsapiens_gene_ensembl
and now check for available attributes (annotation data) that can be accessed for Homo sapiens
genes.
head(listAttributes(useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",host = "www.ensembl.org"))), 10)
name description
1 ensembl_gene_id Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3 ensembl_peptide_id Ensembl Protein ID
4 ensembl_exon_id Ensembl Exon ID
5 description Description
6 chromosome_name Chromosome Name
7 start_position Gene Start (bp)
8 end_position Gene End (bp)
9 strand Strand
10 band Band
Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset()
is needed in which useMart()
and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for Homo sapiens as well as a short description.
Furthermore, users can retrieve all filters for Homo sapiens that can be specified by the actual BioMart query process.
head(listFilters(useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))), 10)
name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
7 marker_end Marker End
8 encode_region Encode region
9 strand Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM()
function.
In this example we will retrieve attributes: start_position
,end_position
and description
for the Homo sapiens gene "GUCA2A"
.
Since the input genes are ensembl gene ids
, we need to specify the filters
argument filters = "tair_locus"
.
# 1) select a mart and data set
mart <- useDataset("hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))
# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- "GUCA2A"
resultTable <- getBM(attributes = c("start_position","end_position","description"),
filters = "hgnc_symbol", values = geneSet, mart = mart)
resultTable
start_position end_position
1 42162691 42164718
description
1 guanylate cyclase activator 2A (guanylin) [Source:HGNC Symbol;Acc:HGNC:4682]
When using getBM()
users can pass all attributes retrieved by listAttributes()
to the attributes
argument of the getBM()
function.
biomartr
This query methodology provided by BioMart
and the biomaRt
package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr
package provides another query methodology that aims to be more organism centric.
Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart()
function implemented in this biomartr
package:
get attributes, datasets, and marts via : organismAttributes()
choose available filters via: organismFilters()
specify a set of query genes
specify all arguments of the biomart()
function using steps 1) - 3) and perform a BioMart query
Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE)
whether or not their dataset name has been changed. For example, organismAttributes("Homo sapiens", topic = "id", update = TRUE)
might reveal that the dataset ENSEMBL_MART_ENSEMBL
has changed.
The getMarts()
function allows users to list all available databases that can be accessed through BioMart interfaces.
# load the biomartr package
library(biomartr)
# list all available databases
getMarts()
mart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 83
2 ENSEMBL_MART_SEQUENCE Sequence
3 ENSEMBL_MART_ONTOLOGY Ontology
4 ENSEMBL_MART_GENOMIC Genomic features 83
5 ENSEMBL_MART_SNP Ensembl Variation 83
6 ENSEMBL_MART_FUNCGEN Ensembl Regulation 83
7 ENSEMBL_MART_VEGA Vega 63
8 pride PRIDE (EBI UK)
Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the ENSEMBL_MART_ENSEMBL
database.
head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5)
dataset description version
1 oanatinus_gene_ensembl Ornithorhynchus anatinus genes (OANA5) OANA5
2 cporcellus_gene_ensembl Cavia porcellus genes (cavPor3) cavPor3
3 gaculeatus_gene_ensembl Gasterosteus aculeatus genes (BROADS1) BROADS1
4 lafricana_gene_ensembl Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2
Now you can select the dataset hsapiens_gene_ensembl
and list all available attributes that can be retrieved from this dataset.
tail(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38)
dataset description version
32 hsapiens_gene_ensembl Homo sapiens genes (GRCh38.p5) GRCh38.p5
33 pformosa_gene_ensembl Poecilia formosa genes (PoeFor_5.1.2) PoeFor_5.1.2
34 mfuro_gene_ensembl Mustela putorius furo genes (MusPutFur1.0) MusPutFur1.0
35 tbelangeri_gene_ensembl Tupaia belangeri genes (tupBel1) tupBel1
36 ggallus_gene_ensembl Gallus gallus genes (Galgal4) Galgal4
37 xtropicalis_gene_ensembl Xenopus tropicalis genes (JGI4.2) JGI4.2
38 ecaballus_gene_ensembl Equus caballus genes (EquCab2) EquCab2
39 pabelii_gene_ensembl Pongo abelii genes (PPYG2) PPYG2
40 xmaculatus_gene_ensembl Xiphophorus maculatus genes (Xipmac4.4.2) Xipmac4.4.2
41 drerio_gene_ensembl Danio rerio genes (GRCz10) GRCz10
42 lchalumnae_gene_ensembl Latimeria chalumnae genes (LatCha1) LatCha1
43 tnigroviridis_gene_ensembl Tetraodon nigroviridis genes (TETRAODON8.0) TETRAODON8.0
44 amelanoleuca_gene_ensembl Ailuropoda melanoleuca genes (ailMel1) ailMel1
45 mmulatta_gene_ensembl Macaca mulatta genes (MMUL_1) MMUL_1
46 pvampyrus_gene_ensembl Pteropus vampyrus genes (pteVam1) pteVam1
47 panubis_gene_ensembl Papio anubis genes (PapAnu2.0) PapAnu2.0
48 mdomestica_gene_ensembl Monodelphis domestica genes (monDom5) monDom5
49 acarolinensis_gene_ensembl Anolis carolinensis genes (AnoCar2.0) AnoCar2.0
50 vpacos_gene_ensembl Vicugna pacos genes (vicPac1) vicPac1
51 tsyrichta_gene_ensembl Tarsius syrichta genes (tarSyr1) tarSyr1
52 ogarnettii_gene_ensembl Otolemur garnettii genes (OtoGar3) OtoGar3
53 dmelanogaster_gene_ensembl Drosophila melanogaster genes (BDGP6) BDGP6
54 mmurinus_gene_ensembl Microcebus murinus genes (micMur1) micMur1
55 loculatus_gene_ensembl Lepisosteus oculatus genes (LepOcu1) LepOcu1
56 olatipes_gene_ensembl Oryzias latipes genes (HdrR) HdrR
57 ggorilla_gene_ensembl Gorilla gorilla genes (gorGor3.1) gorGor3.1
58 oprinceps_gene_ensembl Ochotona princeps genes (OchPri2.0) OchPri2.0
59 dordii_gene_ensembl Dipodomys ordii genes (dipOrd1) dipOrd1
60 oaries_gene_ensembl Ovis aries genes (Oar_v3.1) Oar_v3.1
61 mmusculus_gene_ensembl Mus musculus genes (GRCm38.p4) GRCm38.p4
62 mgallopavo_gene_ensembl Meleagris gallopavo genes (UMD2) UMD2
63 gmorhua_gene_ensembl Gadus morhua genes (gadMor1) gadMor1
64 aplatyrhynchos_gene_ensembl Anas platyrhynchos genes (BGI_duck_1.0) BGI_duck_1.0
65 saraneus_gene_ensembl Sorex araneus genes (sorAra1) sorAra1
66 sharrisii_gene_ensembl Sarcophilus harrisii genes (DEVIL7.0) DEVIL7.0
67 meugenii_gene_ensembl Macropus eugenii genes (Meug_1.0) Meug_1.0
68 btaurus_gene_ensembl Bos taurus genes (UMD3.1) UMD3.1
69 cfamiliaris_gene_ensembl Canis familiaris genes (CanFam3.1) CanFam3.1
Now that you have selected a database (plants_mart_26
) and a dataset (athaliana_eg_gene
), users can list all available attributes for this dataset using the getAttributes()
function.
# list all available attributes for dataset: hsapiens_gene_ensembl
head( getAttributes(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl"), 10 )
name description
1 ensembl_gene_id Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3 ensembl_peptide_id Ensembl Protein ID
4 ensembl_exon_id Ensembl Exon ID
5 description Description
6 chromosome_name Chromosome Name
7 start_position Gene Start (bp)
8 end_position Gene End (bp)
9 strand Strand
10 band Band
Finally, the getFilters()
function allows users to list available filters for a specific dataset that can be used for a biomart()
query.
# list all available filters for dataset: hsapiens_gene_ensembl
head( getFilters(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl"), 10 )
name description
1 chromosome_name Chromosome name
2 start Gene Start (bp)
3 end Gene End (bp)
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
7 marker_end Marker End
8 encode_region Encode region
9 strand Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM()
function addresses this issue and provides users with an organism centric query to marts
and datasets
which are available for a particular organism of interest.
Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file named _biomart/listDatasets.txt
within the tempdir()
folder, allowing subsequent queries to perform much faster. The tempdir()
folder however, will be deleted after a new R session was established, so in this case the inital call of the subsequent functions, again will take time to retrieve all organism specific data from the BioMart API.
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Homo sapiens")
organism_name
1 Homo sapiens
2 Homo sapiens
3 Homo sapiens
4 Homo sapiens
5 Homo sapiens
6 Homo sapiens
7 Homo sapiens
8 Homo sapiens
9 Homo sapiens
10 Homo sapiens
11 Homo sapiens
12 Homo sapiens
description
1 Homo sapiens genes (GRCh38.p5)
2 Homo sapiens Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p5)
3 Homo sapiens Structural Variants (GRCh38.p5)
4 Homo sapiens Somatic Structural Variants (GRCh38.p5)
5 Homo sapiens Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p5)
6 Homo sapiens Regulatory Evidence (GRCh38.p5)
7 Homo sapiens Binding Motifs (GRCh38.p5)
8 Homo sapiens Regulatory Features (GRCh38.p5)
9 Homo sapiens miRNA Target Regions (GRCh38.p5)
10 Homo sapiens Regulatory Segments (GRCh38.p5)
11 Homo sapiens Other Regulatory Regions (GRCh38.p5)
12 Homo sapiens genes (GRCh38.p5)
mart dataset version
1 ENSEMBL_MART_ENSEMBL hsapiens_gene_ensembl GRCh38.p5
2 ENSEMBL_MART_SNP hsapiens_snp GRCh38.p5
3 ENSEMBL_MART_SNP hsapiens_structvar GRCh38.p5
4 ENSEMBL_MART_SNP hsapiens_structvar_som GRCh38.p5
5 ENSEMBL_MART_SNP hsapiens_snp_som GRCh38.p5
6 ENSEMBL_MART_FUNCGEN hsapiens_annotated_feature GRCh38.p5
7 ENSEMBL_MART_FUNCGEN hsapiens_motif_feature GRCh38.p5
8 ENSEMBL_MART_FUNCGEN hsapiens_regulatory_feature GRCh38.p5
9 ENSEMBL_MART_FUNCGEN hsapiens_mirna_target_feature GRCh38.p5
10 ENSEMBL_MART_FUNCGEN hsapiens_segmentation_feature GRCh38.p5
11 ENSEMBL_MART_FUNCGEN hsapiens_external_feature GRCh38.p5
12 ENSEMBL_MART_VEGA hsapiens_gene_vega GRCh38.p5
The result is a table storing all marts
and datasets
from which annotations can be retrieved for Homo sapiens. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.
Users will observe that 3 different marts
provide 6 different datasets
storing annotation information for Homo sapiens.
Please note however, that scientific names of organisms must be written correctly! For ex. “Homo Sapiens” will be treated differently (not recognized) than “Homo sapiens” (recognized).
Similar to the biomaRt
package query methodology, users need to specify attributes
and filters
to be able to perform accurate BioMart queries. Here the functions organismAttributes()
and organismFilters()
provide useful and intuitive concepts to obtain this information.
# return available attributes for "Homo sapiens"
head(organismAttributes("Homo sapiens"), 20)
name description dataset
1 ensembl_gene_id Ensembl Gene ID hsapiens_gene_ensembl
2 ensembl_transcript_id Ensembl Transcript ID hsapiens_gene_ensembl
3 ensembl_peptide_id Ensembl Protein ID hsapiens_gene_ensembl
4 ensembl_exon_id Ensembl Exon ID hsapiens_gene_ensembl
5 description Description hsapiens_gene_ensembl
6 chromosome_name Chromosome Name hsapiens_gene_ensembl
7 start_position Gene Start (bp) hsapiens_gene_ensembl
8 end_position Gene End (bp) hsapiens_gene_ensembl
9 strand Strand hsapiens_gene_ensembl
10 band Band hsapiens_gene_ensembl
11 transcript_start Transcript Start (bp) hsapiens_gene_ensembl
12 transcript_end Transcript End (bp) hsapiens_gene_ensembl
13 transcription_start_site Transcription Start Site (TSS) hsapiens_gene_ensembl
14 transcript_length Transcript length (including UTRs and CDS) hsapiens_gene_ensembl
15 transcript_tsl Transcript Support Level (TSL) hsapiens_gene_ensembl
16 transcript_gencode_basic GENCODE basic annotation hsapiens_gene_ensembl
17 transcript_appris APPRIS annotation hsapiens_gene_ensembl
18 external_gene_name Associated Gene Name hsapiens_gene_ensembl
19 external_gene_source Associated Gene Source hsapiens_gene_ensembl
20 external_transcript_name Associated Transcript Name hsapiens_gene_ensembl
mart
1 ENSEMBL_MART_ENSEMBL
2 ENSEMBL_MART_ENSEMBL
3 ENSEMBL_MART_ENSEMBL
4 ENSEMBL_MART_ENSEMBL
5 ENSEMBL_MART_ENSEMBL
6 ENSEMBL_MART_ENSEMBL
7 ENSEMBL_MART_ENSEMBL
8 ENSEMBL_MART_ENSEMBL
9 ENSEMBL_MART_ENSEMBL
10 ENSEMBL_MART_ENSEMBL
11 ENSEMBL_MART_ENSEMBL
12 ENSEMBL_MART_ENSEMBL
13 ENSEMBL_MART_ENSEMBL
14 ENSEMBL_MART_ENSEMBL
15 ENSEMBL_MART_ENSEMBL
16 ENSEMBL_MART_ENSEMBL
17 ENSEMBL_MART_ENSEMBL
18 ENSEMBL_MART_ENSEMBL
19 ENSEMBL_MART_ENSEMBL
20 ENSEMBL_MART_ENSEMBL
Users will observe that the organismAttributes()
function returns a data.frame storing attribute names, datasets, and marts which are available for Homo sapiens
.
An additional feature provided by organismAttributes()
is the topic
argument. The topic
argument allows users to to search for specific attributes, topics, or categories for faster filtering.
# search for attribute topic "id"
head(organismAttributes("Homo sapiens", topic = "id"), 20)
name description
1 ensembl_gene_id Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3 ensembl_peptide_id Ensembl Protein ID
4 ensembl_exon_id Ensembl Exon ID
34 study_external_id Study External Reference
35 go_id GO Term Accession
49 dbass3_id Database of Aberrant 3' Splice Sites (DBASS3) IDs
51 dbass5_id Database of Aberrant 5' Splice Sites (DBASS5) IDs
64 hgnc_id HGNC ID(s)
68 mim_morbid_accession MIM Morbid Accession
69 mim_morbid_description MIM Morbid Description
73 mirbase_id miRBase ID(s)
76 protein_id Protein (Genbank) ID [e.g. AAA02487]
84 refseq_peptide RefSeq Protein ID [e.g. NP_001005353]
85 refseq_peptide_predicted RefSeq Predicted Protein ID [e.g. XP_001720922]
96 wikigene_id WikiGene ID
182 ensembl_gene_id Ensembl Gene ID
183 ensembl_transcript_id Ensembl Transcript ID
184 ensembl_peptide_id Ensembl Protein ID
213 ensembl_exon_id Ensembl Exon ID
dataset mart
1 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
34 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
35 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
49 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
51 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
64 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
68 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
69 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
73 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
76 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
84 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
85 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
96 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
182 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
183 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
184 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
213 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
Now, all attribute names
having id
as part of their name
are being returned.
Another example is topic = "homolog"
.
# search for attribute topic "homolog"
head(organismAttributes("Homo sapiens", topic = "homolog"), 20)
name description
229 vpacos_homolog_ensembl_gene Alpaca Ensembl Gene ID
230 vpacos_homolog_canonical_transcript_protein Canonical Protein or Transcript ID
231 vpacos_homolog_ensembl_peptide Alpaca Ensembl Protein ID
232 vpacos_homolog_chromosome Alpaca Chromosome Name
233 vpacos_homolog_chrom_start Alpaca Chromosome Start (bp)
234 vpacos_homolog_chrom_end Alpaca Chromosome End (bp)
235 vpacos_homolog_orthology_type Homology Type
236 vpacos_homolog_subtype Ancestor
237 vpacos_homolog_orthology_confidence Orthology confidence [0 low, 1 high]
238 vpacos_homolog_perc_id % Identity with respect to query gene
239 vpacos_homolog_perc_id_r1 % Identity with respect to Alpaca gene
240 pformosa_homolog_ensembl_gene Amazon molly Ensembl Gene ID
241 pformosa_homolog_canonical_transcript_protein Canonical Protein or Transcript ID
242 pformosa_homolog_ensembl_peptide Amazon molly Ensembl Protein ID
243 pformosa_homolog_chromosome Amazon molly Chromosome Name
244 pformosa_homolog_chrom_start Amazon molly Chromosome Start (bp)
245 pformosa_homolog_chrom_end Amazon molly Chromosome End (bp)
246 pformosa_homolog_orthology_type Homology Type
247 pformosa_homolog_subtype Ancestor
248 pformosa_homolog_orthology_confidence Orthology confidence [0 low, 1 high]
dataset mart
229 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
230 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
231 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
232 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
233 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
234 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
235 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
236 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
237 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
238 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
239 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
240 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
241 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
242 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
243 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
244 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
245 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
246 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
247 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
248 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
Or topic = "dn"
and topic = "ds"
for dn
and ds
value retrieval.
# search for attribute topic "dn"
head(organismAttributes("Homo sapiens", topic = "dn"))
name description
209 cdna_coding_start cDNA coding start
210 cdna_coding_end cDNA coding end
262 acarolinensis_homolog_dn dN
264 dnovemcinctus_homolog_ensembl_gene Armadillo Ensembl Gene ID
265 dnovemcinctus_homolog_canonical_transcript_protein Canonical Protein or Transcript ID
266 dnovemcinctus_homolog_ensembl_peptide Armadillo Ensembl Protein ID
dataset mart
209 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
210 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
262 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
264 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
265 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
266 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# search for attribute topic "ds"
head(organismAttributes("Homo sapiens", topic = "ds"))
name description dataset mart
48 ccds CCDS ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
199 cds_length CDS Length hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
214 cds_start CDS Start hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
215 cds_end CDS End hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
263 acarolinensis_homolog_ds dS hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
276 dnovemcinctus_homolog_ds dS hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
Analogous to the organismAttributes()
function, the organismFilters()
function returns all filters that are available for a query organism of interest.
# return available filters for "Homo sapiens"
head(organismFilters("Homo sapiens"), 20)
name
1 chromosome_name
2 start
3 end
4 band_start
5 band_end
6 marker_start
7 marker_end
8 encode_region
9 strand
10 chromosomal_region
11 with_hgnc
12 with_hgnc_transcript_name
13 with_ox_arrayexpress
14 with_ccds
15 with_chembl
16 with_ox_clone_based_ensembl_gene
17 with_ox_clone_based_ensembl_transcript
18 with_ox_clone_based_vega_gene
19 with_ox_clone_based_vega_transcript
20 with_dbass3
description dataset
1 Chromosome name hsapiens_gene_ensembl
2 Gene Start (bp) hsapiens_gene_ensembl
3 Gene End (bp) hsapiens_gene_ensembl
4 Band Start hsapiens_gene_ensembl
5 Band End hsapiens_gene_ensembl
6 Marker Start hsapiens_gene_ensembl
7 Marker End hsapiens_gene_ensembl
8 Encode region hsapiens_gene_ensembl
9 Strand hsapiens_gene_ensembl
10 Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1) hsapiens_gene_ensembl
11 with HGNC ID(s) hsapiens_gene_ensembl
12 with HGNC transcript name(s) hsapiens_gene_ensembl
13 with ArrayExpress ID(s) hsapiens_gene_ensembl
14 with CCDS ID(s) hsapiens_gene_ensembl
15 with ChEMBL ID(s) hsapiens_gene_ensembl
16 with clone based Ensembl gene ID(s) hsapiens_gene_ensembl
17 with clone based Ensembl transcript ID(s) hsapiens_gene_ensembl
18 with clone based VEGA gene ID(s) hsapiens_gene_ensembl
19 with clone based VEGA transcript ID(s) hsapiens_gene_ensembl
20 with DBASS3 ID(s) hsapiens_gene_ensembl
mart
1 ENSEMBL_MART_ENSEMBL
2 ENSEMBL_MART_ENSEMBL
3 ENSEMBL_MART_ENSEMBL
4 ENSEMBL_MART_ENSEMBL
5 ENSEMBL_MART_ENSEMBL
6 ENSEMBL_MART_ENSEMBL
7 ENSEMBL_MART_ENSEMBL
8 ENSEMBL_MART_ENSEMBL
9 ENSEMBL_MART_ENSEMBL
10 ENSEMBL_MART_ENSEMBL
11 ENSEMBL_MART_ENSEMBL
12 ENSEMBL_MART_ENSEMBL
13 ENSEMBL_MART_ENSEMBL
14 ENSEMBL_MART_ENSEMBL
15 ENSEMBL_MART_ENSEMBL
16 ENSEMBL_MART_ENSEMBL
17 ENSEMBL_MART_ENSEMBL
18 ENSEMBL_MART_ENSEMBL
19 ENSEMBL_MART_ENSEMBL
20 ENSEMBL_MART_ENSEMBL
The organismFilters()
function also allows users to search for filters that correspond to a specific topic or category.
# search for filter topic "id"
head(organismFilters("Homo sapiens", topic = "id"), 20)
name description
31 with_go_id with GO Term Accession(s)
36 with_mim_morbid with MIM disease ID(s)
43 with_protein_id with protein (Genbank) ID(s)
53 with_refseq_peptide with RefSeq protein ID(s)
54 with_refseq_peptide_predicted with RefSeq predicted protein ID(s)
63 ensembl_gene_id Ensembl Gene ID(s) [e.g. ENSG00000139618]
64 ensembl_transcript_id Ensembl Transcript ID(s) [e.g. ENST00000380152]
65 ensembl_peptide_id Ensembl protein ID(s) [e.g. ENSP00000369497]
66 ensembl_exon_id Ensembl exon ID(s) [e.g. ENSE00001508081]
67 hgnc_id HGNC ID(s) [e.g. HGNC:8030]
87 go_id GO Term Accession(s) [e.g. GO:0005515]
92 mim_morbid_accession MIM Morbid Accession(s) [e.g. 540000]
93 mirbase_id miRBase ID(s) [e.g. hsa-mir-137]
97 protein_id Protein (Genbank) ID(s) [e.g. ACU09872]
105 refseq_peptide RefSeq protein ID(s) [e.g. NP_001005353]
106 refseq_peptide_predicted RefSeq predicted protein ID(s) [e.g. XP_011520427]
119 wikigene_id WikiGene ID(s) [e.g. 115286]
197 go_evidence_code GO Evidence code
300 with_validated_snp Variant supporting evidence
325 with_validated Variant supporting evidence
dataset mart
31 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
36 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
43 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
53 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
54 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
63 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
64 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
65 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
66 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
67 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
87 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
92 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
93 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
97 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
105 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
106 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
119 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
197 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
300 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
325 hsapiens_snp ENSEMBL_MART_ENSEMBL
The short introduction to the functionality of organismBM()
, organismAttributes()
, and organismFilters()
will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart()
.
For the following examples we will assume that we are interested in the annotation of specific genes from the Homo sapiens proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes()
function.
head(organismAttributes("Homo sapiens", topic = "id"))
name description dataset mart
1 ensembl_gene_id Ensembl Gene ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2 ensembl_transcript_id Ensembl Transcript ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3 ensembl_peptide_id Ensembl Protein ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4 ensembl_exon_id Ensembl Exon ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
34 study_external_id Study External Reference hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
35 go_id GO Term Accession hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# retrieve the proteome of Homo sapiens from refseq
getProteome( db = "refseq",
kingdom = "vertebrate_mammalian",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes") )
file_path <- file.path("_ncbi_downloads","proteomes","Homo_sapiens_protein.faa.gz")
Hsapiens_proteome <- read_proteome(file_path, format = "fasta")
# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Hsapiens_proteome[1:5 , geneids], ".",fixed = TRUE),function(x) x[1]))
result_BM <- biomart( genes = gene_set,
mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
attributes = c("ensembl_gene_id","ensembl_peptide_id"),
filters = "refseq_peptide")
result_BM
refseq_peptide ensembl_gene_id ensembl_peptide_id
1 NP_000005 ENSG00000175899 ENSP00000323929
2 NP_000006 ENSG00000156006 ENSP00000286479
3 NP_000007 ENSG00000117054 ENSP00000359878
4 NP_000008 ENSG00000122971 ENSP00000242592
5 NP_000009 ENSG00000072778 ENSP00000349297
The biomart()
function takes as arguments a set of genes (gene ids specified in the filter
argument), the corresponding mart
and dataset
, as well as the attributes
which shall be returned.
The biomartr
package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO()
function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO()
function allows GO information retrieval from the BioMart database.
In this example we will retrieve GO information for a set of A. thaliana genes stored as tair locus id
.
The getGO()
function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism
of interest needs to be specified. Furthermore, a set of gene ids
as well as their corresponding filter
notation (GUCA2A
gene ids have filter
notation hgnc_symbol
; see organismFilters()
for details) need to be specified. The database
argument then defines the database from which GO information shall be retrieved.
# search for GO terms of an example Homo sapiens gene
GO_tbl <- getGO(organism = "Homo sapiens",
genes = "GUCA2A",
filters = "hgnc_symbol")
hgnc_symbol goslim_goa_description goslim_goa_accession
1 GUCA2A biological_process GO:0008150
2 GUCA2A molecular_function GO:0003674
3 GUCA2A cellular nitrogen compound metabolic process GO:0034641
4 GUCA2A cellular_component GO:0005575
5 GUCA2A biosynthetic process GO:0009058
6 GUCA2A small molecule metabolic process GO:0044281
7 GUCA2A organelle GO:0043226
8 GUCA2A enzyme regulator activity GO:0030234
9 GUCA2A extracellular region GO:0005576
Hence, for each gene id the resulting table stores all annotated GO terms found in BioMart.