The biomartr
package allows users to retrieve biological sequences in a very simple and intuitive way.
Using biomartr
, users can retrieve either genomes, proteomes, or CDS data using the specialized functions:
getGenome()
getProteome()
getCDS()
meta.retrieval()
First users can check whether or not the genome, proteome, or CDS of their interest is available for download.
Using the scientific name of the organism of interest, users can check whether the corresponding genome is available via the is.genome.available()
function.
# checking whether or not the Arabidopsis thaliana
# genome is avaialable for download
is.genome.available("Arabidopsis thaliana")
[1] TRUE
By specifying the details = TRUE
argument, the genome file size as well as additional information can be printed to the console.
# printing details to the console
is.genome.available("Arabidopsis thaliana", details = TRUE)
organism_name kingdoms group subgroup file_size_MB chrs organelles plasmids bio_projects
682 Arabidopsis thaliana Eukaryota Plants Land Plants 119.668 6 2 NA 6
Users will observe that the Arabidopsis thaliana
genome file has a size of 119.668 MB
.
Note: The availability of genomes has been taken from NCBI.
Users can determine the total number of available genomes using the listGenomes()
function.
length(listGenomes())
[1] 15512
Hence, currently 15512 genomes (including all kingdoms of life) are stored on NCBI servers.
Optionally, users can also specify the database for which the availability of organisms shall be checked.
# cheking whether A. thaliana is available in the refseq database
is.genome.available("Arabidopsis thaliana", database = "refseq")
[1] TRUE
Users can also determine the total number of genomes stored in refseq.
length(listGenomes(database = "refseq"))
[1] 5519
This result shows that so far (year 2016) 5519 genomes are stored in refseq.
The simplest way to work with listGenomes()
is to print available genomes to the console.
# the simplest way to retrieve names of available genomes stored within NCBI databases
head(listGenomes() , 5)
[1] "'Chrysanthemum coronarium' phytoplasma"
[2] "'Deinococcus soli' Cha et al. 2014"
[3] "'Echinacea purpurea' witches'-broom phytoplasma"
[4] "Abaca bunchy top virus"
[5] "Abalone herpesvirus Victoria/AUS/2009"
In case users are interested in a detailed output of the corresponding organism file stored on NCBI, again they can specify the details = TRUE
argument.
# show all details
head(listGenomes(details = TRUE) , 5)
organism_name kingdoms group
1 'Chrysanthemum coronarium' phytoplasma Bacteria Terrabacteria group
2 'Deinococcus soli' Cha et al. 2014 Bacteria Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria Terrabacteria group
4 Abaca bunchy top virus Viruses ssDNA viruses
5 Abalone herpesvirus Victoria/AUS/2009 Viruses dsDNA viruses, no RNA stage
subgroup file_size_MB chrs organelles plasmids bio_projects
1 Tenericutes 0.739592 NA NA NA 1
2 Deinococcus-Thermus 3.236980 1 NA NA 1
3 Tenericutes 0.545427 NA NA NA 1
4 Nanoviridae 0.006422 6 NA NA 1
5 unclassified 0.211518 1 NA NA 1
Users will observe that the detailed information output includes the organism_name
, kingdom
, group
, subgroup
, file_size_MB
, chrs
, organelles
, plasmids
, and bio_projects
.
In case users are interested in organisms classified into a specific kingdom of life, they can use the kingdom
argument to filter for organisms that are classified into the corresponding kingdom.
# show all details only for Bacteria
head(listGenomes(kingdom = "Bacteria", details = TRUE) , 5)
organism_name kingdoms group
1 'Chrysanthemum coronarium' phytoplasma Bacteria Terrabacteria group
2 'Deinococcus soli' Cha et al. 2014 Bacteria Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria Terrabacteria group
4 Abiotrophia defectiva Bacteria Terrabacteria group
5 Acaricomes phytoseiuli Bacteria Terrabacteria group
subgroup file_size_MB chrs organelles plasmids bio_projects
1 Tenericutes 0.739592 NA NA NA 1
2 Deinococcus-Thermus 3.236980 1 NA NA 1
3 Tenericutes 0.545427 NA NA NA 1
4 Firmicutes 2.043440 NA NA NA 1
5 Actinobacteria 2.419520 NA NA NA 1
The following filters can be specified for the kingdom
argument: all
, Archaea
, Bacteria
, Eukaryota
, Viroids
, and Viruses
.
Furthermore, users can simply count the kingdom specific availability of genomes as following:
# the number of genomes available for each kingdom
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "kingdoms"])
Archaea Bacteria Eukaryota Viroids Viruses
516 7918 1660 48 5370
Analogous computations can be performed for group
, subgroup
, etc.
# the number of genomes available for each group
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "group"])
Acidobacteria Animals
30 617
Aquificae Avsunviroidae
17 4
Caldiserica Chrysiogenetes
2 2
Deferribacteres Deltavirus
6 1
Dictyoglomi DPANN group
2 30
dsDNA viruses, no RNA stage dsRNA viruses
2405 244
Elusimicrobia environmental samples
3 7
Euryarchaeota FCB group
322 690
Fungi Fusobacteria
662 31
Nitrospinae/Tectomicrobia group Nitrospirae
3 19
Other Plants
18 179
Pospiviroidae Proteobacteria
36 2718
Protists PVC group
190 113
Retro-transcribing viruses Satellites
134 223
Spirochaetes ssDNA viruses
84 848
ssRNA viruses Synergistetes
1389 23
TACK group Terrabacteria group
130 3097
Thermodesulfobacteria Thermotogae
9 28
unassigned viruses unclassified Archaea
11 29
unclassified archaeal viruses unclassified Bacteria
3 1039
unclassified phages unclassified viroids
29 8
unclassified virophages unclassified viruses
6 71
Users can also order organisms by their file size.
# order by file size
library(dplyr)
ncbi_genomes <- listGenomes(details = TRUE)
head(arrange(ncbi_genomes, desc(file_size_MB)) , 10)
organism_name kingdoms group subgroup file_size_MB chrs organelles
1 Pinus lambertiana Eukaryota Plants Land Plants 27602.70 NA NA
2 Picea glauca Eukaryota Plants Land Plants 24627.00 NA 1
3 Pinus taeda Eukaryota Plants Land Plants 22061.90 NA NA
4 Pseudotsuga menziesii Eukaryota Plants Land Plants 14673.20 NA NA
5 Locusta migratoria Eukaryota Animals Insects 5759.80 NA NA
6 Orycteropus afer Eukaryota Animals Mammals 4444.08 NA 1
7 Chrysochloris asiatica Eukaryota Animals Mammals 4210.11 NA 1
8 Elephantulus edwardii Eukaryota Animals Mammals 3843.98 NA NA
9 Apodemus sylvaticus Eukaryota Animals Mammals 3758.14 NA NA
10 Triticum urartu Eukaryota Plants Land Plants 3747.05 NA NA
plasmids bio_projects
1 NA 1
2 NA 2
3 NA 1
4 NA 1
5 NA 1
6 NA 1
7 NA 1
8 NA 1
9 NA 1
10 NA 1
This analysis shows that Pinus lambertiana
has the largest genome available on the NCBI server.
Internally, the listGenomes()
function downloads the Genome Reports file from NCBI and stores it in a tempfile()
folder named _ncbi_downloads/overview.txt
. It is only downloaded once and is then accessed from your hard drive. In case users would like to update the Genome Reports file, they can specify the update = TRUE
argument which allows them to reload the Genome Reports file from the NCBI server.
# users can also update the organism table using the 'update' argument
head(listGenomes(details = TRUE, update = TRUE) , 5)
organism_name kingdoms group
1 'Chrysanthemum coronarium' phytoplasma Bacteria Terrabacteria group
2 'Deinococcus soli' Cha et al. 2014 Bacteria Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria Terrabacteria group
4 Abaca bunchy top virus Viruses ssDNA viruses
5 Abalone herpesvirus Victoria/AUS/2009 Viruses dsDNA viruses, no RNA stage
subgroup file_size_MB chrs organelles plasmids bio_projects
1 Tenericutes 0.739592 NA NA NA 1
2 Deinococcus-Thermus 3.236980 1 NA NA 1
3 Tenericutes 0.545427 NA NA NA 1
4 Nanoviridae 0.006422 6 NA NA 1
5 unclassified 0.211518 1 NA NA 1
Again, the listGenomes()
function can be users to filter for available genome information in refseq.
# list all Eukaryota that are stored in refseq
head(listGenomes(kingdom = "Eukaryota", database = "refseq") , 20)
organism_name
1 Agaricus bisporus
2 Auricularia subglabra
3 Komagataella phaffii
4 Arthroderma benhamiae
5 Arthroderma otae
6 Aspergillus clavatus
7 Aspergillus flavus
8 Aspergillus fumigatus
9 Aspergillus nidulans
10 Aspergillus niger
11 Aspergillus oryzae
12 Aspergillus terreus
13 Phaeoacremonium minimum
14 Batrachochytrium dendrobatidis
15 Ordospora colligata
16 Bipolaris oryzae
17 Bipolaris sorokiniana
18 Bipolaris zeicola
19 Botrytis cinerea
20 Candida albicans
Or analogous:
# the number of genomes available for each kingdom stored in refseq
ncbi_genomes <- listGenomes(details = TRUE, database = "refseq")
table(ncbi_genomes[ , "kingdoms"])
Archaea Bacteria Eukaryota
258 4679 582
Note that when running the listGenomes()
function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI databases. All subsequent executions of listGenomes()
will then respond very fast, because they will access the corresponding files stored on your hard drive.
After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, or CDS file in fasta
format. The following functions allow users to download proteomes, genomes, and CDS files from several database resources such as: refseq
. When a corresponding proteome, genome, or CDS file was loaded to your hard-drive, a documentation *.txt
file is generated storing File Name
, Organism
, Database
, URL
, and DATE
information. This way a better reproducibility of proteome, genome, and CDS versions used for subsequent data analyses can be achieved.
The easiest way to download a genome is to use the getGenome()
function.
In this example we will download the genome of A. thaliana
.
The getGenome()
function is an interface function to the NCBI refseq or NCBI genbank databases from which corresponding genomes can be retrieved.
For this purpose users need to specify the kingdom in which their organism of interest is classified into, e.g. "archaea"
,"bacteria"
, "fungi"
, "invertebrate"
, "plant"
, "protozoa"
, "vertebrate_mammalian"
, or "vertebrate_other"
(see also ?getKingdoms
) and then the scientific name of the organism of interest.
# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
getGenome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","genomes") )
The getGenome()
function creates a directory named '_ncbi_downloads/genomes'
into which the corresponding genome named Arabidopsis_thaliana_genomic.fna.gz
is downloaded. The read_genome()
function enables users to work with the genome as data.table
object.
# path to genome: '_ncbi_downloads/genomes/Arabidopsis_thaliana_genomic.fna.gz'
file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genomic.fna.gz")
# read genome as data.table object
Ath_genome <- read_genome(file_path, format = "fasta")
In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
In case, users wish to download genomes from NCBI genbank instead of NCBI refseq, they can specify the argument db = "genbank"
in getGenome()
.
The getProteome()
function is also an interface function to the NCBI refseq or NCBI genbank databases from which corresponding genomes can be retrieved. It works analogous to getGenome()
.
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","proteomes") )
The getProteome()
function creates a directory named _ncbi_downloads/proteomes
into which the orresponding proteome named Arabidopsis_thaliana_protein.faa.gz
is downloaded. The read_proteome()
function enables users to work with the proteome as data.table
object.
# path to proteome: '_ncbi_downloads/proteomes/Arabidopsis_thaliana_protein.faa.gz'
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
# read proteome as data.table object
Ath_proteome <- read_proteome(file_path, format = "fasta")
In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
In case, users wish to download genomes from NCBI genbank instead of NCBI refseq, they can specify the argument db = "genbank"
in getProteome()
.
The getCDS()
function is also an interface function to the NCBI refseq database from which the corresponding CDS files are downloaded. It works analogous to the getGenome()
and getProteome()
functions but for CDS files.
# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
getCDS( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","CDS") )
The getCDS()
function creates a directory named _ncbi_downloads/CDS
into which corresponding CDS are loaded. The read_cds()
function allows you to read the correspondning CDS as data.table
object.
# path to CDS file: '_ncbi_downloads/CDS/Arabidopsis_thaliana_rna.fna.gz'
file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
# read CDS as data.table object
Ath_cds <- read_cds(file_path, format = "fasta")
In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here")
argument.
Furthermore, the getCDS()
function checks whether all CDS sequences can be divided by 3 (codons). In case sequences of particualar genes cannot be divided by 3, a warning massage is returned to quantify the number of corresponding genes. In case users would like to extract these sequences from their data, they can specify the delete_delete_corrupt = TRUE
argument, which will then delete all corrupt CDS sequences.
For most analyses only subsets of sequences (taken from the entire genome) are needed. This section introduces several approaches to select a set of sequences for furthr analyses.
getProteome()
As seen before, the getProteome()
function allows users to download the entire proteome of a specific organism of interest that is stored in refseq.
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db = "refseq",
kingdom = "plant",
organism = "Arabidopsis thaliana",
path = file.path("_ncbi_downloads","proteomes") )
Again, first we download the A. thaliana proteome and furthermore are interested in the following two genes for subsequent analyses, AT1G06090
and AT1G06100
. Both genes are memebers of the fatty acid desaturase family and take over functions in oxidoreductase activity.
For this purpose users can use the biomart()
function [see Functional Annotation for details].
The number of genome sequences generated and stored in sequence databases is growing exponentially every day. With the availability of this growing amount of data, meta-genomics studies become more popular and useful for finding patterns within genomes by comparing them to thousands of other genomes. However, the first step in any meta-genomics study is the retrieval of the genomes that shall be compared or investigated.
For this purpose, I implemented the meta.retrieval()
function to allow users to perform easy meta-genome retrieval in R.
The getKingdoms()
function stores a list of all available kingdoms of life.
getKingdoms()
[1] "archaea" "bacteria" "fungi"
[4] "invertebrate" "plant" "protozoa"
[7] "vertebrate_mammalian" "vertebrate_other"
These kingdoms can be specified in meta.retrieval()
.
The meta.retrieval()
function aims to simplify the genome retrieval process for subsequent meta-genomics studies.
Usually this step is performed with shell
scripts. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing workflows.
For example, the pipeline logic of the magrittr package can be used with meta.retrieval()
.
# download all vertebrate genomes, then apply ...
meta.retrieval(kingdom = "vertebrate_mammalian", type = "genome") %>% ...
Here ...
denotes any subsequent meta-genomics analysis. Hence, meta.retrieval()
enables the pipelining methodology for meta-genomics.
The meta.retrieval()
function can retrieve genomes, proteomes, and CDS files.
Download all mammalian vertebrate genomes from RefSeq.
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")
All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian
. Alternatively, users can specify the out.folder
argument to define a custom output folder path.
Alternatively, download all mammalian vertebrate genomes from genbank
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome")
Download all mammalian vertebrate proteomes from RefSeq.
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome")
Alternatively, download all mammalian vertebrate proteomes from genbank
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome")
Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data).
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", type = "CDS")
Users can obtain alternative kingdoms using getKingdoms()
.
Finally, users can download all genomes stored in the RefSeq database with one command:
# download all geneomes stored in RefSeq
sapply(getKingdoms(), function(x) meta.retrieval(x, type = "genome"))
Analogously, proteomes or CDS files can be retrieved by exchanging type = "genome"
by type = "proteome"
or type = "CDS"
.
Users can download all genomes stored in the Genbank database with one command:
# download all geneomes stored in Genbank
sapply(getKingdoms(), function(x) meta.retrieval(x, db = "genbank", type = "genome"))
Analogously, proteomes or CDS files can be retrieved by exchanging type = "genome"
by type = "proteome"
.