Sequence Retrieval

2016-08-07

The biomartr package allows users to retrieve biological sequences in a very simple and intuitive way.

Using biomartr, users can retrieve either genomes, proteomes, or CDS data using the specialized functions:

Getting Started with Sequence Retrieval

First users can check whether or not the genome, proteome, or CDS of their interest is available for download.

Using the scientific name of the organism of interest, users can check whether the corresponding genome is available via the is.genome.available() function.

# checking whether or not the Arabidopsis thaliana 
# genome is avaialable for download
is.genome.available("Arabidopsis thaliana")
[1] TRUE

By specifying the details = TRUE argument, the genome file size as well as additional information can be printed to the console.

# printing details to the console
is.genome.available("Arabidopsis thaliana", details = TRUE)
           organism_name  kingdoms  group    subgroup file_size_MB chrs organelles plasmids bio_projects
682 Arabidopsis thaliana Eukaryota Plants Land Plants      119.668    6          2       NA            6

Users will observe that the Arabidopsis thaliana genome file has a size of 119.668 MB.

Note: The availability of genomes has been taken from NCBI.

Users can determine the total number of available genomes using the listGenomes() function.

length(listGenomes())
[1] 15512

Hence, currently 15512 genomes (including all kingdoms of life) are stored on NCBI servers.

Optionally, users can also specify the database for which the availability of organisms shall be checked.

# cheking whether A. thaliana is available in the refseq database
is.genome.available("Arabidopsis thaliana", database = "refseq")
[1] TRUE

Users can also determine the total number of genomes stored in refseq.

length(listGenomes(database = "refseq"))
[1] 5519

This result shows that so far (year 2016) 5519 genomes are stored in refseq.

The simplest way to work with listGenomes() is to print available genomes to the console.

# the simplest way to retrieve names of available genomes stored within NCBI databases
head(listGenomes() , 5)
[1] "'Chrysanthemum coronarium' phytoplasma"         
[2] "'Deinococcus soli' Cha et al. 2014"             
[3] "'Echinacea purpurea' witches'-broom phytoplasma"
[4] "Abaca bunchy top virus"                         
[5] "Abalone herpesvirus Victoria/AUS/2009"

In case users are interested in a detailed output of the corresponding organism file stored on NCBI, again they can specify the details = TRUE argument.

# show all details
head(listGenomes(details = TRUE) , 5)
                                    organism_name kingdoms                       group
1          'Chrysanthemum coronarium' phytoplasma Bacteria         Terrabacteria group
2              'Deinococcus soli' Cha et al. 2014 Bacteria         Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria         Terrabacteria group
4                          Abaca bunchy top virus  Viruses               ssDNA viruses
5           Abalone herpesvirus Victoria/AUS/2009  Viruses dsDNA viruses, no RNA stage
             subgroup file_size_MB chrs organelles plasmids bio_projects
1         Tenericutes     0.739592   NA         NA       NA            1
2 Deinococcus-Thermus     3.236980    1         NA       NA            1
3         Tenericutes     0.545427   NA         NA       NA            1
4         Nanoviridae     0.006422    6         NA       NA            1
5        unclassified     0.211518    1         NA       NA            1

Users will observe that the detailed information output includes the organism_name, kingdom, group, subgroup, file_size_MB, chrs, organelles, plasmids, and bio_projects.

In case users are interested in organisms classified into a specific kingdom of life, they can use the kingdom argument to filter for organisms that are classified into the corresponding kingdom.

# show all details only for Bacteria
head(listGenomes(kingdom = "Bacteria", details = TRUE) , 5)
                                    organism_name kingdoms               group
1          'Chrysanthemum coronarium' phytoplasma Bacteria Terrabacteria group
2              'Deinococcus soli' Cha et al. 2014 Bacteria Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria Terrabacteria group
4                           Abiotrophia defectiva Bacteria Terrabacteria group
5                          Acaricomes phytoseiuli Bacteria Terrabacteria group
             subgroup file_size_MB chrs organelles plasmids bio_projects
1         Tenericutes     0.739592   NA         NA       NA            1
2 Deinococcus-Thermus     3.236980    1         NA       NA            1
3         Tenericutes     0.545427   NA         NA       NA            1
4          Firmicutes     2.043440   NA         NA       NA            1
5      Actinobacteria     2.419520   NA         NA       NA            1

The following filters can be specified for the kingdom argument: all, Archaea, Bacteria, Eukaryota, Viroids, and Viruses.

Furthermore, users can simply count the kingdom specific availability of genomes as following:

# the number of genomes available for each kingdom
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "kingdoms"])
 Archaea  Bacteria Eukaryota   Viroids   Viruses 
      516      7918      1660        48      5370 

Analogous computations can be performed for group, subgroup, etc.

# the number of genomes available for each group
ncbi_genomes <- listGenomes(details = TRUE)
table(ncbi_genomes[ , "group"])
                  Acidobacteria                         Animals 
                             30                             617 
                      Aquificae                   Avsunviroidae 
                             17                               4 
                    Caldiserica                  Chrysiogenetes 
                              2                               2 
                Deferribacteres                      Deltavirus 
                              6                               1 
                    Dictyoglomi                     DPANN group 
                              2                              30 
    dsDNA viruses, no RNA stage                   dsRNA viruses 
                           2405                             244 
                  Elusimicrobia           environmental samples 
                              3                               7 
                  Euryarchaeota                       FCB group 
                            322                             690 
                          Fungi                    Fusobacteria 
                            662                              31 
Nitrospinae/Tectomicrobia group                     Nitrospirae 
                              3                              19 
                          Other                          Plants 
                             18                             179 
                  Pospiviroidae                  Proteobacteria 
                             36                            2718 
                       Protists                       PVC group 
                            190                             113 
     Retro-transcribing viruses                      Satellites 
                            134                             223 
                   Spirochaetes                   ssDNA viruses 
                             84                             848 
                  ssRNA viruses                   Synergistetes 
                           1389                              23 
                     TACK group             Terrabacteria group 
                            130                            3097 
          Thermodesulfobacteria                     Thermotogae 
                              9                              28 
             unassigned viruses            unclassified Archaea 
                             11                              29 
  unclassified archaeal viruses           unclassified Bacteria 
                              3                            1039 
            unclassified phages            unclassified viroids 
                             29                               8 
        unclassified virophages            unclassified viruses 
                              6                              71 

Users can also order organisms by their file size.

# order by file size
library(dplyr)

ncbi_genomes <- listGenomes(details = TRUE)
head(arrange(ncbi_genomes, desc(file_size_MB)) , 10)
            organism_name  kingdoms   group    subgroup file_size_MB chrs organelles
1       Pinus lambertiana Eukaryota  Plants Land Plants     27602.70   NA         NA
2            Picea glauca Eukaryota  Plants Land Plants     24627.00   NA          1
3             Pinus taeda Eukaryota  Plants Land Plants     22061.90   NA         NA
4   Pseudotsuga menziesii Eukaryota  Plants Land Plants     14673.20   NA         NA
5      Locusta migratoria Eukaryota Animals     Insects      5759.80   NA         NA
6        Orycteropus afer Eukaryota Animals     Mammals      4444.08   NA          1
7  Chrysochloris asiatica Eukaryota Animals     Mammals      4210.11   NA          1
8   Elephantulus edwardii Eukaryota Animals     Mammals      3843.98   NA         NA
9     Apodemus sylvaticus Eukaryota Animals     Mammals      3758.14   NA         NA
10        Triticum urartu Eukaryota  Plants Land Plants      3747.05   NA         NA
   plasmids bio_projects
1        NA            1
2        NA            2
3        NA            1
4        NA            1
5        NA            1
6        NA            1
7        NA            1
8        NA            1
9        NA            1
10       NA            1

This analysis shows that Pinus lambertiana has the largest genome available on the NCBI server.

Internally, the listGenomes() function downloads the Genome Reports file from NCBI and stores it in a tempfile() folder named _ncbi_downloads/overview.txt. It is only downloaded once and is then accessed from your hard drive. In case users would like to update the Genome Reports file, they can specify the update = TRUE argument which allows them to reload the Genome Reports file from the NCBI server.

# users can also update the organism table using the 'update' argument
head(listGenomes(details = TRUE, update = TRUE) , 5)
                                    organism_name kingdoms                       group
1          'Chrysanthemum coronarium' phytoplasma Bacteria         Terrabacteria group
2              'Deinococcus soli' Cha et al. 2014 Bacteria         Terrabacteria group
3 'Echinacea purpurea' witches'-broom phytoplasma Bacteria         Terrabacteria group
4                          Abaca bunchy top virus  Viruses               ssDNA viruses
5           Abalone herpesvirus Victoria/AUS/2009  Viruses dsDNA viruses, no RNA stage
             subgroup file_size_MB chrs organelles plasmids bio_projects
1         Tenericutes     0.739592   NA         NA       NA            1
2 Deinococcus-Thermus     3.236980    1         NA       NA            1
3         Tenericutes     0.545427   NA         NA       NA            1
4         Nanoviridae     0.006422    6         NA       NA            1
5        unclassified     0.211518    1         NA       NA            1

Again, the listGenomes() function can be users to filter for available genome information in refseq.

# list all Eukaryota that are stored in refseq
head(listGenomes(kingdom = "Eukaryota", database = "refseq") , 20)
                    organism_name
1               Agaricus bisporus
2           Auricularia subglabra
3            Komagataella phaffii
4           Arthroderma benhamiae
5                Arthroderma otae
6            Aspergillus clavatus
7              Aspergillus flavus
8           Aspergillus fumigatus
9            Aspergillus nidulans
10              Aspergillus niger
11             Aspergillus oryzae
12            Aspergillus terreus
13        Phaeoacremonium minimum
14 Batrachochytrium dendrobatidis
15            Ordospora colligata
16               Bipolaris oryzae
17          Bipolaris sorokiniana
18              Bipolaris zeicola
19               Botrytis cinerea
20               Candida albicans

Or analogous:

# the number of genomes available for each kingdom stored in refseq
ncbi_genomes <- listGenomes(details = TRUE, database = "refseq")
table(ncbi_genomes[ , "kingdoms"])
  Archaea  Bacteria Eukaryota 
      258      4679       582 

Note that when running the listGenomes() function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI databases. All subsequent executions of listGenomes() will then respond very fast, because they will access the corresponding files stored on your hard drive.

Downloading Biological Sequences

After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, or CDS file in fasta format. The following functions allow users to download proteomes, genomes, and CDS files from several database resources such as: refseq. When a corresponding proteome, genome, or CDS file was loaded to your hard-drive, a documentation *.txt file is generated storing File Name, Organism, Database, URL, and DATE information. This way a better reproducibility of proteome, genome, and CDS versions used for subsequent data analyses can be achieved.

Genome Retrieval

The easiest way to download a genome is to use the getGenome() function.

In this example we will download the genome of A. thaliana.

The getGenome() function is an interface function to the NCBI refseq or NCBI genbank databases from which corresponding genomes can be retrieved.

For this purpose users need to specify the kingdom in which their organism of interest is classified into, e.g. "archaea","bacteria", "fungi", "invertebrate", "plant", "protozoa", "vertebrate_mammalian", or "vertebrate_other" (see also ?getKingdoms) and then the scientific name of the organism of interest.

# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
getGenome( db       = "refseq", 
           kingdom  = "plant",
           organism = "Arabidopsis thaliana",
           path     = file.path("_ncbi_downloads","genomes") )

The getGenome() function creates a directory named '_ncbi_downloads/genomes' into which the corresponding genome named Arabidopsis_thaliana_genomic.fna.gz is downloaded. The read_genome() function enables users to work with the genome as data.table object.

# path to genome: '_ncbi_downloads/genomes/Arabidopsis_thaliana_genomic.fna.gz'
file_path <- file.path("_ncbi_downloads","genomes","Arabidopsis_thaliana_genomic.fna.gz")
# read genome as data.table object
Ath_genome <- read_genome(file_path, format = "fasta")

In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

In case, users wish to download genomes from NCBI genbank instead of NCBI refseq, they can specify the argument db = "genbank" in getGenome().

Proteome Retrieval

The getProteome() function is also an interface function to the NCBI refseq or NCBI genbank databases from which corresponding genomes can be retrieved. It works analogous to getGenome().

# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db       = "refseq", 
             kingdom  = "plant",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","proteomes") )

The getProteome() function creates a directory named _ncbi_downloads/proteomes into which the orresponding proteome named Arabidopsis_thaliana_protein.faa.gz is downloaded. The read_proteome() function enables users to work with the proteome as data.table object.

# path to proteome: '_ncbi_downloads/proteomes/Arabidopsis_thaliana_protein.faa.gz'
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
# read proteome as data.table object
Ath_proteome <- read_proteome(file_path, format = "fasta")

In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

In case, users wish to download genomes from NCBI genbank instead of NCBI refseq, they can specify the argument db = "genbank" in getProteome().

CDS Retrieval

The getCDS() function is also an interface function to the NCBI refseq database from which the corresponding CDS files are downloaded. It works analogous to the getGenome() and getProteome() functions but for CDS files.

# download the genome of Arabidopsis thaliana from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
getCDS( db       = "refseq", 
        kingdom  = "plant",
        organism = "Arabidopsis thaliana",
        path     = file.path("_ncbi_downloads","CDS") )

The getCDS() function creates a directory named _ncbi_downloads/CDS into which corresponding CDS are loaded. The read_cds() function allows you to read the correspondning CDS as data.table object.

# path to CDS file: '_ncbi_downloads/CDS/Arabidopsis_thaliana_rna.fna.gz'
file_path <- file.path("_ncbi_downloads","CDS","Arabidopsis_thaliana_rna.fna.gz")
# read CDS as data.table object
Ath_cds <- read_cds(file_path, format = "fasta")

In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Furthermore, the getCDS() function checks whether all CDS sequences can be divided by 3 (codons). In case sequences of particualar genes cannot be divided by 3, a warning massage is returned to quantify the number of corresponding genes. In case users would like to extract these sequences from their data, they can specify the delete_delete_corrupt = TRUE argument, which will then delete all corrupt CDS sequences.

Retrieving sequences for a set of genes

For most analyses only subsets of sequences (taken from the entire genome) are needed. This section introduces several approaches to select a set of sequences for furthr analyses.

Using the output from getProteome()

As seen before, the getProteome() function allows users to download the entire proteome of a specific organism of interest that is stored in refseq.

# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db       = "refseq", 
             kingdom  = "plant",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","proteomes") )

Again, first we download the A. thaliana proteome and furthermore are interested in the following two genes for subsequent analyses, AT1G06090 and AT1G06100. Both genes are memebers of the fatty acid desaturase family and take over functions in oxidoreductase activity.

For this purpose users can use the biomart() function [see Functional Annotation for details].

Perform Meta-Genome Retieval

The number of genome sequences generated and stored in sequence databases is growing exponentially every day. With the availability of this growing amount of data, meta-genomics studies become more popular and useful for finding patterns within genomes by comparing them to thousands of other genomes. However, the first step in any meta-genomics study is the retrieval of the genomes that shall be compared or investigated.

For this purpose, I implemented the meta.retrieval() function to allow users to perform easy meta-genome retrieval in R.

The getKingdoms() function stores a list of all available kingdoms of life.

getKingdoms()
[1] "archaea"              "bacteria"             "fungi"               
[4] "invertebrate"         "plant"                "protozoa"            
[7] "vertebrate_mammalian" "vertebrate_other"

These kingdoms can be specified in meta.retrieval().

The meta.retrieval() function aims to simplify the genome retrieval process for subsequent meta-genomics studies.

Usually this step is performed with shell scripts. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing workflows.

For example, the pipeline logic of the magrittr package can be used with meta.retrieval().

# download all vertebrate genomes, then apply ...
meta.retrieval(kingdom = "vertebrate_mammalian", type = "genome") %>% ...

Here ... denotes any subsequent meta-genomics analysis. Hence, meta.retrieval() enables the pipelining methodology for meta-genomics.

The meta.retrieval() function can retrieve genomes, proteomes, and CDS files.

Retrieve Genomic Sequences

Download all mammalian vertebrate genomes from RefSeq.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")

All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian. Alternatively, users can specify the out.folder argument to define a custom output folder path.

Alternatively, download all mammalian vertebrate genomes from genbank

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome")

Retrieve Protein Sequences

Download all mammalian vertebrate proteomes from RefSeq.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome")

Alternatively, download all mammalian vertebrate proteomes from genbank

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome")

Retrieve CDS Sequences

Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data).

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", type = "CDS")

Users can obtain alternative kingdoms using getKingdoms().

Retrieve Genomes for all kingdoms of life stored in RefSeq

Finally, users can download all genomes stored in the RefSeq database with one command:

# download all geneomes stored in RefSeq
sapply(getKingdoms(), function(x) meta.retrieval(x, type = "genome"))

Analogously, proteomes or CDS files can be retrieved by exchanging type = "genome" by type = "proteome" or type = "CDS".

Retrieve Genomes for all kingdoms of life stored in GenBank

Users can download all genomes stored in the Genbank database with one command:

# download all geneomes stored in Genbank
sapply(getKingdoms(), function(x) meta.retrieval(x, db = "genbank", type = "genome"))

Analogously, proteomes or CDS files can be retrieved by exchanging type = "genome" by type = "proteome".