Meta-Genome Retrieval

2016-11-21

Perform Meta-Genome Retrieval

The number of genome sequences generated and stored in sequence databases is growing exponentially every day. With the availability of this growing amount of data, meta-genomics studies become more popular and useful for finding patterns within genomes by comparing them to thousands of other genomes. However, the first step in any meta-genomics study is the retrieval of the genomes that shall be compared or investigated.

For this purpose, I implemented the meta.retrieval() function to allow users to perform easy meta-genome retrieval in R.

The getKingdoms() function stores a list of all available kingdoms of life. Using the argument db users can specify from which database kingdom information shall be retrieved.

Example RefSeq:

getKingdoms(db = "refseq")
[1] "archaea"              "bacteria"             "fungi"                "invertebrate"        
[5] "plant"                "protozoa"             "vertebrate_mammalian" "vertebrate_other"    
[9] "viral"

Example Genbank:

getKingdoms(db = "genbank")
[1] "archaea"              "bacteria"             "fungi"               
[4] "invertebrate"         "plant"                "protozoa"            
[7] "vertebrate_mammalian" "vertebrate_other"

In these examples the difference betwenn db = "refseq" and db = "genbank" is that db = "genbank" does not store viral information.

These kingdoms can be specified in meta.retrieval().

The meta.retrieval() function aims to simplify the genome retrieval process for subsequent meta-genomics studies.

Usually this step is performed with shell scripts. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing workflows.

For example, the pipeline logic of the magrittr package can be used with meta.retrieval().

# download all vertebrate genomes, then apply ...
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% ...

Here ... denotes any subsequent meta-genomics analysis. Hence, meta.retrieval() enables the pipelining methodology for meta-genomics.

The meta.retrieval() function can retrieve genomes, proteomes, and CDS files.

Retrieve Genomic Sequences

Download all mammalian vertebrate genomes from RefSeq.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")

All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian. Alternatively, users can specify the out.folder argument to define a custom output folder path.

Example Bacteria

# download all bacteria genomes
meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

Example Viruses

# download all virus genomes
meta.retrieval(kingdom = "viral", db = "refseq", type = "genome")

Example Archaea

# download all archaea genomes
meta.retrieval(kingdom = "archaea", db = "refseq", type = "genome")

Example Fungi

# download all fungi genomes
meta.retrieval(kingdom = "fungi", db = "refseq", type = "genome")

Example Plants

# download all plant genomes
meta.retrieval(kingdom = "plant", db = "refseq", type = "genome")

Example Invertebrates

# download all invertebrate genomes
meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome")

Example Protozoa

# download all invertebrate genomes
meta.retrieval(kingdom = "protozoa", db = "refseq", type = "genome")

Alternatively, download all mammalian vertebrate genomes from Genbank, e.g.

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome")

Metagenome project retrieval from NCBI Genbank

NCBI Genbank stores metagenome projects in addition to species specific genome, proteome or CDS sequences. To retrieve these metagenomes users can perform the following combination of commands:

First, users can list the project names of available metagenomes by typing

# list available metagenomes at NCBI Genbank
listMetaGenomes()
[1] "metagenome"                     "human gut metagenome"           "epibiont metagenome"           
 [4] "marine metagenome"              "soil metagenome"                "mine drainage metagenome"      
 [7] "mouse gut metagenome"           "marine sediment metagenome"     "termite gut metagenome"        
[10] "hot springs metagenome"         "human lung metagenome"          "fossil metagenome"             
[13] "freshwater metagenome"          "saltern metagenome"             "stromatolite metagenome"       
[16] "coral metagenome"               "mosquito metagenome"            "fish metagenome"               
[19] "bovine gut metagenome"          "chicken gut metagenome"         "wastewater metagenome"         
[22] "microbial mat metagenome"       "freshwater sediment metagenome" "human metagenome"              
[25] "hydrothermal vent metagenome"   "compost metagenome"             "wallaby gut metagenome"        
[28] "groundwater metagenome"         "gut metagenome"                 "sediment metagenome"           
[31] "ant fungus garden metagenome"   "food metagenome"                "hypersaline lake metagenome"   
[34] "hydrocarbon metagenome"         "activated sludge metagenome"    "viral metagenome"              
[37] "bioreactor metagenome"          "wasp metagenome"                "permafrost metagenome"         
[40] "sponge metagenome"              "aquatic metagenome"             "insect gut metagenome"         
[43] "activated carbon metagenome"    "anaerobic digester metagenome"  "rock metagenome"               
[46] "terrestrial metagenome"         "rock porewater metagenome"      "seawater metagenome"           
[49] "scorpion gut metagenome"        "soda lake metagenome"           "glacier metagenome"

Internally the listMetaGenomes() function downloads the assembly_summary.txt file from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/ to retrieve available metagenome information. This procedure might take a few seconds during the first run of listMetaGenomes(). Subsequently, the assembly_summary.txt file will be stored in the tempdir() directory to achieve a much faster access of this information during following uses of listMetaGenomes().

In case users wish to retrieve detailed information about available metagenome projects they can specify the details = TRUE argument.

# detailed information on available metagenomes at NCBI Genbank
listMetaGenomes(details = TRUE)
# A tibble: 857 x 21
   assembly_accession bioproject    biosample     wgs_master refseq_category  taxid species_taxid
                <chr>      <chr>        <chr>          <chr>           <chr>  <int>         <int>
1     GCA_000206185.1 PRJNA32359 SAMN02954317 AAGA00000000.1              na 256318        256318
2     GCA_000206205.1 PRJNA32355 SAMN02954315 AAFZ00000000.1              na 256318        256318
3     GCA_000206225.1 PRJNA32357 SAMN02954316 AAFY00000000.1              na 256318        256318
4     GCA_000208265.2 PRJNA17779 SAMN02954240 AASZ00000000.1              na 256318        256318
5     GCA_000208285.1 PRJNA17657 SAMN02954268 AATO00000000.1              na 256318        256318
6     GCA_000208305.1 PRJNA17659 SAMN02954269 AATN00000000.1              na 256318        256318
7     GCA_000208325.1 PRJNA16729 SAMN02954263 AAQL00000000.1              na 256318        256318
8     GCA_000208345.1 PRJNA16729 SAMN02954262 AAQK00000000.1              na 256318        256318
9     GCA_000208365.1 PRJNA13699 SAMN02954283 AAFX00000000.1              na 256318        256318
10    GCA_900010595.1 PRJEB11544 SAMEA3639840 CZPY00000000.1              na 256318        256318
# ... with 847 more rows, and 14 more variables: organism_name <chr>, infraspecific_name <chr>,
#   isolate <chr>, version_status <chr>, assembly_level <chr>, release_type <chr>, genome_rep <chr>,
#   seq_rel_date <date>, asm_name <chr>, submitter <chr>, gbrs_paired_asm <chr>, paired_asm_comp <chr>,
#   ftp_path <chr>, excluded_from_refseq <chr>

Finally, users can retrieve available metagenomes using getMetaGenomes(). The name argument receives the metagenome project name retrieved with listMetaGenomes(). The path argument specifies the folder path in which corresponding genomes shall be stored.

# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomes(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut"))
1] "The metagenome of 'human gut metagenome' has been downloaded to '_ncbi_downloads/human_gut'."
  [1] "_ncbi_downloads/human_gut/GCA_000205525.2_ASM20552v2_genomic.fna.gz"
  [2] "_ncbi_downloads/human_gut/GCA_000205765.1_ASM20576v1_genomic.fna.gz"
  [3] "_ncbi_downloads/human_gut/GCA_000205785.1_ASM20578v1_genomic.fna.gz"
  [4] "_ncbi_downloads/human_gut/GCA_000207925.1_ASM20792v1_genomic.fna.gz"
  [5] "_ncbi_downloads/human_gut/GCA_000207945.1_ASM20794v1_genomic.fna.gz"
  [6] "_ncbi_downloads/human_gut/GCA_000207965.1_ASM20796v1_genomic.fna.gz"
  [7] "_ncbi_downloads/human_gut/GCA_000207985.1_ASM20798v1_genomic.fna.gz"
  [8] "_ncbi_downloads/human_gut/GCA_000208005.1_ASM20800v1_genomic.fna.gz"
  [9] "_ncbi_downloads/human_gut/GCA_000208025.1_ASM20802v1_genomic.fna.gz"
 [10] "_ncbi_downloads/human_gut/GCA_000208045.1_ASM20804v1_genomic.fna.gz"
 [11] "_ncbi_downloads/human_gut/GCA_000208065.1_ASM20806v1_genomic.fna.gz"
 [12] "_ncbi_downloads/human_gut/GCA_000208085.1_ASM20808v1_genomic.fna.gz"
 [13] "_ncbi_downloads/human_gut/GCA_000208105.1_ASM20810v1_genomic.fna.gz"
 [14] "_ncbi_downloads/human_gut/GCA_000208125.1_ASM20812v1_genomic.fna.gz"
 [15] "_ncbi_downloads/human_gut/GCA_000208145.1_ASM20814v1_genomic.fna.gz"
 [16] "_ncbi_downloads/human_gut/GCA_000208165.1_ASM20816v1_genomic.fna.gz"
 ...

Internally, getMetaGenomes() creates a folder specified in the path argument. Genomes associated with the metagenomes project specified in the name argument will then be downloaded and stored in this folder. As return value getMetaGenomes() returns the file paths to the genome files which can then be used as input to the read*() functions.

Alternatively or subsequent to the metagenome retrieval, users can retrieve annotation files of genomes belonging to a metagenome project selected with listMetaGenomes() by using the getMetaGenomeAnnotations() function.

# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomeAnnotations(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut","annotations"))
[1] "The annotations of metagenome 'human gut metagenome' have been downloaded and stored at '_ncbi_downloads/human_gut/annotations'."
  [1] "_ncbi_downloads/human_gut/annotations/GCA_000205525.2_ASM20552v2_genomic.gff.gz"
  [2] "_ncbi_downloads/human_gut/annotations/GCA_000205765.1_ASM20576v1_genomic.gff.gz"
  [3] "_ncbi_downloads/human_gut/annotations/GCA_000205785.1_ASM20578v1_genomic.gff.gz"
  [4] "_ncbi_downloads/human_gut/annotations/GCA_000207925.1_ASM20792v1_genomic.gff.gz"
  [5] "_ncbi_downloads/human_gut/annotations/GCA_000207945.1_ASM20794v1_genomic.gff.gz"
  [6] "_ncbi_downloads/human_gut/annotations/GCA_000207965.1_ASM20796v1_genomic.gff.gz"
  [7] "_ncbi_downloads/human_gut/annotations/GCA_000207985.1_ASM20798v1_genomic.gff.gz"
  [8] "_ncbi_downloads/human_gut/annotations/GCA_000208005.1_ASM20800v1_genomic.gff.gz"
  [9] "_ncbi_downloads/human_gut/annotations/GCA_000208025.1_ASM20802v1_genomic.gff.gz"
 [10] "_ncbi_downloads/human_gut/annotations/GCA_000208045.1_ASM20804v1_genomic.gff.gz"
 [11] "_ncbi_downloads/human_gut/annotations/GCA_000208065.1_ASM20806v1_genomic.gff.gz"
 [12] "_ncbi_downloads/human_gut/annotations/GCA_000208085.1_ASM20808v1_genomic.gff.gz"
 [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
 [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
 [14] "_ncbi_downloads/human_gut/annotations/GCA_000208125.1_ASM20812v1_genomic.gff.gz"
 [15] "_ncbi_downloads/human_gut/annotations/GCA_000208145.1_ASM20814v1_genomic.gff.gz"
 [16] "_ncbi_downloads/human_gut/annotations/GCA_000208165.1_ASM20816v1_genomic.gff.gz"
 ...

The file paths of the downloaded *.gff are retured by getMetaGenomeAnnotations() and can be used as input for the read.gff() function in the seqreadr package.

Retrieve Protein Sequences

Download all mammalian vertebrate proteomes.

Example RefSeq:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome")

Example Genbank:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome")

Retrieve CDS Sequences

Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data).

Example RefSeq:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "CDS")

Example Genbank:

# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "CDS")

Retrieve GFF files

Download all mammalian vertebrate gff files.

Example RefSeq:

# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "gff")

Example Genbank:

# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "gff")

Users can obtain alternative kingdoms using getKingdoms().

Retrieve Genomes for all kingdoms of life

If users wish to download the all genomes, proteome, CDS, or gff files for all species available in RefSeq or Genbank, they can use the meta.retrieval.all() function for this purpose.

Genome Retrieval

Example RefSeq:

# download all geneomes stored in RefSeq
meta.retrieval.all(db = "refseq", type = "genome")

Example Genbank:

# download all geneomes stored in Genbank
meta.retrieval.all(db = "genbank", type = "genome")

Proteome Retrieval

Example RefSeq:

# download all proteome stored in RefSeq
meta.retrieval.all(db = "refseq", type = "proteome")

Example Genbank:

# download all proteome stored in Genbank
meta.retrieval.all(db = "genbank", type = "proteome")