bold is an R package to connect to BOLD Systems via their API. Functions in bold let you search for sequence data, specimen data, sequence + specimen data, and download raw trace files.

bold info

Using bold

Install

Install bold from CRAN

install.packages("bold")

Or install the development version from GitHub

devtools::install_github("ropensci/bold")

Load the package

library("bold")

Search for taxonomic names via names

bold_tax_name searches for names with names.

bold_tax_name(name = 'Diplura')
##     input  taxid   taxon tax_rank tax_division parentid       parentname
## 1 Diplura 591238 Diplura    order      Animals       82          Insecta
## 2 Diplura 603673 Diplura    genus     Protists    53974 Scytosiphonaceae
##   taxonrep
## 1  Diplura
## 2     <NA>
bold_tax_name(name = c('Diplura', 'Osmia'))
##     input  taxid   taxon tax_rank tax_division parentid       parentname
## 1 Diplura 591238 Diplura    order      Animals       82          Insecta
## 2 Diplura 603673 Diplura    genus     Protists    53974 Scytosiphonaceae
## 3   Osmia   4940   Osmia    genus      Animals     4962     Megachilinae
##   taxonrep
## 1  Diplura
## 2     <NA>
## 3    Osmia

Search for taxonomic names via BOLD identifiers

bold_tax_id searches for names with BOLD identifiers.

bold_tax_id(id = 88899)
##   input taxid   taxon tax_rank tax_division parentid parentname
## 1 88899 88899 Momotus    genus      Animals    88898  Momotidae
bold_tax_id(id = c(88899, 125295))
##    input  taxid      taxon tax_rank tax_division parentid parentname
## 1  88899  88899    Momotus    genus      Animals    88898  Momotidae
## 2 125295 125295 Helianthus    genus       Plants   100962 Asteraceae

Search for sequence data only

The BOLD sequence API gives back sequence data, with a bit of metadata.

The default is to get a list back

bold_seq(taxon = 'Coelioxys')[1:2]
## [[1]]
## [[1]]$id
## [1] "BCHYM446-13"
## 
## [[1]]$name
## [1] "Coelioxys afra"
## 
## [[1]]$gene
## [1] "BCHYM446-13"
## 
## [[1]]$sequence
## [1] "-------------------------------------------------------------------------------------------------------------------------------------------TTTTTAATAATTTTTTTTTTAGTTATACCATTTTTAATTGGAGGATTTGGAAATTGATTAGTACCTTTAATACTAGGAGCCCCCGATATAGCTTTTCCACGAATAAATAATGTAAGATTTTGACTATTACCTCCCTCAATTTTCTTATTATTATCAAGAACCCTAATTAACCCAAGAGCTGGTACTGGATGAACTGTANCTCCTCCTTTATCCTTATATACATTTCATGCCTCACCTTCCGTTGATTTAGCAATTTTTTCACTTCATTTATCAGGAATTTCATCAATTATTGGATCAATAAATTTTATTGTTACAATCTTAATAATAAAAAATTTTTCTTTAAATTATAGACAAATACCATTATTTTCATGATCAGTTTTAATTACTACAATTTTACTTTTATTATCATTACCAATTTTAGCTGGAGCAATTACTATACTCCTATTTGATCGAAATTTAAATACCTCATTCTTTGACC-----------------------------------------"
## 
## 
## [[2]]
## [[2]]$id
## [1] "FBAPB481-09"
## 
## [[2]]$name
## [1] "Coelioxys afra"
## 
## [[2]]$gene
## [1] "FBAPB481-09"
## 
## [[2]]$sequence
## [1] "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TTTCCACGAATAAATAATGTAAGATTTTGACTATTACCTCCCTCAATTTTCTTATTATTATCAAGAACCCTAATTAACCCAAGTGCTGGTACTGGATGAACTGTATATCCTCCTTTATCCTTATATACATTTCATGCCTCACCTTCCGTTGATTTAGCAATTTTTTCACTTCATTTATCAGGAATTTCATCAATTATTGGATCAATAAATTTTATTGTTACAATCTTAATAATAAAAAATTTTTCTTTAAATTATAGACAAATACCATTATTTTCATGATCAGTTTTAATTACTACAATTTTACTTTTATTATCATTACCAATTTTAGCTGGAGCAATTACTATACTCCTATTTGATCGAAATTTAAATACCTCATTCTTTGACCCAATAGGAGGAGGAGATCCAATTTTATATCAACATTTATTT"

You can optionally get back the httr response object

res <- bold_seq(taxon = 'Coelioxys', response = TRUE)
res$headers
## $date
## [1] "Fri, 17 Apr 2015 19:12:26 GMT"
## 
## $server
## [1] "Apache/2.2.15 (Red Hat)"
## 
## $`x-powered-by`
## [1] "PHP/5.3.15"
## 
## $`content-disposition`
## [1] "attachment; filename=fasta.fas"
## 
## $connection
## [1] "close"
## 
## $`transfer-encoding`
## [1] "chunked"
## 
## $`content-type`
## [1] "application/x-download"
## 
## attr(,"class")
## [1] "insensitive" "list"

You can do geographic searches

bold_seq(geo = "USA")
## [[1]]
## [[1]]$id
## [1] "NEONV108-11"
## 
## [[1]]$name
## [1] "Aedes thelcter"
## 
## [[1]]$gene
## [1] "NEONV108-11"
## 
## [[1]]$sequence
## [1] "AACTTTATACTTCATCTTCGGAGTTTGATCAGGAATAGTTGGTACATCATTAAGAATTTTAATTCGTGCTGAATTAAGTCAACCAGGTATATTTATTGGAAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCTTTTATTATAATTTTCTTTATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGACTAGTTCCTCTAATATTAGGAGCCCCAGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATACTACCTCCCTCATTAACTCTTCTACTTTCAAGTAGTATAGTAGAAAATGGATCAGGAACAGGATGAACAGTTTATCCACCTCTTTCATCTGGAACTGCTCATGCAGGAGCCTCTGTTGATTTAACTATTTTTTCTCTTCATTTAGCCGGAGTTTCATCAATTTTAGGGGCTGTAAATTTTATTACTACTGTAATTAATATACGATCTGCAGGAATTACTCTTGATCGACTACCTTTATTCGTTTGATCTGTAGTAATTACAGCTGTTTTATTACTTCTTTCACTTCCTGTATTAGCTGGAGCTATTACAATACTATTAACTGATCGAAATTTAAATACATCTTTCTTTGATCCAATTGGAGGAGGAGACCCAATTTTATACCAACATTTATTT"
## 
## 
## [[2]]
## [[2]]$id
## [1] "NEONV109-11"
## 
## [[2]]$name
## [1] "Aedes thelcter"
## 
## [[2]]$gene
## [1] "NEONV109-11"
## 
## [[2]]$sequence
## [1] "AACTTTATACTTCATCTTCGGAGTTTGATCAGGAATAGTTGGTACATCATTAAGAATTTTAATTCGTGCTGAATTAAGTCAACCAGGTATATTTATTGGAAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCTTTTATTATAATTTTCTTTATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGACTAGTTCCTCTAATATTAGGAGCCCCAGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATACTACCTCCCTCATTAACTCTTCTACTTTCAAGTAGTATAGTAGAAAATGGGTCAGGAACAGGATGAACAGTTTATCCACCTCTTTCATCTGGAACTGCTCATGCAGGAGCCTCTGTTGATTTAACTATTTTTTCTCTTCATTTAGCCGGAGTTTCATCAATTTTAGGGGCTGTAAATTTTATTACTACTGTAATTAATATACGATCTGCAGGAATTACTCTTGATCGACTACCTTTATTCGTTTGATCTGTAGTAATTACAGCTGTTTTATTACTTCTTTCACTTCCTGTATTAGCTGGAGCTATTACAATACTATTAACTGATCGAAATTTAAATACATCTTTCTTTGACCCAATTGGAGGGGGAGACCCAATTTTATACCAACATTTATTT"

And you can search by researcher name

bold_seq(researchers = 'Thibaud Decaens')[[1]]
## $id
## [1] "BGABB1142-14"
## 
## $name
## [1] "Coleoptera"
## 
## $gene
## [1] "BGABB1142-14"
## 
## $sequence
## [1] "TTCTTATTTGGTGCTTGATCCGCAATAGTTGGAACTTCTCTTAGATTATTAATTCGATCTGAATTAGGATCCCCAGGATCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCTCATGCTTTTATTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCTTTCCCACGAATAAACAATATAAGATTTTGACTTCTTCCTCCTGCTCTCAGTTTATTAATTATAAGAAGAATTGTAGAAAGAGGGGCTGGAACAGGTTGAACTGTTTATCCTCCTCTATCAGCTAATTTAGCTCATAGAGGTTCTTCTGTAGATTTAGCTATTTTTAGCCTACATTTAGCAGGAGTTTCATCAATCCTTGGAGCTGTAAATTTTATTACTACCGTAATTAATATACGTCCTCAAGGTATAACCTTTGATCGTTTATCCTTATTTATTTGAGCAGTAAAAATTACAGCTATTCTTCTATTACTATCTCTTCCTGTTTTAGCAGGA---------------------------------------------------------------------------"

by taxon IDs

bold_seq(ids = c('ACRJP618-11', 'ACRJP619-11'))
## [[1]]
## [[1]]$id
## [1] "ACRJP618-11"
## 
## [[1]]$name
## [1] "Lepidoptera"
## 
## [[1]]$gene
## [1] "ACRJP618-11"
## 
## [[1]]$sequence
## [1] "------------------------TTGAGCAGGCATAGTAGGAACTTCTCTTAGTCTTATTATTCGAACAGAATTAGGAAATCCAGGATTTTTAATTGGAGATGATCAAATCTACAATACTATTGTTACGGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAATTGATTAGTTCCCCTTATACTAGGAGCCCCAGATATAGCTTTCCCTCGAATAAACAATATAAGTTTTTGGCTTCTTCCCCCTTCACTATTACTTTTAATTTCCAGAAGAATTGTTGAAAATGGAGCTGGAACTGGATGAACAGTTTATCCCCCACTGTCATCTAATATTGCCCATAGAGGTACATCAGTAGATTTAGCTATTTTTTCTTTACATTTAGCAGGTATTTCCTCTATTTTAGGAGCGATTAATTTTATTACTACAATTATTAATATACGAATTAACAGTATAAATTATGATCAAATACCACTATTTGTGTGATCAGTAGGAATTACTGCTTTACTCTTATTACTTTCTCTTCCAGTATTAGCAGGTGCTATCACTATATTATTAACGGATCGAAATTTAAATACATCATTTTTTGATCCTGCAGGAGGAGGAGATCCAATTTTATATCAACATTTATTT"
## 
## 
## [[2]]
## [[2]]$id
## [1] "ACRJP619-11"
## 
## [[2]]$name
## [1] "Lepidoptera"
## 
## [[2]]$gene
## [1] "ACRJP619-11"
## 
## [[2]]$sequence
## [1] "AACTTTATATTTTATTTTTGGTATTTGAGCAGGCATAGTAGGAACTTCTCTTAGTCTTATTATTCGAACAGAATTAGGAAATCCAGGATTTTTAATTGGAGATGATCAAATCTACAATACTATTGTTACGGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAATTGATTAGTTCCCCTTATACTAGGAGCCCCAGATATAGCTTTCCCTCGAATAAACAATATAAGTTTTTGGCTTCTTCCCCCTTCACTATTACTTTTAATTTCCAGAAGAATTGTTGAAAATGGAGCTGGAACTGGATGAACAGTTTATCCCCCACTGTCATCTAATATTGCCCATAGAGGTACATCAGTAGATTTAGCTATTTTTTCTTTACATTTAGCAGGTATTTCCTCTATTTTAGGAGCGATTAATTTTATTACTACAATTATTAATATACGAATTAACAGTATAAATTATGATCAAATACCACTATTTGTGTGATCAGTAGGAATTACTGCTTTACTCTTATTACTTTCTCTTCCAGTATTAGCAGGTGCTATCACTATATTATTAACGGATCGAAATTTAAATACATCATTTTTTGATCCTGCAGGAGGAGGAGATCCAATTTTATATCAACATTTATTT"

by container (containers include project codes and dataset codes)

bold_seq(container = 'ACRJP')[[1]]
## $id
## [1] "ACRJP167-09"
## 
## $name
## [1] "Lepidoptera"
## 
## $gene
## [1] "ACRJP167-09"
## 
## $sequence
## [1] "-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAAGTTTTTGACTTTTACCCCCCTCTTTAATTTTATTAATCTCTAGAAGAATTGTCGAAAATGGAGCAGGTACAGGATGAACAGTATATCCCCCACTTTCATCTAATATTGCTCATGGTGGTTCTTCTGTTGATTTAGCTATTTTTTCTCTTCATTTAGCCGGAATTTCCTCTATTTTAGGAGCAATTAATTTTATTACTACTATTATTAATATACGAGTTAATAATTTATCATTTGATCAAATACCTTTATTTGTTTGAGCTGTTGGTATTACTGCCTTATTACTTCTTCTTTCTTTACCAGTTTTAGCTGGAGCCATTACTATACTTTTAACAGATCGAAATCTTAATACTTCATTTTTTGACCCAGCTGGAGGAGGAGACCCAATTTTATACCAACATTTATTT"

by bin (a bin is a Barcode Index Number)

bold_seq(bin = 'BOLD:AAA5125')[[1]]
## $id
## [1] "ASARD6776-12"
## 
## $name
## [1] "Lepidoptera"
## 
## $gene
## [1] "ASARD6776-12"
## 
## $sequence
## [1] "AACTTTATATTTTATTTTTGGAATTTGAGCAGGTATAGTAGGAACTTCTTTAAGATTACTAATTCGAGCAGAATTAGGTACCCCCGGATCTTTAATTGGAGATGACCAAATTTATAATACCATTGTAACAGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATACTAGGAGCTCCTGATATAGCTTTCCCCCGAATAAATAATATAAGATTTTGACTATTACCCCCATCTTTAACCCTTTTAATTTCTAGAAGAATTGTCGAAAATGGAGCTGGAACTGGATGAACAGTTTATCCCCCCCTTTCATCTAATATTGCTCATGGAGGCTCTTCTGTTGATTTAGCTATTTTTTCCCTTCATCTAGCTGGAATCTCATCAATTTTAGGAGCTATTAATTTTATCACAACAATCATTAATATACGACTAAATAATATAATATTTGACCAAATACCTTTATTTGTATGAGCTGTTGGTATTACAGCATTTCTTTTATTGTTATCTTTACCTGTACTAGCTGGAGCTATTACTATACTTTTAACAGATCGAAACTTAAATACATCATTTTTTGACCCAGCAGGAGGAGGAGATCCTATTCTCTATCAACATTTATTT"

And there are more ways to query, check out the docs for ?bold_seq.

Search for specimen data only

The BOLD specimen API doesn't give back sequences, only specimen data. By default you download tsv format data, which is given back to you as a data.frame

res <- bold_specimens(taxon = 'Osmia')
head(res[,1:8])
##      processid         sampleid recordID       catalognum         fieldnum
## 1  BBHYL362-10     10BBCHY-3316  1769805     10BBCHY-3316   L#PC2010EI-002
## 2 BCHYM1499-13 BC ZSM HYM 19359  4005348 BC ZSM HYM 19359 BC ZSM HYM 19359
## 3  BCHYM411-13 BC ZSM HYM 18271  3896352 BC ZSM HYM 18271 BC ZSM HYM 18271
## 4  BCHYM413-13 BC ZSM HYM 18273  3896354 BC ZSM HYM 18273 BC ZSM HYM 18273
## 5  FBAPB700-09 BC ZSM HYM 02175  1289061 BC ZSM HYM 02175 BC ZSM HYM 02175
## 6  FBAPC355-10 BC ZSM HYM 05960  1709621 BC ZSM HYM 05960 BC ZSM HYM 05960
##                    institution_storing      bin_uri phylum_taxID
## 1    Biodiversity Institute of Ontario BOLD:AAB8874           20
## 2 Bavarian State Collection of Zoology BOLD:AAD6282           20
## 3 Bavarian State Collection of Zoology BOLD:AAP2416           20
## 4 Bavarian State Collection of Zoology BOLD:AAP2416           20
## 5 Bavarian State Collection of Zoology BOLD:AAI1853           20
## 6 Bavarian State Collection of Zoology BOLD:AAK6070           20

You can optionally get back the data in XML format

bold_specimens(taxon = 'Osmia', format = 'xml')
<?xml version="1.0" encoding="UTF-8"?>
<bold_records  xsi:noNamespaceSchemaLocation="http://www.boldsystems.org/schemas/BOLDPublic_record.xsd"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <record>
    <record_id>1470124</record_id>
    <processid>BOM1525-10</processid>
    <bin_uri>BOLD:AAN3337</bin_uri>
    <specimen_identifiers>
      <sampleid>DHB 1011</sampleid>
      <catalognum>DHB 1011</catalognum>
      <fieldnum>DHB1011</fieldnum>
      <institution_storing>Marjorie Barrick Museum</institution_storing>
    </specimen_identifiers>
    <taxonomy>

You can choose to get the httr response object back if you'd rather work with the raw data returned from the BOLD API.

res <- bold_specimens(taxon = 'Osmia', format = 'xml', response = TRUE)
res$url
## [1] "http://www.boldsystems.org/index.php/API_Public/specimen?taxon=Osmia&specimen_download=xml"
res$status_code
## [1] 200
res$headers
## $date
## [1] "Fri, 17 Apr 2015 19:14:09 GMT"
## 
## $server
## [1] "Apache/2.2.15 (Red Hat)"
## 
## $`x-powered-by`
## [1] "PHP/5.3.15"
## 
## $`content-disposition`
## [1] "attachment; filename=bold_data.xml"
## 
## $connection
## [1] "close"
## 
## $`transfer-encoding`
## [1] "chunked"
## 
## $`content-type`
## [1] "application/x-download"
## 
## attr(,"class")
## [1] "insensitive" "list"

Search for specimen plus sequence data

The specimen/sequence combined API gives back specimen and sequence data. Like the specimen API, this one gives by default tsv format data, which is given back to you as a data.frame. Here, we're setting sepfasta=TRUE so that the sequence data is given back as a list, and taken out of the data.frame returned so the data.frame is more manageable.

res <- bold_seqspec(taxon = 'Osmia', sepfasta = TRUE)
res$fasta[1:2]
## $`BBHYL362-10`
## [1] "AATTTTATATATAATTTTTGCTATATGATCAGGAATAATTGGTTCAGCAATAAGAATTATTATTCGAATAGAATTAAGAATTCCTGGTTCATGAATTTCAAATGATCAAACTTATAATTCTTTAGTTACTGCTCATGCTTTTTTAATAATTTTTTTTTTAGTTATACCATTCTTAATTGGGGGATTTGGAAATTGATTAATTCCTTTAATATTAGGAATTCCAGATATAGCATTTCCACGAATAAATAATATTAGATTTTGACTTTTACCTCCTTCTTTAATACTTTTATTATTAAGAAATTTTATAAATCCTAGTCCAGGAACTGGATGAACTGTTTATCCACCTTTATCTTCTCATTTATTTCATTCTTCTCCTTCAGTTGATATAGCTATTTTTTCTTTACATATTTCTGGTTTATCTTCTATTATAGGTTCATTAAATTTTATTGTTACAATTATTATAATAAAAAATATTTCTTTAAAACATATTCAATTACCTTTATTTCCTTGATCTGTCTTTATTACTACTATTTTATTACTTTTTTCTTTACCTGTTTTAGCAGGTGCAATTACTATATTATTATTTGATCGAAATTTTAATACTTCATTTTTTGATCCTACAGGAGGAGGAGATCCTATTCTTTATCAACATTTATTT"
## 
## $`BCHYM1499-13`
## [1] "AATTCTTTACATAATTTTTGCTTTATGATCTGGAATAATTGGGTCAGCAATAAGAATTATTATTCGAATAGAATTAAGTATCCCAGGTTCATGAATTACTAATGATCAAATTTATAATTCTTTAGTAACTGCACATGCTTTTTTAATAATTTTTTTTCTTGTGATACCATTTTTAATTGGAGGATTTGGAAATTGATTAATTCCTTTAATATTAGGAATTCCAGATATAGCTTTCCCACGAATAAACAATATTAGATTTTGATTATTACCGCCATCTTTAATATTATTACTTTTAAGAAATTTTTTAAATCCAAGTCCTGGAACAGGATGAACAGTTTATCCCCCTTTATCATCAAATTTATTTCATTCTTCTCCTTCAGTTGATTTAGCAATTTTTTCTTTACATATTTCAGGTTTATCTTCTATTATAGGTTCATTAAATTTTATTGTTACAATTATTATAATAAAAAATATTTCTTTAAAATATATTCAATTGCCTTTATTTCCTTGATCTGTATTTATTACTACTATTCTTTTATTATTTTCTTTACCTGTGTTAGCTGGAGCTATTACTATATTATTATTTGATCGAAATTTTAATACATCTTTTTTTGATCCTACAGGAGGAGGAGATCCAATTCTTTATCAACATTTATTT"

Or you can index to a specific sequence like

res$fasta['GBAH0293-06']
## $`GBAH0293-06`
## [1] "------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TTAATGTTAGGGATTCCAGATATAGCTTTTCCACGAATAAATAATATTAGATTTTGACTGTTACCTCCATCTTTAATATTATTACTTTTAAGAAATTTTTTAAATCCAAGTCCTGGAACAGGATGAACAGTTTATCCTCCTTTATCATCAAATTTATTTCATTCTTCTCCTTCAGTTGATTTAGCAATTTTTTCTTTACATATTTCAGGTTTATCTTCTATTATAGGTTCATTAAATTTTATTGTTACAATTATTATAATAAAAAATATTTCTTTAAAATATATTCAATTACCTTTATTTTCTTGATCTGTATTTATTACTACTATTCTTTTATTATTTTCTTTACCTGTATTAGCTGGAGCTATTACTATATTATTATTTGATCGAAATTTTAATACATCTTTTTTTGATCCAACAGGAGGGGGAGATCCAATTCTTTATCAACATTTATTTTGATTTTTTGGTCATCCTGAAGTTTATATTTTAATTTTACCTGGATTTGGATTAATTTCTCAAATTATTTCTAATGAAAGAGGAAAAAAAGAAACTTTTGGAAATATTGGTATAATTTATGCTATATTAAGAATTGGACTTTTAGGTTTTATTGTT---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"

Get trace files

This function downloads files to your machine - it does not load them into your R session - but prints out where the files are for your information.

bold_trace(taxon = 'Osmia', quiet = TRUE)
## 
## <bold trace files> 
## 
## /Users/sacmac/github/ropensci/bold/inst/vign/bold_trace_files/a12.Lep-R.ab1
## /Users/sacmac/github/ropensci/bold/inst/vign/bold_trace_files/a12.MLep-F.ab1
## /Users/sacmac/github/ropensci/bold/inst/vign/bold_trace_files/ASGCB253-13[LepF1,LepR1]_F.ab1

... cutoff