We have entered the age of data-intensive scientific discovery. As data sets increase in complexity and heterogeneity, we must preserve the cycle of data citation from primary data sources to aggregating databases to research products and back to primary data sources. The citation cycle keeps science transparent, but it is also key to supporting primary providers by documenting the use of their data. The Global Biodiversity Information Facility (GBIF), Botanical Information and Ecology Network (BIEN), and other data aggregators have made great strides in harvesting citation data from research products and linking them back to primary data providers. However, this only works if those that publish research products cite primary data sources in the first place. We developed occCite
, a set of R
-based tools for downloading, managing, and citing biodiversity data, to advance toward the goal of closing the data provenance cycle. These tools preserve links between occurrence data and primary providers once researchers download aggregated data, and facilitate the citation of primary data providers in research papers.
The occCite
workflow follows a three-step process. First, the user inputs one or more taxonomic names (or a phylogeny). occCite
then rectifies these names by checking them against one or more taxonomic databases, which can be specified by the user (see the Global Names List). The results of the taxonomic rectification are then kept in an occCiteData
object in local memory. Next, occCite
takes the occCiteData
object and user-defined search parameters to query BIEN (through rbien
) and/or GBIF(through rGBIF
) for records. The results are appended to the occCiteData
object, along with metadata on the search. Finally, the user can pass the occCiteData
object to occCitation
, which compiles citations for the primary providers, database aggregators, and R
packages used to build the dataset.
Future iterations of occCite
will track citation data through the data cleaning process and provide a series of visualizations on raw query results and final data sets. It will also provide data citations in a format congruent with best-practice recommendations for large biodiversity data sets. Based on these data citation tools, we will also propose a new set of standards for citing primary biodiversity data in published research articles that provides due credit to contributors and allows them to track the use of their work. Keep checking back!
If you plan to query GBIF, you will need to provide them with your user login information. We have provided a dummy login below to show you the format. You will need to provide actual account information. This is because you will actually be downloading all of the records available for the species using occ_download()
, instead of getting results from occ_search()
, which has a hard limit of 200,000 occurrences.
At its simplest, occCite
allows you to search for occurrences for a single species. The taxonomy of the user-specified species will be verified using EOL and NCBI taxonomies by default.
# Simple search
mySimpleOccCiteObject <- occQuery(x = "Protea cynaroides",
datasources = c("gbif", "bien"),
GBIFLogin = GBIFLogin,
GBIFDownloadDirectory =
system.file('extdata/', package='occCite'),
checkPreviousGBIFDownload = T)
Here is what the GBIF results look like:
# GBIF search results
head(mySimpleOccCiteObject@occResults$`Protea cynaroides`$GBIF$OccurrenceTable)
## name longitude latitude day month year
## 1 Protea cynaroides 26.51756 -33.34703 22 10 2020
## 2 Protea cynaroides 19.45966 -34.52285 7 11 2020
## 3 Protea cynaroides 19.13672 -33.76127 1 11 2020
## 4 Protea cynaroides 18.42365 -33.96614 28 3 2019
## 5 Protea cynaroides 18.42872 -33.99052 6 9 2020
## 6 Protea cynaroides 25.23694 -33.88793 4 11 2020
## Dataset DatasetKey
## 1 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## 2 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## 3 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## 4 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## 5 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## 6 iNaturalist research-grade observations 50c9509d-22c7-4a22-a47d-8c48425ef4a7
## DataService
## 1 GBIF
## 2 GBIF
## 3 GBIF
## 4 GBIF
## 5 GBIF
## 6 GBIF
And here are the BIEN results:
#BIEN search results
head(mySimpleOccCiteObject@occResults$`Protea cynaroides`$BIEN$OccurrenceTable)
## name longitude latitude day month year Dataset DatasetKey
## 1 Protea cynaroides 22.875 -33.875 20 8 1973 SANBI 2249
## 2 Protea cynaroides 25.125 -33.875 3 7 1934 SANBI 2249
## 3 Protea cynaroides 20.375 -33.875 16 8 1952 SANBI 2249
## 4 Protea cynaroides 21.375 -33.375 20 3 1947 SANBI 2249
## 5 Protea cynaroides 20.875 -34.125 21 6 1987 SANBI 2249
## 6 Protea cynaroides 24.625 -33.625 12 9 1973 SANBI 2249
## DataService
## 1 BIEN
## 2 BIEN
## 3 BIEN
## 4 BIEN
## 5 BIEN
## 6 BIEN
There is also a summary method for occCite
objects with some basic information about your search.
##
## OccCite query occurred on: 24 November, 2020
##
## User query type: User-supplied list of taxa.
##
## Sources for taxonomic rectification: NCBI
##
##
## Taxonomic cleaning results:
##
## Input Name Best Match Taxonomic Databases w/ Matches
## 1 Protea cynaroides Protea cynaroides NCBI
##
## Sources for occurrence data: gbif, bien
##
## Species Occurrences Sources
## 1 Protea cynaroides 1293 17
##
## GBIF dataset DOIs:
##
## Species GBIF Access Date GBIF DOI
## 1 Protea cynaroides 2020-11-23 10.15468/dl.2449qy
If you want to visualize the results of your search, you can use the plot
method on occCite
objects to generate several kinds of summary plots.
After doing a search for occurrence points, you can use occCitation()
to generate citations for primary biodiversity databases, as well as database aggregators. Note: Currently, GBIF and BIEN are the only aggregators for which citations are supported.
Here is a simple way of generating a formatted citation document from the results of occCitation()
.
## Warning in utils::citation(pkg): no date field in DESCRIPTION file of package
## 'occCite'
## Warning in utils::citation(pkg): could not determine year for 'occCite' from
## package DESCRIPTION file
## Ignoring entry titled "occCite: Querying and Managing Large Biodiversity Occurrence Datasets" because owensocccite: A bibentry of bibtype 'Manual' has to specify the field: c("year", "date")
## Writing 4 Bibtex entries ... OK
## Results written to file 'temp.bib'
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## AFFOUARD A, JOLY A, LOMBARDO J, CHAMP J, GOEAU H, BONNET P (2020). Pl@ntNet automatically identified occurrences. Version 1.2. Pl@ntNet. https://doi.org/10.15468/mma2ec. Accessed via GBIF on 2020-11-23.
## AFFOUARD A, JOLY A, LOMBARDO J, CHAMP J, GOEAU H, BONNET P (2020). Pl@ntNet observations. Version 1.2. Pl@ntNet. https://doi.org/10.15468/gtebaa. Accessed via GBIF on 2020-11-23.
## Cameron E, Auckland Museum A M (2020). Auckland Museum Botany Collection. Version 1.54. Auckland War Memorial Museum. https://doi.org/10.15468/mnjkvv. Accessed via GBIF on 2020-11-23.
## Capers R (2014). CONN. University of Connecticut. https://doi.org/10.15468/w35jmd. Accessed via GBIF on 2020-11-23.
## Chamberlain, S., Barve, V., Mcglinn, D., Oldoni, D., Desmet, P., Geffert, L., Ram, K. (2020). rgbif: Interface to the Global Biodiversity Information Facility API. R package version 3.3.0. https://CRAN.R-project.org/package=rgbif.
## Chamberlain, S., Boettiger, C. (2017). R Python, and Ruby clients for GBIF species occurrence data. PeerJ PrePrints.
## Fatima Parker-Allie, Ranwashe F (2018). PRECIS. South African National Biodiversity Institute. https://doi.org/10.15468/rckmn2. Accessed via GBIF on 2020-11-23.
## MNHN, Chagnoux S (2020). The vascular plants collection (P) at the Herbarium of the Muséum national d'Histoire Naturelle (MNHN - Paris). Version 69.189. MNHN - Museum national d'Histoire naturelle. https://doi.org/10.15468/nc6rxy. Accessed via GBIF on 2020-11-23.
## MNHN. https://doi.org/10.15468/dl.fpwlzt. Accessed via BIEN on 2018-08-14.
## Magill B, Solomon J, Stimmel H (2020). Tropicos Specimen Data. Missouri Botanical Garden. https://doi.org/10.15468/hja69f. Accessed via GBIF on 2020-11-23.
## Maitner, B. (2020). BIEN: Tools for Accessing the Botanical Information and Ecology. R package version 1.2.4. https://CRAN.R-project.org/package=BIEN.
## Missouri Botanical Garden,Herbarium. Accessed via BIEN on NA.
## NSW. https://doi.org/10.15468/dl.fpwlzt. Accessed via BIEN on 2018-08-14.
## R Core Team. (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
## SANBI. https://doi.org/10.15468/dl.fpwlzt. Accessed via BIEN on 2018-08-14.
## Senckenberg (2020). African Plants - a photo guide. https://doi.org/10.15468/r9azth. Accessed via GBIF on 2020-11-23.
## Tela Botanica. Carnet en Ligne. https://doi.org/10.15468/rydcn2. Accessed via GBIF on 2020-11-23.
## UConn. https://doi.org/10.15468/dl.fpwlzt. Accessed via BIEN on 2018-08-14.
## Ueda K (2020). iNaturalist Research-grade Observations. iNaturalist.org. https://doi.org/10.15468/ab3s5x. Accessed via GBIF on 2020-11-23.
## de Vries H, Lemmens M. Observation.org, Nature data from around the World. Observation.org. https://doi.org/10.15468/5nilie. Accessed via GBIF on 2020-11-23.
## naturgucker.de. naturgucker. https://doi.org/10.15468/uc1apo. Accessed via GBIF on 2020-11-23.
In the simplest of searches, such as the one above, the taxonomy of your input species name is automatically rectified through the occCite
function studyTaxonList()
using gnr_resolve()
from the taxize
R
package. If you would like to change the source of the taxonomy being used to rectify your species names, you can specify as many taxonomic repositories as you like from the Global Names Index (GNI). The complete list of GNI repositories can be found here.
studyTaxonList()
chooses the taxonomic names closest to those being input and documents which taxonomic repositories agreed with those names. studyTaxonList()
instantiates an occCiteData
object the same way occQuery()
does. This object can be passed into occQuery()
to perform your occurrence data search.
#Rectify taxonomy
myTROccCiteObject <- studyTaxonList(x = "Protea cynaroides",
datasources = c("NCBI", "EOL", "ITIS"))
myTROccCiteObject@cleanedTaxonomy
## Input Name Best Match Taxonomic Databases w/ Matches
## 1 Protea cynaroides Protea cynaroides NCBI
Querying GBIF can take quite a bit of time, especially for multiple species and/or well-known species. In this case, you may wish to access previously-downloaded data sets from your computer by specifying the general location of your downloaded .zip
files. occQuery
will crawl through your specified GBIFDownloadDirectory
to collect all the .zip
files contained in that folder and its subfolders. It will then import the most recent downloads that match your taxon list. These GBIF data will be appended to a BIEN search the same as if you do the simple real-time search (if you chose BIEN as well as GBIF), as was shown above. checkPreviousGBIFDownload
is TRUE
by default, but if loadLocalGBIFDownload
is TRUE
, occQuery
will ignore checkPreviousDownload
. It is also worth noting that occCite
does not currently support mixed data download sources. That is, you cannot do GBIF queries for some taxa, download previously-prepared data sets for others, and load the rest from local data sets on your computer.
# Simple load
myOldOccCiteObject <- occQuery(x = "Protea cynaroides",
datasources = c("gbif", "bien"),
GBIFLogin = NULL,
GBIFDownloadDirectory = system.file('extdata/', package='occCite'),
loadLocalGBIFDownload = T,
checkPreviousGBIFDownload = F)
## Error in is.nan(x): default method not implemented for type 'list'
Here is the result. Look familiar?
## Error in head(myOldOccCiteObject@occResults$`Protea cynaroides`$GBIF$OccurrenceTable): object 'myOldOccCiteObject' not found
## Error in summary(myOldOccCiteObject): object 'myOldOccCiteObject' not found
Getting citation data works the exact same way with previously-downloaded data as it does from a fresh data set.
## Error in occCitation(myOldOccCiteObject): object 'myOldOccCiteObject' not found
## Error in print(myOldOccCitations): object 'myOldOccCitations' not found
Note that you can also load multiple species using either a vector of species names or a phylogeny (provided you have previously downloaded data for all of the species of interest), and you can load occurrences from non-GBIF data sources (e.g. BIEN) in the same query.
In addition to doing a simple, single species search, you can also use occCite
to search for and manage occurrence datasets for multiple species. You can either submit a vector of species names, or you can submit a phylogeny! The occCitation function will return a named list of citation tables in the case of multiple species.
Here is an example of how such a search is structured, using an unpublished phylogeny of billfishes.
library(ape)
#Get tree
treeFile <- system.file("extdata/Fish_12Tax_time_calibrated.tre", package='occCite')
phylogeny <- ape::read.nexus(treeFile)
tree <- ape::extract.clade(phylogeny, 18)
#Query databases for names
myPhyOccCiteObject <- studyTaxonList(x = tree, datasources = "NCBI")
#Query GBIF for occurrence data
myPhyOccCiteObject <- occQuery(x = myPhyOccCiteObject,
datasources = "gbif",
GBIFDownloadDirectory = system.file('extdata/', package='occCite'),
loadLocalGBIFDownload = T,
checkPreviousGBIFDownload = F)
## Error in is.nan(x): default method not implemented for type 'list'
##
##
## User query type: User-supplied phylogeny.
##
## Sources for taxonomic rectification: NCBI
##
##
## Taxonomic cleaning results:
##
## Input Name Best Match
## 1 Istiompax_indica Istiompax indica
## 2 Kajikia_albida Kajikia albida
## 3 Kajikia_audax Kajikia audax
## 4 Tetrapturus_angustirostris Tetrapturus angustirostris
## 5 Tetrapturus_belone Tetrapturus belone
## 6 Tetrapturus_georgii Tetrapturus georgii
## 7 Tetrapturus_pfluegeri Tetrapturus pfluegeri
## Taxonomic Databases w/ Matches
## 1 NCBI
## 2 NCBI
## 3 NCBI
## 4 NCBI
## 5 NCBI
## 6 NCBI
## 7 NCBI
When you have results for multiple species, as in this case, you can also plot the summary figures either for the whole search…
## Error in d.res[[x]]: subscript out of bounds
or you can plot the results by species!
## Error in d.res[[x]]: subscript out of bounds
And then you can print out the citations, separated by species (or not, but in this example, they’re separate).
#Get citations
myPhyOccCitations <- occCitation(myPhyOccCiteObject)
#Print citations as text with accession dates.
print(myPhyOccCitations, bySpecies = T)
## Error in x$occCitationResults[[i]]: subscript out of bounds