The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In this tutorial, we demonstrate the step-by-step process of downloading data from various sources such as GBIF, OBIS, and iDigBio using existing R packages, as well as from InvertEbase via a local CSV file. This process includes merging all data files and standardizing their formats to make them compatible for integration.
Example species: Mexacanthina lugubris
library(EcoCleanR)
library(rgbif)
#> Warning: package 'rgbif' was built under R version 4.4.3
library(robis)
#> Warning: package 'robis' was built under R version 4.4.1
#>
#> Attaching package: 'robis'
#> The following object is masked from 'package:rgbif':
#>
#> dataset
library(ridigbio)
#> Warning: package 'ridigbio' was built under R version 4.4.3
library(dplyr)Given attributes in the list can be changed/added based on the requirement
attribute_list <- c("source", "catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode")This step uses function occ_data of rgbif package to
extract data from GBIF.
gbif.occ <- occ_data(taxonKey = taxonkey, occurrenceStatus = NULL, limit = 10000L)$data
# refer article/cite_data.Rmd for instructions on how to cite the data from gbif- data providers
## additional field added to know the source
gbif.occ$source <- "gbif"
for (field in attribute_list) {
if (!field %in% names(gbif.occ)) {
gbif.occ[[field]] <- NA # Add the missing field as NA
}
}
## we are making one column called abundance which should have values from individual count and organism Quantity
gbif.occ$abundance <- ifelse(is.na(as.numeric(gbif.occ$individualCount)), as.numeric(gbif.occ$organismQuantity), as.numeric(gbif.occ$individualCount))
## additional field added to know the source
gbif.occ$source <- "gbif"
gbif.occ_temp <- gbif.occ[, attribute_list]
str(gbif.occ_temp[, 1:3])
#> tibble [1,927 × 3] (S3: tbl_df/tbl/data.frame)
#> $ source : chr [1:1927] "gbif" "gbif" "gbif" "gbif" ...
#> $ catalogNumber: chr [1:1927] "258336784" "258586406" "260117394" "261990463" ...
#> $ basisOfRecord: chr [1:1927] "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" ...This step uses occurrence function of robis package to
extract data from OBIS.
obis.occ <- occurrence(species_name)
#> Retrieved 84 records of approximately 84 (100%)
for (field in attribute_list) {
if (!field %in% names(obis.occ)) {
obis.occ[[field]] <- NA # Add the missing field as NA
}
}
obis.occ$abundance <- ifelse(is.na(as.numeric(obis.occ$individualCount)), as.numeric(obis.occ$organismQuantity), as.numeric(obis.occ$individualCount))
obis.occ$source <- "obis"
obis.occ$municipality <- ""
obis.occ_temp <- obis.occ[, attribute_list]
str(obis.occ_temp[, 1:3])
#> tibble [84 × 3] (S3: tbl_df/tbl/data.frame)
#> $ source : chr [1:84] "obis" "obis" "obis" "obis" ...
#> $ catalogNumber: chr [1:84] NA "DMNS:Inv:25322" NA "483074" ...
#> $ basisOfRecord: chr [1:84] "HumanObservation" "PreservedSpecimen" "HumanObservation" "PreservedSpecimen" ...This step uses idig_search_records of ridigbio package
to extract data from IDIGBIO.
idig.occ <- idig_search_records(
type = "records",
rq = list("scientificname" = species_name),
field = "all",
max_items = 10000L,
limit = 10000L,
offset = 0
)
idig.occ <- idig.occ %>%
mutate(
abundance = as.numeric(individualcount),
source = "idigbio",
occurrenceStatus = "",
organismQuantity = ""
) %>%
rename(
decimalLatitude = geopoint.lat,
decimalLongitude = geopoint.lon,
basisOfRecord = basisofrecord,
catalogNumber = catalognumber,
scientificName = scientificname,
stateProvince = stateprovince,
coordinateUncertaintyInMeters = coordinateuncertainty,
individualCount = individualcount,
institutionCode = institutioncode,
verbatimLocality = verbatimlocality,
verbatimEventDate = verbatimeventdate,
countryCode = countrycode
)
idig.occ_temp <- idig.occ[, attribute_list]
str(idig.occ_temp[, 1:3])
#> 'data.frame': 342 obs. of 3 variables:
#> $ source : chr "idigbio" "idigbio" "idigbio" "idigbio" ...
#> $ catalogNumber: chr "lacmip 66.1255" "lacm 1951-43.22" "1069" "239577" ...
#> $ basisOfRecord: chr "fossilspecimen" "preservedspecimen" "preservedspecimen" "preservedspecimen" ...This local file “example_sp_invertebase” is a manual downloaded file from InvertEbase for Mexacanthina lugubris. See the example_sp_invertEbase dataset for its attributes and DwC format.
sym.occ <- example_sp_invertebase
sym.occ$abundance <- as.numeric(sym.occ$individualCount)
for (field in attribute_list) {
if (!field %in% names(sym.occ)) {
sym.occ[[field]] <- NA # Add the missing field as NA
}
}
str(sym.occ[, 1:3])
#> 'data.frame': 710 obs. of 3 variables:
#> $ source : chr "invert" "invert" "invert" "invert" ...
#> $ catalogNumber: chr "49323" "155070811" "66762485" "69352588" ...
#> $ basisOfRecord: chr "PreservedSpecimen" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" ...ec_db_merge function in the EcoCleanR package helps
merge data from all sources, provided that each source has the same
attribute names and number of columns. It also filters the data based on
the specified type (e.g., modern or fossil) and removes records marked
as ‘absent’ occurrenceStatus.
db_list <- list(gbif.occ_temp, obis.occ_temp, idig.occ_temp, sym.occ)
Mixdb.occ <- ec_db_merge(db_list = db_list, datatype = "modern")
str(Mixdb.occ[, 1:3])
#> tibble [2,310 × 3] (S3: tbl_df/tbl/data.frame)
#> $ source : chr [1:2310] "gbif" "gbif" "gbif" "gbif" ...
#> $ catalogNumber: chr [1:2310] "258336784" "258586406" "260117394" "261990463" ...
#> $ basisOfRecord: chr [1:2310] "modern" "modern" "modern" "modern" ...
ec_geographic_map(Mixdb.occ, "decimalLatitude", longitude = "decimalLongitude") # display records those has coordinate values
#> Warning: Removed 667 rows containing missing values or values outside the scale range
#> (`geom_point()`).Further documents:
*see data cleaning steps on mixdb (merged) dataset at vignette:
[data_cleaning]
*see citation guidelines for the downloaded files from gbif, obis, idigbio and InvertEbase vignettes/article/cite_data.rmd
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.