Standardization and integration of different datasets

2022-02-26

Introduction

The first step of the bdc package handles the harmonization of heterogeneous datasets in a standard format simply and efficiently. How is this accomplished? Basically, by replacing the headers of original datasets with standardized terms. To do so, you have to fill out a configuration table to indicate which field names (i.e., column headers) of each original dataset match a list of Darwin Core standard terms.

Once standardized, datasets are then integrated into a standardized database having a minimum set of terms required for sharing biodiversity data and metadata across a wide variety of biodiversity applications (Simple Darwin Core standards).


⚠️IMPORTANT:

Installation

Check here how to install the bdc package

Read the configuration table

Read an example of the configuration table. You can download the table by clicking on the “CSV” button. We demonstrate the usefulness of the package using a database of terrestrial plant species occurring in Brazil.

metadata <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv",
                              package = "bdc"),
                  show_col_types = FALSE)

NOTE: Remember to change the column “fileName” for the path containing the input files and perform all modifications needed in the configuration table before executing bdc_standardize_datasets.



The standardized database embodies information on species taxonomy, geolocation, date of collection, and other relevant context information. Each field is classified in three categories according to its importance to run the function: i) required, i.e., the minimum information necessary to run the function, ii) recommended, i.e., not mandatory but having important details on species records, and iii) additional, i.e., information potentially useful for detailed data analyses.

Below are listed the specifications of each field of the configuration table:


config_description <-
  readr::read_csv(system.file("extdata/Config/DatabaseInfo_description.csv", package = "bdc"), show_col_types = FALSE)

Standardization and integration of datasets

Note that the standardized database integrating all dataset can be saved in the folder “Output/Intermediate” as “00_merged_database” if save_database = TRUE . The database is saved with a “csv” or “qs” extension, being “qs” a helpful format for quickly saving and reading large databases. “qs” files can be read using the function “qread” from the “qs” package.

database <- 
bdc_standardize_datasets(metadata = metadata,
                         format = "csv",
                         overwrite = TRUE, 
                         save_database = TRUE)

#>  0sStandardizing AT_EPIPHYTES file                  
#>  0s 0sStandardizing BIEN file              
#>  0s 0sStandardizing DRYFLOR file                                                         
#>  0s 0sStandardizing GBIF file       
#>  0s 0sStandardizing ICMBIO file           
#>  0s 0sStandardizing IDIGBIO file              
#>  0s 0sStandardizing NEOTROPTREE file              
#>  0s 0sStandardizing SIBBR file               
#>  0s 0sStandardizing SPECIESLINK file
#>  
#> C:/Users/Bruno R. Ribeiro/Desktop/bdc/Output/Intermediate/00_merged_database.csv was created

An example of a standardized database containing the required field to run the bdc package.