| Title: | Automated and Controlled Extraction, Cleaning, and Processing of Occurrence Data for Generating Biogeographic Ranges of Marine Organisms |
| Version: | 1.0.1 |
| Description: | Provides step-by-step automation for integrating biodiversity data from multiple online aggregators, merging and cleaning datasets while addressing challenges such as taxonomic inconsistencies, georeferencing issues, and spatial or environmental outliers. Includes functions to extract environmental data and to define the biogeographic ranges in which species are most likely to occur. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | knitr, rgbif, robis, ridigbio, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| Imports: | dplyr, geodata, geosphere, ggplot2, mregions2, patchwork, rlang, sdmpredictors, sf, taxize, terra, tidyr |
| Depends: | R (≥ 3.5) |
| LazyData: | true |
| VignetteBuilder: | knitr |
| Config/Needs/website: | rmarkdown |
| NeedsCompilation: | no |
| Packaged: | 2025-11-14 18:27:10 UTC; sonip |
| Author: | Priyanka Soni |
| Maintainer: | Priyanka Soni <sonip@usc.edu> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-19 19:30:02 UTC |
Get Decimal Places of Coordinate Values
Description
Get Decimal Places of Coordinate Values
Usage
decimal_places(coord)
Arguments
coord |
A coordinate value in the numeric format of decimal degree |
Value
a numerical value which represent the number of decimal places for the coordiante
Examples
decimal_places(12.7000000)
decimal_places(45.67788)
Calculate geographic distance and mahalanobis distance to estimate outlier probability of a data point
Description
Calculate geographic distance and mahalanobis distance to estimate outlier probability of a data point
Usage
distance_calc(data, latitude, longitude, env_layers, itr = 15, k = 3)
Arguments
data |
data table with spatial and environmental variables |
latitude |
nested input from ec_flag_outlier |
longitude |
nested input from ec_flag_outlier |
env_layers |
header names of env variables. env_layers <- c("Temperature", "pH") |
itr |
iteration to run the clustering 100 or 1000 times |
k |
number of cluster to choose in each iteration |
Value
A list of results that shows result of calculated distance for each iteration
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
temperature_mean = c(12, 13, 14),
temperature_min = c(9, 6, 10),
temperature_max = c(14, 16, 18)
)
env_layers <- c("temperature_mean", "temperature_min", " temperature_max")
result_list <- distance_calc(data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers,
itr = 100,
k = 3
)
Merge the Data sets Extracted from Various datasources.
Description
condition to run this function: all the data frames should have same fields follwing DwC standards: e.g. attribute_list <- c("source","catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode") Assign manually the source name in "source" field. example - gbif, obis, invertEBase etc Assign values of individual count or organism count into abundance. Most online sources has one of them updated with specimen count. this function depends on successful download of data files, it also allow to input csv files from local system
Usage
ec_db_merge(
db_list,
datatype = "modern",
occurrenceStatus = "occurrenceStatus",
basisOfRecord = "basisOfRecord"
)
Arguments
db_list |
list of data frames which we want to merge. e.g. GBIF, iDigbio, InvertEBase and any local file. |
datatype |
default "modern". datatype accept text input as "modern" or "fossil" |
occurrenceStatus |
default name for occurrenceStatus column is occurrenceStatus but a different name can be inserted if required. |
basisOfRecord |
default name for basis of record column is basis of record but a different name can be inserted if required. |
Value
A data frame of occurrence records filtered to include only those classified as "modern" or "fossil".
Examples
db1 <- data.frame(
species = "A",
decimalLongitude = c(-120, -117, NA, NA),
decimalLatitude = c(20, 34, NA, NA),
catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"),
occurrenceStatus = c("present", "", "ABSENT", "Present"),
basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"),
source = "db1",
abundance = c(1, NA, 8, 23)
)
db2 <- data.frame(
species = "A",
decimalLongitude = c(-120.2, -117.1, NA, NA),
decimalLatitude = c(20.2, 34.1, NA, NA),
catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"),
occurrenceStatus = c("present", "", "ABSENT", "Present"),
basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"),
source = "db2",
abundance = c(1, 2, 3, 19)
)
db_list <- list(db1, db2)
merge_modern_data <- ec_db_merge(db_list = db_list, "modern")
Extract the Environmental data
Description
Extract the Environmental data
Usage
ec_extract_env_layers(
data,
env_layers = env_layers,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
data |
data table which has coordinate information |
env_layers |
make a list of enviornmental layers which need to be extracted, example :BO_sstmean, BO_sstmax, BO_sstmin, BO_chomean, BO_phosphate or marspec layer, must check list_layer to know exact name of the layer code. |
latitude |
default assigned as "decimalLatitude" |
longitude |
default assigned as "decimalLongitude" |
Value
A data table which has unique coordinates and env predictors
Examples
env_layers <- c("BO_sstmean", "BO_chlomean", "BO_dissox", "BO_salinity")
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9)
)
data_x <- ec_extract_env_layers(data,
env_layers = env_layers,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Flag the Occurrences those has Extreme Uncertainty Error Radius
Description
Flag the Occurrences those has Extreme Uncertainty Error Radius
Usage
ec_filter_by_uncertainty(
data,
uncertainty_col = "coordinateUncertaintyInMeters",
percentile = 0.96,
ask = TRUE,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
data |
data table which need to be cleaned with unwanted uncertainty values - extreme values |
uncertainty_col |
coordinateUncertaintyInMeters column |
percentile |
to derive threshold, e.g. extreme 5% uncertainty data points to be removed. give percentile value as 0.95 |
ask |
this allow user to decide if the uncertainty threshold value is okay or too high/low |
latitude |
default set on decimalLatitude, this column is use to filter records those does not have georeferences. |
longitude |
default set on decimalLongitude. |
Value
A data frame as result of removing extreme uncertain occurrences
Examples
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -117, NA, NA),
decimalLatitude = c(20, 34, NA, NA),
cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"),
locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"),
coordinateUncertaintyInMeters = c(1000, 2000, 9999900, NA)
)
data <- ec_filter_by_uncertainty(
data,
uncertainty_col = "coordinateUncertaintyInMeters",
latitude = "decimalLatitude",
longitude = "decimalLongitude",
percentile = 0.96,
ask = TRUE
)
Flag the occurrences those are not in east Atlantic and are inland
Description
Flag the occurrences those are not in east Atlantic and are inland
Usage
ec_flag_non_east_atlantic(
ocean_names,
buffer_distance = 50000,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
Examples
ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean")
buffer_distance <- 25000
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5),
decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35)
)
data$flag_non_region <- ec_flag_non_east_atlantic(
ocean_names,
buffer_distance,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Flag occurrences those are not in east Pacific and are inland
Description
Flag occurrences those are not in east Pacific and are inland
Usage
ec_flag_non_east_pacific(
ocean_names,
buffer_distance = 50000,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
Examples
ocean_names <- c("North Pacific Ocean", "South Pacific Ocean")
buffer_distance <- 25000
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -78, -110),
decimalLatitude = c(20, 34, 30)
)
data$flag_non_region <- ec_flag_non_east_pacific(
ocean_names,
buffer_distance,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Flag Occurrences those are in wrong ocean basins and are inland
Description
Flag Occurrences those are in wrong ocean basins and are inland
Usage
ec_flag_non_region(
direction,
ocean,
buffer = 50000,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
direction |
values as "east" or "west". These values help to filter the shape files for east or west of select ocean (e.g. pacific) for both north and south hemisphere. |
ocean |
values such as "pacific" or "atlantic" |
buffer |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
Examples
direction <- "east"
buffer <- 25000
ocean <- "pacific"
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5),
decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35)
)
data$flag_non_region <- ec_flag_non_region(
direction,
ocean,
buffer = 50000,
data
)
Flag Occurrences those are not in west Atlantic and are inland
Description
Flag Occurrences those are not in west Atlantic and are inland
Usage
ec_flag_non_west_atlantic(
ocean_names,
buffer_distance = 50000,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
Examples
ocean_names <- c("North Atlantic Ocean", "South Atlantic Ocean")
buffer_distance <- 25000
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5),
decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35)
)
data$flag_non_region <- ec_flag_non_west_atlantic(
ocean_names,
buffer_distance,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Flag occurrences those are not in east Pacific and are inland
Description
Flag occurrences those are not in east Pacific and are inland
Usage
ec_flag_non_west_pacific(
ocean_names,
buffer_distance = 50000,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
ocean_names |
Insert the name of oceans: "South Pacific Ocean", "North Pacific Ocean", North Atlantic Ocean", "South Atlantic Ocean" |
buffer_distance |
Its a certain buffer distance to consider if a data point is inland. Beyond this distance data points consider as bad data points. e.g. buffer_distance <- 25000 |
data |
Data table which has latitude and longitude information |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A new column with flagged values, 1 means bad records 0 means good record. Column name: flag_non_region
Examples
ocean_names <- c("North Pacific Ocean", "South Pacific Ocean")
buffer_distance <- 25000
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -78, -110),
decimalLatitude = c(20, 34, 30)
)
data$flag_non_region <- ec_flag_non_west_pacific(
ocean_names,
buffer_distance,
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Flag Outlier Occurrences - using Spatial and Non-spatial Attributes
Description
Flag Outlier Occurrences - using Spatial and Non-spatial Attributes
Usage
ec_flag_outlier(
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers,
itr = 50,
k = 3,
geo_quantile = 0.99,
maha_quantile = 0.99
)
Arguments
data |
data table with spatial and environmental variables |
latitude |
default set to "deciamlLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
header names of env variables. env_layers <- c("Temperature", "pH") |
itr |
iteration to run the clustering 100 or 1000 times |
k |
number of cluster to choose in each iteration |
geo_quantile |
value with geo_quantile percentile would consider has threshold for geo_distance to derive the outlier. e.g. default 0.99 |
maha_quantile |
value with maha_quantile percentile would consider has threshold for maha_distance to derive the outlier. e.g. default 0.99 |
Value
A column call flag_outlier which has outlier probability from 0 to 1. 1 is more towards outlier, 0 more towards good data points.
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
BO_sstmean = c(12, 13, 14),
BO_sstmin = c(9, 6, 10),
BO_sstmax = c(14, 16, 18)
)
env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax")
res <- ec_flag_outlier(data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers,
itr = 100,
k = 3,
geo_quantile = 0.99,
maha_quantile = 0.99
)
data$outlier <- res$outlier
iteration_list <- res$result$list
Flag occurrences those has bad precision
Description
Flag occurrences those has bad precision
Usage
ec_flag_precision(
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
threshold = 2
)
Arguments
data |
dataframe |
latitude |
decimalLatitude, this a field in the data file. We prefer to use decimalLatitude as accepeted name based on TDWG standards |
longitude |
decimalLongitude, this a field in the data file. We prefer to use decimalLongitude as accepeted name based on TDWG standards |
threshold |
set on 2 |
Value
A column which has flagged records represents bad records based on low precision as well as rounding
Examples
data <- data.frame(
species = "A",
decimalLongitude = c(-120.67, -78, -110, -60, -75.5, -130.78, -10.2, 5.4),
decimalLatitude = c(20.7, 34.6, 30.0, 10.5, 40.4, 25.66, 15.0, 35.9)
)
data$flag_cordinate_precision <- ec_flag_precision(
data,
latitude = "decimalLongitude",
longitude = "decimalLatitude",
threshold = 2
)
Filter records to georeference using GEOLocate
Description
Filter records to georeference using GEOLocate
Usage
ec_flag_with_locality(
data,
uncertainty = "coordinateUncertaintyInMeters",
locality = "locality",
verbatimLocality = "verbatimLocality"
)
Arguments
data |
data table with occurrence information |
uncertainty |
Mendatory to have coordinateUncertaintyInMeters column in the data table |
locality |
Mandatory to have locality column in the data table. |
verbatimLocality |
Mandatory to have verbatimLocality in the data table. |
Details
Records those does not have coordinates assigned but has locality and varbatim locality information to assign coordinates by using external tools such as GEOLocate
Value
A column with flagged records as 1, which means these records has potential to be georeferenced.
Examples
data <- data.frame(
coordinateUncertaintyInMeters = c(NA, "N/A", 50, "30", NA, "N/A", NA),
locality = c("Santa Cruz", NA, "Los Angeles", "N/A", "", "San Diego", NA),
verbatimLocality = c(NA, "CA coast", "", "N/A", "Long Beach", NA, "")
)
data$flag_check_geolocate <- ec_flag_with_locality(
data, uncertainty = "coordinateUncertaintyInMeters",
locality = "locality",
verbatimLocality = "verbatimLocality"
)
Map view of occurrence data points
Description
Map view of occurrence data points
Usage
ec_geographic_map(
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
data |
Data table |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
Value
A map view shows occurrence records.
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
temperature_mean = c(12, 13, 14),
temperature_min = c(9, 6, 10),
temperature_max = c(14, 16, 18)
)
ec_geographic_map(data,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Map view to visualize data points with outlier probability 0 to 1 on a map view
Description
Map view to visualize data points with outlier probability 0 to 1 on a map view
Usage
ec_geographic_map_w_flag(
data,
flag_column,
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Arguments
data |
Data table which has information of coordinates (decimalLongitude and decimalLatitude) and a column which has flags 0 to 1 |
flag_column |
column name which has flag, e.g. flag_outlier |
latitude |
default set on "decimalLatitude", change if the name of column is different. |
longitude |
default set on "decimalLongitude", change if the name of column is different. |
Value
A geographic map which shows occurrence data points with the color gradient to show flagged records in warm color.
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
temperature_mean = c(12, 13, 14),
temperature_min = c(9, 6, 10),
temperature_max = c(14, 16, 18),
flag_outlier = c(0, 0.5, 1)
)
ec_geographic_map_w_flag(data,
flag_column = "flag_outlier",
latitude = "decimalLatitude",
longitude = "decimalLongitude"
)
Impute Environmental Variables using Mean Values of occurrences within a certain radius
Description
Impute Environmental Variables using Mean Values of occurrences within a certain radius
Usage
ec_impute_env_values(
data_x,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
radius_km = 10,
iter = 3
)
Arguments
data_x |
this is data_x which is the output of ec_extract_env_layers |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
radius_km |
radius to average the values of data points within the circle to imput the values for missing datta points |
iter |
number of times to iterate the imputation, e.g. 1 or 2 or 3 |
Value
An updated table of data_x which has imputed values for the missing env variables, condition applies that the this imputation wont work if the data points are too sparse.
Examples
data_x <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
BO_sstmean = c(12, NA, 14),
BO_sstmin = c(9, NA, 10),
BO_sstmax = c(14, NA, 18)
)
radius_km <- 10
iter <- 3
data_x <- ec_impute_env_values(data_x,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
radius_km, iter
)
Merge the Update Georeferenced Occurrence Points back to the Main Data File.
Description
Merge the Update Georeferenced Occurrence Points back to the Main Data File.
Usage
ec_merge_corrected_coordinates(
data_corrected,
data,
catalog = "cleaned_catalog",
latitude = "decimalLatitude",
longitude = "decimalLongitude",
uncertainty_col = "coordinateUncertaintyInMeters"
)
Arguments
data_corrected |
After assigning coordinate values using online georeference tools such as GeoLocate, upload the csv file back to R with the name call data_corrected, we hardcoded the field names as "corrected_longitude", "corrected_latitude" and "corrected_uncertainty" and "cleaned_catalog" for column names of data_corrected dataset" which will be merge with "decimalLongitude", "decimalLantitude", "coordinateUncertaintyInMeters" and "cleaned_catalog" of data table. |
data |
data table which needs to updated with the assign coordiantes |
catalog |
this is an important attribute to use matching the records back to the main data file. |
latitude |
default set to "decimalLatitude", this is a column name of data |
longitude |
default set to "decimalLongitude", this is a column name of data |
uncertainty_col |
this is a column name of data and default set to "coordinateUncertaintyInMeters" |
Value
A data frame with updated coordinate information
Examples
data <- data.frame(
species = "A",
decimalLongitude = c(-120, -119.8, NA, NA),
decimalLatitude = c(20, 34, NA, NA),
cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"),
locality = c(NA, NA, "Los Angeles, CA", "San Pedro, CA"),
coordinateUncertaintyInMeters = c(9999, NA, NA, NA)
)
data_corrected <- data.frame(
corrected_longitude = c(-120, -119.8, 118, 118.3),
corrected_latitude = c(20, 34, 33, 32.9),
cleaned_catalog = c("12345", "89888", "LACM8898", "SDNHM6767"),
corrected_uncertainty = c(9999, NA, 5000, 1000)
)
data<- ec_merge_corrected_coordinates(data_corrected, data,
catalog = "cleaned_catalog",
latitude = "decimalLatitude",
longitude = "decimalLongitude",
uncertainty_col = "coordinateUncertaintyInMeters" )
Scatter Plot between geo_distance vs maha_distance with geo- and maha- Quantile Threshold to Demonstrate the Outliers outside those threshold.
Description
Scatter Plot between geo_distance vs maha_distance with geo- and maha- Quantile Threshold to Demonstrate the Outliers outside those threshold.
Usage
ec_plot_distance(
x,
geo_quantile = 0.99,
maha_quantile = 0.99,
iterative = TRUE,
geo_distance = "geo_distance",
maha_distance = "maha_distance"
)
Arguments
x |
iteration_list derived from ec_flag_outlier can be used to plot these scatter plots between geo_distance vs maha_distance |
geo_quantile |
value with geo_quantile percentile would consider has threshold for geo_distance to derive the outlier. e.g. default 0.99 |
maha_quantile |
value with maha_quantile percentile would consider has threshold for maha_distance to derive the outlier. e.g. default 0.99 |
iterative |
= TRUE/FALSE, default set on TRUE, which provide a iterative loop to check maps of each iteration of listed outcome of outlier probability, if it is FALSE, loop exit with first iteration outcome of outlier probability. |
geo_distance |
default set on "geo_distance", this column has calculated distance - output of ec_flag_outlier |
maha_distance |
default set on "maha_distance", this column has calculated distance - output of ec_flag_outlier |
Value
A list of plots for each iteration outcome
Examples
df1 <- data.frame(
latitude = runif(5, 30, 35),
longitude = runif(5, -120, -115),
temperature = rnorm(5, 15, 2),
pH = rnorm(5, 8, 0.1),
geo_distance = runif(5, 0, 100),
maha_distance = runif(5, 0, 10)
)
df2 <- data.frame(
latitude = runif(5, 30, 35),
longitude = runif(5, -120, -115),
temperature = rnorm(5, 16, 2),
pH = rnorm(5, 7.9, 0.1),
geo_distance = runif(5, 0, 100),
maha_distance = runif(5, 0, 10)
)
iteration_list <- list(df1, df2)#Store both data frames in a list
iteration_list <- list(df1, df2)
plot <- ec_plot_distance(iteration_list, geo_quantile = 0.99, maha_quantile = 0.99,
iterative = TRUE)
Plot cleaned data overlay overall occurrence data to demonstrate accepted ranges of spatial and non-spatial attributes
Description
Plot cleaned data overlay overall occurrence data to demonstrate accepted ranges of spatial and non-spatial attributes
Usage
ec_plot_var_range(
data,
summary_df,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers
)
Arguments
data |
data table which even has outlier data points |
summary_df |
summmary output of final cleaned data, after executing function ec_var_summary |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
list of environmental variables |
Value
A plot which shows spatial and environmental variables with the acceptable range for species habitability
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9, -116.5),
decimalLatitude = c(32.9, 33.5, 31.9, 32.4),
temperature_mean = c(12, 13, 14, 11),
temperature_min = c(9, 6, 10, 10),
temperature_max = c(14, 16, 18, 17),
flag_outlier = c(0, 0.5, 1, 0.7)
) # this data table has data points which was considered as outliers
data_x <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.5),
decimalLatitude = c(32.9, 33.5, 32.4),
temperature_mean = c(12, 13, 11),
temperature_min = c(9, 6, 10),
temperature_max = c(14, 16, 17),
flag_outlier = c(0, 0.5, 0.7)
)
# cleaned data base after removing outliers >x probability.
# in this example, removed data points >0.7 probability to be
# considering outliers
env_layers <- c("temperature_mean", "temperature_min", "temperature_max")
summary_df <- ec_var_summary(data_x,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers
)
# this is the final cleaned data table which
# will be used to derive summary of acceptable niche
ec_plot_var_range(data,
summary_df,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers
)
Remove Duplicate Records from the Merged Data
Description
Remove Duplicate Records from the Merged Data
Usage
ec_rm_duplicate(data, catalogNumber = "catalogNumber", abundance = "abundance")
Arguments
data |
this is merge data frame which is a output file after running ec_db_merge |
catalogNumber |
this is a mandatory field which consider unique for each occurrence record. |
abundance |
this is a mandatory field which has created while data extraction by combining individual count and quantity fields (may vary from one source to another, we aim to standardize those as "abundance"). |
Details
This function will provide a cleaned_catalog column as output, which has catalog numbers standardize and removed duplicates based on generated cleaned_catalog and abundance columns of data. mandatory fields are catalogNumber, source and abundance
Value
A data frame which has unique catalog numbers. the output file will have cleaned_catalog field instead of catalogNumber. Also the unique record will be chosen with the abundance value if there is any.
Examples
db1 <- data.frame(
species = "A",
decimalLongitude = c(-120.2, -117.1, NA, NA),
decimalLatitude = c(20.2, 34.1, NA, NA),
catalogNumber = c("12345", "89888", "LACM8898", "SDNHM6767"),
occurrenceStatus = c("present", "", "ABSENT", "Present"),
basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"),
source = "db1",
abundance = c(1, NA, 8, 23)
)
db2 <- data.frame(
species = "A",
decimalLongitude = c(-120.2, -117.1, NA, NA),
decimalLatitude = c(20.2, 34.1, NA, NA),
catalogNumber = c("123452", "898828", "LACM82898", "SDNHM62767"),
occurrenceStatus = c("present", "", "ABSENT", "Present"),
basisOfRecord = c("preserved_specimen", "", "fossilspecimen", "material_sample"),
source = "db2",
abundance = c(1, 2, 3, 19)
)
db_list <- list(db1, db2)
merge_modern_data <- ec_db_merge(db_list = db_list, "modern")
ecodata <- ec_rm_duplicate(merge_modern_data,
catalogNumber = "catalogNumber",
abundance = "abundance"
)
Trail Zeros from the Coordinate Values
Description
Trail Zeros from the Coordinate Values
Usage
ec_trail_zero(coord)
Arguments
coord |
A coordinate value in the numeric format of decimal degree |
Value
A numerical trailed coordinate value.
Examples
ec_trail_zero(12.7000000)
ec_trail_zero(45.000000)
A Summary Table of Final Cleaned Spatial and Environmental Variables
Description
A Summary Table of Final Cleaned Spatial and Environmental Variables
Usage
ec_var_summary(
data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers
)
Arguments
data |
data table after cleaning the records |
latitude |
default set to "decimalLatitude" |
longitude |
default set to "decimalLongitude" |
env_layers |
an array of col names of enviornmental layers |
Value
A summary table with the mean, min and max values of final cleaned spatial and environmental variables
Examples
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9, -116.5),
decimalLatitude = c(32.9, 33.5, 31.9, 32.4),
BO_sstmean = c(12, 13, 14, 11),
BO_sstmin = c(9, 6, 10, 10),
BO_sstmax = c(14, 16, 18, 17)
)
env_layers <- c("BO_sstmean", "BO_sstmin", "BO_sstmax")
ec_var_summary(data,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
env_layers
)
Check Accepted Synonyms from WoRMs Taxonomy
Description
Check Accepted Synonyms from WoRMs Taxonomy
Usage
ec_worms_synonym(species_name, data, scientificName = "scientificName")
Arguments
species_name |
input species name.e.g. Mexacanthina lugubris |
data |
data table which has information of all occurrence data of the selected species |
scientificName |
default set to scientificName, this is a column in the data extracted from online sources, may have various synonyms of species_name. |
Value
A table with two columns, column one represent the accepted synonyms, and column two demonstrate the unique species names from the occurrence data base with the number of records tagged under species names.
Examples
species_name <- "Mexacanthina lugubris"
data <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-120, -78, -110, -60, -75, -130, -10, 5),
decimalLatitude = c(20, 34, 30, 10, 40, 25, 15, 35)
)
comparison <- ec_worms_synonym(species_name, data, scientificName = "scientificName")
print(comparison)
dataset1: Documentation of data file - ecodata.rda
Description
This data file is consider as raw data file after merging and removing duplicate records of all data sources. e.g. this file is an output of occurrence records of mollusc species "Mexacanthina lugubris" with all modern records extracted from GBIF, OBIS, IDIGBIO and InvertEBase
Usage
ecodata
Format
A data frame with 1115 rows and 19 variables:
- X
index
- basisOfRecord
Type of record (e.g., preserved specimen, fossil)
- occurrenceStatus
Presence or absence of the organism
- institutionCode
Code of the institution that holds the record
- verbatimEventDate
Original recorded date of the event
- scientificName
Full scientific name of the organism
- individualCount
Number of individuals observed
- organismQuantity
Reported quantity of the organism
- abundance
Calculated or standardized abundance value
- decimalLatitude
Latitude in decimal degrees
- decimalLongitude
Longitude in decimal degrees
- coordinateUncertaintyInMeters
Uncertainty in coordinates (meters)
- locality
Named place where the occurrence was recorded
- verbatimLocality
Original text for locality description
- municipality
Municipality or town of the occurrence
- county
County where the record was observed
- stateProvince
State or province name
- country
Country name
- cleaned_catalog
Standardized catalog number for de-duplication
Source
used rgbif for GBIF, ridigbio for iDigBio, robis for OBIS and rsymbiota for InvertEBase
dataset4: Documentation of data file - ecodata_cleaned.rda
Description
This data shows the final cleaned occurrence records
Usage
ecodata_cleaned
Format
A data frame with 698 rows and 35 variables:
- X
Index
- basisOfRecord
Type of occurrence record (e.g., preserved specimen, fossil)
- occurrenceStatus
Indicates presence or absence of the species
- institutionCode
Code of the institution that provided the record
- verbatimEventDate
Original text for the event or collection date
- scientificName
Scientific name of the organism
- individualCount
Number of individuals recorded
- organismQuantity
Reported quantity (unit may vary)
- abundance
Standardized or calculated abundance value
- decimalLatitude
Latitude in decimal degrees
- decimalLongitude
Longitude in decimal degrees
- coordinateUncertaintyInMeters
Spatial uncertainty of coordinates in meters
- locality
Named location where the record was collected
- verbatimLocality
Original locality text as provided by the source
- municipality
Municipality or town of occurrence
- county
County of occurrence
- stateProvince
State or province of occurrence
- country
Country of occurrence
- cleaned_catalog
Standardized catalog number used for de-duplication
- lat_precision
Number of decimal places in the latitude coordinate
- lon_precision
Number of decimal places in the longitude coordinate
- flag_cordinate_precision
Flag for low coordinate precision
- flag_cc_val
Flag for invalid or impossible coordinates
- flag_cc_equal
Flag for identical latitude and longitude (likely erroneous)
- flag_cc_zero
Flag for coordinates at (0,0)
- flag_cc_cent
Flag for coordinates placed at a country or region centroid
- flag_cc_gbif
Flag for coordinates matching GBIF headquarters (artifact)
- flag_cc_inst
Flag for coordinates matching institution location
- flag_non_region
Flag for coordinates outside the study region
- outliers
Flag for outliers based on clustering of spatial and environmental variables
- BO_sstmean
Mean sea surface temperature from Bio-ORACLE
- BO_sstmax
Maximum sea surface temperature from Bio-ORACLE
- BO_sstmin
Minimum sea surface temperature from Bio-ORACLE
- BO_chloro
Chlorophyll concentration from Bio-ORACLE
- BO_dissox
Dissolved oxygen level from Bio-ORACLE
Source
Generated after filtering outlier data points
dataset2: Documentation of data file - ecodata_corrected.rda
Description
This data file created by using GEOLocate tool and we only kept 4 columns. These georeference information will be merge back with the main data file ecodata
Usage
ecodata_corrected
Format
A data frame with 433 rows and 4 variables:
- cleaned_catalog
catalog number
- corrected_latitude
latitude values assigned by GEOLocate
- corrected_longitude
longitude values assigned by GEOLocate
- corrected_uncertainty
uncertainty values assigned by GEOLocate
Source
this file was created manually after extracting the csv file from GEOLocate online software to assign coordiante and uncertainty values for the records has locality information
dataset3: Documentation of data file - ecodata_with_outliers.rda
Description
This data file created after running ec_flag_outlier function. It has records with outlier probability
Usage
ecodata_with_outliers
Format
A data frame with 713 rows and 35 variables:
- X
index
- basisOfRecord
Type of occurrence record (e.g., preserved specimen, fossil)
- occurrenceStatus
Indicates presence or absence of the species
- institutionCode
Code of the institution that provided the record
- verbatimEventDate
Original text for the event or collection date
- scientificName
Scientific name of the organism
- individualCount
Number of individuals recorded
- organismQuantity
Reported quantity (unit may vary)
- abundance
Standardized or calculated abundance value
- decimalLatitude
Latitude in decimal degrees
- decimalLongitude
Longitude in decimal degrees
- coordinateUncertaintyInMeters
Spatial uncertainty of coordinates in meters
- locality
Named location where the record was collected
- verbatimLocality
Original locality text as provided by the source
- municipality
Municipality or town of occurrence
- county
County of occurrence
- stateProvince
State or province of occurrence
- country
Country of occurrence
- cleaned_catalog
Standardized catalog number used for de-duplication
- lat_precision
Number of decimal places in the latitude coordinate
- lon_precision
Number of decimal places in the longitude coordinate
- flag_cordinate_precision
Flag for low coordinate precision
- flag_cc_val
Flag for invalid or impossible coordinates
- flag_cc_equal
Flag for identical latitude and longitude (likely erroneous)
- flag_cc_zero
Flag for coordinates at (0,0)
- flag_cc_cent
Flag for coordinates placed at a country or region centroid
- flag_cc_gbif
Flag for coordinates matching GBIF headquarters (artifact)
- flag_cc_inst
Flag for coordinates matching institution location
- flag_non_region
Flag for coordinates outside the study region
- outliers
Flag for outliers based clustering of spatial and env variables
- BO_sstmean
Mean sea surface temperature from Bio-ORACLE
- BO_sstmax
Maximum sea surface temperature from Bio-ORACLE
- BO_sstmin
Minimum sea surface temperature from Bio-ORACLE
- BO_chloro
Chlorophyll concentration from Bio-ORACLE
- BO_dissox
Dissolved oxygen level from Bio-ORACLE
Source
this file was created manually after extracting the csv file from GEOLocate online software to assign coordiante and uncertainty values for the records has locality information
dataset5: Documentation of data file - ecodata_x.rda
Description
This data was created to get unique combination of coordinate values to extract env variables from bio-oracle and merge back in main data table - ecodata
Usage
ecodata_x
Format
A data frame with 705 rows and 6 variables:
- species
species name
- decimalLatitude
Latitude in decimal degrees
- decimalLongitude
Longitude in decimal degrees
- temperature_mean_BO
Mean sea surface temperature from Bio-ORACLE
- temperature_max_BO
Maximum sea surface temperature from Bio-ORACLE
- temperature_min_BO
Minimum sea surface temperature from Bio-ORACLE
Source
this file has unique coordinate information with unique values of enviornemnt variables
dataset6: Documentation of data file - example_sp_invertebase.rda
Description
This is a data dump downloaded from invertEbase, as the R package link with InverEbase is currently archive and not maintained, we are providing an example file.
Usage
example_sp_invertebase
Format
A data frame with 710 rows and 20 variables:
- source
invertEbase
- catalogNumber
CatalogNumber
- basisOfRecord
type of observations
- occurrenceStatus
presence or absent
- institutionCode
Institution code
- verbatimEventDate
when was this occurrence created
- scientificName
species name
- individualCount
abundance
- organismQuantity
abundance
- abundance
abundance
- decimalLatitude
Latitude in decimal degrees
- decimalLongitude
Longitude in decimal degrees
- coordinateUncertaintyInMeters
uncertainty of coordiantes
- locality
location information
- verbatimLocality
verbatim location information
- municipality
municipality
- country
country
- stateProvince
State or Provinces
- county
county
- countryCode
country code
Source
this file is downloaded file from invertEBase for species - "Mexacanthina lugubris" and modified field names based on TDWG standard.
Calculate Harversine distance
Description
Calculate Harversine distance
Usage
haversine_kmeans(data, latitude, longitude, k)
Arguments
data |
is a dataframe with spatial attributes - Latitude and Logitude |
latitude |
nested imput from ec_flag_outlier |
longitude |
nested imput from ec_flag_outlier |
k |
is number of cluster required for the data set you have. Normally visual inspection can give a sense on number of clusters. Cautious to have more than expected clusters to fit all data points, as overfitting can end up inluding bad data points in the analysis. e.g. k = 3 |
Value
A data frame with centroid and clusters using Harversine distance matrix
Examples
data_x <- data.frame(
scientificName = "Mexacanthina lugubris",
decimalLongitude = c(-117, -117.8, -116.9),
decimalLatitude = c(32.9, 33.5, 31.9),
BO_sstmean = c(12, 13, 14),
BO_sstmin = c(9, 6, 10),
BO_sstmax = c(14, 16, 18)
)
result <- haversine_kmeans(
data_x,
latitude = "decimalLatitude",
longitude = "decimalLongitude",
k = 3
)