The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
scholid is a lightweight, dependency-free (base R only)
toolkit for working with scholarly and academic identifiers. It provides
small, well-tested helpers to detect, normalize, classify, and extract
common identifier strings.
This vignette introduces the interface and typical workflows for mixed, messy identifier data.
scholid exposes a small set of user-facing functions
that operate consistently across identifier types:
scholid_types() lists supported identifier types.is_scholid(x, type) checks whether values match the
identifier type.normalize_scholid(x, type) returns canonical identifier
strings.extract_scholid(text, type) extracts identifiers from
free text.classify_scholid(x) guesses the identifier type per
element.detect_scholid_type(x) detects identifier types from
canonical or wrapped input values (e.g., URLs or labels).These generic helpers dispatch internally to type-specific
implementations such as is_doi(),
normalize_orcid(), and extract_isbn().
## [1] "arxiv" "doi" "isbn" "issn" "orcid" "pmcid" "pmid"
is_scholid()is_scholid() checks whether each value matches a
specific identifier type. It is vectorized and preserves missing
values.
## [1] TRUE FALSE NA
normalize_scholid()Normalization removes common wrappers and enforces a canonical representation. This is particularly useful when identifiers are stored as URLs or prefixed labels.
x <- c(
"https://doi.org/10.1000/182.",
"doi:10.1000/182",
" 10.1000/182 "
)
scholid::normalize_scholid(
x = x,
type = "doi"
)## [1] "10.1000/182" "10.1000/182" "10.1000/182"
For ORCID iDs, normalization removes URL prefixes and enforces hyphenated grouping.
x <- c(
"https://orcid.org/0000-0002-1825-0097",
"0000000218250097"
)
scholid::normalize_scholid(
x = x,
type = "orcid"
)## [1] "0000-0002-1825-0097" "0000-0002-1825-0097"
Normalization is designed to be predictable: - NA input
stays NA. - Invalid inputs typically become
NA_character_.
extract_scholid()Extraction is for harvesting identifiers from unstructured text. The result is a list with one element per input element. Each element is a character vector of matches (possibly empty).
txt <- c(
"See https://doi.org/10.1000/182 and doi:10.5555/12345678.",
"No identifier here.",
NA
)
scholid::extract_scholid(
text = txt,
type = "doi"
)## [[1]]
## [1] "10.1000/182" "10.5555/12345678."
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
The list return type is intentional: a single text string can contain multiple identifiers.
classify_scholid()classify_scholid() returns the best-guess identifier
type per element for mixed identifier columns. Classification is based
on the set of available is_<type>() checks and the
precedence order defined by scholid_types().
x <- c(
"10.1000/182",
"0000-0002-1825-0097",
"PMC12345",
"2101.00001v2",
"not an id",
NA
)
scholid::classify_scholid(x = x)## [1] "doi" "orcid" "pmcid" "arxiv" NA NA
Many identifiers appear wrapped (URLs, prefixes, trailing punctuation). Classification is strict and expects canonical strings. A common pattern is:
txt <- "Read https://doi.org/10.1000/182 (and ORCID 0000-0002-1825-0097)."
dois <- scholid::extract_scholid(txt, "doi")[[1]]
orcids <- scholid::extract_scholid(txt, "orcid")[[1]]
dois_n <- scholid::normalize_scholid(dois, "doi")
orcids_n <- scholid::normalize_scholid(orcids, "orcid")
scholid::classify_scholid(c(dois_n, orcids_n))## [1] "doi" "orcid"
## [1] TRUE
## [1] TRUE
detect_scholid_type()detect_scholid_type() performs best-effort type
detection for mixed, messy identifier input. In contrast to
classify_scholid(), detection also recognizes common
wrapped forms such as URLs and prefixed labels (e.g., doi:,
https://orcid.org/, arXiv:,
PMID:).
Detection is useful when working with raw data where identifiers may not yet be normalized.
For example, wrapped identifiers are not classified strictly:
x <- c(
"https://doi.org/10.1000/182",
"ORCID: 0000-0002-1825-0097",
"arXiv:2101.00001",
"PMID: 12345",
"not an id"
)
scholid::classify_scholid(x)## [1] NA NA NA NA NA
However, they can be detected directly:
## [1] "doi" "orcid" "arxiv" "pmid" NA
Whitespace and minor formatting irregularities are handled conservatively:
## [1] "orcid" "doi" "issn"
detect_scholid_type() does not modify values. Once the
identifier type is known, use normalize_scholid() to
convert to canonical form and is_scholid() for strict
validation.
A typical workflow for messy data is:
This separation keeps detection permissive and normalization predictable, while preserving strict validation where needed.
scholid is intentionally small and conservative:
is_*(),
normalize_*(), and extract_*() helpers.## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Vienna
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.56
## [5] cachem_1.1.0 knitr_1.51 htmltools_0.5.9 rmarkdown_2.30
## [9] lifecycle_1.0.5 cli_3.6.5 scholid_0.1.0 sass_0.4.10
## [13] jquerylib_0.1.4 compiler_4.5.2 rstudioapi_0.18.0 tools_4.5.2
## [17] evaluate_1.0.5 bslib_0.10.0 yaml_2.3.12 otel_0.2.0
## [21] jsonlite_2.0.0 rlang_1.1.7
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.