The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
rtransparency automatically identifies and extracts
indicators of research transparency from the full text
of biomedical articles, in both PubMed Central (PMC) JATS XML and
plain-text (PDF-derived) form. Every prediction comes with the exact
statement that triggered it, so results are auditable rather than a
black box. Detection is rule-based (curated regular expressions over the
relevant article sections), self-contained (no GitHub-only or AGPL
dependencies), and ships with reproducible accuracy benchmarks.
| Indicator | Detects | XML function | Text function |
|---|---|---|---|
| Conflicts of interest | A COI disclosure is present (including “no competing interests”) | rt_coi_pmc |
rt_coi |
| Funding | A statement that funding was received | rt_fund_pmc |
rt_fund |
| Protocol registration | A trial/protocol registration identifier or statement (NCT, ISRCTN, PROSPERO, OSF, CHiCTR, DRKS, ANZCTR, IRCT, UMIN, …) | rt_register_pmc |
rt_register |
| Novelty | The article claims its own work is novel or first | rt_novelty_pmc |
rt_novelty |
| Replication | A replication or external/independent validation was performed | rt_replication_pmc |
rt_replication |
| Data sharing | The authors’ own data are made available (repository, accession, or in-article) | rt_data_code_pmc |
rt_data_code |
| Code sharing | The authors’ own analysis code is shared | rt_data_code_pmc |
rt_data_code |
| AI disclosure | A statement discloses generative-AI use in manuscript preparation (2023+) | rt_ai_pmc |
rt_ai |
Conflicts of interest and AI disclosure are disclosure-based: a statement on the topic counts whether the disclosure is positive or negative. Conflict-of- interest and funding statements are detected not only in English but also in Spanish, Portuguese, French, German and Italian.
# From CRAN (when available)
install.packages("rtransparency")
# Development version from GitHub
# install.packages("remotes")
remotes::install_github("choxos/rtransparency", build_vignettes = TRUE)No GitHub-only or AGPL dependencies are required; data and code
detection is native (it no longer wraps oddpub).
rt_read_pdf() (PDF to text) additionally needs the poppler
pdftotext utility on your system. The optional
furrr and future packages enable parallel
corpus processing; ggplot2 enables plotting.
library(rtransparency)
xml <- system.file("extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency")
res <- rt_all_pmc(xml, remove_ns = TRUE)
# The predictions, one column per indicator:
res[, c("is_coi_pred", "is_fund_pred", "is_register_pred", "is_novelty_pred",
"is_replication_pred", "is_open_data", "is_open_code", "is_ai_pred")]
# Each prediction is paired with the text that triggered it, e.g.:
res$coi_text
res$fund_text
res$open_data_statementsrt_all_pmc() returns one row with the eight predictions,
the extracted statement for each, article identifiers and metadata, the
year, and is_success. is_ai_pred is
NA for articles published before 2023.
Each indicator can be run on its own, for a PMC XML file or a plain-text file:
rt_coi_pmc(xml, remove_ns = TRUE) # conflicts of interest
rt_fund_pmc(xml, remove_ns = TRUE) # funding
rt_register_pmc(xml, remove_ns = TRUE) # protocol registration
rt_novelty_pmc(xml, remove_ns = TRUE) # novelty claims
rt_replication_pmc(xml, remove_ns = TRUE)# replication / external validation
rt_data_code_pmc(xml, remove_ns = TRUE) # data AND code sharing (+ extracted links)
rt_ai_pmc(xml, remove_ns = TRUE) # generative-AI-use disclosure (2023+)
rt_meta_pmc(xml, remove_ns = TRUE) # article metadatart_all_pmc_dir() runs all eight indicators over an
entire directory (or a vector of paths). It is built for large
corpora:
res <- rt_all_pmc_dir(
"path/to/xml", # a directory, or a character vector of file paths
remove_ns = TRUE,
output = "results.csv", # resumable: re-running skips files already recorded
parallel = TRUE, # via furrr + an active future::plan()
progress = TRUE
)output, results are
written to a CSV in chunks; a re-run skips files already recorded and
appends only the new ones.is_success = FALSE row instead of aborting the run.future::plan("multisession") and
parallel = TRUE.The same detectors run on plain-text (PDF-derived) articles.
rt_read_pdf() returns the extracted text as a character
string; write it to a .txt file, then point the text
detectors (which share the PMC detection logic) at that file:
article_txt <- rt_read_pdf("article.pdf") # needs poppler's pdftotext; returns text
writeLines(article_txt, "article.txt") # the detectors take a file path
rt_all("article.txt") # COI, funding, registration, novelty, replication
rt_coi("article.txt") # or one indicator at a time
rt_ai("article.txt") # generative-AI-use disclosurert_ai() is the plain-text counterpart of
rt_ai_pmc(). Because a text file carries no reliable
publication date, it applies no 2023 year gate (it
returns TRUE/FALSE, never NA) and
cannot confine the scan to back-matter sections, so restrict its use to
2023-or-later articles and expect a slightly higher false-positive rate
on papers that use AI as a research method.
Once you have one row per article, summarize the corpus:
data(rt_demo) # a small simulated example shipped with the package
rt_summary(rt_demo) # per-indicator prevalence with a Wilson confidence
# interval and a sensitivity/specificity-corrected
# (Rogan-Gladen) prevalence
rt_summary(rt_demo, by = "year") # subgroup summaries
rt_score(rt_demo) # add a per-article count of openness practices met
rt_plot(rt_demo) # prevalence bar chart
rt_plot(rt_demo, type = "trend", year = "year") # prevalence over timeThe accuracy correction uses the bundled rt_accuracy
table (detector sensitivity and specificity for seven indicators).
Supply your own estimates:
rt_accuracy # the bundled estimates
my_acc <- data.frame(variable = "is_open_data", sensitivity = 0.84, specificity = 0.97)
rt_summary(rt_demo, accuracy = my_acc) # correct with your own valuesThe data- and code-availability links the detector extracts
(open_data_links, open_code_links) can be
passed to FAIR-assessment tooling such as rfair to score
the findability and accessibility of the shared resources.
Benchmarked against the human-labeled XML benchmark of Serghiou et
al. (2021), reproducible under data-raw/benchmark/, with
results in inst/benchmark/:
| Indicator | Sensitivity | Specificity |
|---|---|---|
| Conflicts of interest | 94.0% | 100% |
| Funding | 100% | 95.7% |
| Protocol registration | 99.2% | 96.9% |
| Data sharing | 76.5% | 99.0% |
| Code sharing | 88.1% | 99.5% |
Registration and code in the table above are labeled independently of
the detector; COI, funding and data labels in the 1000-article 2023
sample were reconciled against detector-extracted statements
(detector-adjudicated), so their agreement is not a fully independent
estimate. Data sharing is deliberately precision-favoring: its 76.5%
sensitivity trades recall for 99.0% specificity (the original
oddpub algorithm scores about 84%/97% on this set).
The newer indicators are validated against maintainer-built,
hand-labeled benchmarks in inst/benchmark/:
| Indicator | Sensitivity | Specificity | Basis |
|---|---|---|---|
| Novelty | 83.8% | 95.2% | hand-labeled novelty/replication gold set |
| Replication | 92.8% | 98.5% | replication-enriched sample (111 positives); correction is approximate |
| AI-use disclosure | not accuracy-corrected | — | experimental; only 9 positives in the 2023 sample |
Replication’s correction mixes designs (sensitivity from the enriched
sample, specificity from the representative 2023 sample), so it is less
clean than the single-design corrections above. AI-use disclosure is
reported uncorrected and is excluded from rt_accuracy until
a larger labeled post-2022 sample exists. Two further benchmarks live in
inst/benchmark/: a five-language sample
for multilingual COI and funding, and a TXT-parity
benchmark comparing the text and XML detectors.
See vignette("rtransparency") for the methodology and
vignette("scope-and-limitations") for what each indicator
does and does not capture.
vignette("rtransparency") — introduction and
methodologyvignette("transparency-summary") — corpus prevalence,
scoring and plottingvignette("ai-disclosure") — the AI-use disclosure
indicator in depthvignette("scope-and-limitations") — indicator
semantics, limitations, output schemaThis package builds on the original
rtransparent tool of Stylianos (Stelios)
Serghiou, an enhanced, renamed fork maintained by Ahmad Sofi-Mahmudi (ORCID
0000-0001-6829-0823, GitHub @choxos). It adds four indicators
(novelty, replication, AI disclosure, and a natively re-implemented
data/code detector), multilingual COI and funding detection, plain-text
parity, and corpus-scale batch processing. Serghiou is credited as an
author.
The foundational paper: Serghiou et al., Assessment of
transparency indicators across the biomedical literature: How open is
open? PLOS Biology, 2021, doi:10.1371/journal.pbio.3001107.
Run citation("rtransparency") for both references.
Please file bugs or questions as issues at https://github.com/choxos/rtransparency/issues with a minimal reproducible example.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.