The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

rolescry

Name-blind variable-role detection by data signature. Data inspice, non nomen – inspect the data, not the name.

rolescry assigns statistical roles to the columns of a tabular dataset – group variable, continuous/binary outcome, survival time and event, paired and agreement measurement pairs, repeated measures, scale items, subject identifier, and covariates – using only each column’s information-theoretic signature (Shannon entropy, normalized mutual information, distributional shape and inter-column structure), never the column names. Renaming every column to col_1, col_2, ... does not change the result. No large language models, no external data transmission; detection is deterministic.

Extracted from the MDStatR biostatistics engine.

Installation

From r-universe:

install.packages("rolescry", repos = "https://canboynukara.r-universe.dev")

From GitHub:

# install.packages("remotes")
remotes::install_github("canboynukara/rolescry")

The package needs only base R + stats. Optional packages (readxl/openxlsx/haven for file reading; moments/diptest/stringdist for extra refinements) are used only if installed.

Quick start

library(rolescry)

set.seed(1)
d <- data.frame(
  arm  = rep(c(0, 1), each = 50),   # group
  pre  = rnorm(100, 10, 2),         # paired with post
  post = rnorm(100, 11, 2),
  resp = rbinom(100, 1, 0.4)        # binary outcome
)

res <- detect_roles(d)
res
res$roles$group_var$columns
summary(res)

The name-blindness guarantee

Detection is purely mathematical by default (name_bonus = NULL):

pos <- function(res, dat) match(res$roles$paired_pairs$columns, names(dat))
d_blind <- setNames(d, paste0("col_", seq_along(d)))
identical(pos(detect_roles(d), d), pos(detect_roles(d_blind), d_blind))
#> TRUE  -- the SAME columns (by position) are detected, named or col_N

Column names can be used only as a small, capped tie-breaker (at most a +10 point nudge, i.e. <= 10%) by passing a keyword dictionary; the mathematical signature still dominates:

detect_roles(d, name_bonus = rolescry_default_name_bonus())

Header-aware loading

df <- read_data("messy_export.xlsx")   # auto-detects the header row

How it works

detect_roles() types each column from its values (.build_var_info), scores candidate roles with information-theoretic and distributional signatures (compute_nmi() exposes the normalized mutual information directly), and returns a structured role_detection object with per-role confidence and a component breakdown. See vignette("rolescry") for the method and the name-blind guarantee.

Citation & attribution

Derived from Boynukara, C. (2026). MDStatR (v2.1.0 Veritas). Zenodo. https://doi.org/10.5281/zenodo.20707791

Run citation("rolescry") to cite the package and its parent engine.

License

Apache License 2.0 (inherited from the parent MDStatR project). See LICENSE.md.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.