The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
splitGraph ends at a split_spec object. It
deliberately knows nothing about rsample,
tidymodels, or any other resampling engine. The handoff
contract is the sample_data table inside the spec plus a
few scalar fields (group_var, block_vars,
time_var, ordering_required,
recommended_resampling).
This cookbook shows three small, self-contained adapters that turn a
split_spec into something a downstream workflow can
use:
(train, test) row-index pairs — runnable here, no extra
dependencies.rsample::group_vfold_cv() adapter
for grouped cross-validation keyed to group_id.rsample::rolling_origin() adapter
for ordered evaluation keyed to order_rank.Adapters 2 and 3 show idiomatic glue but are not evaluated in this
vignette so that splitGraph does not pick up
rsample as a build-time dependency.
The same pattern works for any other resampling library you happen to use.
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"),
subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
batch_id = c("B1", "B2", "B1", "B2", "B1", "B2"),
timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
time_index = c(0, 1, 0, 1, 0, 1),
outcome_id = c("ctrl", "case", "ctrl", "case", "case", "ctrl"),
stringsAsFactors = FALSE
)
g <- graph_from_metadata(meta, graph_name = "cookbook")
subject_constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(subject_constraint, graph = g)
spec
#> <split_spec> subject
#> Samples: 6
#> Groups: 3
#> Recommended resampling: grouped_cvThe sample_data table is the contract:
This is the simplest meaningful adapter. It groups by whatever
split_spec$group_var says is the split unit, and returns
one held-out group per fold.
logo_folds <- function(spec, observation_data, sample_id_col = "sample_id") {
stopifnot(inherits(spec, "split_spec"))
if (!sample_id_col %in% names(observation_data)) {
stop("`observation_data` must contain a `", sample_id_col, "` column.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$group_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
joined$.row <- seq_len(nrow(joined))
groups <- split(joined$.row, joined[[spec$group_var]])
lapply(names(groups), function(g) {
list(
group = g,
train = unlist(groups[setdiff(names(groups), g)], use.names = FALSE),
assess = groups[[g]]
)
})
}
# Pretend we have an observation frame keyed by sample_id.
obs <- data.frame(
sample_id = meta$sample_id,
x = rnorm(nrow(meta)),
y = rbinom(nrow(meta), 1, 0.5)
)
folds <- logo_folds(spec, obs)
length(folds)
#> [1] 3
folds[[1]]
#> $group
#> [1] "subject:P1"
#>
#> $train
#> [1] 3 4 5 6
#>
#> $assess
#> [1] 1 2That is the entire downstream contract: take spec, take
an observation frame, return train/assess index lists. Anything more
complicated is specific to a resampling library.
rsample::group_vfold_cv()Grouped CV keyed to group_id. The downstream package
would typically ship something like this; the adapter is short enough
that you can paste it into your own analysis script.
spec_to_group_vfold <- function(spec, observation_data,
v = NULL,
sample_id_col = "sample_id") {
stopifnot(inherits(spec, "split_spec"))
if (!requireNamespace("rsample", quietly = TRUE)) {
stop("Install rsample to use this adapter.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$group_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
n_groups <- length(unique(joined[[spec$group_var]]))
if (is.null(v)) v <- n_groups
rsample::group_vfold_cv(
data = joined,
group = !!spec$group_var,
v = v
)
}v = NULL (the default above) gives leave-one-group-out,
which is the right default when splitGraph has already
grouped samples by their deepest leakage-relevant unit (e.g. subject).
Pick a smaller v for k-fold-style grouped CV.
rsample::rolling_origin()When spec$ordering_required is TRUE (or
spec$time_var is set), the right downstream object is an
ordered split rather than a grouped one.
spec_to_rolling_origin <- function(spec, observation_data,
sample_id_col = "sample_id",
initial = NULL,
assess = 1L) {
stopifnot(inherits(spec, "split_spec"))
if (is.null(spec$time_var)) {
stop("This split_spec has no `time_var`; ordered evaluation is not available.")
}
if (!requireNamespace("rsample", quietly = TRUE)) {
stop("Install rsample to use this adapter.")
}
joined <- merge(
observation_data,
spec$sample_data[, c("sample_id", spec$time_var)],
by.x = sample_id_col, by.y = "sample_id", sort = FALSE
)
ordered <- joined[order(joined[[spec$time_var]]), , drop = FALSE]
if (is.null(initial)) initial <- max(1L, floor(nrow(ordered) * 0.6))
rsample::rolling_origin(ordered, initial = initial, assess = assess)
}The key idea: splitGraph puts ordering information on
the spec; the adapter is just a thin shim that consumes it.
If the downstream consumer is not in R, write the spec to JSON and let the consumer (Python, Julia, a CLI) interpret it.
tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)
# Inspect the first ~30 lines so the on-disk format is visible.
cat(readLines(tmp, n = 30), sep = "\n")
#> {
#> "splitGraph_object": "split_spec",
#> "schema_version": "0.1.0",
#> "group_var": "group_id",
#> "block_vars": [
#> "batch_group"
#> ],
#> "time_var": "order_rank",
#> "ordering_required": false,
#> "constraint_mode": "subject",
#> "constraint_strategy": "subject",
#> "recommended_resampling": "grouped_cv",
#> "metadata": {
#> "graph_name": "cookbook",
#> "dataset_name": null,
#> "source_mode": "subject",
#> "source_strategy": "subject",
#> "relations_used": "sample_belongs_to_subject",
#> "n_samples": 6,
#> "n_groups": 3,
#> "warnings": [],
#> "enriched_from_graph": true
#> },
#> "sample_data": [
#> {
#> "sample_id": "S1",
#> "sample_node_id": "sample:S1",
#> "group_id": "subject:P1",
#> "primary_group": "subject:P1",
#> "batch_group": "B1",
# And read it back exactly.
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)
#> [1] TRUE
unlink(tmp)The same pair exists for dependency_graph
(write_dependency_graph() /
read_dependency_graph()). Both formats are documented under
?write_split_spec and ?write_dependency_graph
and include a schema_version field so consumers can detect
drift.
The only assumptions an adapter has to honor:
split_spec$sample_data is keyed by
sample_id (character).split_spec$group_var is the column that holds the
splitting unit.split_spec$block_vars are present-but-coarser blocking
columns.split_spec$time_var, when non-NULL,
defines the ordering.split_spec$recommended_resampling is a hint, not a
contract — your adapter is free to ignore it.That is the whole interface. As long as those five fields are honored, anything is a valid downstream consumer.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.