The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

1 What this vignette is for

splitGraph ends at a split_spec object. It deliberately knows nothing about rsample, tidymodels, or any other resampling engine. The handoff contract is the sample_data table inside the spec plus a few scalar fields (group_var, block_vars, time_var, ordering_required, recommended_resampling).

This cookbook shows three small, self-contained adapters that turn a split_spec into something a downstream workflow can use:

  1. A base-R adapter that returns a list of (train, test) row-index pairs — runnable here, no extra dependencies.
  2. An rsample::group_vfold_cv() adapter for grouped cross-validation keyed to group_id.
  3. An rsample::rolling_origin() adapter for ordered evaluation keyed to order_rank.

Adapters 2 and 3 show idiomatic glue but are not evaluated in this vignette so that splitGraph does not pick up rsample as a build-time dependency.

The same pattern works for any other resampling library you happen to use.

2 Build a split_spec to work with

meta <- data.frame(
  sample_id    = c("S1", "S2", "S3", "S4", "S5", "S6"),
  subject_id   = c("P1", "P1", "P2", "P2", "P3", "P3"),
  batch_id     = c("B1", "B2", "B1", "B2", "B1", "B2"),
  timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
  time_index   = c(0, 1, 0, 1, 0, 1),
  outcome_id   = c("ctrl", "case", "ctrl", "case", "case", "ctrl"),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "cookbook")
subject_constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(subject_constraint, graph = g)
spec
#> <split_spec> subject 
#>   Samples: 6 
#>   Groups: 3 
#>   Recommended resampling: grouped_cv

The sample_data table is the contract:

as.data.frame(spec)[, c("sample_id", "group_id", "batch_group", "order_rank")]
#>   sample_id   group_id batch_group order_rank
#> 1        S1 subject:P1          B1          1
#> 2        S2 subject:P1          B2          2
#> 3        S3 subject:P2          B1          1
#> 4        S4 subject:P2          B2          2
#> 5        S5 subject:P3          B1          1
#> 6        S6 subject:P3          B2          2

3 Adapter 1 — base R: leave-one-group-out folds

This is the simplest meaningful adapter. It groups by whatever split_spec$group_var says is the split unit, and returns one held-out group per fold.

logo_folds <- function(spec, observation_data, sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!sample_id_col %in% names(observation_data)) {
    stop("`observation_data` must contain a `", sample_id_col, "` column.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  joined$.row <- seq_len(nrow(joined))
  groups <- split(joined$.row, joined[[spec$group_var]])

  lapply(names(groups), function(g) {
    list(
      group   = g,
      train   = unlist(groups[setdiff(names(groups), g)], use.names = FALSE),
      assess  = groups[[g]]
    )
  })
}

# Pretend we have an observation frame keyed by sample_id.
obs <- data.frame(
  sample_id = meta$sample_id,
  x = rnorm(nrow(meta)),
  y = rbinom(nrow(meta), 1, 0.5)
)

folds <- logo_folds(spec, obs)
length(folds)
#> [1] 3
folds[[1]]
#> $group
#> [1] "subject:P1"
#> 
#> $train
#> [1] 3 4 5 6
#> 
#> $assess
#> [1] 1 2

That is the entire downstream contract: take spec, take an observation frame, return train/assess index lists. Anything more complicated is specific to a resampling library.

4 Adapter 2 — rsample::group_vfold_cv()

Grouped CV keyed to group_id. The downstream package would typically ship something like this; the adapter is short enough that you can paste it into your own analysis script.

spec_to_group_vfold <- function(spec, observation_data,
                                v = NULL,
                                sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )

  n_groups <- length(unique(joined[[spec$group_var]]))
  if (is.null(v)) v <- n_groups

  rsample::group_vfold_cv(
    data  = joined,
    group = !!spec$group_var,
    v     = v
  )
}

v = NULL (the default above) gives leave-one-group-out, which is the right default when splitGraph has already grouped samples by their deepest leakage-relevant unit (e.g. subject). Pick a smaller v for k-fold-style grouped CV.

5 Adapter 3 — rsample::rolling_origin()

When spec$ordering_required is TRUE (or spec$time_var is set), the right downstream object is an ordered split rather than a grouped one.

spec_to_rolling_origin <- function(spec, observation_data,
                                   sample_id_col = "sample_id",
                                   initial = NULL,
                                   assess = 1L) {
  stopifnot(inherits(spec, "split_spec"))
  if (is.null(spec$time_var)) {
    stop("This split_spec has no `time_var`; ordered evaluation is not available.")
  }
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$time_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  ordered <- joined[order(joined[[spec$time_var]]), , drop = FALSE]

  if (is.null(initial)) initial <- max(1L, floor(nrow(ordered) * 0.6))
  rsample::rolling_origin(ordered, initial = initial, assess = assess)
}

The key idea: splitGraph puts ordering information on the spec; the adapter is just a thin shim that consumes it.

6 Going across language boundaries via JSON

If the downstream consumer is not in R, write the spec to JSON and let the consumer (Python, Julia, a CLI) interpret it.

tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)

# Inspect the first ~30 lines so the on-disk format is visible.
cat(readLines(tmp, n = 30), sep = "\n")
#> {
#>   "splitGraph_object": "split_spec",
#>   "schema_version": "0.1.0",
#>   "group_var": "group_id",
#>   "block_vars": [
#>     "batch_group"
#>   ],
#>   "time_var": "order_rank",
#>   "ordering_required": false,
#>   "constraint_mode": "subject",
#>   "constraint_strategy": "subject",
#>   "recommended_resampling": "grouped_cv",
#>   "metadata": {
#>     "graph_name": "cookbook",
#>     "dataset_name": null,
#>     "source_mode": "subject",
#>     "source_strategy": "subject",
#>     "relations_used": "sample_belongs_to_subject",
#>     "n_samples": 6,
#>     "n_groups": 3,
#>     "warnings": [],
#>     "enriched_from_graph": true
#>   },
#>   "sample_data": [
#>     {
#>       "sample_id": "S1",
#>       "sample_node_id": "sample:S1",
#>       "group_id": "subject:P1",
#>       "primary_group": "subject:P1",
#>       "batch_group": "B1",

# And read it back exactly.
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)
#> [1] TRUE

unlink(tmp)

The same pair exists for dependency_graph (write_dependency_graph() / read_dependency_graph()). Both formats are documented under ?write_split_spec and ?write_dependency_graph and include a schema_version field so consumers can detect drift.

7 When you need a custom adapter

The only assumptions an adapter has to honor:

  • split_spec$sample_data is keyed by sample_id (character).
  • split_spec$group_var is the column that holds the splitting unit.
  • split_spec$block_vars are present-but-coarser blocking columns.
  • split_spec$time_var, when non-NULL, defines the ordering.
  • split_spec$recommended_resampling is a hint, not a contract — your adapter is free to ignore it.

That is the whole interface. As long as those five fields are honored, anything is a valid downstream consumer.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.