The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

transitiontrees

Predictive pathway discovery in categorical sequence data.

transitiontrees fits a variable-depth pathway tree (a prediction suffix tree; Ron, Singer & Tishby 1996) from sequences and exposes a tidy, pathway-centric API: most common pathways, most predictively divergent pathways, modal-flip diagnostics, and next-state predictions.

Why this package

Sequence-analysis tooling tends to either count k-grams (descriptive) or fit fixed-order Markov chains (one-size-fits-all memory). Real trajectories — student engagement weeks, patient pathways, clickstreams, play sequences — usually have memory that varies by context: some states predict the next state on their own, others only when paired with a longer history. transitiontrees makes that variable-depth structure the central object and reports it as a ranked list of pathways the data actually supports.

It is a self-contained, independent implementation of the variable-order prediction-suffix-tree model:

Installation

# From CRAN (once released)
install.packages("transitiontrees")

# Development version from GitHub
# install.packages("remotes")
remotes::install_github("mohsaqr/transitiontrees")

Usage

library(transitiontrees)

# --- Fit ------------------------------------------------------------
# From a wide character matrix / data.frame, or a list of character
# vectors. `min_count` is the minimum occurrences a context needs to
# get its own node.
tree <- context_tree(seqs, max_depth = 4, min_count = 5, smoothing = "floor")

# Smoothing as an explicit list (hyperparameters), or a different scheme:
# context_tree(seqs, max_depth = 4, smoothing = list("floor", ymin = 0.001))
# context_tree(seqs, max_depth = 4, smoothing = "kneser_ney")

# From a long event log directly (no manual reshaping):
# context_tree(events, actor = "id", action = "move", order = "step",
#              max_depth = 3)
# or build the wide frame yourself:
# wide <- prepare_input(events, actor = "id", time = "timestamp",
#                       action = "move")

# --- Inspect --------------------------------------------------------
print(tree)
summary(tree)
model_fit(tree)                          # logLik, df, nobs, AIC, BIC, perplexity

# --- Prune ----------------------------------------------------------
# Collapse contexts whose extra depth is not a significant gain over
# their parent (likelihood-ratio G^2 test at familywise alpha = 0.05).
pruned <- prune_tree(tree, criterion = "G2", alpha = 0.05)

# --- The pathway-centric API ---------------------------------------
tree_pathways(pruned)                       # all pathways, sorted by count
common_pathways(pruned,    top = 8)         # top by frequency
divergent_pathways(pruned, top = 6)         # top by KL from shorter history
divergent_pathways(pruned, flips_only = TRUE)
sharp_pathways(pruned,     top = 5)         # most deterministic continuations
tree_dependence(pruned)                     # per-context entropy/KL diagnostic

# --- Predict & evaluate --------------------------------------------
predict(pruned, newdata = list(c("A","B","B"), c("A","A","C")))
predict(pruned, c("A","B"), type = "class")   # modal next state
logLik(pruned); AIC(pruned); BIC(pruned)
perplexity(pruned, newdata = test_seqs)
score_sequences(pruned, newdata = test_seqs)

# Generate sequences from the fitted tree
generate_sequences(pruned, n = 100, length = 20)
simulate(pruned, nsim = 100, length = 20)     # S3 alias

# --- Cross-validated tuning ----------------------------------------
tg <- tune_tree(seqs, max_depth = 1:4, min_count = c(3, 5),
                smoothing = c("floor", "kneser_ney"),
                prune = c(FALSE, TRUE), folds = 5)
attr(tg, "best")

# --- Two-tree comparison with a permutation test -------------------
tree_a <- context_tree(group_a)
tree_b <- context_tree(group_b)
compare_trees(tree_a, tree_b, iter = 200)

# --- Bootstrap pathway reliability ---------------------------------
boot <- bootstrap_pathways(pruned, iter = 1000, stat = "count")
summary(boot)        # tidy per-pathway stability + informativeness table
plot(boot)

# --- Tree introspection --------------------------------------------
query_pathway(pruned, "A -> B")             # full predicted distribution
pathway_exists(pruned, c("A","B","C"))
sub <- subtree(pruned, "A")                 # restrict to descendants of "A"

# --- Impute & mine --------------------------------------------------
impute_sequences(pruned, gappy_seqs)                  # fill internal NA gaps
mine_contexts(pruned, state = "B", min_prob = 0.6)    # contexts where B is likely
mine_sequences(pruned, test_seqs, which = "surprising")  # worst-fit sequences

# --- Visualisation --------------------------------------------------
plot(pruned)                                # horizontal tree (default)
plot(pruned, style = "dendrogram")          # pure-ggplot dendrogram
plot(pruned, style = "icicle")              # partition (sunburst) diagram
plot(pruned, style = "interactive")         # interactive HTML widget
plot_pathways(pruned)                       # next-move probability heatmap
plot_divergence(pruned)                     # per-context KL lollipop
plot_distributions(pruned)                  # per-context next-state bars
plot_pruning(pruned, "A -> B -> C")         # suffix-chain memory view
plot_trajectories(pruned, "frequency")      # forward trajectory tree
plot_predictive(pruned, test_seqs)          # held-out confidence diagnostics

Bundled datasets

Four datasets ship with the package for examples and tests:

Dataset Shape Notes
trajectories 138 × 15 wide character matrix engagement states over time
group_regulation_long long event log, POSIXct time regulation-of-learning events
ai_long long event log, Unix time + session id AI-prompting moves
engagement state-sequence (stslist) object weekly engagement sequences

Pathway notation

Pathways are reported in arrow notation (A -> B -> C). The leftmost state is the oldest; the next-state prediction is conditional on the trajectory ending at the rightmost state. The root context (the marginal next-state distribution) is shown as (start).

Bootstrap interpretation

bootstrap_pathways() reports stability and informativeness separately. p_stability is the bootstrap-estimated probability that a pathway statistic falls outside the chosen consistency band under sequence-level resampling. Small values mean the pathway rarely fails the reproducibility criterion. informative_rate is the fraction of resamples where the pathway’s empirical G² statistic exceeds the chi-square reference cutoff against its suffix-parent context.

Read the two flags together:

stable informative Interpretation
TRUE TRUE reproducible and predictively distinctive pathway
TRUE FALSE reproducible pathway, but not predictively distinctive
FALSE TRUE sharp/divergent pathway carried by an unstable subset of sequences
FALSE FALSE weak or sample-fragile pathway

References

Begleiter, R., El-Yaniv, R., Yona, G. (2004). On prediction using variable-order Markov models. Journal of Artificial Intelligence Research, 22, 385–421.

Ron, D., Singer, Y., Tishby, N. (1996). The power of amnesia: learning probabilistic automata with variable memory length. Machine Learning, 25, 117–149.

Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653–664.

License

GPL-3.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.