The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
README.md now carries the standard status badges
(CRAN version, R-CMD-check, Codecov test coverage, lifecycle, and
license).
The README “Included datasets” table lists the
rj_map dataset (added in 0.2.2) alongside
london_boroughs_map and chicago_map in the
polygon-boundaries row, and the “Visualization” section now points to
the bundled sf boundary datasets (rj_map,
london_boroughs_map, chicago_map) instead of
the removed geobr download for the Brazil example.
example_brazil_rj.R and the introduction
vignette now map clusters using the bundled rj_map dataset
(added in 0.2.2) instead of downloading boundaries with
geobr, so the Rio de Janeiro material runs without
geobr/arrow, consistent with the Chicago and
London examples. All example scripts use the data-frame/column scan
interface.rj_map dataset: sf polygon
boundaries for the 92 municipalities of Rio de Janeiro state (IBGE
Malhas Municipais), with an ibge_code join key to
rj_mortality. The Rio de Janeiro example now maps clusters
with data(rj_map), mirroring chicago_map and
london_boroughs_map, so it no longer requires an external
polygon download (previously
geobr::read_municipality()).pop <- "live_births"; treespatial_scan(data, population = pop, ...).
Previously only a bare name (population) or a literal
string ("live_births") worked; a variable holding the name
was mis-resolved as a length-1 value, which failed with a confusing
“must have the same length as ‘cases’” error. This makes programmatic
use (looping over datasets/denominators) work as expected. Bare names,
literal strings, and expressions over columns continue to work
unchanged.The scan functions no longer take parallel vectors. Each now takes a
data data.frame as its first argument and
refers to its columns by (unquoted) name. This makes calls shorter,
pipe-friendly (data |> treespatial_scan(...)), and
removes the repeated df$column boilerplate.
treespatial_scan(), circular_scan(),
sequential_scan() and aggregate_tree() gained
a leading data argument; their cases,
population, region_id, x,
y and node_id arguments now name columns of
data rather than being vectors. Column arguments accept an
unquoted name (cases), a string ("cases"), or
an expression on columns (raw_count * weight).
tree_scan() is now keyed by node_id: it
takes
tree_scan(data, cases, node_id, tree, population = NULL)
with one row per leaf. Rows are matched to the tree by
node_id (so they no longer need to be pre-ordered to the
tree’s leaf order), and counts are summed within leaf.
sequential_scan() tree-only mode now also requires a
node_id column (consistent with tree_scan());
previously it relied on the row order.
Migration. Wrap your vectors in a
data.frame and pass column names:
# before (<= 0.1.x)
treespatial_scan(cases = d$cases, population = d$population,
region_id = d$region_id, x = d$x, y = d$y,
node_id = d$node_id, tree = tree)
# now (>= 0.2.0)
treespatial_scan(d, cases, population, region_id, x, y, node_id,
tree = tree)The tree argument is unchanged (still a separate
node_id/parent_id data.frame, or the
tree_node_id/tree_parent_id vectors). The
returned objects, their classes, and all
print/summary/filter_clusters()/
get_cluster_regions() behaviour are unchanged.
New internal helper .resolve_col() performs the
(base-R, dependency-free) non-standard evaluation that maps column
arguments to vectors. The C++ Monte Carlo core and the statistical
results are unchanged.
Bundled example scripts (inst/examples/) and
vignettes were updated to the new interface.
.Rd files that were previously maintained by hand (the
three map/raw datasets chicago_map,
london_boroughs_map, fl_deaths,
filter_clusters(), and the eight
print/summary methods) have been consolidated
into the roxygen blocks in R/data.R,
R/filter_clusters.R, and R/print.R, so
devtools::document() no longer skips them. Content from the
hand-written pages (the st_simplify note and merge tip for
london_boroughs_map, the Crown-copyright attribution, and
the fuller fl_deaths examples) was preserved. No
user-visible change to the rendered documentation.n_coresThe three Monte Carlo routines (mc_treespatial_cpp,
mc_spatial_cpp, mc_treescan_cpp) now use a
single native C++ implementation for every
n_cores >= 1. Previously
n_cores = 1 took a separate code path that used R’s
rmultinom() over NumericMatrix objects, while
n_cores > 1 used a native std::mt19937
sampler over flat arrays.
n_cores.
Each simulation draws from a deterministic per-simulation seed (from R’s
RNG when seed is set), so the simulated null distribution
and the resulting p-value are identical for any thread
count given a fixed seed. n_cores changes only
wall-clock time.p-values
at n_cores = 1 are no longer bit-identical to the
pre-0.1.50 serial path (which used R’s rmultinom). Observed
statistics, most-likely clusters, and secondary-cluster extraction are
unaffected. Fix your seed to reproduce results.aggregate_up() and max_llr_all_pairs().example_chicago.R now uses the compositional
population denominator (total incidents per area) rather
than pop_residential. With the residential denominator the
most likely cluster is a broad-spectrum spatial hotspot reported at the
tree root; the compositional denominator asks which (crime category,
area) combinations are over-represented and returns a specific branch,
which is the tree-spatial use the method is designed for.Small adjustments to the DESCRIPTION file
Small adjustments to the vignettes
The package now ships two vignettes:
vignette("introduction", package = "treeSS") — Rio
de Janeiro end-to-end, reproducing Section 5.2 of Cançado et al. (2025).
This was the previous introduction vignette, trimmed to RJ
only.
vignette("florida", package = "treeSS")
(new) — a pedagogical walk-through of building the tree-spatial
scan inputs from raw data using the bundled fl_deaths
dataset: building the ICD-10 tree from the codes that actually appear in
the data, downloading county polygons + centroids from
tigris, and assembling the parallel-vector input contract
that treespatial_scan() expects.
The Chicago and London datasets, previously discussed inline in the
introduction vignette, are now reserved for the companion
software paper.
The four bundled plotting examples for sequential_scan()
(example_brazil_rj.R, example_chicago.R,
example_florida.R) previously did a left join from the full
map polygon set onto the cluster table. When the shapefile contained
polygons not present in the analysis dataset (3 RJ municipalities
missing from the DATASUS/IBGE 89-municipality subset, for instance),
those polygons emerged with panel = NA, which
facet_wrap rendered as an extra empty panel labelled
“NA”.
The examples now cross-join the polygon set with the panel labels
first and then left-join the cluster information by
(id, panel), so every map polygon is drawn in every
iteration panel — those that fall outside the analysis dataset get the
na.value colour (a light grey), exactly as intended. No
extra “NA” panel is produced.
The london example uses leaflet rather than
facet_wrap and was not affected.
multicluster_scan()multicluster_scan() (added in 0.1.45 as an adaptation of
Li, Wang, Yang, Li and Lai 2011 to the tree-spatial setting) has been
removed. The function is gone, along with its C++ backend
(mc_multicluster_treespatial_cpp,
mc_multicluster_spatial_cpp), the
get_cluster_regions.multicluster_scan S3 method, the
corresponding print / summary methods, all
examples, and the vignette subsection.
Rationale:
On real datasets with a concentrated signal (e.g. infant
mortality in Rio de Janeiro: 622 tree nodes, 5358 zones), the top-K
candidate pool was dominated by overlapping variants of a single
geographic neighbourhood, so the fast top-K disjoint-pair search could
not find a valid pair. The full-pool rescue path was too slow to be
practical (timing out on nsim = 999 with 4 cores).
The factorisation of the joint LLR used by Li et al. (2011) is exact under the Poisson model for circular scans; its extension to the tree-spatial setting was not formally established.
filter_clusters() (Cançado et al. 2025) and
sequential_scan() (Zhang, Assunção and Kulldorff 2010)
together already cover the practical secondary-cluster use cases with
published, well-studied statistical properties.
Users who want joint-cluster detection in the circular case can use the original implementation from Li et al. (2011) outside this package.
The package now offers two clearly-bounded approaches:
filter_clusters() — paper-faithful non-overlap
criterion of Cançado et al. (2025), Sec. 5.1.1, applied to the
single-pass candidate pool.
sequential_scan() — sequential adjustment of Zhang,
Assunção and Kulldorff (2010): detect MLC, remove its regions (with
optional buffer of nearest neighbours), re-run the scan on the reduced
data with a fresh Monte Carlo simulation; iterate until the current MLC
is no longer significant. Each iteration’s p-value is correct under the
conditional argument in the paper, so no multiple-testing correction is
required.
Replaced the ad-hoc Holm-Bonferroni iterative_scan()
with two methods drawn directly from the published literature on
multi-cluster spatial scan statistics, adapted to the tree-spatial
setting. The package now offers three approaches to secondary-cluster
detection, with the choice driven by which type of shadowing the user
wants to remove:
filter_clusters() (unchanged) – the original
non-overlap criterion of Cancado et al. (2025) Sec. 5.1.1, applied to
the single-pass candidate pool.
sequential_scan() (new) – the sequential adjustment
of Zhang, Assuncao and Kulldorff (2010), adapted to tree-spatial /
circular / tree-only inputs. Detects the MLC, removes its regions (and
an optional buffer_size of nearest neighbours) from the
dataset, and re-runs the scan on the reduced data with a fresh Monte
Carlo simulation. Iterates until the MLC of the current reduced data is
no longer significant or max_iter is reached. Each
iteration’s p-value is correct under the conditional argument of Section
3 of the paper – no post-hoc multiple-testing correction is applied or
required.
multicluster_scan() (new) – the two-cluster joint
statistic of Li, Wang, Yang, Li and Lai (2011), adapted to tree-spatial
and circular scans. Builds the alternative as a joint presence of two
region-disjoint clusters; the joint LLR factorises into the sum of the
two single-cluster LLRs under Poisson, so the observed maximum is found
by sweeping the candidate pool. The Monte Carlo for the joint statistic
runs in C++ (new exports mc_multicluster_treespatial_cpp
and mc_multicluster_spatial_cpp) with the same OpenMP
backend as the other scans, so performance is on par with
treespatial_scan(). The decision rule of Table 2 of the
paper is applied: 0, 1, or 2 significant clusters are reported based on
the joint p-value and a re-evaluation of the weaker cluster on the
reduced dataset.
iterative_scan() and its
print/summary/get_cluster_regions methods have been
removed. The Holm-Bonferroni “scan + zero cases + re-scan” procedure is
not part of the published methods we wanted to offer; the sequential and
multi-cluster scans above cover the intended use cases and are grounded
in the literature.
Internal helper .matrix_to_vectors() (previously
used only by iterative_scan) has been removed.
print.sequential_scan(),
summary.sequential_scan()print.multicluster_scan(),
summary.multicluster_scan()get_cluster_regions.sequential_scan(),
get_cluster_regions.multicluster_scan()filter_clusters(), treespatial_scan(), and
circular_scan() cross-reference the new methods in
@seealso.inst/examples/
(Brazil/RJ, Chicago, Florida, London) use sequential_scan()
in place of the removed iterative_scan() block.tests/testthat/test-sequential-scan.R covering
structure, the max_iter stopping rule, the buffer
mechanism, behaviour under H0, and printing.tests/testthat/test-multicluster-scan.R covering
structure, the stronger-versus-weaker ordering, region disjointness of
the returned pair, the significance decision rule, and printing.tests/testthat/test-get-cluster-regions.R and
tests/testthat/test-binomial.R updated to drop their
references to iterative_scan().Address the four items requested in the first-round CRAN review.
Single-quote software/API names per the CRAN cookbook:
OpenMP is now written as 'OpenMP' in the
package description. Reference: https://contributor.r-project.org/cran-cookbook/description_issues.html#formatting-software-names
Add DOI links to the two references that were previously cited
without a link, using the CRAN-mandated
authors (year) <doi:...> form (no space after
doi:, no space inside the angle brackets):
\value tags (and the corresponding
@return roxygen blocks) to the seven
print()/summary() method Rd files flagged by
CRAN. Each documents that the method invisibly returns its input object
unchanged and is called for its printing side effect, with a description
of the fields written to the console (and, for summary()
methods, the additional fields beyond those of the matching
print() method):
print.circular_scan.Rdprint.iterative_scan.Rdprint.tree_scan.Rdprint.treespatial_scan.Rdsummary.circular_scan.Rdsummary.tree_scan.Rdsummary.treespatial_scan.Rd Reference: https://contributor.r-project.org/cran-cookbook/docs_issues.html#missing-value-tags-in-.rd-filesgenerate_example_data() no longer sets a hardcoded seed
within the function: the default of the seed argument is
now NULL (previously 123L). When the user does
not pass a seed, the function draws from the user’s session-level RNG
state without modifying it; when the user passes an explicit integer,
the existing save-and-restore logic (introduced in 0.1.43) still
applies. The \usage{} block and the
\item{seed}{...} description of the corresponding Rd file
have been updated to match. The roxygen example
(ex <- generate_example_data(seed = 42)) is unchanged:
it passes an explicit seed and so remains reproducible. Reference: https://contributor.r-project.org/cran-cookbook/code_issues.html#setting-a-specific-seedTesting a a clean R CMD check --as-cran.
\source{} blocks to all three tree datasets,
pointing at the corresponding leaf-level dataset and at the
data-raw/ build script in the GitHub repo.get_cluster_regions(). Added
@examples block.@examples block to the roxygen comments.seed = ... argument no
longer silently overwrite the user’s session-level RNG state.
Previously, calling treespatial_scan(..., seed = 42) after
a set.seed(2026) in the user’s session would leave the RNG
in a state determined by the internal Monte Carlo loop, so any
subsequent runif(), sample(), etc. was no
longer reproducible from the user’s set.seed(2026). Now the
user’s pre-existing RNG state is saved on entry and restored on exit
(whether the function returns normally or via an error), so the
seed argument affects only the result of the call.
Implementation is in two new internal helpers
.seed_save_and_set() and .seed_restore() in
R/utils.R.print.iterative_scan() now accepts
max_show for API consistency with the other three print
methods. The default behavior is unchanged (the table is printed without
the region_ids and leaf_ids columns to keep it
compact); pass max_show = -1L to include both columns.cran-comments.md file.remotes::install_github("allanvc/treeSS").summary() methods for circular_scan,
tree_scan, and treespatial_scan now have
proper roxygen descriptions and explicitly document that the
max_show argument added in 0.1.39 is forwarded to the
corresponding print() method via . Each summary doc points
to the matching print doc for the full details.The print methods now truncate long Leaf IDs and
Regions lists by default, in the style of
tibble. The motivation is the Chicago example: the most
likely cluster turns out to be the root of the FBI
crime taxonomy (1900+ leaves), which under the previous policy printed
every single leaf, producing more than 10 pages of console output in the
rendered PDF.
New argument max_show on
print.treespatial_scan(), print.tree_scan()
and print.circular_scan(). Default is 10L.
When a vector field exceeds this length, only the first
max_show values are shown and a tail of
... and N more is appended. Pass
max_show = -1L (or any value at least as large as the
field) to recover the previous full-output behavior.
The internal .cat_wrapped() helper gained the same
max_show argument (default 10L) and propagates
it through the print methods.
No changes to the underlying scan results: only the console / PDF
rendering of the result objects is affected. The full leaf and region
IDs are always available on
result$most_likely_cluster$ leaf_ids and
result$most_likely_cluster$region_ids for programmatic
use.
The choice of default mirrors tibble’s behavior: enough
to give the reader a sense of the cluster contents, but not so much that
a single print() call dominates the document.
treespatial_scan() for combined spatial and
hierarchical cluster detection.circular_scan() for Kulldorff’s circular
spatial scan statistic.tree_scan() for the tree-based scan
statistic.build_zones(),
aggregate_tree(), filter_clusters().print() and summary() methods for all
scan result classes.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.