| Title: | Sample Provenance Quality Resolver in Proteomics |
| Version: | 0.1.0 |
| Description: | Detect sample-provenance inconsistencies and potential mix-ups in mass-spectrometry-based plasma-proteome cohorts. Provides a clustering-based approach (build a nearest-neighbour graph in a dimensionality-reduced space and iteratively split large components by edge weight), a threshold-based approach (classify sample pairs as belonging or not-belonging from a pairwise distance cutoff), parameter optimization over distance metrics and cutoffs, and a pairwise random-forest classifier for protein importance ranking. This is a native R port of the author's Python package 'spqrp' (https://github.com/fhradilak/spqrp), implementing methods from an associated manuscript currently in preparation. |
| License: | GPL-3 |
| URL: | https://github.com/fhradilak/spqrp_r |
| BugReports: | https://github.com/fhradilak/spqrp_r/issues |
| Encoding: | UTF-8 |
| LazyData: | true |
| Depends: | R (≥ 4.5.0) |
| Imports: | cli, dplyr (≥ 1.1.0), ggplot2 (≥ 3.5.0), igraph (≥ 2.0.0), lgr, pROC, randomForest, ranger (≥ 0.16.0), rlang, solitude (≥ 1.1.3), stats, tibble, tidyr (≥ 1.3.0), utils, withr |
| Suggests: | knitr, plotly (≥ 4.10.0), recipes, rmarkdown, smacof, testthat (≥ 3.0.0), themis, uwot (≥ 0.2.0), vdiffr |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-09 14:02:19 UTC; franziska |
| Author: | Franziska Hradilak [aut, cre] |
| Maintainer: | Franziska Hradilak <Franziska.Hradilak@student.hpi.uni-potsdam.de> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-17 13:20:02 UTC |
spqrp: Sample Provenance Quality Resolver in Proteomics
Description
Detects sample-provenance inconsistencies in MS-based plasma-proteome cohorts via pairwise distance, threshold-based classification, iterative clustering, and a pairwise random-forest classifier for protein importance ranking. Native R port of the Python package of the same name.
Author(s)
Maintainer: Franziska Hradilak Franziska.Hradilak@student.hpi.uni-potsdam.de
Authors:
Franziska Hradilak Franziska.Hradilak@student.hpi.uni-potsdam.de
See Also
Useful links:
Isolation Forest outlier detection
Description
Pivots to a (sample x protein) matrix, runs an Isolation Forest via the 'solitude' package (a pure-R port of the same Liu et al. 2008 algorithm that scikit-learn's 'IsolationForest' uses), and returns the data frame with outlier rows removed plus a tibble of per-sample anomaly scores.
Usage
by_isolation_forest(
protein_df,
peptide_df = NULL,
n_estimators = 100L,
impute_zero = FALSE,
impute_median = FALSE,
outlier_threshold = 0.6,
contamination = "auto",
quiet = TRUE
)
Arguments
protein_df |
Long-format intensity data frame. |
peptide_df |
Optional peptide-level data frame; subset alongside 'protein_df' using the same outlier list. |
n_estimators |
Number of trees. |
impute_zero |
Replace NA intensities with 0 before fitting. |
impute_median |
Replace NA intensities with column-wise median. |
outlier_threshold |
Used when 'contamination = "auto"'. Anomaly score above which a sample is flagged. Default '0.6' (calibrated for solitude's score scale; on sklearn's scale this would be '0.5'). |
contamination |
Either '"auto"' (default; use 'outlier_threshold') or a numeric in '[0, 1]' specifying the fraction of the data to flag as outliers (top-by-score). Mirrors sklearn's 'IsolationForest(contamination=...)' API. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Details
Two ways to decide which samples are outliers, mirroring sklearn's 'IsolationForest' API:
* 'contamination = "auto"' (default) – flag every sample whose anomaly score exceeds 'outlier_threshold'. * 'contamination' set to a numeric in '[0, 1]' – flag exactly the top 'contamination * 100' ‘outlier_threshold'. Mirrors sklearn’s 'IsolationForest(contamination = 0.1)'.
On the **sklearn score scale**, 'contamination = "auto"' corresponds to a threshold of ‘0.5'. solitude’s scores, however, are systematically shifted upward because 'solitude' (via 'ranger') uses 'mtry = ncol - 1' and 'extratrees' split bounds drawn from the full dataset rather than from the per-tree subsample. The result is that inlier scores typically sit between '0.55' and '0.60' even on clean data, so the sklearn-calibrated '0.5' cutoff would flag everything. The default 'outlier_threshold = 0.6' below is calibrated empirically for solitude's distribution and reproduces sklearn's "few-to-zero outliers on clean data" behaviour. Lower it (e.g. '0.55') for more aggressive flagging, or use 'contamination' for a percentile-based rule.
Value
Invisibly returns a named list with 'protein_df', 'peptide_df', 'outlier_list', 'anomaly_df', and possibly 'messages' on failure. 'invisible()' keeps the REPL silent on unassigned calls; assign the result to a name and inspect with 'result$protein_df' etc.
Examples
df <- spqrp_example_data("input_cohort_df")
res <- by_isolation_forest(df, impute_median = TRUE)
res$outlier_list
Plot per-sample anomaly scores from the isolation forest
Description
Plot per-sample anomaly scores from the isolation forest
Usage
by_isolation_forest_plot(output_anomaly_df, title = "")
Arguments
output_anomaly_df |
Tibble returned in 'anomaly_df' from [by_isolation_forest()]. |
title |
Plot title. |
Value
A 'plotly' figure (printed when invoked at top level).
Pairwise distances on the top-n ranked proteins
Description
Sub-procedure used by [perform_distance_evaluation_on_ranked_proteins()] and [run_clustering()]. Selects the top-'n' proteins from 'top_importance' (after optionally dropping 'remove_list' and after restricting the ranking to proteins actually present in 'df'), pivots 'df' to a wide matrix, and computes pairwise distances.
Usage
calculate_pairwise_distances(
top_importance,
n,
df,
metric = "correlation",
fractional_p = 0.5,
remove_list = NULL,
number_display_neighbours = 1L,
quiet = TRUE
)
Arguments
top_importance |
Data frame with 'Protein' and 'Importance' columns. |
n |
Number of top-ranked proteins to keep. |
df |
Long-format cohort data frame. |
metric |
See [get_distances()]. |
fractional_p |
Fractional/Minkowski exponent. |
remove_list |
Optional character vector of proteins to drop. |
number_display_neighbours |
Number of nearest neighbours to return. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Details
The restriction to proteins present in 'df' happens **before** the top-n cut, so a ranking whose highest-importance entries are absent from 'df' still yields 'n' usable proteins (the next-best ones). This matches [optimize_parameters()]'s behaviour and avoids producing near-empty distance matrices when the ranking and 'df' were built from different protein universes.
Value
List with 'sample_order', 'distance_matrix', 'df_dist', and 'nearest_neighbours'.
Validate required columns of a cohort and (optionally) a ranking
Description
Throws an informative error when required columns are missing.
Usage
check_input_data_format(df, importance_ranking = NULL)
Arguments
df |
Cohort data frame; must contain 'Sample_ID', 'Patient_ID', 'Protein', 'Intensity'. |
importance_ranking |
Optional ranking data frame; must contain 'Protein', 'Importance' if supplied. |
Value
Invisible 'TRUE'.
Examples
df <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")
check_input_data_format(df, ranking)
Iterative-clustering primitive: build a kNN graph + 2D coords
Description
Builds a 2D embedding via PCA/UMAP/MDS, connects each sample to its 'n_neighbors' nearest neighbours in distance space, and optionally splits components larger than 'max_component_size' by repeatedly removing the largest-weight edge.
Usage
cluster_samples_iteratively(
result,
df,
method = "UMAP",
random_state = 42L,
n_neighbors = 1L,
max_component_size = NULL,
n_umap_neighbors = 15L,
precomputed_graph = NULL,
mds_backend = c("cmdscale", "smacof"),
quiet = TRUE
)
Arguments
result |
Output of [calculate_pairwise_distances()] (uses 'distance_matrix'). |
df |
Long-format cohort data frame. |
method |
'"UMAP"', '"PCA"', or '"MDS"'. |
random_state |
Seed for the dimensionality reduction. |
n_neighbors |
Number of nearest-neighbour edges per sample. |
max_component_size |
If non-NULL, iteratively split clusters above this size. |
n_umap_neighbors |
UMAP's 'n_neighbors' parameter. |
precomputed_graph |
Optional precomputed igraph object. |
mds_backend |
'"cmdscale"' (default) or '"smacof"' (Suggests). |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
List with 'G' (igraph) and 'coords_2d' (matrix).
Protein-importance ranking for plasma cohort "A"
Description
A pre-computed protein-importance ranking produced by the pairwise balanced random-forest classifier ([train_pairwise_balanced_rand_forest()]) on a real mass-spectrometry plasma-proteome cohort. It serves as the built-in default ranking for [perform_distance_evaluation_on_ranked_proteins()] and [optimize_parameters()] when the caller supplies neither 'top_importance_df' nor 'top_importance_path'.
Usage
cohort_a_ranking
Format
A [tibble][tibble::tibble] with one row per protein and two columns:
- Protein
Character. Protein identifier (UniProt accession with gene suffix, e.g. '"P01861_IGHG4"').
- Importance
Numeric. Random-forest importance score; higher means more discriminative. Rows are ordered from most to least important.
Source
Pairwise balanced random-forest importances computed on plasma cohort "A", derived from mass-spectrometry plasma-proteome measurements.
See Also
[perform_distance_evaluation_on_ranked_proteins()], [optimize_parameters()], [retrieve_ranking()]
Examples
head(cohort_a_ranking)
Keep proteins present in at least a given fraction of samples
Description
Keep proteins present in at least a given fraction of samples
Usage
filter_by_occurrence(df, cutoff = 0.7)
Arguments
df |
Long-format intensity data frame. |
cutoff |
Fraction in '[0, 1]'. A protein is kept when its non-NA intensity covers at least 'cutoff' of the samples in 'df'. |
Value
Filtered tibble in the same shape as 'df'.
Examples
df <- spqrp_example_data("input_cohort_df")
kept <- filter_by_occurrence(df, cutoff = 0.7)
length(unique(kept$Protein))
Pairwise distance matrix on a long-format intensity table
Description
Pivots the long-format 'df_dist' to a (sample x protein) wide matrix (missing values filled with 0) and computes pairwise distances using the requested metric.
Usage
get_distances(
df_dist,
metric = "correlation",
intensity = "Intensity",
fractional_p = 0.5,
index = SAMPLE
)
Arguments
df_dist |
Long-format data frame with 'Sample_ID', 'Protein', and the 'intensity' column. |
metric |
One of '"correlation"', '"euclidean"', '"manhattan"', '"minkowski"', or '"fractional"' (Minkowski with 'p = fractional_p'). |
intensity |
Name of the intensity column. Defaults to '"Intensity"'. |
fractional_p |
Exponent for the fractional / Minkowski metric. |
index |
Column used as the sample identifier. Defaults to '"Sample_ID"'. |
Value
A list with 'distance_matrix' (numeric matrix with row/col names set to sample IDs) and 'df_pivot' (the wide tibble used to compute it).
Evaluate pairwise belonging/not-belonging classification
Description
Computes TP/FP/FN/TN, precision, recall (sensitivity), F1, accuracy, balanced accuracy from the result of [get_sample_relations_by_cutoff()].
Usage
get_evaluation_metrics(belonging, not_belonging, quiet = TRUE)
Arguments
belonging |
Tibble of pairs flagged as belonging (same cluster). |
not_belonging |
Tibble of pairs flagged as not belonging. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
Named list of metrics; preserves the keys used by the Python port (e.g. 'False_Negative_Pairs', 'False_Positive_Distances').
k nearest neighbours of every sample
Description
k nearest neighbours of every sample
Usage
get_nearest_neighbours(df_pivot, distance_matrix, k = 4)
Arguments
df_pivot |
Wide tibble (rows = samples, columns = proteins). |
distance_matrix |
Symmetric distance matrix in the same row order as 'df_pivot'. |
k |
Number of neighbours per sample. |
Value
Tibble with columns 'Neighbor_1..k' and 'Distance_1..k', row-aligned with 'df_pivot'.
Split pairwise distances into belonging / not-belonging by cutoff
Description
Pairs with 'distance <= cutoff' go into 'belonging', others into 'not_belonging'. Patient IDs are looked up via 'sample_patient_mapping'.
Usage
get_sample_relations_by_cutoff(
distance_matrix,
cutoff,
sample_patient_mapping,
sample_order,
quiet = TRUE
)
Arguments
distance_matrix |
Symmetric numeric matrix. |
cutoff |
Numeric cutoff. |
sample_patient_mapping |
Named character vector (names = 'Sample_ID', values = 'Patient_ID'). |
sample_order |
Row/col order of 'distance_matrix' as a character vector of sample IDs. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
List with 'belonging' and 'not_belonging' tibbles.
Pick a binary-classifier threshold from probabilities
Description
Supports ‘"ROC"' (Youden’s J via 'pROC'), '"F1"' (best F1 via grid), '"J"' (alias for ROC), and '"MinFP"' (largest threshold with FPR <= max_fpr).
Usage
get_threshold(
y_test,
y_prob,
method = c("ROC", "F1", "J", "MinFP"),
max_fpr = 0.01
)
Arguments
y_test |
Binary 0/1 vector. |
y_prob |
Predicted probability of class 1. |
method |
One of '"ROC"', '"F1"', '"J"', '"MinFP"'. |
max_fpr |
Used only for '"MinFP"'. |
Value
List with 'y_pred_adjusted' and 'threshold'.
Log2-transform the intensity column in place (long format)
Description
Log2-transform the intensity column in place (long format)
Usage
log_transform(df)
Arguments
df |
Long-format intensity data frame. |
Value
'df' with 'Intensity = log2(Intensity)'.
Examples
df <- spqrp_example_data("input_cohort_df")
head(log_transform(df))
Long-to-wide pivot (samples in rows, proteins in columns)
Description
Pivots a long-format intensity table into a (sample x protein) wide tibble. Rows are sorted by 'Sample_ID' and columns by protein name (both via codepoint / radix sort) so the matrix layout matches pandas' 'pivot_table' output in the Python port – a prerequisite for reproducible IsolationForest outputs across the two implementations.
Usage
long_to_wide(intensity_df, value_name = NULL)
Arguments
intensity_df |
Long-format data frame with 'Sample_ID', 'Protein', and intensity values. |
value_name |
Optional name of the intensity column. |
Value
Wide tibble.
Per-sample median normalisation (log-space subtraction)
Description
Subtracts each sample's median intensity from its intensities, then re-centers on the dataset's overall median. By default 'df' is assumed to already be in log space. If 'revert_log = TRUE', the function reverts the log transform first, then divides by the per-sample median ratio.
Usage
normalize_medianintensity(
dataset,
string_of_pool = "",
revert_log = FALSE,
sample = SAMPLE,
plot = TRUE
)
Arguments
dataset |
Long-format intensity data frame. |
string_of_pool |
If non-empty, samples whose ID contains this substring are excluded from normalization (kept out of the post-normalization data). |
revert_log |
If 'TRUE', run [revert_log_transform()] first. |
sample |
Column to group by (defaults to '"Sample_ID"'). |
plot |
If 'TRUE', attach a before/after boxplot. |
Details
Returns a list with 'data' (the normalized tibble) and 'plot' (a ggplot showing before/after boxplots).
Value
List with 'data' and (optionally) 'plot'.
Examples
df <- spqrp_example_data("input_cohort_df")
norm <- normalize_medianintensity(log_transform(df), plot = FALSE)
head(norm$data)
Grid-search the cutoff that optimises a chosen performance metric
Description
For each value of 'n' (the number of top-ranked proteins) and each fractional-p (only used when 'metric = "fractional"'), sweeps a fixed grid of percentile cutoffs and records the parameters that optimize 'optimization_strategy'.
Usage
optimize_parameters(
df,
metric = "correlation",
log_file = NULL,
top_importance_path = NULL,
top_importance_df = NULL,
range = 2:49,
optimization_strategy = default_strategies(),
remove_list = character(),
quiet = TRUE
)
Arguments
df |
Long-format cohort data frame. |
metric |
Distance metric. '"fractional"' enables a sweep over 'fractional_p_values'. |
log_file |
Optional path; if non-NULL the optimization log is written there. Default 'NULL' (no log). |
top_importance_path |
Optional CSV path with 'Protein', 'Importance'. Used only when 'top_importance_df' is 'NULL'. |
top_importance_df |
Optional pre-loaded ranking. When both this and 'top_importance_path' are 'NULL' (the default) the bundled [cohort_a_ranking] dataset is used. |
range |
Integer vector of 'n' values to evaluate. |
optimization_strategy |
One of '"fp+fn"', '"fp"', '"fn"', '"F1"', '"precision"', '"sensitivity"'. Optimizes for the lowest false negative (fn) or false positive (fp) scores or for the highest F1, precision, sensitivity. |
remove_list |
Proteins to drop from the ranking. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
Tibble of one row per 'n', listing the best parameters and their classification metrics.
Examples
df <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")
best <- optimize_parameters(
df = df, top_importance_df = ranking,
metric = "manhattan", range = 2:4
)
best
Percentile cutoff (numpy.percentile-equivalent)
Description
Returns the 'percentile'-th percentile of 'distances' using linear interpolation (‘stats::quantile' type 7), matching 'numpy.percentile'’s default for the values used in the package.
Usage
percentile_cutoff(distances, percentile = 25)
Arguments
distances |
Numeric vector of pairwise distances. |
percentile |
Percentile in '[0, 100]'. |
Value
A single numeric.
Threshold-based pairwise distance evaluation
Description
Computes pairwise distances on the top-'n' proteins, splits sample pairs by a percentile cutoff ('p') on the distance distribution, and computes classification metrics against the patient ID ground truth.
Usage
perform_distance_evaluation_on_ranked_proteins(
df,
top_importance_path = NULL,
top_importance_df = NULL,
n = 10L,
p = 0.5,
remove_list = NULL,
metric = "correlation",
fractional_p = 0.5,
threshold_based = TRUE,
quiet = TRUE,
number_display_neighbours = 4L,
name = "",
plot = TRUE,
save_path = NULL
)
Arguments
df |
Long-format cohort data frame. |
top_importance_path |
Optional path to a CSV with 'Protein' and 'Importance'. Used only when 'top_importance_df' is 'NULL'. |
top_importance_df |
Optional pre-loaded ranking data frame. If supplied, 'top_importance_path' is ignored. When both are 'NULL' (the default) the bundled [cohort_a_ranking] dataset is used. |
n |
Number of top-ranked proteins. |
p |
Percentile (0-100) for the distance cutoff. |
remove_list |
Proteins to exclude from the ranking. |
metric |
Distance metric (see [get_distances()]). |
fractional_p |
Fractional/Minkowski exponent. |
threshold_based |
If 'FALSE', only return distances and skip classification. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
number_display_neighbours |
Number of nearest neighbours to report. |
name |
Plot title suffix; appended to "Distribution of Pairwise Distances". Set this to a cohort label (e.g. 'name = "Cohort A"') so saved plots are self-documenting. |
plot |
If 'TRUE', draw the distance histogram with FN/FP overlays and a legend matching the Python figure. |
save_path |
Where to save a high-resolution render of the distance-distribution plot. Accepts ‘NULL' (default, don’t save), 'TRUE' (auto-save to a timestamped file in 'tempdir()'), or a character path (e.g. '"distances.png"'). Same semantics as [run_clustering()]'s 'save_path'. Only used when 'plot = TRUE'. |
Value
Invisibly returns a list with 'top_importance', 'nearest_neighbours', 'cutoff', 'belonging', 'not_belonging', 'eval_metrics', 'distance_matrix', and 'plot' (the ggplot built when 'plot = TRUE'; 'NULL' otherwise). 'invisible()' keeps the REPL silent on unassigned calls. Assign to a name and use 'result$plot', 'result$eval_metrics', etc. To render the distance-distribution histogram on demand: 'print(result$plot)'.
Examples
df <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")
result <- perform_distance_evaluation_on_ranked_proteins(
df = df, top_importance_df = ranking,
metric = "manhattan", p = 0.989, n = 4L
)
result$eval_metrics[c("TP", "FP", "FN", "TN", "F1")]
result$plot
Regress intensity on plate and replace it with OLS residuals
Description
Identifies a 'plate' (or 'Plate') column, encodes it as integers, fits ‘lm(Intensity ~ plate)', and replaces 'Intensity' with the model’s residuals. If no plate column is present, returns the input unchanged with a message.
Usage
plate_correct_residuals_by_protein(
group_data,
individual = PATIENT,
sample = SAMPLE,
impute = FALSE,
verbose = FALSE
)
Arguments
group_data |
Long-format intensity data frame. |
individual |
Patient identifier column (default '"Patient_ID"'). |
sample |
Sample identifier column (default '"Sample_ID"'). |
impute |
If 'TRUE', impute missing intensities by patient/protein median before regression; otherwise drop NA rows. |
verbose |
If 'TRUE', also build before/after boxplots. |
Value
Tibble with corrected 'Intensity'. Attribute '"plot"' carries the diagnostic ggplot when 'verbose = TRUE'.
Examples
df <- spqrp_example_data("input_cohort_df")
df$plate <- rep(c("A", "B"), length.out = nrow(df))
corrected <- plate_correct_residuals_by_protein(df)
head(corrected)
Heavy clustering visualisation (TP hulls, FP edges, singleton markers)
Description
Builds the canonical SPQRP cluster plot: convex hulls around same-patient true-positive clusters, dotted edges for cross-patient false-positive edges, blue square markers for true-positive singletons, pink circles for uncertain (isolated but should-be-connected) samples. Returns the ggplot plus cluster bookkeeping.
Usage
plot_distances_neighbours_with_coloring_hue(
df,
G,
coords_2d,
method = "UMAP",
subset_samples = NULL,
highlight_singletons = TRUE,
highlight_single_samples_missing_connections = TRUE,
figsize = c(14, 14),
dpi = 150L,
label_patient_only = FALSE,
label_offset_x = 0.01,
label_offset_y = 0.01,
label_font = NULL,
df_name = "DF_NAME",
save_path = NULL,
print = TRUE,
quiet = TRUE
)
Arguments
df |
Long-format cohort data frame. |
G |
igraph object from [cluster_samples_iteratively()]. |
coords_2d |
2D coordinates from [cluster_samples_iteratively()]. |
method |
Reduction method ('"UMAP"', '"PCA"', '"MDS"'). |
subset_samples |
Optional sample subset to visualise. |
highlight_singletons |
Mark same-patient-singletons with blue squares. |
highlight_single_samples_missing_connections |
Mark uncertain samples with pink circles. |
figsize |
Numeric vector of length 2: width and height in inches. Drives both the 'ggsave' output dimensions (when 'save_path' is set) and the auto-scaling of point sizes, line widths, fonts, and theme 'base_size'. Default 'c(14, 14)'. Use 'c(20, 20)' or larger for publication-quality renders. |
dpi |
Resolution (dots per inch) for the saved file. Default ‘150' matches the Python package’s matplotlib default. Use '300' for print-quality. |
label_patient_only |
Label nodes by patient instead of sample. |
label_offset_x, label_offset_y |
Label nudge offsets. Values <= 0.05 are interpreted as a fraction of the coord range (auto- scaled to the data); larger values are absolute. |
label_font |
Override the auto-scaled label font size. 'NULL' (default) lets the function pick a size from 'figsize'. |
df_name |
Plot title. |
save_path |
If non-NULL, save plot to this file via 'ggsave'. |
print |
If 'TRUE', print the plot. |
quiet |
If 'TRUE' (default), suppress informational status messages (save-path hints, cluster summaries, transitive performance metrics). Set 'FALSE' to print them. Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
List with 'plot', 'G', 'cluster_assignments', 'transitive_results', 'uncertain_nodes', 'error_candidates'.
Histogram of pairwise distances with optional percentile lines
Description
Histogram of pairwise distances with optional percentile lines
Usage
plot_distribution_of_pairwise_dist(
distances,
percentiles = c(1, 2, 5, 10),
print = TRUE,
quiet = TRUE
)
Arguments
distances |
Numeric vector. |
percentiles |
Vector of percentile values (0-100) to mark with vertical lines. |
print |
If 'TRUE', auto-render the plot (gated by 'quiet'). |
quiet |
If 'TRUE' (default), suppress informational status messages and skip auto-rendering of the returned ggplot. Set 'FALSE' to render. Warnings about genuine data issues are emitted regardless. |
Value
Invisible ggplot.
Histogram with FN/FP/percentile overlays and a legend
Description
Mirrors Python's 'plot_distribution_with_highlights' (helpers.py): a grey histogram of all pairwise distances, overlaid with vertical lines for false-negative pairs (blue), false-positive pairs (orange), and percentile cutoffs (magenta). The legend names each category, matching the Python figure key.
Usage
plot_distribution_with_highlights(
distances,
fn_distances,
fp_distances,
percentiles = c(1, 2, 5, 10),
name = "",
print = TRUE,
figsize = c(8, 5),
dpi = 150L,
save_path = NULL,
quiet = TRUE
)
Arguments
distances |
Numeric vector of all pairwise distances. |
fn_distances |
Distances of false-negative pairs. Pass 'numeric(0)' or 'NULL' to omit the FN legend entry. |
fp_distances |
Distances of false-positive pairs. Pass 'numeric(0)' or 'NULL' to omit the FP legend entry. |
percentiles |
Percentiles (0-100) to draw as vertical lines. Each percentile becomes its own legend entry. |
name |
Title suffix appended to "Distribution of Pairwise Distances". Set this to a cohort label so saved plots are self-documenting. |
print |
If 'TRUE', print the plot. |
figsize |
Numeric vector of length 2: width and height in inches for ‘ggsave' output. Default 'c(8, 5)' matches Python’s 'plt.figure(figsize=(8, 5))'. |
dpi |
Resolution (dots per inch) for the saved file. Default '150'. |
save_path |
Where to save a high-resolution PNG/SVG/PDF render. Accepts: * ‘NULL' (default) – don’t save; only return the ggplot object. The function prints a hint about how to save. * a character path (e.g. '"distances.png"') – save there via 'ggsave()'. Extension chooses the format. |
quiet |
If 'TRUE' (default), suppress the informational 'save_path' hints. Warnings about genuine data issues are emitted regardless. |
Value
Invisible ggplot.
Print a one-line summary of an spqrp_train object
Description
Displays the classifier backend, the number of training/test pairs, and the feature count for the pairwise random-forest model returned by [train_with_normalise()] and [train_pairwise_balanced_rand_forest()].
Usage
## S3 method for class 'spqrp_train'
print(x, ...)
Arguments
x |
A 'spqrp_train' object. |
... |
Unused; present for S3 generic compatibility. |
Value
'x', invisibly.
Remove samples flagged as outliers by Isolation Forest
Description
Convenience wrapper around [by_isolation_forest()] with median imputation. Removes samples (not proteins) whose intensity profile looks anomalous compared to the rest of the cohort.
Usage
remove_outlier_samples(
df,
sample = SAMPLE,
contamination = "auto",
outlier_threshold = 0.6,
quiet = TRUE
)
Arguments
df |
Long-format intensity data frame. |
sample |
Sample column (defaults to '"Sample_ID"'). |
contamination |
'"auto"' (default) or a numeric in '[0, 1]'. See [by_isolation_forest()] for details. |
outlier_threshold |
Anomaly-score cutoff used when 'contamination = "auto"'. Default '0.6', calibrated empirically for solitude's anomaly-score distribution. See [by_isolation_forest()] for the rationale. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Details
Pass ‘contamination = 0.1' (or any fraction) to mimic sklearn’s 'IsolationForest(contamination = 0.1)' behaviour, or keep the default 'contamination = "auto"' to use the conservative absolute threshold.
The returned list includes 'anomaly_plot', a 'plotly' bar chart of per-sample anomaly scores coloured by outlier flag. Printing the object at the R REPL (or 'print(result$anomaly_plot)' inside a script) renders the chart – mirroring the Python wrapper's auto-shown bar plot, but without surprising side effects when the function is called non-interactively.
Value
Invisibly returns a named list with components: * 'df' – filtered tibble (same shape as 'df', fewer rows) * 'anomaly_df' – per-sample tibble of 'Sample_ID', 'Anomaly Score', 'Outlier' * 'outlier_list' – character vector of flagged 'Sample_ID's * 'anomaly_plot' – a 'plotly' figure; 'print(result$anomaly_plot)' to view the bar chart. 'NULL' if the optional 'plotly' package is not installed (a message explains how to enable it).
The return is wrapped in 'invisible()' so unassigned REPL calls stay silent (matches 'quiet = TRUE'). Assign to a name to inspect.
Examples
df <- spqrp_example_data("input_cohort_df")
filtered <- remove_outlier_samples(df, contamination = "auto")
filtered$outlier_list
head(filtered$df)
Convert classifier output to a Protein / Importance ranking
Description
Strips the 'diff_' prefix the pairwise model adds to feature names. Importance values are normalised to sum to 1.0 across features at training time (matching sklearn's 'clf.feature_importances_' convention), so the numbers in the returned tibble are directly comparable to Python output. Rank order is preserved across the normalisation.
Usage
retrieve_ranking(results)
Arguments
results |
Output of [train_with_normalise()]. |
Value
Tibble with 'Protein' and 'Importance' columns; 'Importance' sums to ~1.0.
Examples
df <- spqrp_example_data("input_cohort_df")
results <- train_with_normalise(df, plate_corrected = FALSE,
outlier_removal = FALSE)
retrieve_ranking(results)
Inverse of log2-transform: raise intensities to the power of 2
Description
Inverse of log2-transform: raise intensities to the power of 2
Usage
revert_log_transform(df)
Arguments
df |
Long-format intensity data frame. |
Value
'df' with 'Intensity = 2^Intensity'.
End-to-end clustering pipeline
Description
Computes pairwise distances on the top-'n' ranked proteins, builds a k-nearest-neighbour graph in a 2D embedding (default UMAP), iteratively splits big components by max-weight edge, and visualises the result.
Usage
run_clustering(
df,
ranking,
n_neighbors,
max_component_size,
metric = "manhattan",
n = 20L,
fractional_p = 0.98,
plot_name = "DF_Ranking_X on DF_Y",
method = "UMAP",
figsize = c(16, 16),
dpi = 200L,
save_path = NULL,
quiet = TRUE
)
Arguments
df |
Long-format cohort data frame. |
ranking |
Data frame with 'Protein' and 'Importance'. |
n_neighbors |
Number of nearest-neighbour edges per sample. |
max_component_size |
Maximum allowed connected component size. |
metric |
Distance metric. |
n |
Number of top-ranked proteins to use. |
fractional_p |
Fractional/Minkowski exponent. |
plot_name |
Plot title. |
method |
Dimensionality reduction method ('"UMAP"', '"PCA"', '"MDS"'). |
figsize |
Numeric vector of length 2: width and height in inches. Used both for 'ggsave' (when 'save_path' is set) and to auto-scale point sizes, line widths, and text on the plot. Larger values produce more readable plots. Default 'c(16, 16)'. |
dpi |
Resolution (dots per inch) for the saved file. Default ‘200' (matches Python matplotlib’s default-ish output; bump to 300 for print). |
save_path |
Where to save a high-resolution PNG/SVG/PDF render. Accepts: * ‘NULL' (default) – don’t save; only return the ggplot object. The function still prints a hint about how to download the plot. * a character path (e.g. '"out.png"' or '"figs/cluster.svg"') – save there via 'ggsave()'. Extension chooses the format. |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
Invisibly returns a list with 'result_filtered', 'G' (the igraph object), 'cluster_assignments', 'transitive_results', 'uncertain_samples', 'error_candidate_samples', 'plot', and 'saved_path' (the path passed in via 'save_path', or 'NULL'). 'invisible()' keeps the REPL silent on unassigned calls. Assign to a name to inspect; render the cluster plot on demand via 'print(result$plot)'.
Examples
df <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")
res <- run_clustering(
df = df, ranking = ranking,
n_neighbors = 1L, max_component_size = 2L,
metric = "manhattan", method = "PCA"
)
head(res$cluster_assignments)
res$transitive_results
Load a bundled example data file as a tibble
Description
Load a bundled example data file as a tibble
Usage
spqrp_example_data(which = c("input_cohort_df", "protein_ranking"))
Arguments
which |
One of '"input_cohort_df"', '"protein_ranking"'. |
Details
The package ships two example CSV files in 'inst/extdata/', both describing a small synthetic cohort intended only for runnable examples and tests:
* 'example_input_cohort_df.csv' – mock cohort (30 patients x 2 samples x 5 proteins) in long format with the required columns 'Sample_ID', 'Patient_ID', 'Protein', 'Intensity'. * 'example_protein_ranking.csv' – protein importance ranking aligned with the mock cohort.
The real-cohort protein-importance ranking is provided separately as the lazy-loaded [cohort_a_ranking] dataset: a tibble of 'Protein' / 'Importance' computed by the pairwise balanced random-forest classifier on plasma cohort "A". It is the built-in default ranking for [perform_distance_evaluation_on_ranked_proteins()] and [optimize_parameters()], and is accessed with 'data(cohort_a_ranking)' or 'spqrp::cohort_a_ranking' rather than through this function.
Use [spqrp_example_path()] if you need the file path instead of the loaded data.
Value
A tibble.
Examples
spqrp_example_data("input_cohort_df")
Filesystem path to a bundled example CSV
Description
Filesystem path to a bundled example CSV
Usage
spqrp_example_path(which = c("input_cohort_df", "protein_ranking"))
Arguments
which |
One of '"input_cohort_df"', '"protein_ranking"'. |
Value
Absolute character path inside 'inst/extdata/'.
Examples
spqrp_example_path("input_cohort_df")
Pairwise balanced random-forest classifier
Description
Builds a pairwise design matrix (feature-wise differences of every sample pair, optionally augmented with the Euclidean distance), labels each pair 1 if the two samples share a patient ID, then trains a class- balanced random forest. The classifier backend is selectable.
Usage
train_pairwise_balanced_rand_forest(
X_train,
y_train,
X_test,
y_test,
df_pivot_test,
compute_euclid = TRUE,
method = "F1",
classifier_backend = c("randomForest", "ranger", "themis_smote"),
k = 0L,
plots_per_sample = FALSE,
sample_decision_curve = FALSE,
absolute = FALSE,
quiet = TRUE
)
Arguments
X_train, X_test |
Sample x feature matrices. |
y_train, y_test |
Patient labels (vectors with one entry per row). |
df_pivot_test |
Wide test frame including 'Sample_ID' column – used to label misclassified pairs by sample. |
compute_euclid |
Add a NaN-aware Euclidean distance feature. |
method |
Threshold selection (see [get_threshold()]). |
classifier_backend |
'"randomForest"' (default – closest behaviour to Python's 'imblearn.BalancedRandomForestClassifier' via per-tree balanced bootstrap), '"ranger"' (faster; class-weighted impurity), or '"themis_smote"' (SMOTE oversampling). See <https://github.com/fhradilak/spqrp_r/blob/main/articles/numerical-divergence.md> for the tradeoffs. Importance values returned in the results are normalised to sum to 1.0 across features (matching sklearn's 'clf.feature_importances_' convention) regardless of backend. |
k |
Fold number for diagnostic printing. |
plots_per_sample |
Per-sample probability plots. |
sample_decision_curve |
If 'TRUE', draw ROC + PR + threshold plots. |
absolute |
Take absolute value of feature differences before passing to the model. (Stored after training is complete.) |
quiet |
If 'TRUE' (default), suppress informational status messages. Set 'FALSE' to print progress and per-call summaries (sample counts, chosen cutoff, etc.). Warnings about genuine data issues – e.g. samples dropped from the analysis – are emitted regardless. |
Value
Named list as described in the package docs.
Examples
df <- spqrp_example_data("input_cohort_df")
# In practice, call the high-level [train_with_normalise()] instead --
# it handles the train/test split, normalisation, and pivoting for you.:
res <- train_with_normalise(df, plate_corrected = FALSE,
outlier_removal = FALSE)
res$classifier_backend
End-to-end ranking pipeline: filter, normalise, optionally plate-correct, train RF
Description
Mirrors 'protein_selection.train_with_normalise' from the Python package but exposes 'classifier_backend' so users can compare three RF variants ('"ranger"', '"randomForest"', '"themis_smote"'). See <https://github.com/fhradilak/spqrp_r/blob/main/articles/numerical-divergence.md> for the tradeoffs.
Usage
train_with_normalise(
df,
threshold = 0.7,
test_size = 0.3,
plate_corrected = TRUE,
individual = PATIENT,
sample = SAMPLE,
compute_euclid = FALSE,
method = "F1",
outlier_removal = TRUE,
train_individuals = NULL,
test_individuals = NULL,
sample_decision_curve = FALSE,
classifier_backend = c("randomForest", "ranger", "themis_smote"),
importance_method = "impurity",
plot_per_sample = FALSE,
absolute = FALSE,
quiet = TRUE
)
Arguments
df |
Long-format cohort data frame. |
threshold |
Occurrence-filter threshold. |
test_size |
Patient-level test fraction. |
plate_corrected |
If 'TRUE', run plate-effect residualisation. |
individual |
Patient column. |
sample |
Sample column. |
compute_euclid |
Add NaN-aware Euclidean distance feature. |
method |
Threshold-selection strategy. |
outlier_removal |
Run [by_isolation_forest()] on each split. |
train_individuals, test_individuals |
Explicit split overrides. |
sample_decision_curve |
Draw ROC/PR curves. |
classifier_backend |
'"randomForest"' (default – closest behaviour to Python's 'imblearn.BalancedRandomForestClassifier'), '"ranger"' (faster), or '"themis_smote"'. The default was changed from '"ranger"' to '"randomForest"' to bring R rankings closer to the Python port. See <https://github.com/fhradilak/spqrp_r/blob/main/articles/numerical-divergence.md>. |
importance_method |
Unused placeholder (kept for API parity). |
plot_per_sample |
Per-sample probability plots. |
absolute |
Use absolute pairwise differences. |
quiet |
If 'TRUE' (default), suppress informational status messages (train/test split listing, "Proteins only in test set", outliers removed, fold headers, per-fold metrics, top-importance list, and per-misclassified-pair prints) and skip auto-rendering of the ROC / PR / probability plots. Set 'FALSE' to print everything. Warnings about genuine data issues are emitted regardless. |
Value
'spqrp_train' S3 object (a named list with classifier, pair indices, feature importances, misclassified pairs).
Examples
df <- spqrp_example_data("input_cohort_df")
res <- train_with_normalise(df, plate_corrected = FALSE,
outlier_removal = FALSE)
retrieve_ranking(res)