The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

spqrp

Sample Provenance Quality Resolver in Proteomics — native R port of the Python spqrp package.

Recent advancements in MS technology and lab methods opened the door for large-scale proteomics but also led to a growing concern regarding sample mix-ups. spqrp helps you evaluate whether sample data is safe for further analysis by clustering samples and flagging probable mix-ups, uncertain assignments, and outliers.

Install

# install.packages("remotes")
remotes::install_github("fhradilak/spqrp_r")

No Python install needed — this is a native R port.

Input data format

A long-format data frame with these columns:

Column	Description
`Sample_ID`	Unique sample identifier
`Patient_ID`	Patient identifier
`Protein`	Protein name/identifier
`Intensity`	Numeric intensity value

Optionally a protein ranking with Protein and Importance columns. If you don’t supply one, the package uses a precomputed ranking from a plasma cohort (spqrp_example_data("ranking_cohort_a")).

Quick start

library(spqrp)

df      <- spqrp_example_data("input_cohort_df")
ranking <- spqrp_example_data("protein_ranking")

# Clustering: build kNN graph, split big components, visualise
res <- run_clustering(
  df = df, ranking = ranking,
  n_neighbors = 1L,
  max_component_size = 2L,
  metric = "manhattan",
  method = "UMAP"   # or "PCA" / "MDS"
)

res$cluster_assignments       # sample -> cluster ID
res$uncertain_samples         # likely missing connections
res$error_candidate_samples   # likely sample mix-ups
res$plot                      # ggplot object

Verbose output

All spqrp functions are silent by default — no progress messages, no per-call summaries. If you want progress and diagnostic prints (which sample IDs got flagged, what cutoff was picked, how many proteins overlapped between ranking and data, etc.) pass quiet = FALSE to any function that emits status output:

remove_outlier_samples(df, quiet = FALSE)         # prints flagged Sample_IDs
run_clustering(df, ranking, n_neighbors = 1,
                max_component_size = 3, quiet = FALSE)  # prints save-path hint,
                                                          # cluster listing,
                                                          # transitive metrics
perform_distance_evaluation_on_ranked_proteins(
  df, top_importance_df = ranking, quiet = FALSE
)                                                 # prints real-protein count

Warnings about genuine data issues — e.g. samples dropped because they lack measurements for any of the top-ranked proteins — fire regardless of quiet, because they signal a real problem you need to see.

Three random-forest backends for protein ranking

If you don’t have a precomputed ranking, train one. The Python package uses imblearn.BalancedRandomForestClassifier; this R port exposes three substitute backends so you can pick the tradeoff that fits:

results <- train_with_normalise(
  df,
  classifier_backend = "randomForest"  # default — closest to imblearn's BalancedRF
  # classifier_backend = "ranger"        # faster, class.weights on impurity
  # classifier_backend = "themis_smote"  # SMOTE rebalance + ranger
)

new_ranking <- retrieve_ranking(results)

See articles/numerical-divergence.md for when to pick each.

Threshold-based evaluation

result <- perform_distance_evaluation_on_ranked_proteins(
  df = df,
  top_importance_df = ranking,
  metric = "manhattan",
  p = 0.989,
  n = 20L
)
result$cutoff
result$eval_metrics[c("TP", "FP", "FN", "TN", "Precision", "Sensitivity", "F1")]

optimize_parameters() sweeps n and the percentile cutoff to find optimal values for your dataset.

Preprocessing

Optional helpers that mirror the Python pipeline:

df_pp <- df |>
  log_transform() |>
  filter_by_occurrence(cutoff = 0.7)

norm <- normalize_medianintensity(df_pp, plot = FALSE)
df_pp <- norm$data

# If your data has a `plate` column:
df_pp <- plate_correct_residuals_by_protein(df_pp)

Function reference

Function	Purpose
`run_clustering()`	End-to-end clustering pipeline
`cluster_samples_iteratively()`	Build kNN graph + 2D embedding
`plot_distances_neighbours_with_coloring_hue()`	Heavy clustering visualization
`perform_distance_evaluation_on_ranked_proteins()`	Threshold-based pairwise classification
`optimize_parameters()`	Grid-search optimal `n` and percentile
`calculate_pairwise_distances()`	Distance matrix on top-`n` proteins
`train_with_normalise()`	Full ranking pipeline (filter → normalize → RF)
`retrieve_ranking()`	Extract ranked proteins from a trained model
`train_pairwise_balanced_rand_forest()`	Pairwise RF (3 backends)
`get_threshold()`	ROC/F1/Youden/MinFP threshold selection
`get_distances()`, `get_nearest_neighbours()`	Distance + kNN helpers
`get_sample_relations_by_cutoff()`, `get_evaluation_metrics()`	Cutoff → metrics
`percentile_cutoff()`	numpy.percentile-equivalent
`filter_by_occurrence()`, `log_transform()`, `revert_log_transform()`, `normalize_medianintensity()`, `plate_correct_residuals_by_protein()`	Preprocessing
`by_isolation_forest()`, `by_isolation_forest_plot()`, `remove_outlier_samples()`	Outlier detection (Isolation Forest). `contamination = 0.1` for sklearn-like behaviour.
`spqrp_example_data()`	Access bundled example CSVs
`check_input_data_format()`	Validate required columns

Migrating from the Python version

The R API mirrors the Python one — function names are identical snake_case. Outputs are R named lists (which work just like Python dicts: res$cluster_assignments).

Because the underlying numerical libraries differ (uwot vs umap-learn, ranger vs imblearn, solitude (wrapping ranger) vs sklearn’s IsolationForest), exact numbers can drift across runs even with matched seeds. See articles/numerical-divergence.md for which outputs are bit-exact, which match up to rotation/reflection, and which are only equivalent in expectation.

License

GPL-3

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.