Understanding Method Comparison Statistics

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Marcello Grassi

Introduction

This vignette provides a conceptual overview of the statistical methods implemented in valytics. The goal is to help you understand what the numbers mean and how to think about them, not to prescribe specific acceptance criteria or make decisions for you.

Whether your analysis “passes” or “fails” depends entirely on your specific application, regulatory requirements, and clinical context. This package provides the tools; you and your organization define what constitutes acceptable agreement.

library(valytics)
library(ggplot2)

Statistical Concepts in Bland-Altman Analysis

What is Bias?

The bias (mean difference) quantifies the average systematic offset between two methods. It answers: “On average, how much higher or lower does method Y read compared to method X?”

data("creatinine_serum")
ba <- ba_analysis(
  x = creatinine_serum$enzymatic,
  y = creatinine_serum$jaffe
)

cat("Bias:", round(ba$results$bias, 3), "mg/dL\n")
#> Bias: 0.174 mg/dL
cat("95% CI:", round(ba$results$bias_ci["lower"], 3), "to",
    round(ba$results$bias_ci["upper"], 3), "\n")
#> 95% CI: 0.127 to 0.22

What this tells you:

The point estimate indicates the direction and magnitude of systematic difference
The confidence interval quantifies uncertainty due to sampling variability
If the CI excludes zero, the bias is statistically distinguishable from zero at that confidence level

What this does NOT tell you:

Whether the bias is clinically important (that depends on your application)
Whether the methods are “equivalent” (you must define what that means)
Whether you should use one method over another

What are Limits of Agreement?

The limits of agreement (LoA) define an interval expected to contain 95% of the differences between methods. They answer: “For a randomly selected sample, how much could the two methods disagree?”

cat("Lower LoA:", round(ba$results$loa_lower, 3), "\n")
#> Lower LoA: -0.236
cat("Upper LoA:", round(ba$results$loa_upper, 3), "\n")
#> Upper LoA: 0.584
cat("Width:", round(ba$results$loa_upper - ba$results$loa_lower, 3), "\n")
#> Width: 0.82

The LoA represent the range of disagreement you can expect in practice. A narrow LoA indicates consistent agreement; a wide LoA indicates variable differences.

Key insight: The LoA are often more informative than the bias alone. Two methods might have negligible average bias but wide limits of agreement, meaning individual measurements could differ substantially.

Visualizing Agreement

The Bland-Altman plot provides a visual assessment:

plot(ba)

Bland-Altman plot showing differences vs. averages.

What to look for:

Random scatter around the bias line: Suggests constant bias across the measurement range
Funnel shape: Variance changes with magnitude (heteroscedasticity)
Systematic trend: Proportional bias (differences depend on concentration)
Points outside LoA: Expected for ~5% of observations if assumptions hold

Checking Assumptions

Bland-Altman analysis assumes normally distributed differences. The summary provides a Shapiro-Wilk test:

summ <- summary(ba)
if (!is.null(summ$normality_test)) {
  cat("Shapiro-Wilk p-value:", round(summ$normality_test$p.value, 4), "\n")
}
#> Shapiro-Wilk p-value: 0

A low p-value suggests non-normality. Consider:

Examining the distribution visually
Using percentage differences if variance increases with magnitude
Applying transformations for skewed data

ggplot(data.frame(diff = ba$results$differences), aes(x = diff)) +
  geom_histogram(aes(y = after_stat(density)), bins = 15,
                 fill = "steelblue", alpha = 0.7) +
  geom_density(linewidth = 1) +
  labs(x = "Difference (Jaffe - Enzymatic)", y = "Density") +
  theme_minimal()

Distribution of differences.

Statistical Concepts in Passing-Bablok Regression

Slope and Intercept

Passing-Bablok regression fits a line: Y = intercept + slope * X

The parameters have direct interpretations:

Slope = 1: No proportional (multiplicative) difference
Intercept = 0: No constant (additive) difference

pb <- pb_regression(
  x = creatinine_serum$enzymatic,
  y = creatinine_serum$jaffe
)

cat("Slope:", round(pb$results$slope, 4), "\n")
#> Slope: 0.9711
cat("  95% CI:", round(pb$results$slope_ci["lower"], 4), "to",
    round(pb$results$slope_ci["upper"], 4), "\n")
#>   95% CI: 0.9661 to 0.9741
cat("Intercept:", round(pb$results$intercept, 4), "\n")
#> Intercept: 0.2339
cat("  95% CI:", round(pb$results$intercept_ci["lower"], 4), "to",
    round(pb$results$intercept_ci["upper"], 4), "\n")
#>   95% CI: 0.2288 to 0.2387

How to read the confidence intervals:

If the slope CI includes 1: Cannot conclude proportional bias exists
If the slope CI excludes 1: Evidence of proportional bias
If the intercept CI includes 0: Cannot conclude constant bias exists
If the intercept CI excludes 0: Evidence of constant bias

Translating to Practical Differences

You can use the regression equation to estimate expected differences at specific concentrations:

# At various concentrations, what's the expected difference?
concentrations <- c(0.8, 1.3, 3.0, 6.0)

for (conc in concentrations) {
  expected_y <- pb$results$intercept + pb$results$slope * conc
  difference <- expected_y - conc
  cat(sprintf("At X = %.1f: expected Y = %.3f, difference = %.3f\n",
              conc, expected_y, difference))
}
#> At X = 0.8: expected Y = 1.011, difference = 0.211
#> At X = 1.3: expected Y = 1.496, difference = 0.196
#> At X = 3.0: expected Y = 3.147, difference = 0.147
#> At X = 6.0: expected Y = 6.060, difference = 0.060

This helps translate abstract regression parameters into concrete, application-specific terms.

Linearity Assessment

The CUSUM test evaluates whether a linear model is appropriate:

cat("CUSUM statistic:", round(pb$cusum$statistic, 4), "\n")
#> CUSUM statistic: 0.97
cat("p-value:", round(pb$cusum$p_value, 4), "\n")
#> p-value: 0.3036

A significant result (conventionally p < 0.05) suggests the relationship may not be linear across the measurement range. If non-linearity is detected:

Consider whether it’s clinically meaningful
Examine specific concentration ranges
Evaluate whether a single regression is appropriate

plot(pb, type = "cusum")

CUSUM plot for linearity assessment.

Common Analysis Considerations

Correlation is Not Agreement

High correlation between methods is often reported but can be misleading:

r <- cor(creatinine_serum$enzymatic, creatinine_serum$jaffe)
cat("Correlation coefficient:", round(r, 4), "\n")
#> Correlation coefficient: 0.9952

Correlation measures whether methods rank samples similarly, not whether they give the same values. Two methods with r = 1 but different calibrations would show systematic bias that correlation fails to detect.

Sample Characteristics Matter

Your results depend on:

Concentration range: Bias may differ at low vs. high concentrations
Sample types: Matrix effects can vary
Population: Results from one patient group may not generalize

Be cautious about extrapolating beyond the conditions of your study.

Statistical vs. Practical Significance

A statistically significant bias (CI excludes zero) may or may not be practically important. Consider:

# Example: Is a bias of X clinically meaningful?
# This depends entirely on YOUR application
bias_value <- ba$results$bias

cat("Observed bias:", round(bias_value, 3), "mg/dL\n")
#> Observed bias: 0.174 mg/dL
cat("\nWhether this is 'acceptable' depends on:\n")
#> 
#> Whether this is 'acceptable' depends on:
cat("- Your specific clinical decision thresholds\n")
#> - Your specific clinical decision thresholds
cat("- Regulatory requirements for your application\n")
#> - Regulatory requirements for your application
cat("- Intended use of the measurement\n")
#> - Intended use of the measurement
cat("- Established performance goals (CLIA, biological variation, etc.)\n")
#> - Established performance goals (CLIA, biological variation, etc.)

Creating Analysis Reports

Here’s how to extract key statistics for reporting:

# Bland-Altman summary
cat("=== Bland-Altman Analysis ===\n")
#> === Bland-Altman Analysis ===
cat(sprintf("n = %d\n", ba$input$n))
#> n = 80
cat(sprintf("Bias: %.3f (95%% CI: %.3f to %.3f)\n",
            ba$results$bias,
            ba$results$bias_ci["lower"],
            ba$results$bias_ci["upper"]))
#> Bias: 0.174 (95% CI: 0.127 to 0.220)
cat(sprintf("SD of differences: %.3f\n", ba$results$sd_diff))
#> SD of differences: 0.209
cat(sprintf("LoA: %.3f to %.3f\n\n",
            ba$results$loa_lower,
            ba$results$loa_upper))
#> LoA: -0.236 to 0.584

# Passing-Bablok summary
cat("=== Passing-Bablok Regression ===\n")
#> === Passing-Bablok Regression ===
cat(sprintf("Slope: %.4f (95%% CI: %.4f to %.4f)\n",
            pb$results$slope,
            pb$results$slope_ci["lower"],
            pb$results$slope_ci["upper"]))
#> Slope: 0.9711 (95% CI: 0.9661 to 0.9741)
cat(sprintf("Intercept: %.4f (95%% CI: %.4f to %.4f)\n",
            pb$results$intercept,
            pb$results$intercept_ci["lower"],
            pb$results$intercept_ci["upper"]))
#> Intercept: 0.2339 (95% CI: 0.2288 to 0.2387)
cat(sprintf("CUSUM p-value: %.4f\n", pb$cusum$p_value))
#> CUSUM p-value: 0.3036

Choosing the Right Method

The valytics package provides three complementary approaches for method comparison. Each has strengths suited to different scenarios.

Method Comparison Table

Comparison of method comparison approaches
Aspect	Bland-Altman	Passing-Bablok	Deming
Primary question	How well do methods agree?	Is there systematic bias?	Is there systematic bias?
Statistical approach	Descriptive statistics	Non-parametric regression	Parametric regression
Error assumption	Differences ~ Normal	Distribution-free	Errors ~ Normal
Outlier handling	Sensitive	Robust	Sensitive
Output focus	Bias, limits of agreement	Slope, intercept CIs	Slope, intercept, SEs
Sample size	n >= 30 recommended	n >= 30 for stable CIs	n >= 10 feasible
Best when	Defining acceptable agreement	Outliers present, unknown error	Known error ratio, small n

Decision Flowchart

Do you need to define acceptable limits of agreement?
- Yes → Use Bland-Altman analysis
- No → Continue to step 2
Are there potential outliers in your data?
- Yes → Use Passing-Bablok regression
- No → Continue to step 3
Do you know the error ratio between methods?
- Yes → Use Deming regression with specified λ
- No → Use Deming regression with λ = 1 (orthogonal) or Passing-Bablok
Is your sample size small (n < 30)?
- Yes → Deming regression may provide more stable estimates
- No → Either regression method is appropriate

Using Multiple Methods

In practice, using multiple methods provides a more complete picture:

# Complete method comparison workflow
ba <- ba_analysis(reference ~ test, data = mydata)
pb <- pb_regression(reference ~ test, data = mydata)
dm <- deming_regression(reference ~ test, data = mydata)

# Bland-Altman for agreement assessment
summary(ba)
plot(ba)

# Compare regression methods
cat("Passing-Bablok slope:", pb$results$slope, "\n")
cat("Deming slope:", dm$results$slope, "\n")

If Passing-Bablok and Deming give similar results, you can be more confident in the conclusions. If they differ substantially, investigate why (outliers? non-normality? heteroscedasticity?).

Summary

The valytics package provides statistical tools for method comparison. It calculates:

Bland-Altman: Bias, limits of agreement, and their confidence intervals
Passing-Bablok: Slope, intercept, and linearity assessment
Deming regression: Slope, intercept, and linearity assessment when both X and Y variables have measurement errors

These statistics describe the relationship between methods. Whether that relationship is “acceptable” for your purpose is a separate question that depends on:

Clinical decision thresholds
Regulatory requirements
Performance specifications (biological variation, CLIA, etc.)
Intended use

The package reports what the data show. You decide what it means for your application.

References

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-310.

Bland JM, Altman DG. Measuring agreement in method comparison studies. Statistical Methods in Medical Research. 1999;8(2):135-160.

Passing H, Bablok W. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Journal of Clinical Chemistry and Clinical Biochemistry. 1983;21(11):709-720.

Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method-comparison studies. Clinical Chemistry. 1973;19(1):49-57.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.