---
title: "A Tutorial on Mixtures of Quantile Regressions"
subtitle: "Finding hidden subgroups and their effects, for applied researchers"
author: "Kailas Venkitasubramanian, University of North Carolina at Charlotte"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    fig_caption: true
bibliography: mixqr.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{A Tutorial on Mixtures of Quantile Regressions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE,
  fig.width = 7, fig.height = 4.4, dpi = 150, fig.align = "center"
)
set.seed(2026)
```

This tutorial is written for applied researchers across education, economics,
sociology, political science, public health, and adjacent fields, who suspect
that one regression line is hiding more than one story, and who care about more
than the average. You do not need prior exposure to quantile regression or
to mixture models. Each idea is introduced in words before any equation, every
analysis step is shown in runnable code, and every result is visualized.

By the end you will be able to: recognize when a *mixture of quantile
regressions* is the right tool; fit one with **mixqr**; read and visualize the
estimates; classify observations into the recovered subgroups; check whether the
answer can be trusted; describe effects across the whole outcome distribution;
and report the analysis reproducibly.

This tutorial covers the **core** of the framework. **mixqr** is built on a single
EM platform with an engine/extension contract, so the same machinery also supports
expectile and M-quantile component families, component-specific penalized
selection, and non-crossing multi-quantile estimation (see the *Extensions*
article), while the companion package **mixqrgate** adds location-varying gating.

```{r libs}
library(mixqr)
library(ggplot2)
library(quantreg)
```

```{r theme, include = FALSE}
# A clean, consistent look used throughout.
theme_mixqr <- function() {
  theme_minimal(base_size = 12) +
    theme(panel.grid.minor = element_blank(),
          plot.title = element_text(face = "bold"),
          plot.subtitle = element_text(colour = "grey35"),
          legend.position = "top")
}
pal_seq <- c("#1b6ca8", "#e07b39", "#5aae61", "#9b59b6")
```

# 1. The problem: one line, two stories

Most regression models a single relationship: "a one-unit change in *x* moves
*y* by β." But populations are often **mixtures of unobserved subgroups**, each
with its own relationship. A few examples from the social sciences:

* **Schools.** The link between family socioeconomic status (SES) and a student's
  test score is steep in some schools and shallow in others, and you may not
  have a clean label for "which kind of school."
* **Labor markets.** Returns to education differ between a primary, credentialed
  segment and a secondary, casualized one.
* **Program evaluation.** A policy helps a "responder" subgroup and barely moves
  a "non-responder" subgroup. The average effect describes neither.

When you fit one line to a mixture, you get a number that averages away the
groups and mirrors none of them. A *finite mixture of regressions* instead
assumes the data come from a small number of latent groups (called *components*
or *regimes*) and estimates a separate regression in each, discovering the
grouping and the group-specific effects *jointly*, without a pre-existing label
[@mclachlanpeel2000].

`mixqr` does this for **quantile** regressions rather than mean regressions.
That buys two things, explained next.

# 2. Why quantiles, and why a mixture

**Why quantiles.** A quantile regression at level $\tau$ models the conditional
$\tau$-th quantile of the outcome (the median at $\tau = 0.5$, a lower tail at
$\tau = 0.1$, an upper tail at $\tau = 0.9$) instead of the mean
[@koenker1978; @koenker2005]. The median is far more robust to skew and outliers
than the mean, and fitting several quantiles reveals effects that a mean model
cannot: a predictor may lift the bottom of the distribution while leaving the top
alone.

**Why a mixture.** Within each latent regime, `mixqr` fits a quantile
regression; across regimes, a *mixing probability* says how prevalent each regime
is. Formally, with a latent class $Z \in \{1, \dots, m\}$ and
$\Pr(Z = j) = \pi_j$, the conditional $\tau$-th quantile in regime $j$ is
$$ Q_\tau(Y \mid x, Z = j) = x'\beta_j , $$
and the components are estimated by an EM algorithm so that the grouping and the
$\beta_j$ are found together [@wuyao2016].

**Why not simpler tools?**

| Approach | What it misses |
|---|---|
| OLS / single mean regression | averages the regimes away; sensitive to skew |
| A single quantile regression | robust and distributional, but still **one** line |
| Cluster first (k-means), then regress | two-stage; clusters ignore the *relationship*, and the second-stage SEs ignore the clustering uncertainty |
| `mixqr` | finds regimes **and** their quantile-specific effects jointly, with calibrated uncertainty |

# 3. An illustrative dataset

To keep everything reproducible and transparent, we *simulate* a student-
achievement example with a known structure, then let `mixqr` recover it. (You
would use your own data; the workflow is identical.)

We generate 600 students. Each belongs to one of two unobserved school regimes:

* **Supportive schools** (60% of students): a high baseline and a strong SES
  gradient.
* **Under-resourced schools** (40%): a lower baseline and a weaker gradient.

Two features make this a job for quantile methods. Scores are **right-skewed**
(a few unusually high achievers), so the median is a steadier summary than the
mean. Within each regime the **spread of scores grows with SES**, so SES affects
the top of the distribution differently from the bottom, a pattern a mean model
cannot express.

```{r simulate}
simulate_schools <- function(n = 600, pi_A = 0.6) {
  ses <- rnorm(n)                                   # standardized SES
  regime <- ifelse(runif(n) < pi_A, "A", "B")
  err <- exp(rnorm(n, 0, 0.45)) - 1                 # right-skewed, median 0
  spread <- ifelse(regime == "A", 6, 9) * exp(0.30 * ses)  # grows with SES
  score <- ifelse(regime == "A",
                  56 + 8.0 * ses,                   # supportive schools
                  46 + 3.0 * ses) + spread * err
  data.frame(score = score, ses = ses, regime = factor(regime))
}

schools <- simulate_schools()
head(schools)
```

The raw scatter already hints at two fans of points, but a single median line
(dashed) splits the difference and represents neither regime:

```{r raw, fig.height = 4.2, fig.alt = "Scatter of score against SES with a single pooled median line."}
single <- rq(score ~ ses, tau = 0.5, data = schools)

ggplot(schools, aes(ses, score)) +
  geom_point(alpha = 0.35, colour = "grey30") +
  geom_abline(intercept = coef(single)[1], slope = coef(single)[2],
              linewidth = 1.1, linetype = "dashed") +
  labs(x = "Family SES (standardized)", y = "Math score",
       title = "One median line hides the subgroups",
       subtitle = "A single quantile regression averages two regimes together") +
  theme_mixqr()
```

# 4. Fitting your first model

We fit a two-component mixture of **median** regressions. We ask for
stochastic-EM standard errors (the recommended, calibrated method; see
Section 8) and several restarts because the mixture likelihood is multimodal.

```{r fit}
fit <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2,
             nstart = 25, variance = "stochEM",
             vcontrol = mixqr_vcontrol(B = 200, seed = 1))
fit
```

Read the output as two regressions plus the mixing probabilities. `summary()`
adds standard errors, $z$-values, and the diagnostics discussed later:

```{r summary}
summary(fit)
```

The estimated components recover the design: one regime with a high intercept and
a steep SES slope (the supportive schools), and one with a lower intercept and a
shallower slope (the under-resourced schools). Components are ordered by slope, so
component 1 is always the shallower-gradient regime.

# 5. Visualizing the estimates

## 5.1 The two regimes

The single most useful picture overlays the component median lines on the data,
colored by each student's most likely regime.

```{r est-plot, fig.alt = "Score against SES, points coloured by recovered regime, with the two component median lines."}
# Components are ordered by SES slope, so component 1 is the shallower-gradient
# (under-resourced) regime and component 2 the steeper (supportive) one.
comp_label <- c("Under-resourced schools", "Supportive schools")
pal2 <- c("Under-resourced schools" = "#e07b39",
          "Supportive schools"      = "#1b6ca8")

schools$regime_hat <- factor(comp_label[predict(fit, type = "class")],
                             levels = comp_label)

xseq <- seq(min(schools$ses), max(schools$ses), length.out = 100)
lines_df <- do.call(rbind, lapply(1:2, function(j) {
  data.frame(ses = xseq,
             score = as.numeric(cbind(1, xseq) %*% fit$beta[, j]),
             regime = factor(comp_label[j], levels = comp_label))
}))

ggplot(schools, aes(ses, score)) +
  geom_point(aes(colour = regime_hat), alpha = 0.45, size = 1.8) +
  geom_line(data = lines_df, aes(colour = regime), linewidth = 1.3) +
  scale_colour_manual(values = pal2, name = "Recovered regime") +
  labs(x = "Family SES (standardized)", y = "Math score",
       title = "Two SES–achievement gradients, recovered jointly",
       subtitle = "Component median regressions from mixqr") +
  theme_mixqr()
```

## 5.2 A coefficient comparison

For reporting, a coefficient plot with confidence intervals communicates the
substantive contrast between regimes at a glance. We pull the component
coefficients and their intervals straight from `confint()`.

```{r coef-plot, fig.height = 3.6, fig.alt = "Coefficient estimates with 95% confidence intervals for each regime, by term."}
ci <- confint(fit)                      # Wald intervals from stochastic-EM SEs
keep <- grep("^comp", rownames(ci))
coef_df <- data.frame(
  label  = rownames(ci)[keep],
  est    = c(as.numeric(fit$beta)),
  lo     = ci[keep, 1], hi = ci[keep, 2]
)
comp_idx <- as.integer(sub("comp(\\d+):.*", "\\1", coef_df$label))
coef_df$regime <- factor(comp_label[comp_idx], levels = comp_label)
coef_df$term <- sub("comp\\d+:", "", coef_df$label)
coef_df$term <- factor(coef_df$term, labels = c("Intercept", "SES slope"))

ggplot(coef_df, aes(est, regime, colour = regime)) +
  geom_pointrange(aes(xmin = lo, xmax = hi), linewidth = 0.9, size = 0.7) +
  facet_wrap(~ term, scales = "free_x") +
  scale_colour_manual(values = pal2, guide = "none") +
  labs(x = "Estimate (95% CI)", y = NULL,
       title = "Regime-specific effects with uncertainty",
       subtitle = "The SES gradient is far steeper in the supportive regime") +
  theme_mixqr()
```

# 6. Who is in which group?

Mixture models give **soft** memberships: each observation has a posterior
probability of belonging to each regime. Hard labels (`type = "class"`) take the
most likely regime, but the probabilities themselves are informative: many
points are assigned with near-certainty, a few are ambiguous.

```{r posterior, fig.height = 3.6, fig.alt = "Histogram of the probability of belonging to regime A."}
post <- predict(fit, type = "posterior")
schools$p_supportive <- post[, 2]       # component 2 = supportive (steeper) regime

ggplot(schools, aes(p_supportive)) +
  geom_histogram(bins = 30, fill = "#1b6ca8", colour = "white") +
  labs(x = "Posterior probability of the supportive regime",
       y = "Students",
       title = "Most students are classified with confidence",
       subtitle = "Mass near 0 and 1 means confident assignments") +
  theme_mixqr()
```

Here about `r round(100 * mean(apply(post, 1, max) > 0.9))`% of students are
assigned to a regime with probability above 0.9. Only a few sit in the ambiguous
middle. The package summarizes the separation in one number, the
**responsibility overlap** (0 = perfectly separated, 1 = fully overlapping),
moderate in this example:

```{r overlap}
fit$diagnostics$overlap
```

# 7. Diagnostics: can you trust it?

Two checks and a caveat matter for a mixture of quantile regressions.

## 7.1 The component error densities

`mixqr`'s semiparametric engine estimates each regime's error distribution
non-parametrically, constrained so its $\tau$-th quantile is exactly zero
[@hallpresnell1999]. Plotting these densities shows the shape of each regime's
residuals and confirms the constraint (the mass below 0 equals $\tau$).

```{r densities, fig.height = 3.8, fig.alt = "Estimated component error densities with the constraint marker at zero."}
fit_kd <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2,
                engine = "kdEM", init = "ald", nstart = 15)

rng <- range(vapply(fit_kd$density, function(d) range(d$grid), numeric(2)))
tt <- seq(rng[1], rng[2], length.out = 400)
dens_df <- do.call(rbind, lapply(1:2, function(j) {
  data.frame(resid = tt, density = fit_kd$density[[j]]$eval(tt),
             regime = factor(comp_label[j], levels = comp_label))
}))

ggplot(dens_df, aes(resid, density, colour = regime)) +
  geom_line(linewidth = 1.1) +
  geom_vline(xintercept = 0, linetype = 3, colour = "grey40") +
  scale_colour_manual(values = pal2, name = "Regime") +
  labs(x = "Residual", y = "Density",
       title = "Each regime has its own (skewed) error distribution",
       subtitle = "Estimated by the Wu & Yao kernel-density EM; tau-quantile fixed at 0") +
  theme_mixqr()
```

## 7.2 Two engines agree

The fast `"ald"` engine assumes a parametric (asymmetric Laplace) error; the
`"kdEM"` engine estimates the error shape. When they give similar component
lines, you can trust the parametric default; when they diverge, the data have
structure the parametric model misses. Here they agree closely.

```{r engines}
rbind(ALD = as.numeric(fit$beta), kdEM = as.numeric(fit_kd$beta))
```

## 7.3 A caveat worth knowing

The semiparametric estimator can be **biased when the regimes overlap heavily and
the errors are strongly asymmetric** (a property of the estimating equations, not
a bug; see @wuyao2016, sec. 6). The overlap diagnostic above is your early
warning, and cross-checking the two engines is the practical guard. With
well-separated regimes (the common, useful case), the estimates are reliable.

# 8. Inference done right

`mixqr` offers two standard-error methods:

* `variance = "sparsity"` — fast, but **conditional on the classification**: it
  treats group membership as known and therefore *understates* uncertainty.
  `summary()` flags this.
* `variance = "stochEM"` — a stochastic-EM multiple-imputation estimator
  [@wuyao2016, Alg. 3.1] that propagates the classification and mixing
  uncertainty. Use this method for inference. A Monte-Carlo benchmark (see the
  [Validation article](https://kvenkita.github.io/mixqr/articles/mixqr-validation.html))
  shows its intervals reach roughly nominal 95% coverage for the regression coefficients.

```{r confint}
confint(fit)
```

One caveat: the **mixing-probability** estimate carries a finite-sample bias, so
its interval is correctly sized but can sit off-center. Here the true split is
60/40, yet the estimate is about
`r round(100 * fit$pi[2])`/`r round(100 * fit$pi[1])` — a live reminder to report
$\pi$ as an approximate split, not to the decimal.

# 9. Beyond the median: the whole distribution

The real payoff of *quantile* mixtures is describing each regime across the
outcome distribution, beyond its median. We refit at $\tau = 0.1, 0.5, 0.9$
and draw each regime's conditional-quantile lines. (Components are slope-ordered,
so they align across $\tau$.)

```{r multitau, fig.height = 4.6, fig.alt = "Conditional quantile lines at tau = 0.1, 0.5, 0.9 for each regime."}
taus <- c(0.1, 0.5, 0.9)
multi <- do.call(rbind, lapply(taus, function(tt) {
  f <- mixqr(score ~ ses, data = schools, tau = tt, m = 2, nstart = 15)
  do.call(rbind, lapply(1:2, function(j) {
    data.frame(ses = xseq,
               score = as.numeric(cbind(1, xseq) %*% f$beta[, j]),
               regime = factor(comp_label[j], levels = comp_label), tau = tt)
  }))
}))

ggplot(schools, aes(ses, score)) +
  geom_point(alpha = 0.18, colour = "grey45", size = 1.2) +
  geom_line(data = multi,
            aes(colour = regime, group = interaction(regime, tau),
                linewidth = factor(tau))) +
  scale_colour_manual(values = pal2, name = "Regime") +
  scale_linewidth_manual(values = c("0.1" = 0.5, "0.5" = 1.2, "0.9" = 0.5),
                         name = "Quantile") +
  labs(x = "Family SES (standardized)", y = "Math score",
       title = "Conditional quantile bands by regime",
       subtitle = "Wider bands at high SES reveal growing within-regime inequality") +
  theme_mixqr()
```

The vertical spread between a regime's 10th- and 90th-percentile lines is its
*within-regime inequality* at a given SES. Where that gap widens with SES, the
top and bottom of the regime pull apart — a distributional finding invisible to a
mean model. (Fitting several $\tau$ separately can in principle let a regime's
quantile lines cross; jointly enforcing non-crossing is the role of a companion
extension to `mixqr`.)

# 10. How many groups?

We assumed two regimes. In practice you choose `m`. `mixqr_select()` compares
component counts. Cross-validated predictive log-likelihood (`"cv"`) is engine-
agnostic and penalizes complexity; `AIC`/`BIC` are available too (with the usual
caveat that mixture-component counts sit on a parameter-space boundary).

```{r select, fig.height = 3.6, fig.alt = "Cross-validated score against number of components."}
sel <- mixqr_select(score ~ ses, data = schools, tau = 0.5, m = 1:4,
                    criterion = "cv", folds = 5, nstart = 8,
                    control = mixqr_control(seed = 1))
sel$table

ggplot(sel$table, aes(m, cv_negloglik)) +
  geom_line(colour = "#1b6ca8") +
  geom_point(size = 3, colour = "#1b6ca8") +
  geom_point(data = sel$table[sel$table$m == 2, ],
             size = 5, shape = 21, fill = "#e07b39", colour = "#e07b39") +
  labs(x = "Number of components (m)", y = "CV held-out negative log-likelihood",
       title = "Choosing the number of regimes",
       subtitle = "Lower is better; the elbow at two regimes is the parsimonious choice") +
  theme_mixqr()
```

The sharp drop from one to two components, then the near-flat stretch beyond, is
the signature of two genuine regimes. The strict cross-validation minimum here
falls at m = `r sel$best` (`r round(min(sel$table$cv_negloglik))` versus
`r round(sel$table$cv_negloglik[2])` at m = 2 — a rounding-error apart, while
m = 1 is far worse). Read the **elbow**, not the bare minimum: two regimes is the
parsimonious, defensible choice.

# 11. Reporting and reproducibility

A complete, reproducible write-up needs little:

* the quantile level $\tau$, the number of components `m` and how it was chosen,
  the engine, and the number of restarts;
* component coefficients with **stochastic-EM** intervals (Section 8);
* the overlap / separability diagnostic (Section 6) and, ideally, the two-engine
  cross-check (Section 7);
* a seed.

```{r report, eval = FALSE}
# everything needed to reproduce the headline fit
fit <- mixqr(score ~ ses, data = schools, tau = 0.5, m = 2,
             nstart = 25, variance = "stochEM",
             vcontrol = mixqr_vcontrol(B = 200, seed = 1))
summary(fit)
```

That is the whole arc. One model call turned "one line, two stories" into two
regimes, their quantile-specific effects, calibrated intervals, and diagnostics
that say when to trust them. The same few lines of code scale from this teaching
example to your own data.

# Citation

If `mixqr` supports your research, please cite the package and the underlying
method:

```{r cite, eval = FALSE}
citation("mixqr")
```

> Venkitasubramanian, K. (2026). *mixqr: Finite Mixtures of Quantile
> Regressions*. R package version 0.1.0.9000.
> <https://github.com/kvenkita/mixqr>
>
> Wu, Q. & Yao, W. (2016). Mixtures of quantile regressions. *Computational
> Statistics & Data Analysis*, 93, 162–176.

# References
