The package quantregGrowth
aims to estimate the smooth,
but unspecified, effect of numerical covariate(s) on one or more
quantiles of the numerical response variable. The quantile regression
equation can also include linear effect(s) of numerical or categorical
covariates. Each flexible and unspecified relationship is expressed via
a B-spline basis whose coefficients are somehow shrunken to control the
wiggliness of the curve(s). The amount of smoothness is estimated as
part of the model fitting as described in Muggeo
et al. (2021).
The package relies on the well known quantreg
package by
Roger Koenker by exploiting full functionality and efficiency of its
fitter functions.
When focus is on multiple response quantiles (1%, 5%,…, 95%, say) depending on a single variable, the resulting fitted quantile curves are also referred as growth charts. Growth charts are smooth quantile trajectories at different probability values, ‘tau’, with the noncrossing property, namely the quantile curve at any specified tau value has to be always higher then the quantile curve at any lower probability value. Growth charts are used to build reference values of an outcome of interest with respect to another variable, usually age or time. Typical examples include height (or weight) or vocabulary size development versus the children’s age as investigated in wordbank website.
quantregGrowth
uses B-splines with a penalty on the
coefficients and relevant smoothing parameter `selected’ within the
estimation algorithm, and imposes proper constraints to prevent quantile
crossing.
We build growth charts of height versus age using the data set
SiChildren
shipped with the package. The main function is
gcrq()
which requires the usual model formula including the
ps()
function on the right hand side. gcrq()
returns quantile curves at (default) probability values
c(.1, .25, .5, .75, .9)
, but in this example we choose
different values via the argument tau
, and display the
result by means of the plot
method function:
library(quantregGrowth)
data(SiChildren)
o <-gcrq(height ~ ps(age), tau=seq(.05,.95,l=7), data=SiChildren)
plot(o, res=TRUE, col=-1) #res=TRUE displays data too; col<0 for the default palette
In addition to the graphical output, we could look at the estimated
quantile values via the function charts()
which is a simple
wrapper of predict.gcrq()
to facilitate production of
growth charts based on the fitted model. We display the quantiles just
at some age values
charts(o, k=c(10,10.5,11,16,17)) #the quantile at the specified k values
> 0.05 0.2 0.35 0.5 0.65 0.8 0.95
> age=10 135.24 138.35 141.05 143.00 144.23 146.29 149.83
> age=10.5 133.02 136.86 140.29 142.25 144.00 146.85 151.83
> age=11 133.41 137.75 141.17 143.14 145.24 148.52 154.43
> age=16 154.36 160.47 164.16 168.39 170.16 172.18 176.94
> age=17 159.20 163.33 165.38 171.00 171.64 172.47 175.38
Results look reasonable, but there are some annoying features which need to be stressed: at around age=16 the highest quantile curve (the 95th) decreases, while some quantile curves (the 5th to the 65th) drop at early ages around 11 years and then increase again afterwards.
Broadly speaking, non monotonic growth charts could be biologically
sound (for instance when modelling the body mass index), but in this
example they are not. Likely, the sparseness of data on the left and
right tails of age distribution, causes the drops in the fitted curves.
Hence we re-fit the model by imposing positive monotonicity contraints
via the argument monotone=1
in ps()
,
oM<-gcrq(height ~ ps(age, monotone=1), data=SiChildren, tau=seq(.05,.95,l=7))
In the following, we report two plots: the first one focuses on the
early ages by stressing differences between the fitted curves, and the
second figure could be possibly closer to the usual growth charts, with
legend (set at the age value overlap=15
), underlying grid
(with 15 vertical and 10 horizontal lines) and `customized’ axis
labels.
par(mfrow=c(1,2))
#the 1st plot..
plot(o, xlim=c(10.2,12.5), col=1, lty=3, lwd=2)
plot(oM, add=TRUE, col=2, lty=1, lwd=2)
#the 2nd..
plot(oM, legend=TRUE, overlap=15, grid=list(x=15,y=10), col=2, lty=1,
ylab="Height (cm)", xlab="Age (years)")
Several options are available when plotting a gcrq
object, see ?plot.gcrq
.
quantregGrowth
can deal with multiple smooth terms
straightforwardly. We use data simulated via mgcv::gamSim
function (example 1). We are interested in modelling nonparametrically
the effect of four covariates on the median, say. At this aim, we call
gcrq()
with several ps()
terms and a single
tau
value,
set.seed(1515)
d<-mgcv::gamSim(n=200, eg=1, verbose=FALSE) #verbose=FALSE just suppresses the message..
o <- gcrq(y ~ ps(x0) + ps(x1) + ps(x2) + ps(x3), data=d, tau=.5)
The plots including the fitted relationships for all 4 terms are ready obtained by
# Plot the fit
par(mfrow=c(2,2))
plot(o, res=TRUE, col=2, conf.level=.95, shade=TRUE, cex.p=.6) #cex.p<1 to reduce the points..
where the y-label on each plot displays the edf of each smooth term.
Note, when there are multiple covariates,
res=TRUE
portrays
the partial residuals (and not the observed values).
gcrq()
can also include standard linear terms: for
instance, the above plots suggest that a simple linear term would
suffice to capture the relationships for x1
and
x3
. Therefore in the next model formula we include these
variables outside the ps()
function. We also display the
model output via summary.gcrq()
.
o1<-gcrq(y ~ ps(x0) + x1 + ps(x2) + x3, data=d, tau=.5)
#the summary method
summary(o1)
> Call:
> gcrq(formula = y ~ ps(x0) + x1 + ps(x2) + x3, tau = 0.5, data = d)
>
> -------- Percentile: 0.50 check function: 157.69 -----
>
> parametric terms:
> Est StErr |z| p-value
> (Intercept) 4.0077 0.6471 6.19 5.88e-10 ***
> x1 6.7935 0.8178 8.31 < 2e-16 ***
> x3 0.6163 0.8100 0.76 0.447
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> smooth terms:
> edf
> ps(x0) 2.000
> ps(x2) 4.999
> =========================
>
> No. of obs = 200 No. of params = 25 (25 for each quantile)
> Overall check = 157.69 SIC = -0.105 on edf = 10 (ncross constr: FALSE)
For the parametric terms, the summary method returns point estimates, standard errors based on the sandwich formula, and \(p\)-values coming from the Wald statistic. Degrees of freedom for the smooth terms are printed next, along with some useful information of the fit.
Of course we could estimate the same ‘semiparametric’ model at several probability values
o2<-gcrq(y ~ x1 + x3 + ps(x0) + ps(x2), data=d, tau=c(.12,.25,.5,.7,.8,.9))
with the non-crossing constraints. Summary and plots can be obtained straightforwardly.
A particular case of ‘semiparametric’ model is given by the so-called
varying coefficient (VC) model, wherein the linear effect of a (even
categorical) covariate, \(x_1\) say,
depends flexibly on a continuous variable \(x_2\). The relevant term in the linear
predictor can be written \(\beta(x_2)x_1\), where \(\beta(\cdot)\) is the smooth expressed via
B-splines. We simulate data via the gamSim()
function
again.
d<-mgcv::gamSim(n=200, eg=3, scale=1) #simulated data with VC effect
> Continuous `by' variable example
and the fit is obtained by means of the argument by
in
ps()
,
o <- gcrq(y~ x1 + ps(x2, by=x1), data=d, tau=.5)
It should be noted we also specify the linear term x1
,
since ps()
makes the basis identifiable. The fitted curve
can be displayed as usual via plot.gcrq()
.
When the linear covariate x1
is categorical, e.g. the
gender, fitting VC terms means to fit the smooth effect of
x2
into each category of x1
. We illustrate
that with a simulated dataset
n=50
x0<-x1<-x<-1:n/n
y0<-10+sin(2*pi*x) #sinusoidal relationship + intercept
y1<-seq(7,11,l=n) #linear relationship + intercept
sigma<- .2
y<-c(y0,y1)+rnorm(2*n)*sigma
x<-c(x,x)
z<-rep(0:1, each=n) #numeric variable
g <-factor(z) #factor
o<-gcrq(y ~ g + ps(x, by=g), tau=.5)
Again, the same model could be obtained using a full B-spline in each
group,
i.e. gcrq(y ~ 0 + ps(x, by=g, dropc=FALSE, center=FALSE), tau=.5)
.
We compare the true and fitted quantile curves in the next plots
par(mfrow=c(1,2))
plot(x0, y[1:50]);lines(x0,y0)
plot(o, term=1, add=TRUE, col=2, lwd=2)
plot(x1, y[51:100]);lines(x1,y1)
plot(o, term=2, add=TRUE, col=3, lwd=2, shift=coef(o)["g1",1])
To overlap the lines in the second group, we add the estimated
coefficient of the corresponding category (
g1
) to the
fitted quantile curve (via the argument shift
of
plot.gcrq()
. When the variable in by
is
categorical, gcrq()
automatically knows that
category-specific curves have to be fitted. However it could be probably
instructive to fit the same model using the numerical covariate
z
, namely
o1<-gcrq(y ~ z + ps(x) + ps(x, by=z), tau=.5)
Note the plot of the VC term ps(x, by=z)
represents the
difference between the smooth relationships y1-y0
, bar a
constant depending on the intercepts.
gcrq()
can fit purely linear models, i.e. with no
ps()
term in the formula. Here a simple example.
n=100
x<-1:n/n
y<-2*x+rnorm(n)*rev(x)
o3 <-gcrq(y ~ x, tau=seq(.05,.95,l=10)) #fit 10 (noncrossing) regression quantiles
By means of plot.gcrq()
with proper arguments, we
display in the first plot the linear growth charts, and in the 2nd
figure, we portray the tau-varying coefficient by setting
axis.tau=TRUE
,
par(mfrow=c(1,2))
plot(x,y)
plot(o3, add=TRUE)
plot(o3, term="x", axis.tau=TRUE, conf.level = .95, col=2)
The 2nd plot could be useful to assess if and how the specified coefficient estimate changes across the quantile curves.
A possibly useful feature of quantregGrowth
is supplying
a user-defined (multiplicative) penalty via the argument
pen.matrix
in ps()
. The penalty matrix \(A\), say, should be a matrix such that
\(\lambda||A\beta||_1\) is the
penalization in the objective to be minimized. \(\beta\) is the vector of spline
coefficients and \(\lambda\) is the
smoothing parameter fixed or to be estimated. The example below
illustrates how a proper penalty matrix can be set to force a zero slope
in the fitted curve at large values of the covariate. At this aim we
consider weighted first-order differences by assigning a very large
weight to the rightmost differences to fulfill the zero slope
constraint. Overall, the penalty is \(\sum_j^{J-1} |\beta_j-\beta_{j-1}|w_j=||diag(w)
D^{(1)}\beta||_1= ||A\beta||_1\), where \(D^{(1)}\) is the first-order differences
matrix and \(w\) is the weight vector
as defined below. To build the customized penalty matrix with
appropriate dimension, it is probably helpful to set explicitly the
number of the basis coefficients via the arguments ndx
(number of covariate intervals which defaults to \(min(n/4,12)\)) and deg
(spline
degree which defaults to 3): Each basis has ndx+deg-1
identifiable coefficients.
data(growthData)
D1 <- diff(diag(23),dif=1)[,-1] #the 1st order diff matrix
w <- c(rep(1,15),rep(1000,7)) #the weight vector s.t. length(w)=nrow(D1)
A <- diag(w) %*% D1 #the penalty matrix
o4 <-gcrq(y~ps(x, ndx=20, pen.matrix=A), data=growthData, tau=.5)
The length of zero slope depends on number of large values in \(w\), in the above example
rep(1000,7)
. Increasing (decreasing) the number of large
values would lead to increase (decrease) the zero slope interval.
Figure below reports results of the aforementioned constrained fit, along with the fits obtained using unweighted first order-differences without and with additional monotonicity constraints
o5 <-gcrq(y~ps(x, d=1), data=growthData, tau=.5)
o6 <-gcrq(y~ps(x, d=1, monotone = 1), data=growthData, tau=.5)
par(mfrow=c(1,3))
plot(o4, res=TRUE, col=2, lwd=3, ylim=c(0,12))
plot(o5, res=TRUE, col=3, lwd=3, ylim=c(0,12))
plot(o6, res=TRUE, col=4, lwd=3, ylim=c(0,12))
We conclude with a simple example illustrating how
quantregGrowth
can be used to perform variable selection,
namely when we are interested in spotting, typically few, important
covariates among a large number of candidates. Thus the response below
depends on the variables in the columns 3, 5, and 11 of the matrix
X
and on the linear covariate z
which enter
the model unpenalized
n <- 50
z <-1:n/n
p <-20
X<-matrix(runif(n*p),n,p)
true.coef <-rep(0,p)
true.coef[c(3,5,11)] <- c(.7,1.5,-1)
y<-5 + 1.5*z+ drop(X%*%true.coef) + rnorm(n)*.25
In order to discard noisy covariates we use the lasso penalty on
corresponding coefficients. More specifically, the penalized objective
to be minimized is \(\sum_i\rho_\tau
(y_i-\beta_0-\beta_1z_i-\sum_j\gamma_jx_{ij})+\lambda
\sum_j|\gamma_j|\), where \(\rho_\tau(\cdot)\) is the usual check
function. Note the lasso penalty refers to the coefficients of the \(x_j\)’s only; \(\beta_0\) and \(\beta_1\) are unpenalized. The model is
fitted straightforwardly by including the matrix X
in
ps()
, and the unpenalized covariate in the main
formula.
o <-gcrq(y~ z + ps(X), tau=.5)
plot(o, term=1)
The
plot
method returns the estimated coefficients relevant
to the columns of X
.
It is also possible to fit multiple quantile curves and to display to results accordingly
o <-gcrq(y ~ z + ps(X))
par(mfrow=c(1,2))
plot(o, term=1, legend=TRUE)
plot(o, term=2)
Note in the left plot, each symbol refers to a specific quantile curve.
Muggeo V.M.R., Torretta F, Eilers P.H.C., Sciandra M., Attanasio M. (2021). Multiple smoothing parameters selection in additive regression quantiles, Statistical Modelling, 21, 428 - 448. https://journals.sagepub.com/doi/full/10.1177/1471082X20929802
Muggeo V.M.R. (2021). Additive Quantile regression with automatic smoothness selection: the R package quantregGrowth. https://www.researchgate.net/publication/350844895_Additive_Quantile_regression_with_automatic_smoothness_selection_the_R_package_quantregGrowth