A linear model with non-constant variances

Posthuma Partners

2017-02-16

The package

The ‘lmvar’ package fits a linear model in which the assumption of homoscedasticity (i.e., the variance is independent of the expectation value) is dropped. Instead, the variance has its own model, comparable to the model for the expectation value.

The fit results in an ‘lmvar’ object, which is a list of class ‘lmvar’. Accessor functions are provided to extract the list members such as the fitted betas and the log-likelihood for the model. Various utility functions such as residuals to calculate residuals, AIC to calculate the AIC, fitted to obtain expected values and standard deviations, etc., are also provided by the package.

The package lacks much of the sophistication of the ‘lm’ and ‘glm’ packages. On the bright side, this means it is simple to use. It is intended for people who run a classical linear model and want to see what happens if the restriction of a constant variance is dropped. Questions in this context are: does the allowance of heteroscedasticity result in a better fit, lower values for the AIC or BIC, smaller errors from a cross-validation, etc.?

The model

The package fits the following model. A vector \(Y\) of observations (sometimes called ‘responses’) of length \(n\) is a stochastic vector. It is distributed according to a multivariate Gaussian distribution:

\[\begin{equation} Y \sim \mathcal{N}_n( \mu, \Sigma), \end{equation}\] where \(\mu\) is the vector of expectation values and \(\Sigma\) the covariance matrix (also called the ‘variance-covariance matrix’). Just like in the standard linear model, the covariance is taken to be a \(n \times n\) diagonal matrix but contrary to the standard linear model, the diagonal entries need not be all the same: \[\begin{equation} \Sigma_{ij} = \begin{cases} 0 & i \neq j\\ \sigma_i^2 & i=j \end{cases} \end{equation}\]

Model for the expectation values

As in the classical linear model, the vector of expectation values \(\mu\) is given by \[\begin{equation} \mu = X_\mu \beta_\mu \end{equation}\]

where \(X_\mu\) is the ‘model matrix’ or ‘design matrix’ for \(\mu\) and \(\beta_\mu\) the parameter vector for \(\mu\). \(X_\mu\) is a \(n \times k_\mu\) matrix and \(\beta_\mu\) a vector of length \(k_\mu\).

Model for the variances

Let \(\sigma\) denotes the vector \((\sigma_1, \dots, \sigma_n)\). The model for \(\sigma\) is \[\begin{equation} \log \sigma = X_\sigma \beta_\sigma \end{equation}\]

where \(\log \sigma\) stands for the vector \((\log\sigma_1, \dots, \log\sigma_n)\), \(X_\sigma\) is the ‘model matrix’ or ‘design matrix’ for \(\sigma\) and \(\beta_\sigma\) the parameter vector for \(\sigma\). The logarithm is taken to be the ‘natural logarithm’ with base \(e\). The dimensions of \(X_\sigma\) are \(n \times k_\sigma\) and \(\beta_\sigma\) is a vector of length \(k_\sigma\).

Also know that…

The vector of observations \(Y\) and the matrices \(X_\mu\) and \(X_\sigma\) are specified by the user. They must contain real values. The fit returns the maximum-likelihood estimators for \(\beta_\mu\) and \(\beta_\sigma\). They are also real-valued vectors.

The model for both \(\mu\) and \(\sigma\) contains an intercept term. That means that the first column of both matrices is a column in which each matrix-element equals 1. The package will add this column to the user-suppplied matrices to ensure that the intercept term is always present. There is no need for a user to include such a column in a user-supplied model-matrix.

After adding the intercept column, the package will check whether the resulting matrices are full rank. If not, columns will be removed from each matrix until it is full rank.

The addition of an intercept column and, possibly, the removal of columns to obtain a full-rank matrix, imply that the actual matrices used in the fit can be different from the user-specified matrices. The matrices that are actually used in the fit are returned as members of the lmvar object.

Carrying out the fit boils down to solving a set of non-linear equations. This is carried out by the function nleqslv from the package with the same name.

More mathematical details about the model can be found in the vignette ‘Math’ which comes with this package. It can be viewed with vignette("Math") or vignette("Math", package="lmvar").

Using the package

The main function in the package is lmvar. It carries out a fit and returns an lmvar object.

The user must specify a vector of observations and two model-matrices when calling lmvar. They must meet the following conditions:

With each column in \(X_\mu\) corresponds an element in \(\beta_\mu\). The name of that element is the corresponding column name. The same is true for \(X_\sigma\) and \(\beta_\sigma\).

It can happen that nleqslv fails to solve the maximum-likelihood equations. Sometimes the problem can be traced back to columns in \(X_\sigma\) with many zero’s. I.e., covariates (or factor-levels) for \(\sigma\) which affect only few observations. Removal of these columns may remedy the issue.

An lmvar object is a list whose members are intended to be extracted with the supplied accessor and utility functions. The only members for which no such functions have been implemented are:

Once lmvar has run and an lmvar-object created, one can obtain \(\beta_\mu\) and \(\beta_\sigma\) with the function coef. The function fitted allows one to obtain \(\mu\) and \(\sigma\). We refer to the package documentation (in particular the package index which can be viewed with help(package = "lmvar")) for a list of all available functions and function details.

Demonstration

We demonstrate the package with the help of the dataframe attenu which can be found in the datasets package.

library(lmvar)

# As example we use the dataset 'attenu' from the library 'datasets'. The dataset contains
# the response variable 'accel' and two explanatory variables 'mag'  and 'dist'.
library(datasets)

# Create the model matrix for the expected values
X = cbind(attenu$mag, attenu$dist)
colnames(X) = c("mag", "dist")

# Create the model matrix for the standard deviations.
X_s = cbind(attenu$mag, 1 / attenu$dist)
colnames(X_s) = c("mag", "dist_inv")

# Carry out the fit
fit = lmvar(attenu$accel, X, X_s)

We have now created the object fit which is our object of class lmvar. To obtain a first impression of the fit, we look at the summary

summary(fit)
## Call: 
##  lmvar(y = attenu$accel, X_mu = X, X_sigma = X_s)
## 
## Standardized residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4679 -0.6615 -0.1132  0.5736  2.9969 
## 
## Coefficients:
##                  Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)   -0.14518878  0.06367414 -2.2802    0.0226 *  
## mag            0.05436925  0.01137133  4.7813 1.742e-06 ***
## dist          -0.00129047  0.00014701 -8.7779 < 2.2e-16 ***
## (Intercept_s) -4.33049620  0.44747494 -9.6776 < 2.2e-16 ***
## mag_s          0.28918112  0.07286834  3.9685 7.231e-05 ***
## dist_inv_s     3.14861277  0.23985317 13.1273 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Standard deviations: 
##     Min      1Q  Median      3Q     Max 
##  0.0634  0.0775  0.0920  0.1095 46.8245 
## 
## Comparison to model with constant variance (i.e. classical linear model)
## Log likelihood-ratio: 31.38197 
## Additional degrees of freedom: 2 
## p-value for difference in deviance: 2.35e-14

The first line shows the call that created fit. Next, we are told something about the distribution of the standardized residuals. Ideally, the first quarter (1Q) must be approximately -0.67, the median 0 and the third quarter (3Q) 0.67.

Next, the summary shows the matrix with the coefficients \(\beta_\mu\) and \(\beta_\sigma\). The coefficients \(\beta_\mu\) are (Intercept), mag and dist. The coefficients \(\beta_\sigma\) are (Intercept_s), mag_s and dist_inv_s. They are called this way to distinguish them from the coefficients for \(\beta_\mu\). In cases where there is no risk of confusion, their true names will be used, which are (Intercept_s), mag and dist_inv.

The matrix with coefficients shows that all coefficients are statistically significant at the 5% level.

The next piece of information gives an impression of the distribution of the standard deviations. They range from 0.0631 to 42.0983.

Finally the model is compared to a classical linear model with the same model matrix \(X_\mu\) but a fixed standard deviation. The summary shows the difference in log-likelihood between the two models and the difference in degrees of freedom. Twice the difference in log-likelihood is the difference in deviance, for which a p-value is calculated. The fact that the p-value in the summary is nearly zero, indicates that the lmvar fit is a better fit than the classical linear model. I.e., it makes sense to let the variance vary instead of keeping it fixed.

Let’s see how the standard deviations are distributed

sigma = fitted(fit, mu = FALSE)
hist(sigma)

To check the distribution of the residuals, we make another histogram.

hist(residuals(fit))

The rank of the matrix \(X_\mu\) used in the fit is

dfree(fit, sigma = FALSE)
## [1] 3

The value 3 is correct: the user-supplied matrix had 2 columns and lmvar added an intercept column. Apparently all columns are linearly independent so no column had to be removed.

To see the number of observations in the fit, the log-likelihood and the AIC value, we run

nobs(fit)
## [1] 182
logLik(fit)
## 'log Lik.' 154.7313 (df=6)
AIC(fit)
## [1] -297.4625

The coefficients \(\beta_\mu\) and \(\beta_\sigma\) that were displayed in the summary-overview are obtained by

coef(fit)
##   (Intercept)           mag          dist (Intercept_s)         mag_s 
##  -0.145188779   0.054369245  -0.001290472  -4.330496198   0.289181123 
##    dist_inv_s 
##   3.148612765

If we only ask for \(\beta_\sigma\),we see their real names

coef(fit, mu = FALSE)
## (Intercept_s)           mag      dist_inv 
##    -4.3304962     0.2891811     3.1486128

We conclude this demonstration with the covariance matrix for the coefficients \(\beta_\mu\)

vcov(fit, sigma = FALSE)
##               (Intercept)           mag          dist
## (Intercept)  4.054396e-03 -7.174215e-04  4.262834e-06
## mag         -7.174215e-04  1.293072e-04 -8.916107e-07
## dist         4.262834e-06 -8.916107e-07  2.161296e-08

Hopefully this demonstration has given an idea of how to work with the package. The documentation of the individual fuctions contains further examples.

Functions in the package

We refer to the package index for a list of all available functions. The index can be viewed with help(package="lmvar").

Other packages

The function remlscore in the package statmod fits precisely the same model as lmvar. However, statmod does not provide any utility function, which we believe are important to foster the acceptance of this model as a step beyond classical linear regression.

Other functions that allow for a model of the dispersion are, e.g., hglm in the package hglm and geese in the package geepack. These models are more complicated though, and require a level of expertise not required by lmvar.

Acknowledgements

We thank prof. dr. Eric Cator for his valuable comments and suggestions.