The ‘lmvar’ package fits a linear model in which the assumption of homoscedasticity (i.e., the variance is independent of the expectation value) is dropped. Instead, the variance has its own model, comparable to the model for the expectation value.
The fit results in an ‘lmvar’ object, which is a list of class ‘lmvar’. Accessor functions are provided to extract the list members such as the fitted betas and the log-likelihood for the model. Various utility functions such as residuals
to calculate residuals, AIC
to calculate the AIC, fitted
to obtain expected values and standard deviations, etc., are also provided by the package.
The package lacks much of the sophistication of the ‘lm’ and ‘glm’ packages. On the bright side, this means it is simple to use. It is intended for people who run a classical linear model and want to see what happens if the restriction of a constant variance is dropped. Questions in this context are: does the allowance of heteroscedasticity result in a better fit, lower values for the AIC or BIC, smaller errors from a cross-validation, etc.?
The package fits the following model. A vector \(Y\) of observations (sometimes called ‘responses’) of length \(n\) is a stochastic vector. It is distributed according to a multivariate Gaussian distribution:
\[\begin{equation} Y \sim \mathcal{N}_n( \mu, \Sigma), \end{equation}\] where \(\mu\) is the vector of expectation values and \(\Sigma\) the covariance matrix (also called the ‘variance-covariance matrix’). Just like in the standard linear model, the covariance is taken to be a \(n \times n\) diagonal matrix but contrary to the standard linear model, the diagonal entries need not be all the same: \[\begin{equation} \Sigma_{ij} = \begin{cases} 0 & i \neq j\\ \sigma_i^2 & i=j \end{cases} \end{equation}\]where \(X_\mu\) is the ‘model matrix’ or ‘design matrix’ for \(\mu\) and \(\beta_\mu\) the parameter vector for \(\mu\). \(X_\mu\) is a \(n \times k_\mu\) matrix and \(\beta_\mu\) a vector of length \(k_\mu\).
where \(\log \sigma\) stands for the vector \((\log\sigma_1, \dots, \log\sigma_n)\), \(X_\sigma\) is the ‘model matrix’ or ‘design matrix’ for \(\sigma\) and \(\beta_\sigma\) the parameter vector for \(\sigma\). The logarithm is taken to be the ‘natural logarithm’ with base \(e\). The dimensions of \(X_\sigma\) are \(n \times k_\sigma\) and \(\beta_\sigma\) is a vector of length \(k_\sigma\).
The vector of observations \(Y\) and the matrices \(X_\mu\) and \(X_\sigma\) are specified by the user. They must contain real values. The fit returns the maximum-likelihood estimators for \(\beta_\mu\) and \(\beta_\sigma\). They are also real-valued vectors.
The model for both \(\mu\) and \(\sigma\) contains an intercept term. That means that the first column of both matrices is a column in which each matrix-element equals 1. The package will add this column to the user-suppplied matrices to ensure that the intercept term is always present. There is no need for a user to include such a column in a user-supplied model-matrix.
After adding the intercept column, the package will check whether the resulting matrices are full rank. If not, columns will be removed from each matrix until it is full rank.
The addition of an intercept column and, possibly, the removal of columns to obtain a full-rank matrix, imply that the actual matrices used in the fit can be different from the user-specified matrices. The matrices that are actually used in the fit are returned as members of the lmvar
object.
Carrying out the fit boils down to solving a set of non-linear equations. This is carried out by the function nleqslv
from the package with the same name.
More mathematical details about the model can be found in the vignette ‘Math’ which comes with this package. It can be viewed with vignette("Math")
or vignette("Math", package="lmvar")
.
The main function in the package is lmvar
. It carries out a fit and returns an lmvar
object.
The user must specify a vector of observations and two model-matrices when calling lmvar
. They must meet the following conditions:
NaN
etc., are not allowed.lmvar
is called (Intercept)
for \(X_\mu\) and (Intercept_s)
for \(X_\sigma\). In case no column-names are specified, the second, third, etc. columns are called v1
, v2
etc. for \(X_\mu\) and v1_s
, v2_s
etc. for \(X_\sigma\).(Intercept)
. This is a reserved name, as we explained above. Likewise, a user-supplied matrix \(X_\sigma\) must not have a column named (Intercept_s)
.With each column in \(X_\mu\) corresponds an element in \(\beta_\mu\). The name of that element is the corresponding column name. The same is true for \(X_\sigma\) and \(\beta_\sigma\).
It can happen that nleqslv
fails to solve the maximum-likelihood equations. Sometimes the problem can be traced back to columns in \(X_\sigma\) with many zero’s. I.e., covariates (or factor-levels) for \(\sigma\) which affect only few observations. Removal of these columns may remedy the issue.
An lmvar
object is a list whose members are intended to be extracted with the supplied accessor and utility functions. The only members for which no such functions have been implemented are:
y
: the user-supplied vector of observationsX_mu
: the actual, full-rank matrix \(X_\mu\) including the intercept term that is used in the fitX_sigma
: the actual, full-rank matrix \(X_\sigma\) including the intercept term that is used in the fitOnce lmvar
has run and an lmvar
-object created, one can obtain \(\beta_\mu\) and \(\beta_\sigma\) with the function coef
. The function fitted
allows one to obtain \(\mu\) and \(\sigma\). We refer to the package documentation (in particular the package index which can be viewed with help(package = "lmvar")
) for a list of all available functions and function details.
We demonstrate the package with the help of the dataframe attenu
which can be found in the datasets
package.
library(lmvar)
# As example we use the dataset 'attenu' from the library 'datasets'. The dataset contains
# the response variable 'accel' and two explanatory variables 'mag' and 'dist'.
library(datasets)
# Create the model matrix for the expected values
X = cbind(attenu$mag, attenu$dist)
colnames(X) = c("mag", "dist")
# Create the model matrix for the standard deviations.
X_s = cbind(attenu$mag, 1 / attenu$dist)
colnames(X_s) = c("mag", "dist_inv")
# Carry out the fit
fit = lmvar(attenu$accel, X, X_s)
We have now created the object fit
which is our object of class lmvar
. To obtain a first impression of the fit, we look at the summary
summary(fit)
## Call:
## lmvar(y = attenu$accel, X_mu = X, X_sigma = X_s)
##
## Standardized residuals:
## Min 1Q Median 3Q Max
## -1.4679 -0.6615 -0.1132 0.5736 2.9969
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14518878 0.06367414 -2.2802 0.0226 *
## mag 0.05436925 0.01137133 4.7813 1.742e-06 ***
## dist -0.00129047 0.00014701 -8.7779 < 2.2e-16 ***
## (Intercept_s) -4.33049620 0.44747494 -9.6776 < 2.2e-16 ***
## mag_s 0.28918112 0.07286834 3.9685 7.231e-05 ***
## dist_inv_s 3.14861277 0.23985317 13.1273 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Standard deviations:
## Min 1Q Median 3Q Max
## 0.0634 0.0775 0.0920 0.1095 46.8245
##
## Comparison to model with constant variance (i.e. classical linear model)
## Log likelihood-ratio: 31.38197
## Additional degrees of freedom: 2
## p-value for difference in deviance: 2.35e-14
The first line shows the call that created fit
. Next, we are told something about the distribution of the standardized residuals. Ideally, the first quarter (1Q
) must be approximately -0.67, the median
0 and the third quarter (3Q
) 0.67.
Next, the summary shows the matrix with the coefficients \(\beta_\mu\) and \(\beta_\sigma\). The coefficients \(\beta_\mu\) are (Intercept)
, mag
and dist
. The coefficients \(\beta_\sigma\) are (Intercept_s)
, mag_s
and dist_inv_s
. They are called this way to distinguish them from the coefficients for \(\beta_\mu\). In cases where there is no risk of confusion, their true names will be used, which are (Intercept_s)
, mag
and dist_inv
.
The matrix with coefficients shows that all coefficients are statistically significant at the 5% level.
The next piece of information gives an impression of the distribution of the standard deviations. They range from 0.0631 to 42.0983.
Finally the model is compared to a classical linear model with the same model matrix \(X_\mu\) but a fixed standard deviation. The summary shows the difference in log-likelihood between the two models and the difference in degrees of freedom. Twice the difference in log-likelihood is the difference in deviance, for which a p-value is calculated. The fact that the p-value in the summary is nearly zero, indicates that the lmvar
fit is a better fit than the classical linear model. I.e., it makes sense to let the variance vary instead of keeping it fixed.
Let’s see how the standard deviations are distributed
sigma = fitted(fit, mu = FALSE)
hist(sigma)
To check the distribution of the residuals, we make another histogram.
hist(residuals(fit))
The rank of the matrix \(X_\mu\) used in the fit is
dfree(fit, sigma = FALSE)
## [1] 3
The value 3 is correct: the user-supplied matrix had 2 columns and lmvar
added an intercept column. Apparently all columns are linearly independent so no column had to be removed.
To see the number of observations in the fit, the log-likelihood and the AIC value, we run
nobs(fit)
## [1] 182
logLik(fit)
## 'log Lik.' 154.7313 (df=6)
AIC(fit)
## [1] -297.4625
The coefficients \(\beta_\mu\) and \(\beta_\sigma\) that were displayed in the summary-overview are obtained by
coef(fit)
## (Intercept) mag dist (Intercept_s) mag_s
## -0.145188779 0.054369245 -0.001290472 -4.330496198 0.289181123
## dist_inv_s
## 3.148612765
If we only ask for \(\beta_\sigma\),we see their real names
coef(fit, mu = FALSE)
## (Intercept_s) mag dist_inv
## -4.3304962 0.2891811 3.1486128
We conclude this demonstration with the covariance matrix for the coefficients \(\beta_\mu\)
vcov(fit, sigma = FALSE)
## (Intercept) mag dist
## (Intercept) 4.054396e-03 -7.174215e-04 4.262834e-06
## mag -7.174215e-04 1.293072e-04 -8.916107e-07
## dist 4.262834e-06 -8.916107e-07 2.161296e-08
Hopefully this demonstration has given an idea of how to work with the package. The documentation of the individual fuctions contains further examples.
We refer to the package index for a list of all available functions. The index can be viewed with help(package="lmvar")
.
The function remlscore
in the package statmod
fits precisely the same model as lmvar
. However, statmod
does not provide any utility function, which we believe are important to foster the acceptance of this model as a step beyond classical linear regression.
Other functions that allow for a model of the dispersion are, e.g., hglm
in the package hglm
and geese
in the package geepack
. These models are more complicated though, and require a level of expertise not required by lmvar
.
We thank prof. dr. Eric Cator for his valuable comments and suggestions.