Greybox
Ivan Svetunkov
2018-09-10
There are three well-known notions of “boxes” in modelling: 1. Whitebox - the model that is completely transparent and does not have any randomness. One can see how the inputs are transformed into the specific outputs. 2. Blackbox - the model which does not have an apparent structure. One can only observe inputs and outputs but does not know what happens inside. 3. Greybox - the model that is in between the first two. We observe inputs and outputs plus have some information about the structure of the model, but there is still a part of unknown.
The whiteboxes are usually used in optimisations (e.g. linear programming), while blackboxes are popular in machine learning. As for the greybox models, they are more often used in analysis and forecasting. So the package greybox contains models that are used for these purposes.
At the moment the package contains advanced linear model function and several basic functions that implement model selection and combinations using information criteria (IC). You won’t find statistical tests in this package - there’s plenty of them in the other packages. Here we try using the modern techniques and methods that do not rely on hypothesis testing. This is the main philosophical point of greybox
.
Main functions
The package includes the following functions:
- alm() - Advanced Linear Model. This is something similar to GLM, but with a focus on forecasting and the information criteria usage.
stepwise()
- select the linear model with the lowest IC from all the possible in the provided data. Uses partial correlations. Works fast;
lmCombine()
- combine the linear models into one using IC weights;
lmDynamic()
- produce model with dynamic weights and time varying parameters based on IC weight.
xregExpander()
- expand the provided data by including leads and lags of the variables.
- ro() - produce forecasts with a specified function using rolling origin.
rmc()
- regression for multiple comparison of methods. This is a parametric counterpart of nemenyi()
, which should work better than nemenyi()
on large samples.
The first two construct a model of a class lm
, that could be used for the purposes of analysis or forecasting. The last one expands the exogenous variables to the matrix with lags and leads. Let’s see how all of them work. Let’s start from the end.
xregExpander
The function xregExpander()
is useful in cases when the exogenous variable may influence the response variable either via some lags or leads. As an example, consider BJsales.lead
series from the datasets
package. Let’s assume that the BJsales
variable is driven by the today’s value of the indicator, the value five and 10 days ago. This means that we need to produce lags of BJsales.lead
. This can be done using xregExpander()
:
BJxreg <- xregExpander(BJsales.lead,lags=c(-5,-10))
The BJxreg
is a matrix, which contains the original data, the data with the lag 5 and the data with the lag 10. However, if we just move the original data several observations ahead or backwards, we will have missing values in the begining / end of series, so xregExpander()
fills in those values with the forecasts using es()
and iss()
functions from smooth
package (depending on the type of variable we are dealing with). This also means that in cases of binary variables you may have weird averaged values as forecasts (e.g. 0.7812), so beware and look at the produced matrix. Maybe in your case it makes sense to just substitute these weird numbers with zeroes…
You may also need leads instead of lags. This is regulated with the same lags
parameter but with positive values:
BJxreg <- xregExpander(BJsales.lead,lags=c(7,-5,-10))
Once again, the values are shifted, and now the first 7 values are backcasted. In order to simplify things we can produce all the values from 10 lags till 10 leads, which returns the matrix with 21 variables:
BJxreg <- xregExpander(BJsales.lead,lags=c(-10:10))
stepwise
The function stepwise() does the selection based on an information criterion (specified by user) and partial correlations. In order to run this function the response variable needs to be in the first column of the provided matrix. The idea of the function is simple, it works iteratively the following way:
- The basic model of the first variable and the constant is constructed (this corresponds to simple mean). An information criterion is calculated;
- The correlations of the residuals of the model with all the original exogenous variables are calculated;
- The regression model of the response variable and all the variables in the previous model plus the new most correlated variable from (2) is constructed using
lm()
function;
- An information criterion is calculated and is compared with the one from the previous model. If it is greater or equal to the previous one, then we stop and use the previous model. Otherwise we go to step 2.
This way we do not do a blind search, going forward or backwards, but we follow some sort of “trace” of a good model: if the residuals contain a significant part of variance that can be explained by one of the exogenous variables, then that variable is included in the model. Following partial correlations makes sure that we include only meaningful (from technical point of view) variables in the model. In general the function guarantees that you will have the model with the lowest information criterion. However this does not guarantee that you will end up with a meaningful model or with a model that produces the most accurate forecasts. So analyse what you get as a result.
Let’s see how the function works with the Box-Jenkins data. First we expand the data and form the matrix with all the variables:
BJxreg <- as.data.frame(xregExpander(BJsales.lead,lags=c(-10:10)))
BJxreg <- cbind(as.matrix(BJsales),BJxreg)
colnames(BJxreg)[1] <- "y"
ourModel <- stepwise(BJxreg)
This way we have a nice data frame with nice names, not something weird with strange long names. It is important to note that the response variable should be in the first column of the resulting matrix. After that we use stepwise function:
ourModel <- stepwise(BJxreg)
And here’s what it returns (the object of class lm
):
ourModel
#> Call:
#> alm(formula = y ~ xLag4 + xLag9 + xLag3 + xLag10 + xLag5 + xLag6 +
#> xLead9 + xLag7 + xLag8, data = data, distribution = "dnorm")
#>
#> Coefficients:
#> (Intercept) xLag4 xLag9 xLag3 xLag10 xLag5
#> 17.6448130 3.3712172 1.3724166 4.6781047 1.5412073 2.3213096
#> xLag6 xLead9 xLag7 xLag8
#> 1.7075130 0.3766691 1.4024773 1.3370199
The values in the function are listed in the order of most correlated with the response variable to the least correlated ones. The function works very fast because it does not need to go through all the variables and their combinations in the dataset.
All the basic methods can be used together with the final model (e.g. predict()
, forecast()
, summary()
etc).
lmCombine
lmCombine()
function creates a pool of linear models using lm()
, writes down the parameters, standard errors and information criteria and then combines the models using IC weights. The resulting model is of the class “lm.combined”. The speed of the function deteriorates exponentially with the increase of the number of variables \(k\) in the dataset, because the number of combined models is equal to \(2^k\). The advanced mechanism that uses stepwise()
and removes a large chunk of redundant models is also implemented in the function and can be switched using bruteForce
parameter.
Here’s an example of the reduced data with combined model and the parameter bruteForce=TRUE
:
ourModel <- lmCombine(BJxreg[,-c(3:7,18:22)],bruteForce=TRUE)
summary(ourModel)
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Importance Lower 2.5% Upper 97.5%
#> (Intercept) 20.90312 1.81649 1.000 17.31243 24.49380
#> x -0.04283 0.20934 0.256 -0.45663 0.37097
#> xLag5 6.39707 0.65301 1.000 5.10626 7.68788
#> xLag4 5.84667 0.70751 1.000 4.44812 7.24522
#> xLag3 5.68545 0.72261 1.000 4.25704 7.11385
#> xLag2 0.12328 0.31266 0.284 -0.49476 0.74132
#> xLag1 -0.08344 0.26044 0.269 -0.59825 0.43138
#> xLead1 -0.08953 0.25863 0.275 -0.60077 0.42170
#> xLead2 -0.03508 0.19264 0.257 -0.41587 0.34570
#> xLead3 -0.11763 0.28383 0.293 -0.67868 0.44341
#> xLead4 -0.00672 0.15873 0.256 -0.32048 0.30704
#> xLead5 0.11405 0.26333 0.300 -0.40647 0.63457
#> ---
#> Residual standard error: 2.20758 on 142.81 degrees of freedom:
#> Combined ICs:
#> AIC AICc BIC BICc
#> 670.4403 671.2708 692.0868 694.1674
summary()
function provides the table with the parameters, their standard errors, their relative importance and the 95% confidence intervals. Relative importance indicates in how many cases the variable was included in the model with high weight. So, in the example above variables xLag5, xLag4, xLag3 were included in the models with the highest weights, while all the others were in the models with lower ones. This may indicate that only these variables are needed for the purposes of analysis and forecasting.
The more realistic situation is when the number of variables is high. In the following example we use the data with 21 variables. So if we use brute force and estimate every model in the dataset, we will end up with \(2^{21}\) = 2^21
combinations of models, which is not possible to estimate in the adequate time. That is why we use bruteForce=FALSE
:
ourModel <- lmCombine(BJxreg,bruteForce=FALSE)
summary(ourModel)
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Importance Lower 2.5% Upper 97.5%
#> (Intercept) 17.64791 0.81183 1.000 16.04277 19.25305
#> xLag4 3.38331 0.31375 1.000 2.76296 4.00366
#> xLag9 1.35972 0.31507 1.000 0.73677 1.98266
#> xLag3 4.68465 0.29265 1.000 4.10602 5.26328
#> xLag10 1.53859 0.28559 1.000 0.97391 2.10326
#> xLag5 2.32391 0.32371 1.000 1.68387 2.96394
#> xLag6 1.70272 0.32661 1.000 1.05695 2.34848
#> xLead9 0.22300 0.21966 0.647 -0.21132 0.65732
#> xLag7 1.40028 0.32759 1.000 0.75256 2.04799
#> xLag8 1.35128 0.32653 0.999 0.70567 1.99689
#> xLead10 0.14021 0.19464 0.506 -0.24463 0.52505
#> ---
#> Residual standard error: 0.93677 on 138.848 degrees of freedom:
#> Combined ICs:
#> AIC AICc BIC BICc
#> 417.2391 419.2053 450.8137 455.7397
In this case first, the stepwise()
funciton is used, which finds the best model in the pool. Then each variable that is not in the model is added to the model and then removed iteratively. IC, parameters values and standard errors are all written down for each of these expanded models. Finally, in a similar manner each variable is removed from the optimal model and then added back. As a result the pool of combined models becomes much smaller than it could be in case of the brute force, but it contains only meaningful models, that are close to the optimal. The rationale for this is that the marginal contribution of variables deteriorates with the increase of the number of parameters in case of the stepwise function, and the IC weights become close to each other around the optimal model. So, whenever the models are combined, there is a lot of redundant models with very low weights. By using the mechanism described above we remove those redundant models.
There are several methods for the lm.combined
class, including:
predict.greybox()
- returns the point and interval predictions.
forecast.greybox()
- wrapper around predict()
The forecast horizon is defined by the length of the provided sample of newdata
.
plot.lm.combined()
- plots actuals and fitted values.
plot.predict.greybox()
- which uses graphmaker()
function from smooth
in order to produce graphs of actuals and forecasts.
As an example, let’s split the whole sample with Box-Jenkins data into in-sample and the holdout:
BJInsample <- BJxreg[1:130,];
BJHoldout <- BJxreg[-(1:130),];
ourModel <- lmCombine(BJInsample,bruteForce=FALSE)
A summary and a plot of the model:
summary(ourModel)
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Importance Lower 2.5% Upper 97.5%
#> (Intercept) 19.38815 0.97649 1.000 17.45457 21.32173
#> xLag4 3.35394 0.33444 1.000 2.69171 4.01616
#> xLag9 1.33030 0.33663 0.999 0.66373 1.99686
#> xLag3 4.77118 0.32282 1.000 4.13196 5.41040
#> xLag10 1.53916 0.30400 1.000 0.93721 2.14111
#> xLag5 2.32533 0.34457 1.000 1.64305 3.00762
#> xLag6 1.66071 0.34692 1.000 0.97376 2.34766
#> xLead9 0.30264 0.16541 0.881 -0.02490 0.63018
#> xLag8 1.36582 0.34798 0.999 0.67679 2.05486
#> xLag7 1.32787 0.34905 0.998 0.63671 2.01902
#> xLead1 -0.02706 0.10015 0.268 -0.22536 0.17124
#> ---
#> Residual standard error: 0.95603 on 118.855 degrees of freedom:
#> Combined ICs:
#> AIC AICc BIC BICc
#> 368.3783 370.6753 400.3370 405.9273
plot(ourModel)
Importance tells us how important the respective variable is in the combination. 1 means 100% important, 0 means not important at all.
And the forecast using the holdout sample:
ourForecast <- predict(ourModel,BJHoldout)
#> Warning: The covariance matrix for combined models is approximate. Don't
#> rely too much on that.
plot(ourForecast)

These are the main functions implemented in the package for now. If you want to read more about IC model selection and combinations, I would recommend Burnham and Anderson (2002) textbook.
lmDynamic
This function is based on the principles of lmCombine()
and point ICs. It allows not only combining the models but also to capture the dynamics of it parameters. So in a way this corresponds to a time varying parameters model, but based on information criteria.
Continuing the example from lmCombine()
, let’s construct the dynamic model:
ourModel <- lmDynamic(BJInsample,bruteForce=FALSE)
We can plot the model and ask for the summary in the similar way as with lmCombine()
:
ourSummary <- summary(ourModel)
ourSummary
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Importance Lower 2.5% Upper 97.5%
#> (Intercept) 19.73016 1.11842 1.00000 17.51580 21.94451
#> xLag4 3.48072 0.52158 0.97509 2.44804 4.51340
#> xLag9 1.34411 0.71688 0.80320 -0.07523 2.76344
#> xLag3 4.87629 0.49356 0.98462 3.89910 5.85349
#> xLag10 1.53842 0.68090 0.83647 0.19030 2.88653
#> xLag5 2.29900 0.66051 0.90478 0.99127 3.60674
#> xLag6 1.62412 0.69498 0.84238 0.24813 3.00011
#> xLead9 0.20102 0.20760 0.58669 -0.21001 0.61205
#> xLag8 1.30230 0.73550 0.77456 -0.15391 2.75851
#> xLag7 1.23575 0.71706 0.76861 -0.18394 2.65545
#> xLead1 0.01928 0.13365 0.28143 -0.24534 0.28390
#> ---
#> Residual standard error: 0.29199 on 120.24218 degrees of freedom:
#> Combined ICs:
#> AIC AICc BIC BICc
#> 49.53297 50.01684 63.87064 65.04827
plot(ourModel)

The coefficeints in the summary are the averaged out over the whole sample. The more interesting elements are the time varying parameters, their standard errors (and respective confidence intervals) and time varying importance of the parameters.
# Coefficients in dynamics
head(ourModel$dynamic)
#> NULL
# Standard errors of the coefficients in dynamics
head(ourModel$se)
#> (Intercept) xLag4 xLag9 xLag3 xLag10 xLag5
#> [1,] 1.006227 0.3490706 0.3430449 0.3299650 0.3073849 0.3495045
#> [2,] 1.040583 0.3574727 0.3738482 0.3405528 0.3209881 0.3542456
#> [3,] 0.960554 0.3541460 0.3545871 0.3528779 0.3275815 0.3579489
#> [4,] 1.095473 0.4358911 0.8209166 0.4484011 0.6658296 0.6140531
#> [5,] 1.091252 0.4840342 0.7031024 0.4296465 0.6863125 0.6251528
#> [6,] 1.579978 1.4249578 0.7934378 4.8755976 0.9347289 1.1362565
#> xLag6 xLead9 xLag8 xLag7 xLead1
#> [1,] 0.3520294 0.1993513 0.3586997 0.3632377 0.08033929
#> [2,] 0.3590884 0.2116489 0.3994733 0.4056219 0.15123442
#> [3,] 0.3645103 0.2010113 0.3807811 0.4223361 0.29385078
#> [4,] 0.8502370 0.2010228 0.8602877 1.0642949 0.03618039
#> [5,] 0.9488466 0.2010205 1.0464571 0.9796686 0.18304177
#> [6,] 0.9889303 0.2552196 0.7530162 1.0151752 1.50713618
# Importance of parameters in dynamics
head(ourModel$importance)
#> (Intercept) xLag4 xLag9 xLag3 xLag10 xLag5
#> [1,] 1 1.0000000 0.9973533 1.000000000 0.9999769 1.0000000
#> [2,] 1 1.0000000 0.9708993 1.000000000 0.9997493 0.9999996
#> [3,] 1 1.0000000 0.9991394 1.000000000 0.9886956 0.9998766
#> [4,] 1 0.9999998 0.6391405 1.000000000 0.9932798 0.9907556
#> [5,] 1 1.0000000 0.9012306 1.000000000 0.9408163 0.9669927
#> [6,] 1 1.0000000 0.9669370 0.000159121 0.7044398 1.0000000
#> xLag6 xLead9 xLag8 xLag7 xLead1
#> [1,] 0.9999181 4.316350e-01 0.9980106 0.9963564 0.21491922
#> [2,] 0.9981788 8.543845e-01 0.9619174 0.9600163 0.43415732
#> [3,] 0.9987263 5.417124e-04 0.9971251 0.9306122 0.99865721
#> [4,] 0.7061153 1.796136e-04 0.7385491 0.2760648 0.06622019
#> [5,] 0.6508966 2.009093e-08 0.4044998 0.4414783 0.58301383
#> [6,] 0.9993530 3.813437e-01 0.7842385 0.9923411 0.97652417
The importance can also be plotted using plot()
and coef()
functions, which might produce a lot of images:
The plots show how the importance of each parameter changes over time. The values do not look as smooth as we would like them to, but nothing can be done with this at this point. If you want something smooth, then smooth these values out using, for example, cma()
function from smooth
package.
In fact, even degrees of freedom are now also time varying:
ourModel$dfDynamic
#> [1] 10.638169 11.179302 10.913374 8.410305 8.888928 9.805336 9.692229
#> [8] 10.108928 9.596108 8.355333 8.110091 8.325276 8.147258 9.260638
#> [15] 8.211777 8.019886 8.947718 7.923381 9.713389 9.042951 9.003586
#> [22] 9.993941 10.074492 10.082990 10.742267 11.401828 10.004429 10.192848
#> [29] 9.245310 10.079378 10.257277 9.163669 9.466405 8.789201 9.455050
#> [36] 10.488498 11.132939 9.611156 10.327637 9.688387 9.679958 10.855423
#> [43] 10.240383 10.123653 10.214254 10.065120 8.985523 8.998608 9.971901
#> [50] 10.012036 10.288743 11.130018 10.030527 10.611253 10.323163 9.873998
#> [57] 8.782753 8.979447 9.215197 8.815235 8.291630 8.569913 8.495248
#> [64] 9.098965 10.023286 8.811119 9.712579 8.675171 8.805526 8.852387
#> [71] 8.937664 9.692961 9.149996 8.898936 8.106896 9.531607 9.009735
#> [78] 9.525549 10.690720 9.560784 9.511517 9.295723 10.080932 10.076916
#> [85] 10.755791 10.050309 10.030572 9.129278 9.992773 9.452569 10.387848
#> [92] 10.883425 11.224420 11.195366 11.078483 10.713917 8.188753 9.893770
#> [99] 9.207828 8.422913 9.036704 10.416710 10.690833 11.047595 10.279752
#> [106] 10.083729 9.996557 9.600129 10.133174 10.665134 10.957068 10.227205
#> [113] 9.516802 10.010265 11.069592 10.596234 11.193334 10.121109 10.816862
#> [120] 11.208848 11.069212 9.998647 9.923869 10.784109 10.181479 10.051873
#> [127] 8.744937 9.752157 9.582358 9.057583
ourModel$df.residualDynamic
#> [1] 119.3618 118.8207 119.0866 121.5897 121.1111 120.1947 120.3078
#> [8] 119.8911 120.4039 121.6447 121.8899 121.6747 121.8527 120.7394
#> [15] 121.7882 121.9801 121.0523 122.0766 120.2866 120.9570 120.9964
#> [22] 120.0061 119.9255 119.9170 119.2577 118.5982 119.9956 119.8072
#> [29] 120.7547 119.9206 119.7427 120.8363 120.5336 121.2108 120.5450
#> [36] 119.5115 118.8671 120.3888 119.6724 120.3116 120.3200 119.1446
#> [43] 119.7596 119.8763 119.7857 119.9349 121.0145 121.0014 120.0281
#> [50] 119.9880 119.7113 118.8700 119.9695 119.3887 119.6768 120.1260
#> [57] 121.2172 121.0206 120.7848 121.1848 121.7084 121.4301 121.5048
#> [64] 120.9010 119.9767 121.1889 120.2874 121.3248 121.1945 121.1476
#> [71] 121.0623 120.3070 120.8500 121.1011 121.8931 120.4684 120.9903
#> [78] 120.4745 119.3093 120.4392 120.4885 120.7043 119.9191 119.9231
#> [85] 119.2442 119.9497 119.9694 120.8707 120.0072 120.5474 119.6122
#> [92] 119.1166 118.7756 118.8046 118.9215 119.2861 121.8112 120.1062
#> [99] 120.7922 121.5771 120.9633 119.5833 119.3092 118.9524 119.7202
#> [106] 119.9163 120.0034 120.3999 119.8668 119.3349 119.0429 119.7728
#> [113] 120.4832 119.9897 118.9304 119.4038 118.8067 119.8789 119.1831
#> [120] 118.7912 118.9308 120.0014 120.0761 119.2159 119.8185 119.9481
#> [127] 121.2551 120.2478 120.4176 120.9424
And as usual we can produce forecast from this function, the mean parameters are used in this case:
ourForecast <- predict(ourModel,BJHoldout)
#> Warning: The covariance matrix for combined models is approximate. Don't
#> rely too much on that.
plot(ourForecast)
This function is currently under development, so stay tuned.
References
- Burnham Kenneth P. and Anderson David R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. Springer-Verlag New York. .