The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In this vignette, the theoretical considerations are skipped and
simple examples of several GCEstim
functions are given.
Consider dataGCE
(see “Generalized Maximum Entropy
framework”).
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> X001 X002 X003 X004
#> Min. :-0.367479 Min. :-0.34228 Min. :-0.226104 Min. :-0.232254
#> 1st Qu.:-0.124128 1st Qu.:-0.06117 1st Qu.:-0.065646 1st Qu.:-0.057236
#> Median :-0.007875 Median : 0.02085 Median :-0.013335 Median :-0.011069
#> Mean :-0.004117 Mean : 0.01087 Mean :-0.003149 Mean :-0.009777
#> 3rd Qu.: 0.112612 3rd Qu.: 0.09638 3rd Qu.: 0.074539 3rd Qu.: 0.039632
#> Max. : 0.357509 Max. : 0.26828 Max. : 0.255017 Max. : 0.168009
#> X005 y
#> Min. :-0.240180 Min. :-2.0085
#> 1st Qu.:-0.074220 1st Qu.: 0.2753
#> Median : 0.009687 Median : 1.0415
#> Mean : 0.002912 Mean : 0.9871
#> 3rd Qu.: 0.074166 3rd Qu.: 1.5678
#> Max. : 0.220218 Max. : 3.4560
Suppose we want to obtain the estimated model
\[\begin{equation} \widehat{\mathbf{y}}=\widehat{\beta_0} + \widehat{\beta_1}\mathbf{X001} + \widehat{\beta_2}\mathbf{X002} + \widehat{\beta_3}\mathbf{X003} + \widehat{\beta_4}\mathbf{X004} + \widehat{\beta_5}\mathbf{X005}. \end{equation}\]
Using the function lmgce()
and setting:
formula = y ~ X001 + X002 + X003 + X004 + X005
data = dataGCE
the desired model is computed.
res.lmgce.v01 <-
lmgce(
formula = y ~ X001 + X002 + X003 + X004 + X005,
data = dataGCE)
#> Warning in lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE):
#>
#> The minimum error was found for the highest upper limit of the support. Confirm if higher values should be tested.
If we would like to see a concise summary of the result we can simply type,
res.lmgce.v01
#>
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE)
#>
#> Coefficients:
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0149 -0.2202 2.5209 3.7025 9.2585 12.3076
To obtain a more detailed evaluation of the fitted model, we can use
summary()
summary(res.lmgce.v01)
#>
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.14311 -0.28521 0.00022 0.33284 0.94757
#>
#> Coefficients:
#> Estimate Std. Deviation z value Pr(>|t|)
#> (Intercept) 1.01494 0.02058 49.312 < 2e-16 ***
#> X001 -0.22020 0.26574 -0.829 0.40732
#> X002 2.52086 2.51212 1.003 0.31563
#> X003 3.70245 1.39718 2.650 0.00805 **
#> X004 9.25849 3.13278 2.955 0.00312 **
#> X005 12.30761 2.99930 4.103 4.07e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Normalized Entropy:
#> NormEnt SupportLL SupportUL
#> (Intercept) 0.613120 -1.385564 1.385564
#> X001 0.999981 -40.044509 40.044509
#> X002 0.998628 -53.653819 53.653819
#> X003 0.997924 -64.082185 64.082185
#> X004 0.992078 -82.107203 82.107203
#> X005 0.977198 -64.503624 64.503624
#>
#> Residual standard error: 0.4193 on 94 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 6.623, Chosen Error: 1se
#> Multiple R-squared: 0.8289, Adjusted R-squared: 0.8198
#> NormEnt: 0.9298, CV-NormEnt: 0.9314 (0.001405)
#> RMSE: 0.4065, CV-RMSE: 0.4245 (0.07841)
The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{GCE}=\) (1.015, -0.22, 2.521, 3.702, 9.258, 12.308).
(coef.res.lmgce.v01 <- coef(res.lmgce.v01))
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0149399 -0.2201957 2.5208625 3.7024518 9.2584894 12.3076121
The prediction root mean squared error (RMSE) is \(RMSE_{\mathbf{\hat y}}^{GCE} \approx\) 0.407.
The prediction 5-fold cross-validation (CV) error is \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE} \approx\) 0.424.
Z-score confidence intervals (CIs) are obtained with the
confint
S3 method for class lmgce
confint(res.lmgce.v01, level = 0.95)
#> 2.5% 97.5%
#> (Intercept) 0.9745995 1.0552803
#> X001 -0.7410327 0.3006412
#> X002 -2.4028096 7.4445345
#> X003 0.9640248 6.4408787
#> X004 3.1183608 15.3986181
#> X005 6.4290868 18.1861374
and can be visualized with plot
Color-filled dots represent the GCE estimates while non-filled dots
represent the OLS estimates (lmgce
uses by default
OLS = TRUE
).
Bootstrap CIs are available by setting
method = "percentile"
or method = "basic"
,
defining a number of replicates boot.B = 1000
and bootstrap
method (three methods available, see
help("confint.lmgce")
)
res.lmgce.v01.confint <-
confint(
res.lmgce.v01,
level = 0.95,
method = "percentile",
boot.B = 1000,
boot.method = "residuals"
)
res.lmgce.v01.confint
#> 2.5% 97.5%
#> (Intercept) 0.9466904 1.0613189
#> X001 -0.4420402 0.7366755
#> X002 -3.2157042 0.2189214
#> X003 0.2355981 2.5550294
#> X004 1.8239425 6.3393070
#> X005 5.2718289 9.3762228
or
res.lmgce.v02 <-
update(res.lmgce.v01,
boot.B = 1000,
boot.method = "residuals")
#> Warning in lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE, :
#>
#> The minimum error was found for the highest upper limit of the support. Confirm if higher values should be tested.
res.lmgce.v02.confint
#> 2.5% 97.5%
#> (Intercept) 0.9441818 1.0773997
#> X001 -0.7361146 0.5095636
#> X002 -2.1082789 4.1808036
#> X003 0.8410829 4.7016152
#> X004 3.4094424 11.1024973
#> X005 6.6052740 13.9711447
From the previous summary we can see that the standardized upper limit for all support spaces was 6.623
and corresponds to the support spaces which cross-validation RMSE (CV-RMSE) is not greater than the minimum CV-RMSE plus one standard error (1se)
During estimation, standardized limits are reverted to the original scale
res.lmgce.v01$support.matrix #original scale
#> SupportLL SupportUL
#> (Intercept) -1.385564 1.385564
#> X001 -40.044509 40.044509
#> X002 -53.653819 53.653819
#> X003 -64.082185 64.082185
#> X004 -82.107203 82.107203
#> X005 -64.503624 64.503624
lmgce
uses by default support.signal = NULL
and support.signal.vector = NULL
meaning that it will
evaluate the errormeasure = "RMSE"
with cross-validation
(cv = TRUE
, cv.nfolds = 5
) in a set of
support.signal.vector.n = 20
supports spaces
logarithmically between support.signal.vector.min = 0.3
and
support.signal.vector.min = 20
. It will then choose the
minimum, one standard error or elbow error (see
help("lmgce")
).
Red dots represent the CV-error and green dots the CV-Normalized entropy. Whiskers have the length of two standard errors for each of the 20 support spaces. The dotted horizontal line is the OLS CV-error. The black vertical dotted line corresponds to the support spaces that produced the lowest CV-error. The black vertical dashed line corresponds to the support spaces that produced the 1se CV-error. The red vertical dotted line corresponds to the support spaces that produced the elbow CV-error.
With plot
it is possible to obtain the trace of the
estimates
In the last two plots are depicted the final solutions. That is to
say that after choosing the support spaces limits based on the defined
error, the number of points of the support spaces and their probability
support.signal.points = c(1/5, 1/5, 1/5, 1/5, 1/5)
,
twosteps.n = 1
extra estimation(s) is(are) performed. This
estimation uses the GCE framework even if the previous steps were by
default on the GME framework. The distribution of probabilities used is
the one estimated for the chosen support spaces.
res.lmgce.v01$p0
#> p_1 p_2 p_3 p_4 p_5
#> (Intercept) 0.0164583 0.04043601 0.09934627 0.2440815 0.5996780
#> X001 0.1999969 0.19999843 0.20000000 0.2000016 0.2000031
#> X002 0.1989669 0.19948212 0.19999866 0.2005165 0.2010358
#> X003 0.1856222 0.19254787 0.19973189 0.2071840 0.2149140
#> X004 0.1711658 0.18450259 0.19887859 0.2143747 0.2310783
#> X005 0.1457798 0.16891704 0.19572648 0.2267909 0.2627857
The trace of the CV-error can be obtained with plot
The pre reestimation CV-error is depicted by the red dot and final/reestimated CV-error corresponds to the dark red dot. The horizontal dotted line represents again the OLS CV-error. The final estimated vector of probabilities is
res.lmgce.v01$p
#> p_1 p_2 p_3 p_4 p_5
#> (Intercept) 0.01096511 0.03033771 0.08393684 0.2322322 0.6425282
#> X001 0.20220556 0.20109672 0.19999395 0.1988972 0.1978065
#> X002 0.18164405 0.19039062 0.19955835 0.2091675 0.2192394
#> X003 0.17754981 0.18812593 0.19933204 0.2112057 0.2237866
#> X004 0.15738899 0.17628610 0.19745213 0.2211595 0.2477133
#> X005 0.13074317 0.15871806 0.19267870 0.2339058 0.2839543
After looking at the results and traces we may feel the need to select a different support, lets say the minimum. That can obviously be done using, for instance
lmgce(y ~ X001 + X002 + X003 + X004 + X005,
data = dataGCE,
errormeasure.which = "min")
#or
update(res.lmgce.v01, errormeasure.which = "min)
but that implies a complete reestimation and can be very time-costly.
Since the results of all the evaluated support spaces are stored,
choosing a different support should be done with
changesupport
summary(res.lmgce.v01.min)
#>
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.14044 -0.32087 -0.00145 0.30294 0.98079
#>
#> Coefficients:
#> Estimate Std. Deviation z value Pr(>|t|)
#> (Intercept) 1.0214 0.0205 49.819 < 2e-16 ***
#> X001 -0.5732 0.2647 -2.166 0.03035 *
#> X002 6.5726 2.5023 2.627 0.00862 **
#> X003 5.9458 1.3917 4.272 1.94e-05 ***
#> X004 14.3297 3.1206 4.592 4.39e-06 ***
#> X005 17.1517 2.9876 5.741 9.42e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Normalized Entropy:
#> NormEnt SupportLL SupportUL
#> (Intercept) 0.857780 -2.190354 2.190354
#> X001 0.999986 -120.925588 120.925588
#> X002 0.998977 -162.022706 162.022706
#> X003 0.999413 -193.514073 193.514073
#> X004 0.997923 -247.945652 247.945652
#> X005 0.995174 -194.786726 194.786726
#>
#> Residual standard error: 0.4176 on 94 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 20, Chosen Error: min
#> Multiple R-squared: 0.831, Adjusted R-squared: 0.822
#> NormEnt: 0.9749, CV-NormEnt: 0.9748 (0.000247)
#> RMSE: 0.4049, CV-RMSE: 0.428 (0.07535)
#>
#> $p2
#>
#> $p3
#>
#> $p4
#>
#> $p6
Note that the GCE estimates are now “closer” to the OLS
estimates.
In this example the true coefficients are known and we can also observe
that the min
approach gives estimates “closer” to those
coefficients, which in ill-condition problems is not recommended.
data.frame("Supp_1se" = coef(res.lmgce.v01),
"Supp_min" = coef(res.lmgce.v01.min),
"OLS" = coef(res.lmgce.v01$results$OLS$res),
"TRUE" = coef.dataGCE)
#> Supp_1se Supp_min OLS TRUE.
#> (Intercept) 1.0149399 1.0213867 1.0219926 1
#> X001 -0.2201957 -0.5732211 -0.5888888 0
#> X002 2.5208625 6.5726180 6.8461549 0
#> X003 3.7024518 5.9458002 6.0887435 3
#> X004 9.2584894 14.3296693 14.6678589 6
#> X005 12.3076121 17.1516725 17.4690058 9
Several generic functions can used with lmgce
class
objects, for instance
An add-in to easily generate the R code for a lmgce
analysis when support.method="standardized"
and to perform
that analysis can be accessed with
Data can be imported from a file or be called from the
Environment
.
The parameters of lmgce()
can be changed.
Several outputs can be visualized.
The expression for lmgce()
can be exported.
lmgce()
allows for the evaluation of several support
spaces and, given a certain criterion, selects one of them. But GCE
estimation is also sensitive to the number of points of the support and
noise spaces and to the weight that each of the spaces has in the loss
function. cv.lmgce()
tests the combination of these
parameters.
res.cv.lmgce <-
cv.lmgce(
y ~ X001 + X002 + X003 + X004 + X005,
data = dataGCE,
support.signal.points = c(3, 5, 7, 9, 11),
support.noise.points = c(3, 5, 7, 9, 11),
weight = c(0.1, 0.3, 0.5, 0.7, 0.9))
res.cv.lmgce
#>
#> Call:
#> cv.lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE,
#> support.signal.points = c(3, 5, 7, 9, 11), support.noise.points = c(3,
#> 5, 7, 9, 11), weight = c(0.1, 0.3, 0.5, 0.7, 0.9))
#>
#>
#> Best combination:
#> 5 points for the signal support; 7 points for the noise support; a weight of 0.3 for the noise support.
#>
#> Coefficients:
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0053 -0.1054 1.1877 2.9615 7.5766 10.7126
It returns CV-errors, convergence of the optimization algorithm and time.
res.cv.lmgce$results[order(res.cv.lmgce$results$error.measure.cv.mean),][1:10,-6]
#> support.signal.points support.noise.points weight error.measure
#> 37 5 7 0.3 0.4079956
#> 48 7 11 0.3 0.4081282
#> 75 11 11 0.5 0.4084240
#> 69 9 9 0.5 0.4083783
#> 53 7 3 0.5 0.4083569
#> 70 11 9 0.5 0.4085954
#> 43 7 9 0.3 0.4083187
#> 64 9 7 0.5 0.4086463
#> 36 3 7 0.3 0.4083031
#> 49 9 11 0.3 0.4085277
#> error.measure.cv.mean convergence time
#> 37 0.4342384 0 2.969143
#> 48 0.4343499 0 3.386481
#> 75 0.4344170 0 4.381215
#> 69 0.4344356 0 3.828571
#> 53 0.4344939 0 2.478632
#> 70 0.4345810 0 3.994712
#> 43 0.4346196 0 3.165467
#> 64 0.4346713 0 3.713387
#> 36 0.4346806 0 3.180539
#> 49 0.4347227 0 3.316488
The model with the combination that produces the lowest CV-error is kept.
summary(res.cv.lmgce$best)
#>
#> Call:
#> cv.lmgce(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE,
#> support.signal.points = c(3, 5, 7, 9, 11), support.noise.points = c(3,
#> 5, 7, 9, 11), weight = c(0.1, 0.3, 0.5, 0.7, 0.9))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.13620 -0.25946 0.00574 0.34428 0.95410
#>
#> Coefficients:
#> Estimate Std. Deviation z value Pr(>|t|)
#> (Intercept) 1.00530 0.01239 81.163 < 2e-16 ***
#> X001 -0.10536 0.15992 -0.659 0.509991
#> X002 1.18769 1.51177 0.786 0.432084
#> X003 2.96152 0.84081 3.522 0.000428 ***
#> X004 7.57661 1.88527 4.019 5.85e-05 ***
#> X005 10.71264 1.80495 5.935 2.94e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Normalized Entropy:
#> NormEnt SupportLL SupportUL
#> (Intercept) 0.621836 -1.385564 1.385564
#> X001 0.999996 -40.044509 40.044509
#> X002 0.999696 -53.653819 53.653819
#> X003 0.998672 -64.082185 64.082185
#> X004 0.994700 -82.107203 82.107203
#> X005 0.982759 -64.503624 64.503624
#>
#> Residual standard error: 0.4208 on 94 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 6.623, Chosen Error: 1se
#> Multiple R-squared: 0.8275, Adjusted R-squared: 0.8183
#> NormEnt: 0.9329, CV-NormEnt: 0.9351 (0.002509)
#> RMSE: 0.408, CV-RMSE: 0.4342 (0.05257)
Using plot
we obtain
Consider the time series in the ts
object
moz_ts
.
Let:
\(CO2_t\) be the emissions of \(CO_2\) at time \(t\);
\(GDP_{t-1}\) be the gross domestic product at time \(t-1\);
\(EPC_{t-1}\) be the energy per capita at time \(t-1\);
\(EU_{t-1}\) be the energy use at time \(t-1\).
Let us obtain the estimated model
\[\begin{equation} \widehat{CO2_t}=\widehat{\beta_0} + \widehat{\beta_1}GDP_{t-1} + \widehat{\beta_2}EPC_{t-1} + \widehat{\beta_3}EU_{t-1} \end{equation}\]
Using the function tsbootgce()
and setting:
formula = CO2 ~ 1 + L(GDP, 1) + L(EPC, 1) + L(EU, 1)
data = moz_ts
the desired model is computed. formula
allows additional
functions available in R package dynlm).
Note that by default tsbootgce
uses
reps = 1000
replicates as elements of an ensemble, which
retain the shape of the original time series, as well as the time
dependence structure of the autocorrelation and the partial
autocorrelation functions. The plot
function can give us
percentile confidence intervals (red, green and blue areas) and the
median distribution (dashed yellow line) of the generated time series as
well as the original data (black line).
A concise summary of the result can be obtained simply by typing,
res.tsbootgce
#>
#> Call:
#> tsbootgce(formula = CO2 ~ 1 + L(GDP, 1) + L(EPC, 1) + L(EU, 1),
#> data = moz_ts)
#>
#> Coefficients:
#> (Intercept) L(GDP, 1) L(EPC, 1) L(EU, 1)
#> tsbootgce -33.40547 0.03909 0.02290 0.26097
#> lmgce -70.38222 0.05716 0.03043 0.32684
#> lm -154.46129 0.13826 -0.02110 0.48219
The tsbootgce
coefficients displayed by default are
coef.method = "mode"
.
coef(res.tsbootgce)
#> (Intercept) L(GDP, 1) L(EPC, 1) L(EU, 1)
#> -33.40547358 0.03908692 0.02289881 0.26097442
and the \(95\%\) highest density regions are
confint(res.tsbootgce)
#> 2.5% 97.5%
#> (Intercept) -49.98354439 -16.83952784
#> L(GDP, 1) 0.03435963 0.04463217
#> L(EPC, 1) 0.01606767 0.02749484
#> L(EU, 1) 0.21988135 0.30194119
The empirical distribution, confidence intervals and measure of central tendency of each estimate can be visualized by typing
plot(res.tsbootgce,
ci.levels = c(0.90, 0.95, 0.99),
ci.method = c("hdr" #,"basic" #,"percentile"
))[[2]]
The estimates obtained in each bootstrap repetition have a normalized entropy associated that can be used as a weight for an aggregation procedure named neagging.
lmgce
objects keep the bootstrap result in
object$results$bootstrap
when the argument
boot.B
is greater or equal to 10
. In the case
of existence of these results we can call the function
neagging
setting only the object
parameter.
The trace of the in sample error can be obtained with
plot
.
The dotted horizontal line represents the in sample error of the
lmgce
model. The neagging minimum in sample error,
represented by the vertical dashed line, was obtained aggregating 2
models
The trace of the estimates can also be visualized with
plot
, where the dotted horizontal lines represent the
estimates from the lmgce
model.
The estimated coefficients that produce the lowest in sample error are \(\widehat{\boldsymbol{\beta}}^{neagging}=\) (1.039, -0.159, 0.64, 2.808, 7.236, 10.193).
coef(res.neagging.lmgce)
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0391383 -0.1591743 0.6400220 2.8079923 7.2358547 10.1933880
The estimated coefficients, when \(1000\) models are aggregated, are \(\widehat{\boldsymbol{\beta}}^{neagging}=\) (1.016, -0.265, 2.184, 3.321, 8.112, 10.912).
coef(res.neagging.lmgce, which = ncol(res.neagging.lmgce$matrix))
#> (Intercept) X001 X002 X003 X004 X005
#> 1.0156715 -0.2654051 2.1841604 3.3214449 8.1118787 10.9122498
Although the in sample error is higher in the neagging approach, in
some scenarios an out of sample improvement can be observed. Consider a
data set with the same structure as dataGCE
which was
generated setting seed = 240863
.The out of sample RMSE is
smaller using the neagging model
tsbootgce
objects always keep the bootstrap result in
object$results$bootstrap
so we can call the function
neagging
setting only the object
parameter.
In this case the in sample error is lower in the neagging approach.
This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.