Generalized Cross Entropy framework

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Jorge Cabral

Introduction

Although the common situation is the absence of prior information on \(\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})\), in some particular cases pre-sample information exists in the form of \(\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})\). This \(\mathbf{q}\) distribution can be used as an initial hypothesis to be incorporated in the consistency relations of maximum entropy formalism. Kullback and Leibler [1] defined cross-entropy (CE) between \(\mathbf{p}\) and \(\mathbf{q}\) as

\[\begin{align} I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right). \end{align}\]

\(I(\mathbf{p},\mathbf{q})\) measures the discrepancy between the \(\mathbf{p}\) and \(\mathbf{q}\) distributions. It is non-negative, and when \(\mathbf{p}=\mathbf{q}\) one gets \(I(\mathbf{p},\mathbf{q})=0\). So, according to the principle of minimum cross-entropy [2,3] probabilities that are as close as possible to the prior probabilities should be chosen.

Generalized Cross Entropy estimator

Given the previous, and for the reparameterized linear regression model, \[\begin{equation} \mathbf{y}=\mathbf{XZp} + \mathbf{Vw}, \end{equation}\]
the Generalized Cross Entropy (GCE) estimator is given by

\[\begin{equation} \hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) = \underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}} \left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\}, \end{equation}\]
subject to the same model constraints as the GME estimator (see “Generalized Maximum Entropy framework”).

Using set notation the minimization problem can be rewritten as follows: \[\begin{align} &\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\ &\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\ & & \sum_{m=1}^M p_{km} = 1, \forall k\\ & & \sum_{j=1}^J w_{kj} = 1, \forall k. \end{align}\]

The Lagrangian equation \[\begin{equation} \mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) + \boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} - \mathbf{Vw} \right) + \boldsymbol{\theta}'\left( \mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p} \right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N \otimes \mathbf{1}'_J)\mathbf{w}\right) \end{equation}\]
can be used to find the interior solution, where \(\lambda\), \(\theta\), and \(\tau\) are \((N\times 1)\), \(((K+1)\times 1)\), \((N\times 1)\) associated vectors of Lagrangian multipliers, respectively.
Taking the gradient of the Lagrangian and solving the first-order conditions yields the solutions for \(\mathbf{\hat p}\) and \(\mathbf{\hat w}\)

\[\begin{equation} \hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})} \end{equation}\] and \[\begin{equation} \hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}. \end{equation}\]

Note that when the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results.

Examples

Consider dataGCE (see “Generalized Maximum Entropy framework”).
Again under a “no a priori information” scenario for the parameters, one can assume that \(z_k^{upper}=100\), \(k\in\left\lbrace 0,\dots,6\right\rbrace\) is a “wide upper bound” for the signal support space. Using lmgce a model can be fitted under the GME or GCE framework. If support.signal.points is an integer, a constant vector or a constant matrix one is assuming a uniform distribution on \(\mathbf{q}\) and therefore considering the GME framework.

library(GCEstim)

res.lmgce.100.GME <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = 5,
    twosteps.n = 0,
    seed = 230676
  )

The estimated GME coefficients are \(\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=\) (1.026, -0.155, 1.822, 3.319, 8.393, 11.467).

(coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME))
#> (Intercept)        X001        X002        X003        X004        X005 
#>   1.0255630  -0.1552375   1.8221235   3.3194530   8.3932055  11.4670530

But if there is some information, for instance, on \(\beta_1\) and \(\beta_2\), that can be reflected on support.signal.points. Lets suppose that one suspects that \(\beta_1=\beta_2=0\). Since the support spaces are centered in zero one can assign a higher probability to the support point in or around the center. One can set \(\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1, 0.1)'\), for instance. support.signal.points accepts information on the distribution of probabilities in the form of a \((K+1)\times M\) matrix. The first line corresponds to \(\mathbf{q_0}\), the second to \(\mathbf{q_1}\), and so on.

(support.signal.points.matrix <- 
  matrix(
    c(rep(1/5, 5),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      rep(1/5, 5),
      rep(1/5, 5),
      rep(1/5, 5)
      ),
    ncol = 5,
    byrow = TRUE))
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  0.2  0.2  0.2  0.2  0.2
#> [2,]  0.1  0.1  0.6  0.1  0.1
#> [3,]  0.1  0.1  0.6  0.1  0.1
#> [4,]  0.2  0.2  0.2  0.2  0.2
#> [5,]  0.2  0.2  0.2  0.2  0.2
#> [6,]  0.2  0.2  0.2  0.2  0.2

res.lmgce.100.GCE <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = support.signal.points.matrix,
    twosteps.n = 0,
    seed = 230676
  )

The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=\) (1.026, -0.143, 1.655, 3.228, 8.189, 11.269).

(coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE))
#> (Intercept)        X001        X002        X003        X004        X005 
#>    1.026345   -0.143421    1.654828    3.227839    8.189040   11.269391

The prediction errors are approximately equal ( \(RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx\) 0.407 and \(RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx\) 0.407) as well as the prediction cross-validation errors ( \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx\) 0.428 and \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx\) 0.427).
The precision errors is lower for the GCE approach: \(RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}} \approx\) 1.595 and \(RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}} \approx\) 1.458.

(RMSE_beta.lmgce.100.GME <-
   GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"))
#> [1] 1.594821

(RMSE_beta.lmgce.100.GCE <-
    GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"))
#> [1] 1.457947

If there was some information on the distribution of \(\mathbf{w}\), a similar analysis could be done for noise.signal.points.

Conclusion

The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations.

References

Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics. 1951;22:79-86. doi:10.1214/aoms/1177729694

Lindley DV, Kullback S. Information theory and statistics. Journal of the American Statistical Association. 1959;54:825. doi:10.2307/2282528

Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. The Annals of Mathematical Statistics. 1963;34:911-934. doi:10.1214/aoms/1177704014

Acknowledgements

This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.