Advanced Linear Model

Ivan Svetunkov

2018-09-10

ALM stands for “Advanced Linear Model”. It’s not so much advanced as it sounds, but it has some advantages over the basic LM, retaining some basic features. In some sense alm() resembles the glm() function from stats package, but with a higher focus on forecasting rather than on hypothesis testing. You will not get p-values anywhere from the alm() function and won’t see \(R^2\) in the outputs. The maximum what you can count on is having confidence intervals for the parameters or for the regression line. The other important difference from glm() is the availability of distributions that are not supported by glm() (for example, Folded Normal or Chi Squared distributions).

The core of the function is the likelihood approach. The estimation of parameters in the model is done via the maximisation of likelihood function of a selected distribution. The calculation of the standard errors is done based on the calculation of hessian of the distribution. And in the centre of all of that are information criteria that can be used for the models comparison.

All the supported distributions have specific functions which form the following four groups for the distribution parameter in alm():

  1. General continuous density functions,
  2. Continuous density functions for positive data,
  3. Discrete density functions,
  4. Cumulative functions for binary variables.

All of them rely on respective d- and p- functions in R. For example, Log Normal distribution uses dlnorm() function from stats package.

The alm() function also supports occurrence parameter, which allows modelling non-zero values and the occurrence of non-zeroes as two different models. The combination of any distribution from (1) - (3) for the non-zero values and a distribution from (4) for the occurrence will result in a mixture distribution model, e.g. a mixture of Log-Normal and Cumulative Logistic or a Hurdle Poisson (with Cumulative Normal for the occurrence part).

Every model produced using alm() can be represented as: \[\begin{equation} \label{eq:basicALM} y_t = f(\mu_t, \epsilon_t) = f(x_t' B, \epsilon_t) , \end{equation}\] where \(y_t\) is a value of the response variable, \(x_t\) is a vector of exogenous variables, \(B\) is a vector of the parameters, \(\mu_t\) is the conditional mean (produced based on the exogenous variables and the parameters of the model), \(\epsilon_t\) is the error term on the observation \(t\) and \(f(\cdot)\) is a distribution function that does a transformation of the inputs into the output. In case of a mixture distribution the model becomes slightly more complicated: \[\begin{equation} \label{eq:basicALMMixture} \begin{matrix} y_t = o_t f(x_t' B, \epsilon_t) \\ o_t \sim \text{Bernoulli}(p_t) \\ p_t = g(z_t' A, \eta_t) \end{matrix}, \end{equation}\]

where \(o_t\) is a binary variable, \(p_t\) is the probability of occurrence, \(z_t\) is a vector of exogenous variables, \(A\) is a vector of parameters and \(\eta\) is a the error term for the \(p_t\).

The alm() function returns, along with the set of common for lm() variables (such as coefficient and fitted.values), the variable mu, which corresponds to the conditional mean used inside the distribution, and scale – the second parameter, which usually corresponds to standard error or dispersion parameter. The values of these two variables vary from distribution to distribution. Note, however, that the model variable returned by lm() function was renamed into data in alm(), and that alm() does not return terms and QR decomposition.

Given that the parameters of any model in alm() are estimated via likelihood, it can be assumed that they have assymptotically normal distribution, thus the confidence intervals for any model rely on the normality and are constructed based on the unbiased estimate of variance, extracted using sigma() function.

The covariance matrix of parameters almost in all the cases is calculated as an inverse of the hessian of respective distribution funtion. The exclusions are Normal, Log-Normal, Cumulative Logistic and Cumulative Normal distributions, that use analytical solutions.

Although the basic principles of estimation of models and predictions from them are the same for all the distributions, each of the distribution has its own features. So it makes sense to discuss them individually. We discuss the distributions in the four groups mentioned above.

General continuous density distributions

This group of functions includes:

  1. Normal distribution,
  2. Laplace distribution,
  3. Logistic distribution,
  4. S distribution,

Normal distribution

The density of normal distribution is: \[\begin{equation} \label{eq:Normal} f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) , \end{equation}\]

where \(\sigma^2\) is the variance of the error term.

alm() with Normal distribution (distribution="dnorm") is equivalent to lm() function from stats package and returns roughly the same estimates of parameters, so if you are concerned with the time of calculation, I would recommend reverting to lm().

Maximising the likelihood of the model is equivalent to the estimation of the basic linear regression using Least Squares method: \[\begin{equation} \label{eq:linearModel} y_t = \mu_t + \epsilon_t = x_t' B + \epsilon_t, \end{equation}\]

where \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\).

The variance \(\sigma^2\) is estimated in alm() based on likelihood: \[\begin{equation} \label{eq:sigmaNormal} \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 , \end{equation}\]

where \(T\) is the sample size. Its square root (standard deviation) is used in the calculations of dnorm() function, and the value is then return via scale variable. This value does not have bias correction. However the sigma() method applied to the resulting model, returns the bias corrected version of standard deviation. And vcov(), confint(), summary() and predict() rely on the value extracted by sigma().

\(\mu_t\) is returned as is in mu variable, and the fitted values are set equivalent to mu.

Laplace distribution

Laplace distribution has some similarities with the Normal one: \[\begin{equation} \label{eq:Laplace} f(y_t) = \frac{1}{\sqrt{2 b}} \exp \left( -\frac{\left| y_t - \mu_t \right|}{b} \right) , \end{equation}\] where \(b\) is the scale parameter, which, when estimated using liklihood, is equal to the mean absolute error: \[\begin{equation} \label{eq:bLaplace} b = \frac{1}{T} \sum_{t=1}^T \left| y_t - \mu_t \right| . \end{equation}\]

So maximising the likelihood is equivalent to estimating the linear regression via the minimisation of \(b\) . So when estimating a model via minimising \(b\), the assumption imposed on the error term is \(\epsilon_t \sim \text{Laplace}(0, b)\). The main difference of Laplace from Normal distribution is its fatter tails.

alm() function with distribution="dlaplace" returns mu equal to \(\mu_t\) and the fitted values equal to mu. \(b\) is returned in the scale variable. The prediction intervals are derived from the quantiles of Laplace distribution after transforming the conditional variance into the conditional scale parameter \(b\) using the connection between the two in Laplace distribution: \[\begin{equation} \label{eq:bLaplaceAndSigma} b = \sqrt{\frac{\sigma^2}{2}}. \end{equation}\]

Logistic distribution

The density function of Logistic distribution is: \[\begin{equation} \label{eq:Logistic} f(y_t) = \frac{\exp \left(- \frac{y_t - \mu_t}{s} \right)} {s \left( 1 + \exp \left(- \frac{y_t - \mu_t}{s} \right) \right)^{2}}, \end{equation}\] where \(s\) is a scale parameter, which is estimated in alm() based on the connection between the parameter and the variance in the logistic distribution: \[\begin{equation} \label{eq:sLogisticAndSigma} s = \sigma \sqrt{\frac{3}{\pi^2}}. \end{equation}\]

Once again the maximisation of implies the estimation of the linear model , where \(\epsilon_t \sim \text{Logistic}(0, s)\).

Logistic is considered a fat tailed distribution, but its tails are not as fat as in Laplace. Kurtosis of standard Logistic is 4.2, while in case of Laplace it is 6.

alm() function with distribution="dlogis" returns \(\mu_t\) in mu and in fitted.values variables, and \(s\) in the scale variable. Similar to Laplace distribution, the prediction interevals use the connection between the variance and scale, and rely on the qlogis function.

S distribution

The S distribution has the following density function: \[\begin{equation} \label{eq:S} f(y_t) = \frac{1}{4b^2} \exp \left( -\frac{\sqrt{|y_t - \mu_t|}}{b} \right) , \end{equation}\] where \(b\) is a scale parameter. If estimated via maximum likelihood, the scale parameter is equal to: \[\begin{equation} \label{eq:bS} b = \frac{1}{T} \sum_{t=1}^T \sqrt{\left| y_t - \mu_t \right|} , \end{equation}\]

which corresponds to the minimisation of “Half Absolute Error” or “Half Absolute Moment”, which is equal to \(2b\).

S distribution has a kurtosis of 25.2, which makes it an “extreme excess” distribution. It might be useful in cases of randomly occurring incidents and extreme values.

alm() function with distribution="ds" returns \(\mu_t\) in the same variables mu and fitted.values, and \(b\) in the scale variable. Similarly to the previous functions, the prediction intervals are based on the qs() function from greybox package and use the connection between the scale and the variance: \[\begin{equation} \label{eq:bSAndSigma} b = \left( \frac{\sigma^2}{120} \right) ^{\frac{1}{4}}. \end{equation}\]

For all the functions in this category resid() method returns \(e_t = y_t - \mu_t\).

Continuous density functions for positive data

This group includes:

  1. Log Normal distribution,
  2. Folded Normal distribution,
  3. Noncentral Chi Squared distribution.

Although (2) and (3) in theory allow having zeroes in data, given that the density function is equal to zero in any specific point, it will be zero in these cases as well. So the alm() will return some solutions for these distributions, but don’t expect anything good. As for (1), it supports strictily positive data.

Log Normal distribution

Log Normal distribution appears when a normally distributed variable is expontiated. This means that if \(x \sim \mathcal{N}(\mu, \sigma^2)\), then \(\exp x \sim \text{log}\mathcal{N}(\mu, \sigma^2)\). The density function of Log Normal distribution is: \[\begin{equation} \label{eq:LogNormal} f(y_t) = \frac{1}{y_t \sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(\log y_t - \mu_t \right)^2}{2 \sigma^2} \right) , \end{equation}\] where the variance estimated using likelihood is: \[\begin{equation} \label{eq:sigmaLogNormal} \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\log y_t - \mu_t \right)^2 . \end{equation}\] Estimating the model with Log Normal distribution is equivalent to estimating the parameters of log-linear model: \[\begin{equation} \label{eq:logLinearModel} \log y_t = \mu_t + \epsilon_t, \end{equation}\] where \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) or: \[\begin{equation} \label{eq:logLinearModelExp} y_t = \exp(\mu_t + \epsilon_t). \end{equation}\]

alm() with distribution="dlnorm" does not transform the provided data and estimates the density directly using dlnorm() function with the estimated mean \(\mu_t\) and the variance . If you need a log-log model, then you would need to take logarithms of the external variables. The \(\mu_t\) is returned in the variable mu, the \(\sigma^2\) is in the variable scale, while the fitted.values contains the exponent of \(\mu_t\), which, given the connection between the Normal and Log Normal distributions, corresponds to median of distribution rather than mean. Finally, resid() method returns \(e_t = \log y_t - \mu_t\).

Folded Normal distribution

Folded Normal distribution is obtained when the absolute value of normally distributed variable is taken: if \(x \sim \mathcal{N}(\mu, \sigma^2)\), then \(|x| \sim \text{folded }\mathcal{N}(\mu, \sigma^2)\). The density function is: \[\begin{equation} \label{eq:foldNormal} f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \left( \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) + \exp \left( -\frac{\left(y_t + \mu_t \right)^2}{2 \sigma^2} \right) \right), \end{equation}\] Conditional mean and variance of Folded Normal are estimated in alm() (with distribution="dfnorm") similarly to how this is done for Normal distribution. They are returned in the variables mu and scale respectively. In order to produce the fitted value (which is returned in fitted.values), the following correction is done: \[\begin{equation} \label{eq:foldNormalFitted} \hat{y_t} = \sqrt{\frac{2}{\pi}} \sigma \exp \left( -\frac{\mu_t^2}{2 \sigma^2} \right) + \mu_t \left(1 - 2 \Phi \left(-\frac{\mu_t}{\sigma} \right) \right), \end{equation}\]

where \(\Phi(\cdot)\) is the CDF of Normal distribution.

The conditional variance of the forecasts is calculated based on the elements of vcov() (as in all the other functions), the predicted values are corrected in the same way as the fitted values , and the prediction intervals are generated from the qfnorm() function of greybox package. As for the residuals, resid() method returns \(e_t = y_t - \mu_t\).

Noncentral Chi Squared distribution

Noncentral Chi Squared distribution arises, when a normally distributed variable with a unity variance is squared and summed up: if \(x_i \sim \mathcal{N}(\mu_i, 1)\), then \(\sum_{i=1}^k x_i^2 \sim \chi^2(k, \lambda)\), where \(k\) is the number of degrees of freedom and \(\lambda = \sum_{i=1}^k \mu_i^2\). In the case of non-unity variance, with \(z_i \sim \mathcal{N}(\mu_i, \sigma^2)\), the variable can also be represented as \(z_i = \sigma x_i\), and then it can be assumed that \(\sum_{i=1}^k z_i^2 \sim \chi^2(k \sigma, \lambda)\). In the perfect world, \(\lambda_t\) would coorespond to the location of the original distribution of \(z_i\), while \(k\) would need to be time varying and would need to include both number of elements \(k\) and the individual variances \(\sigma^2_i\) for each of the element, depending on the external variables values. However, given that the squares of the normal data are used, it is not possible to disaggregate the values into the original two parts. Thus we assume that the variance is constant for all the cases, and estimate it using likelihood. As a result the non-centrality parameter covers two parts that would be split in the ideal world.

The density function of Noncentral Chi Squared distribution is quite difficult. alm() uses dchisq() function from stats package, assuming constant number of degrees of freedom \(k\) and time varying noncentrality parameter \(\lambda_t\): \[\begin{equation} \label{eq:NCChiSquared} f(y_t) = \frac{1}{2} \exp \left( -\frac{y_t + \lambda_t}{2} \right) \left(\frac{y_t}{\lambda_t} \right)^{\frac{k}{4}-0.5} I_{\frac{k}{2}-1}(\sqrt{\lambda_t y_t}), \end{equation}\] where \(I_k(x)\) is a Bessel function of the first kind. The \(\lambda_t\) parameter is estimated from a regression with exogenous variables: \[\begin{equation} \label{eq:lambdaValue} \lambda_t = \exp ( x_t' B ) , \end{equation}\]

where \(\exp\) is taken in order to make \(\lambda_t\) strictly positive, while \(k\) is estimated directly by maximising the likelihood. In order to avoid the negative values of \(k\), it’s absolute value is used. \(\lambda_t\) is then returned in the variable mu, while \(k\) is returned in scale. Finally, fitted.values returns \(\lambda_t + k\). Similar correction is done in predict() function. As for the prediction intervals, they are generated using qchisq() function from stats package. Last but not least, resid() method returns \(e_t = \log y_t - \log \mu_t\).

Discrete density functions

This group includes:

  1. Poisson distribution,
  2. Negative Binomial distribution,

These distributions should be used in cases of count data.

Poisson distribution

Poisson distribution used in ALM has the following standard probability mass function: \[\begin{equation} \label{eq:Poisson} P(X=y_t) = \frac{\lambda_t^{y_t} \exp(-\lambda_t)}{y_t!}, \end{equation}\]

where \(\lambda_t = \mu_t = \sigma^2_t = \exp(x_t' B)\). As it can be noticed, here we assume that the variance of the model varies in time and depends on the values of the exogenous variables, which is a specific case of heteroscedasticity. The exponent of \(x_t' B\) is needed in order to avoid the negative values in \(\lambda_t\).

alm() with distribution="dpois" returns mu, fitted.values and scale equal to \(\lambda_t\). The qunatiles of distribution in predict() method are generated using qpois() function from stats package. Finally, the returned residuals correspond to \(\log y_t - \log \mu_t\), which is not really helpful or meaningful…

Negative Binomial distribution

Negative Binomial distribution implemented in alm() is parameterised in terms of mean and variance: \[\begin{equation} \label{eq:NegBin} P(X=y_t) = \binom{y_t+\frac{\mu_t^2}{\sigma^2-\mu_t}}{y_t} \left( \frac{\sigma^2 - \mu_t}{\sigma^2} \right)^{y_t} \left( \frac{\mu_t}{\sigma^2} \right)^\frac{\mu_t^2}{\sigma^2 - \mu_t}, \end{equation}\]

where \(\mu_t = \exp(x_t' B)\) and \(\sigma^2\) is estimated separately in the optimisation process. These values are then used in the dnbinom() function in order to calculate the log-likelihood based on the distribution function.

alm() with distribution="dnbinom" returns \(\mu_t\) in mu and fitted.values and \(\sigma^2\) in scale. The prediction intervals are produces using qnbinom() function. Similarly to Poisson distribution, resid() method returns \(\log y_t - \log \mu_t\).

Cumulative functions for binary variables

The final class of models includes two cases:

  1. Logistic distribution (logit model),
  2. Normal distribution (probit model).
In both of them it is assumed that the response variable is binary and can be either zero or one. The main idea for this class of models is to use a transformation of the original data and link a continuous latent variable with a binary one. As a reminder, all the models eventually assume that: \[\begin{equation} \label{eq:basicALMCumulative} \begin{matrix} o_t \sim \text{Bernoulli}(p_t) \\ p_t = g(x_t' A, \eta_t) \end{matrix}, \end{equation}\]

where \(o_t\) is the binary response variable and \(g(\cdot)\) is the cumulative distribution function. Given that we work with the probability of occurrence, the predict() method produces forecasts for the probability of occurrence rather than the binary variable itself. Finally, although many other cumulative distribution functions can be used for this transformation (e.g. plaplace() or plnorm()), the most popular ones are logistic and normal CDFs.

Given that the binary variable has Bernoulli distribution, its log-likelihood is: \[\begin{equation} \label{eq:BernoulliLikelihood} \ell(p_t | o_t) = \sum_{o_t=1} \log p_t + \sum_{o_t=0} \log(1 - p_t), \end{equation}\]

So the estimation of parameters for all the CDFs can be done maximising this likelihood.

In all the functions it is assumed that there is an actual level \(q_t\) that underlies the probability \(p_t\). This level can be modelled as: \[\begin{equation} \label{eq:CDFLevelALM} q_t = \nu_t + \eta_t , \end{equation}\]

and it can be transformed to the probability with \(p_t = g(q_t)\). So the aim of all the functions is to estimate the expectation \(\nu_t\) and transform it to the estimate of the probability \(\hat{p}_t\).

In order to estimate the error \(\eta_t\), we assume that \(o_t=1\) happens mainly when the respective estimated probability \(\hat{p}_t\) is very close to one as well. Based on that the error can be calculated as: \[\begin{equation} \label{eq:BinaryError} u_t' = o_t - \hat{p}_t . \end{equation}\] However this error is not useful and should be somehow transformed into the scale of the underlying unobserved variable \(q_t\). Given that both \(o_t \in (0, 1)\) and \(\hat{p}_t \in (0, 1)\), the error will lie in \((-1, 1)\). We therefore standardise it so that it lies in the region of \((0, 1)\): \[\begin{equation} \label{eq:BinaryErrorBounded} u_t = \frac{u_t' + 1}{2} = \frac{o_t - \hat{p}_t + 1}{2}. \end{equation}\]

This transformation means that, when \(o_t=\hat{p}_t\), then the error \(u_t=0.5\), when \(o_t=1\) and \(\hat{p}_t=0\) then \(u_t=1\) and finally, in the opposite case of \(o_t=0\) and \(\hat{p}_t=1\), it is \(u_t=0\). After that this error is transformed using either Logistic or Normal quantile generation function into the scale of \(q_t\), making sure that the case of \(u_t=0.5\) corresponds to zero, the \(u_t>0.5\) corresponds to the positive and \(u_t<0.5\) corresponds to the negative errors.

Cumulative Logistic distribution

We have previously discussed the density function of logistic distribution. The standardised cumulative distribution function used in alm() is: \[\begin{equation} \label{eq:LogisticCDFALM} \hat{p}_t = \frac{1}{1+\exp(-\nu_t)}, \end{equation}\] where \(\nu_t = x_t' A\) is the conditional mean of the level, underlying the probability. This value is then used in the likelihood in order to estimate the parameters of the model. The error term of the model is calculated using the formula: \[\begin{equation} \label{eq:LogisticError} e_t = \log \left( \frac{u_t}{1 - u_t} \right) = \log \left( \frac{1 + o_t (1 + \exp(\nu_t))}{1 + \exp(\nu_t) (2 - o_t) - o_t} \right). \end{equation}\]

This way the error varies from \(-\infty\) to \(\infty\) and is equal to zero, when \(u_t=0.5\). The error is assumed to be normally distributed (because… why not?).

The alm() function with distribution="plogis" returns \(\nu_t\) in mu, standard deviation, calculated using the respective errors in scale and the probability \(\hat{p}_t\) based on in fitted.values. resid() method returns the errors discussed above. predict() method produces point forecasts and the intervals for the probability of occurrence. The intervals use the assumption of normality of the error term, generating respective quantiles (based on the estimated \(\nu_t\) and variance of the error) and then transforming them into the scale of probabitliy using Logistic CDF.

Cumulative Normal distribution

The case of cumulative Normal distribution is quite similar to the cumulative Logistic one. The transformation is done using the standard normal CDF: \[\begin{equation} \label{eq:NormalCDFALM} \hat{p}_t = \Phi(\nu_t) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\nu_t} \exp \left(-\frac{1}{2}x^2 \right) dx , \end{equation}\] where \(\nu_t = x_t' A\). Similarly to the Logistic CDF, the estimated probability is used in the likelihood in order to estimate the parameters of the model. The error term is calculated using the standardised quantile function of Normal distribution: \[\begin{equation} \label{eq:NormalError} e_t = \Phi \left(\frac{o_t - \hat{p}_t + 1}{2}\right)^{-1} . \end{equation}\]

It acts similar to the error from Logistic distribution, but is based on the different functions. Once again we assume that the error has Normal distribution.

Similar to the Logistic CDF, the alm() function with distribution="pnorm" returns \(\nu_t\) in mu, standard deviation, calculated based on the errors in scale and the probability \(\hat{p}_t\) based on in fitted.values. resid() method returns the errors discussed above. predict() method produces point forecasts and the intervals for the probability of occurrence. The intervals use the assumption of normality of the error term and are based on the same idea as in Logistic CDF: quantiles of normal distribution (using the estimated mean and standard deviation) and then the transformation using the standard Normal CDF.

Mixture distribution models

Finally, mixture distribution models can be used in alm() by defining distribution and occurrence parameters. Currently only plogis() and pnorm() are supported for the occurrence variable, but all the other distributions discussed above can be used for the modelling of the non-zero values. If occurrence="plogis" or occurrence="pnorm", then alm() is fit two times: first on the non-zero data only (defining the subset) and second - using the same data, substituting the response variable by the binary occurrence variable and specifying distribution=occurrence. As an alternative option, occurrence alm() model can be estimated separately and then provided as a variable in occurrence.

As an example of mixture model, let’s generate some data:

xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5))
xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")

xreg[,1] <- round(exp(xreg[,1]-70) / (1 + exp(xreg[,1]-70)),0) * round(xreg[,1]-70)
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]

First, we estimate the occurrence model (it will complain that the response variable is not binary, but it will work):

modelOccurrence <- alm(y~x1+x2+Noise, inSample, distribution="plogis")
#> Warning: You have defined CDF 'plogis' as a distribution.
#> This means that the response variable needs to be binary with values of 0 and 1.
#> Don't worry, we will encode it for you. But, please, be careful next time!

And then use it for the mixture model:

modelMixture <- alm(y~x1+x2+Noise, inSample, distribution="dlnorm", occurrence=modelOccurrence)

The occurrence model will be return in the respective variable:

summary(modelMixture)
#> Distribution used in the estimation: Mixture of Log Normal and Cumulative logistic
#> Coefficients:
#>             Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept)  9.10335    4.24971    0.13725    18.06945
#> x1           0.02334    0.03666   -0.05400     0.10069
#> x2          -0.04879    0.03134   -0.11491     0.01734
#> Noise       -0.01889    0.01354   -0.04745     0.00967
#> ICs:
#>      AIC     AICc      BIC     BICc 
#> 194.2757 185.0865 218.0959 197.9623
summary(modelMixture$occurrence)
#> Distribution used in the estimation: Cumulative logistic
#> Coefficients:
#>             Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 10.81879    3.55457    3.73773    17.89986
#> x1           0.20918    0.02673    0.15592     0.26243
#> x2          -0.26730    0.02108   -0.30930    -0.22530
#> Noise       -0.00268    0.01123   -0.02506     0.01970
#> ICs:
#>      AIC     AICc      BIC     BICc 
#> 76.33316 77.14397 88.24329 90.01979

After that we can produce forecasts using the data from the holdout sample:

predict(modelMixture,outSample,interval="p")
#>             Mean Lower 6.2% Upper 2.7%
#>  [1,] 1.60934145  1.3762149  11.417080
#>  [2,] 6.99625692  1.8245301  31.357194
#>  [3,] 0.79469044  1.3085726   7.913651
#>  [4,] 2.41790620  1.3622464  13.757168
#>  [5,] 1.34296141  1.4301730  10.671984
#>  [6,] 0.24397578  1.6037315   4.067957
#>  [7,] 0.80530373  1.4088255   8.085185
#>  [8,] 1.67616715  1.0238369  10.875520
#>  [9,] 0.88641058  1.8796485  10.230027
#> [10,] 1.01106647  1.4831829   9.944270
#> [11,] 0.54538151  1.5305443   6.738692
#> [12,] 1.33202263  1.4035909  10.409994
#> [13,] 3.55667405  1.5219621  18.217167
#> [14,] 0.89969756  0.9502512   7.622868
#> [15,] 0.04624123  0.0000000   0.000000
#> [16,] 0.10118268  0.0000000   0.000000
#> [17,] 1.17190079  1.3500388   9.655933
#> [18,] 0.20781397  1.7900268   3.704615
#> [19,] 3.47090721  1.2673846  16.424860
#> [20,] 1.96304825  1.7003650  14.751814