The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Sample Selection Models for Panel Data

Jing Peng

School of Business, University of Connecticut
jing.peng@uconn.edu

2025-10-21

1. Introduction

The PanelSelect package supports a family of panel sample selection models. The models extend the classic Heckman selection framework to panel data with individual random effects, enabling more robust and flexible correction for non-random participation and attrition in panel settings.

The general structure involves a two-stage setup:

The package allows two distinct sources of selection bias:

These two types of selection can be activiated separately or jointly, providing flexibility for empirical applications.

The likelihood functions for these models are detailed in Bailey and Peng (2025) and Peng and Van den Bulte (2024). Because they lack closed-form solutions, likelihoods are approximated numerically via Gaussian–Hermite quadrature. Two types of standard errors are available: (1) Hessian-based standard errors and (2) BHHH (Berndt–Hall–Hall–Hausman) standard errors.

Table 1 lists the supported models. Panel count models are supported through a companion package PanelCount.

Table 1. Panel Sample Selection Models
Model First Stage Second Stage Outcome Variable Package
probitRE_linearRE probitRE linearRE linear PanelSelect
probitRE_probitRE probitRE ProbitRE binary PanelSelect
probitRE_PoissonRE probitRE PoissonRE count PanelCount
probitRE_PLNRE probitRE PLN_RE count PanelCount

2. Models

Let \(\mathbf{w}_{it}\) and \(\mathbf{x}_{it}\) represent the row vectors of covariates in the selection and outcome equations, respectively, with \(\boldsymbol{\alpha}\) and \(\boldsymbol{\beta}\) denoting the corresponding column vectors of parameters. Define \(d_{it}\) as a binary participation or response indicator, and \(y_{it}\) as the outcome variable, which is observed only when \(d_{it} = 1\). Individuals are indexed by \(i = 1, \ldots, N\), and time periods by \(t = 1, \ldots, T\). The models supported by the package are summarized below, where “RE” denotes the inclusion of individual random effects.

2.1. probitRE_linearRE

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (LinearRE):

\[y_{it}^* = \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \sigma\epsilon_{it}\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \]

Here, \(\delta\), \(\lambda\), and \(\sigma\) are positive parameters capturing the standard deviation of random effects and the second-stage error. The variance of the first-stage error is normalized to one to ensure unique identification of model parameters, consistent with the convention in Probit models.

2.2. probitRE_probitRE

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (LinearRE):

\[y_{it}^* = 1( \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \epsilon_{it} > 0 )\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \]

Since both stages are Probit models, the variances of error terms in both stages are fixed to one to ensure model identification.

2.3. probitRE_PoissonRE (imported from PanelCount)

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (PoissonRE): \[E[y_{it}|x_{it},v_i] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i)\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right). \] In the second stage, the random effect \(v_i\) accounts for unobserved heterogeneity and potential overdispersion in the count outcome. Because the Poisson distribution intrinsically models conditional variance, no additional idiosyncratic error term is introduced in this specification.

2.4. probitRE_PLNRE (imported from PanelCount)

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (PLN_RE): \[E[y_{it}|x_{it},v_i,\epsilon_{it}] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i + \gamma\epsilon_{it})\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \] This model introduces an additional random effect at the individual–time level, thereby allowing selection to occur at both the individual and individual–time levels. However, \(\epsilon_{it}\) should be interpreted as a random effect rather than as an idiosyncratic error term.

3. Examples

After loading the PanelSelect package, type “example(model_name)” to see sample code for each model. The code below runs the probitRE_linearRE model on a simulated dataset with the following data generating process (DGP):

\[d_{it} = 1( x_{it} + u_i + \varepsilon_{it} > 0 )\]

\[y_{it}^* = x_{it} + v_i + \epsilon_{it}\]

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & -0.5 \\ -0.5 & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0.5 \\ 0.5 & 1 \end{pmatrix} \right). \]

library(PanelSelect)
library(MASS)
N = 500
period = 5
obs = N*period
rho = -0.5
tau = 0.5
set.seed(100)

re = mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,rho,rho,1), nrow=2))
u = rep(re[,1], each=period)
v = rep(re[,2], each=period)
e = mvrnorm(obs, mu=c(0,0), Sigma=matrix(c(1,tau,tau,1), nrow=2))
e1 = e[,1]
e2 = e[,2]

t = rep(1:period, N)
id = rep(1:N, each=period)
w = rnorm(obs)
z = rnorm(obs)
x = rnorm(obs)
d = as.numeric(x + w + u + e1 > 0)
y = x + w + v + e2
y[d==0] = NA
dt = data.frame(id, t, y, x, w, z, d)

# As N increases, the parameter estimates will be more accurate
m = probitRE_linearRE(d~x+w, y~x+w, 'id', dt, H=10, verbose=-1)
print(m$estimates, digits=4)
#>                    estimate      se       z         p      lci      uci
#> probit.(Intercept) -0.01675 0.05586 -0.2998 7.643e-01 -0.12623  0.09274
#> probit.x            0.96137 0.05128 18.7461 0.000e+00  0.86086  1.06188
#> probit.w            1.04892 0.05324 19.7033 0.000e+00  0.94458  1.15326
#> linear.(Intercept)  0.03536 0.15517  0.2279 8.197e-01 -0.26877  0.33950
#> linear.x            0.94444 0.06366 14.8362 0.000e+00  0.81967  1.06921
#> linear.w            0.92165 0.06963 13.2359 0.000e+00  0.78518  1.05813
#> delta               0.97825 0.07005 13.9653 0.000e+00  0.85015  1.12564
#> lambda              1.01392 0.05469 18.5410 0.000e+00  0.91221  1.12697
#> sigma               1.02086 0.03570 28.5923 0.000e+00  0.95322  1.09329
#> rho                -0.46251 0.07913 -5.8452 5.060e-09 -0.60296 -0.29425
#> tau                 0.44739 0.15957  2.8038 5.051e-03  0.09018  0.70261

The resulting estimates closely match the true parameter values, confirming the model’s effectiveness in recovering the underlying parameters.

4. Notes

This section highlights several practical considerations while using this package.

Citations

Bailey, M., & Peng, J. (2025). A Random Effects Model of Non-Ignorable Nonresponse in Panel Survey Data. Available at SSRN https://www.ssrn.com/abstract=5475626

Peng, J., & Van den Bulte, C. (2024). Participation vs. Effectiveness in Sponsored Tweet Campaigns: A Quality-Quantity Conundrum. Management Science, 70(11):7345-8215. Available at https://doi.org/10.1287/mnsc.2019.01897

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.