Sample Selection Models for Panel Data

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Jing Peng

2025-10-21

1. Introduction

The PanelSelect package supports a family of panel sample selection models. The models extend the classic Heckman selection framework to panel data with individual random effects, enabling more robust and flexible correction for non-random participation and attrition in panel settings.

The general structure involves a two-stage setup:

The first stage is a panel Probit model describing the participation or response decision.
The second stage can be a panel linear, Probit, Poisson, or Poisson log-normal model for the outcome.

The package allows two distinct sources of selection bias:

Selection on individual effects (i.e., correlation between random effects across equations)
Selection on idiosyncratic errors (i.e., contemporaneous correlation of error terms)

These two types of selection can be activiated separately or jointly, providing flexibility for empirical applications.

The likelihood functions for these models are detailed in Bailey and Peng (2025) and Peng and Van den Bulte (2024). Because they lack closed-form solutions, likelihoods are approximated numerically via Gaussian–Hermite quadrature. Two types of standard errors are available: (1) Hessian-based standard errors and (2) BHHH (Berndt–Hall–Hall–Hausman) standard errors.

Table 1 lists the supported models. Panel count models are supported through a companion package PanelCount.

**Table 1. Panel Sample Selection Models**
Model	First Stage	Second Stage	Outcome Variable	Package
probitRE_linearRE	probitRE	linearRE	linear	PanelSelect
probitRE_probitRE	probitRE	ProbitRE	binary	PanelSelect
probitRE_PoissonRE	probitRE	PoissonRE	count	PanelCount
probitRE_PLNRE	probitRE	PLN_RE	count	PanelCount

2. Models

Let \(\mathbf{w}_{it}\) and \(\mathbf{x}_{it}\) represent the row vectors of covariates in the selection and outcome equations, respectively, with \(\boldsymbol{\alpha}\) and \(\boldsymbol{\beta}\) denoting the corresponding column vectors of parameters. Define \(d_{it}\) as a binary participation or response indicator, and \(y_{it}\) as the outcome variable, which is observed only when \(d_{it} = 1\). Individuals are indexed by \(i = 1, \ldots, N\), and time periods by \(t = 1, \ldots, T\). The models supported by the package are summarized below, where “RE” denotes the inclusion of individual random effects.

2.1. probitRE_linearRE

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (LinearRE):

\[y_{it}^* = \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \sigma\epsilon_{it}\]

Correlation structure:

Here, \(\delta\), \(\lambda\), and \(\sigma\) are positive parameters capturing the standard deviation of random effects and the second-stage error. The variance of the first-stage error is normalized to one to ensure unique identification of model parameters, consistent with the convention in Probit models.

2.2. probitRE_probitRE

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (LinearRE):

\[y_{it}^* = 1( \mathbf{x}_{it} \boldsymbol{\beta} + \lambda v_i + \epsilon_{it} > 0 )\]

Correlation structure:

Since both stages are Probit models, the variances of error terms in both stages are fixed to one to ensure model identification.

2.3. probitRE_PoissonRE (imported from PanelCount)

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (PoissonRE): \[E[y_{it}|x_{it},v_i] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i)\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right). \] In the second stage, the random effect \(v_i\) accounts for unobserved heterogeneity and potential overdispersion in the count outcome. Because the Poisson distribution intrinsically models conditional variance, no additional idiosyncratic error term is introduced in this specification.

2.4. probitRE_PLNRE (imported from PanelCount)

First stage (ProbitRE): \[d_{it} = 1( \mathbf{w}_{it} \boldsymbol{\alpha} + \delta u_i + \varepsilon_{it} > 0 )\]

Second stage (PLN_RE): \[E[y_{it}|x_{it},v_i,\epsilon_{it}] = exp(\mathbf{x}_{it} \boldsymbol{\beta} + \sigma v_i + \gamma\epsilon_{it})\]

Correlation structure:

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \tau \\ \tau & 1 \end{pmatrix} \right). \] This model introduces an additional random effect at the individual–time level, thereby allowing selection to occur at both the individual and individual–time levels. However, \(\epsilon_{it}\) should be interpreted as a random effect rather than as an idiosyncratic error term.

3. Examples

After loading the PanelSelect package, type “example(model_name)” to see sample code for each model. The code below runs the probitRE_linearRE model on a simulated dataset with the following data generating process (DGP):

\[d_{it} = 1( x_{it} + u_i + \varepsilon_{it} > 0 )\]

\[y_{it}^* = x_{it} + v_i + \epsilon_{it}\]

\[ \begin{pmatrix} u_i \\ v_i \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix} \begin{pmatrix} 1 & -0.5 \\ -0.5 & 1 \end{pmatrix} \right), \quad \begin{pmatrix} \varepsilon_{it} \\ \epsilon_{it} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0.5 \\ 0.5 & 1 \end{pmatrix} \right). \]

library(PanelSelect)
library(MASS)
N = 500
period = 5
obs = N*period
rho = -0.5
tau = 0.5
set.seed(100)

re = mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,rho,rho,1), nrow=2))
u = rep(re[,1], each=period)
v = rep(re[,2], each=period)
e = mvrnorm(obs, mu=c(0,0), Sigma=matrix(c(1,tau,tau,1), nrow=2))
e1 = e[,1]
e2 = e[,2]

t = rep(1:period, N)
id = rep(1:N, each=period)
w = rnorm(obs)
z = rnorm(obs)
x = rnorm(obs)
d = as.numeric(x + w + u + e1 > 0)
y = x + w + v + e2
y[d==0] = NA
dt = data.frame(id, t, y, x, w, z, d)

# As N increases, the parameter estimates will be more accurate
m = probitRE_linearRE(d~x+w, y~x+w, 'id', dt, H=10, verbose=-1)
print(m$estimates, digits=4)
#>                    estimate      se       z         p      lci      uci
#> probit.(Intercept) -0.01675 0.05586 -0.2998 7.643e-01 -0.12623  0.09274
#> probit.x            0.96137 0.05128 18.7461 0.000e+00  0.86086  1.06188
#> probit.w            1.04892 0.05324 19.7033 0.000e+00  0.94458  1.15326
#> linear.(Intercept)  0.03536 0.15517  0.2279 8.197e-01 -0.26877  0.33950
#> linear.x            0.94444 0.06366 14.8362 0.000e+00  0.81967  1.06921
#> linear.w            0.92165 0.06963 13.2359 0.000e+00  0.78518  1.05813
#> delta               0.97825 0.07005 13.9653 0.000e+00  0.85015  1.12564
#> lambda              1.01392 0.05469 18.5410 0.000e+00  0.91221  1.12697
#> sigma               1.02086 0.03570 28.5923 0.000e+00  0.95322  1.09329
#> rho                -0.46251 0.07913 -5.8452 5.060e-09 -0.60296 -0.29425
#> tau                 0.44739 0.15957  2.8038 5.051e-03  0.09018  0.70261

The resulting estimates closely match the true parameter values, confirming the model’s effectiveness in recovering the underlying parameters.

4. Notes

This section highlights several practical considerations while using this package.

As shown in Bailey and Peng (2025), panel selection models typically achieve stronger identification than their cross-sectional counterparts.
The panel need not be balanced. Also, individuals are allowed to make different participation decisions at different time periods.
The covariates used in the first-stage and second-stage equations may differ, although excluding a variable from one stage typically requires theoretical justification. When in doubt, it is advisable to use the same set of covariates in both stages.
The model allows the two correlation parameters \(\rho\) and \(\tau\) to be turned off separately or jointly.
Inputs and output formats for models imported from the PanelCount package are somewhat different.
When the outcome variable is observed regardless of participation, the endogeneity package may be more appropriate, although it currently focuses on cross-sectional settings.

Citations

Bailey, M., & Peng, J. (2025). A Random Effects Model of Non-Ignorable Nonresponse in Panel Survey Data. Available at SSRN https://www.ssrn.com/abstract=5475626

Peng, J., & Van den Bulte, C. (2024). Participation vs. Effectiveness in Sponsored Tweet Campaigns: A Quality-Quantity Conundrum. Management Science, 70(11):7345-8215. Available at https://doi.org/10.1287/mnsc.2019.01897

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.