collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:
This vignette focuses on the integration of collapse and the popular plm (‘Linear Models for Panel Data’) package by Yves Croissant, Giovanni Millo and Kevin Tappe. It will demonstrate the utility of the pseries and pdata.frame classes introduced in plm together with the corresponding methods for fast collapse functions (implemented in C or C++), to extend and facilitate extremely fast computations on panel-vectors and panel data frames (20-100 times faster than native plm). The collapse package should enable R programmers to - with very little effort - write high-performance code in the domain of panel data exploration and panel data econometrics.
Notes:
To learn more about collapse, see the ‘Introduction to collapse’ vignette or the built-in structured documentation available under help("collapse-documentation")
after installing the package. In addition help("collapse-package")
provides a compact set of examples for quick-start.
Documentation and vignettes can also be viewed online.
The vignette is structured as follows:
Part 1 introduces collapse’s fast functions and associated transformation operators to compute various transformations on panel data, and delivers some benchmarks.
Part 2 uses these functions to explore panel data a bit and introduce additional functions for summary statistics, panel-autocorrelations and testing fixed effects.
Part 3 finally provides an example programming application by coding a slightly extended and very efficient Hausman and Taylor (1981) estimator.
For this vignette we will use a dataset (wlddev
) supplied with collapse containing a panel of 4 key development indicators taken from the World Bank Development Indicators Database:
library(collapse)
head(wlddev)
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
# 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
# 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
# 4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
# 5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
# 6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
fNobs(wlddev) # This column-wise counts the number of observations
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 12744 12744 12744 12744 12744 12744 12744 12744 8995 11068 1356 8336
fNdistinct(wlddev) # This counts the number of distinct values
# country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
# 216 216 59 59 7 7 4 2 8995 10048 363 7564
First let us convert this data to a plm panel data.frame (class pdata.frame):
library(plm)
# This creates a panel data frame
pwlddev <- pdata.frame(wlddev, index = c("iso3c", "year"))
str(pwlddev, give.attr = FALSE)
# Classes 'pdata.frame' and 'data.frame': 12744 obs. of 12 variables:
# $ country: 'pseries' Named chr "Aruba" "Aruba" "Aruba" "Aruba" ...
# $ iso3c : Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ date : pseries, format: "1961-01-01" "1962-01-01" "1963-01-01" ...
# $ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ decade : 'pseries' Named num 1960 1960 1960 1960 1960 1960 1970 1970 1970 1970 ...
# $ region : Factor w/ 7 levels "East Asia & Pacific",..: 3 3 3 3 3 3 3 3 3 3 ...
# $ income : Factor w/ 4 levels "High income",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ OECD : 'pseries' Named logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ PCGDP : 'pseries' Named num NA NA NA NA NA NA NA NA NA NA ...
# $ LIFEEX : 'pseries' Named num 65.7 66.1 66.4 66.8 67.1 ...
# $ GINI : 'pseries' Named num NA NA NA NA NA NA NA NA NA NA ...
# $ ODA : 'pseries' Named num NA NA NA NA NA NA NA NA NA NA ...
# A pdata.frame has an index attribute attached [retrieved using index(pwlddev) or attr(pwlddev, "index")]
str(index(pwlddev))
# Classes 'pindex' and 'data.frame': 12744 obs. of 2 variables:
# $ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...
# This shows the individual and time dimensions
pdim(pwlddev)
# Balanced Panel: n = 216, T = 59, N = 12744
A plm::pdata.frame
is a data.frame with panel identifiers attached as a list of factors in an index attribute (non-factor index variables are converted to factor). Each column in that data.frame is a Panel Series (plm::pseries
), which also has the panel identifiers attached:
# Panel Series of GDP per Capita and Life-Expectancy at Birth
PCGDP <- pwlddev$PCGDP
LIFEEX <- pwlddev$LIFEEX
str(LIFEEX)
# 'pseries' Named num [1:12744] 65.7 66.1 66.4 66.8 67.1 ...
# - attr(*, "names")= chr [1:12744] "ABW-1960" "ABW-1961" "ABW-1962" "ABW-1963" ...
# - attr(*, "index")=Classes 'pindex' and 'data.frame': 12744 obs. of 2 variables:
# ..$ iso3c: Factor w/ 216 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...
Now that we have explored the basic data structures provided in the plm package, let’s compute some transformations on them:
The functions fbetween
and fbetween
can be used to compute efficient between and within transformations on panel vectors and panel data.frames:
# Between-Transformations
head(fbetween(LIFEEX)) # Between individual (default)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 72.20935 72.20935 72.20935 72.20935 72.20935 72.20935
head(fbetween(LIFEEX, effect = "year")) # Between time
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 53.90349 54.46588 54.85032 55.19844 55.66677 56.13145
# Within-Transformations
head(fwithin(LIFEEX)) # Within individuals (default)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# -6.547351 -6.135351 -5.765351 -5.422351 -5.096351 -4.774351
head(fwithin(LIFEEX, effect = "year")) # Within time
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 11.75851 11.60812 11.59368 11.58856 11.44623 11.30355
by default na.rm = TRUE
thus both functions skip (preserve) missing values in the data (which is the default for all collapse functions). For fbetween
the output behavior can be altered with the option fill
: Setting fill = TRUE
will compute the group-means on the complete cases in each group (as long as na.rm = TRUE
), but replace all values in each group with the group mean (hence overwriting or ‘filling up’ missing values):
# This preserves missing values in the output
head(fbetween(PCGDP), 30)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970
# NA NA NA NA NA NA NA NA NA NA NA
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981
# NA NA NA NA NA NA NA NA NA NA NA
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-1986 ABW-1987 ABW-1988 ABW-1989
# NA NA NA NA 25247.8 25247.8 25247.8 25247.8
# This replaces all individuals with the group mean
head(fbetween(PCGDP, fill = TRUE), 30)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970
# 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981
# 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-1986 ABW-1987 ABW-1988 ABW-1989
# 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8 25247.8
In fwithin
the mean
argument allows to set an arbitrary data mean (different from 0) after the data is centered. In grouped centering task, as sensible choice for such an added mean would be the overall mean of the data series, enabled by the option mean = "overall.mean"
. This will add the overall mean of the series back to the data after subtracting out group means, and thus preserve the level of the data (and will only change the intercept when employed in a regression):
# This performed standard grouped centering
head(fwithin(LIFEEX))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# -6.547351 -6.135351 -5.765351 -5.422351 -5.096351 -4.774351
# This adds the overall average Life-Expectancy (across countries) to the country-demeaned series
head(fwithin(LIFEEX, mean = "overall.mean"))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 57.29374 57.70574 58.07574 58.41874 58.74474 59.06674
fbetween
and fwithin
can also be applied to pdata.frame’s where they will perform these computations variable by variable:
head(fbetween(num_vars(pwlddev)), 3)
# decade PCGDP LIFEEX GINI ODA
# ABW-1960 1988.983 NA 72.20935 NA NA
# ABW-1961 1988.983 NA 72.20935 NA NA
# ABW-1962 1988.983 NA 72.20935 NA NA
head(fbetween(num_vars(pwlddev), fill = TRUE), 3)
# decade PCGDP LIFEEX GINI ODA
# ABW-1960 1988.983 25247.8 72.20935 NA 30313500
# ABW-1961 1988.983 25247.8 72.20935 NA 30313500
# ABW-1962 1988.983 25247.8 72.20935 NA 30313500
head(fwithin(num_vars(pwlddev)), 3)
# decade PCGDP LIFEEX GINI ODA
# ABW-1960 -28.98305 NA -6.547351 NA NA
# ABW-1961 -28.98305 NA -6.135351 NA NA
# ABW-1962 -28.98305 NA -5.765351 NA NA
head(fwithin(num_vars(pwlddev), mean = "overall.mean"), 3)
# decade PCGDP LIFEEX GINI ODA
# ABW-1960 1960 NA 57.29374 NA NA
# ABW-1961 1960 NA 57.70574 NA NA
# ABW-1962 1960 NA 58.07574 NA NA
Now next to fbetween
and fwithin
there also exist short versions B
and W
, which are referred to as transformation operators. These are essentially wrappers around fbetween
and fwithin
and provide the same functionality, but are more parsimonious to employ in regression formulas and also offer additional features when applied to panel data.frames. For panel series, B
and W
are exact analogues to fbetween
and fwithin
, just under a shorter name:
identical(fbetween(PCGDP), B(PCGDP))
# [1] TRUE
identical(fbetween(PCGDP, fill = TRUE), B(PCGDP, fill = TRUE))
# [1] TRUE
identical(fwithin(PCGDP), W(PCGDP))
# [1] TRUE
identical(fwithin(PCGDP, mean = "overall.mean"), W(PCGDP, mean = "overall.mean"))
# [1] TRUE
When applied to panel data.frames, B
and W
offer some additional utility by (a) allowing you to select columns to transform using the cols
argument (default is cols = is.numeric
, so by default all numeric columns will be selected for transformation), (b) allowing you to add a prefix to the transformed columns with the stub
argument (default is stub = "B."
for B
and stub = "W."
for W
) and (c) preserving the panel-id’s with the keep.ids
argument (default keep.ids = TRUE
):
head(B(pwlddev), 3)
# iso3c year B.decade B.PCGDP B.LIFEEX B.GINI B.ODA
# ABW-1960 ABW 1960 1988.983 NA 72.20935 NA NA
# ABW-1961 ABW 1961 1988.983 NA 72.20935 NA NA
# ABW-1962 ABW 1962 1988.983 NA 72.20935 NA NA
head(W(pwlddev, cols = 9:12), 3) # Here using the cols argument
# iso3c year W.PCGDP W.LIFEEX W.GINI W.ODA
# ABW-1960 ABW 1960 NA -6.547351 NA NA
# ABW-1961 ABW 1961 NA -6.135351 NA NA
# ABW-1962 ABW 1962 NA -5.765351 NA NA
fbetween
/ B
and fwithin
/ W
also support weighted computations. This of course applies more to panel-survey settings, but for the sake of illustration suppose we wanted to weight our between and within transformations by the amount of ODA these countries received:
# This replaces values by the ODA-weighted group mean and also preserves the weight variable (ODA, argument keep.w = TRUE)
head(B(pwlddev, w = ~ ODA), 3)
# iso3c year ODA B.decade B.PCGDP B.LIFEEX B.GINI
# ABW-1960 ABW 1960 NA 1992.721 NA 73.54196 NA
# ABW-1961 ABW 1961 NA 1992.721 NA 73.54196 NA
# ABW-1962 ABW 1962 NA 1992.721 NA 73.54196 NA
# This centers values on the ODA-weighted group mean
head(W(pwlddev, w = ~ ODA, cols = c("PCGDP","LIFEEX","GINI")), 3)
# iso3c year ODA W.PCGDP W.LIFEEX W.GINI
# ABW-1960 ABW 1960 NA NA -7.879958 NA
# ABW-1961 ABW 1961 NA NA -7.467958 NA
# ABW-1962 ABW 1962 NA NA -7.097958 NA
# This centers values on the ODA-weighted group mean and also adds the overall ODA-weighted mean of the data
head(W(pwlddev, w = ~ ODA, cols = c("PCGDP","LIFEEX","GINI"), mean = "overall.mean"), 3)
# iso3c year ODA W.PCGDP W.LIFEEX W.GINI
# ABW-1960 ABW 1960 NA NA 52.41778 NA
# ABW-1961 ABW 1961 NA NA 52.82978 NA
# ABW-1962 ABW 1962 NA NA 53.19978 NA
As shown above, with B
and W
the weight column can also be passed as a formula or character string, whereas fbetween
and fwithin
require the all inputs to be passed directly in terms of data (i.e. fbetween(get_vars(pwlddev, 9:11), w = pwlddev$ODA)
), and the weight vector or id columns are never preserved in the output. Therefore in most applications B
and W
are probably more convenient for quick use, whereas fbetween
and fwithin
are the preferred programmers choice, also because they have a little less R-overhead which makes them a tiny bit faster.
Analogous to fbetween
/ B
and fwithin
/ W
, collapse provides a duo of functions and operators fHDbetween
/ HDB
and fHDwithin
/ HDW
to efficiently average and center data on multiple groups. The credit herefore goes to Simen Gaure, the author of the lfe package who wrote an efficient C- implementation of the alternating-projections algorithm to perform this task. fHDbetween
/ HDB
and fHDwithin
/ HDW
enrich this implementation (available in the function lfe::demeanlist
) by providing more options regarding missing values, and also allowing continuous covariates and (full) interactions to be projected out alongside factors. The methods for pseries and pdata.frame’s are however rather simple, as they simply simultaneously center panel-vectors on all panel-identifiers in the index (which can be more than 2):
# This simultaneously averages Life-Expectancy across countries and years
head(HDB(LIFEEX)) # (same as running a regression on country and year dummies and taking the fitted values)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 62.59819 63.09571 63.48015 63.89314 64.36147 64.77122
# This simultaneously centers Life-Expectenacy on countries and years
head(HDW(LIFEEX)) # (same as running a regression on country and year dummies and taking the residuals)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 3.063807 2.978285 2.963845 2.893861 2.751525 2.663777
The architecture of fHDbetween
/ HDB
and fHDwithin
/ HDW
differs a bit from fbetween
/ B
and fwithin
/ W
. This is essentially a consequence of the underlying C-implementation (accessed through lfe::demeanlist
), which was not built to accommodate missing values. fHDbetween
/ HDB
and fHDwithin
/ HDW
therefore both have an argument fill = TRUE
(the default), which stipulates that missing values in the data are preserved in the output. The collapse default na.rm = TRUE
again ensures that only complete cases are used for the computation:
# Missing values are preserved in the output when fill = TRUE (the default)
head(HDB(PCGDP), 30)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970
# NA NA NA NA NA NA NA NA NA NA NA
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981
# NA NA NA NA NA NA NA NA NA NA NA
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-1986 ABW-1987 ABW-1988 ABW-1989
# NA NA NA NA 21750.50 22024.44 22371.47 22670.55
# When fill = FALSE, only the complete cases are returned
nofill <- HDB(PCGDP, fill = FALSE)
head(nofill, 30)
# ABW-1986 ABW-1987 ABW-1988 ABW-1989 ABW-1990 ABW-1991 ABW-1992 ABW-1993 ABW-1994 ABW-1995 ABW-1996
# 21750.50 22024.44 22371.47 22670.55 22990.95 23001.82 23042.98 23085.61 23307.28 23506.84 23690.18
# ABW-1997 ABW-1998 ABW-1999 ABW-2000 ABW-2001 ABW-2002 ABW-2003 ABW-2004 ABW-2005 ABW-2006 ABW-2007
# 24025.68 24305.15 24611.12 25073.75 25255.17 25445.18 25693.93 26195.16 26517.71 27017.07 27535.56
# ABW-2008 ABW-2009 ABW-2010 ABW-2011 ABW-2012 ABW-2013 ABW-2014 ABW-2015
# 27560.67 26822.40 27049.76 27246.63 27290.13 27465.78 27646.39 27839.22
# This results in a shorter panel-vector
length(nofill)
# [1] 8995
length(PCGDP)
# [1] 12744
# The cases that were missing and removed from the output are available as an attribute
head(attr(nofill, "na.rm"), 30)
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965 ABW-1966 ABW-1967 ABW-1968 ABW-1969 ABW-1970
# 1 2 3 4 5 6 7 8 9 10 11
# ABW-1971 ABW-1972 ABW-1973 ABW-1974 ABW-1975 ABW-1976 ABW-1977 ABW-1978 ABW-1979 ABW-1980 ABW-1981
# 12 13 14 15 16 17 18 19 20 21 22
# ABW-1982 ABW-1983 ABW-1984 ABW-1985 ABW-2018 AFG-1960 AFG-1961 AFG-1962
# 23 24 25 26 59 60 61 62
In the pdata.frame methods there are 3 different choices how to deal with missing values. The default for the plm classes in variable.wise = TRUE
, which will essentially sequentially apply fHDbetween.pseries
and fHDwithin.pseries
(with the default fill = TRUE
) to all columns. This is the same behavior as in fbetween
/ B
and fwithin
/ W
, which also consider the column-wise complete obs:
# This column-wise centers the data on countries and years
tail(HDW(pwlddev), 10)
# HDW.decade HDW.PCGDP HDW.LIFEEX HDW.GINI HDW.ODA
# ZWE-2009 -1.262177e-29 -4599.857 -9.166656 NA 200109393
# ZWE-2010 -1.262177e-29 -4700.931 -7.661442 NA 151705524
# ZWE-2011 -1.262177e-29 -4796.847 -6.212781 8.550597e-10 119746204
# ZWE-2012 -1.262177e-29 -4705.630 -4.797836 NA 384959776
# ZWE-2013 -1.262177e-29 -4884.977 -3.577774 NA 157816348
# ZWE-2014 -1.262177e-29 -5065.539 -2.575553 NA 106350944
# ZWE-2015 3.660315e-28 -5264.664 -1.758456 NA 160576556
# ZWE-2016 3.660315e-28 -5526.032 -1.213358 NA -6204739
# ZWE-2017 3.660315e-28 -5224.886 NA NA 32144901
# ZWE-2018 3.660315e-28 NA NA NA NA
If variable.wise = FALSE
, fHDbetween
/ HDB
and fHDwithin
/ HDW
will only consider the complete cases in the dataset, but still return a dataset of the same dimensions (as long as fill = TRUE
), resulting in some rows all-missing:
# This centers the complete cases of the data data on countries and years and keeps missing cases
tail(HDW(pwlddev, variable.wise = FALSE), 10)
# HDW.decade HDW.PCGDP HDW.LIFEEX HDW.GINI HDW.ODA
# ZWE-2009 NA NA NA NA NA
# ZWE-2010 NA NA NA NA NA
# ZWE-2011 -4.654654e-11 -2.813378e-06 6.104804e-09 2.834694e-09 -0.5804762
# ZWE-2012 NA NA NA NA NA
# ZWE-2013 NA NA NA NA NA
# ZWE-2014 NA NA NA NA NA
# ZWE-2015 NA NA NA NA NA
# ZWE-2016 NA NA NA NA NA
# ZWE-2017 NA NA NA NA NA
# ZWE-2018 NA NA NA NA NA
Finally, if also fill = FALSE
, the behavior is the same as in the pseries method: Missing cases are removed from the data:
# This centers the complete cases of the data data on countries and years, and removes missing cases
res <- HDW(pwlddev, fill = FALSE)
tail(res, 10)
# HDW.decade HDW.PCGDP HDW.LIFEEX HDW.GINI HDW.ODA
# ZMB-1991 -5.927868e-12 5.984333e+02 -1.053314e+00 5.723837e+00 5.658344e+07
# ZMB-1993 -1.124560e-11 5.270411e+02 -3.390080e+00 -1.228276e+00 1.371350e+08
# ZMB-1996 -3.497479e-12 5.583191e+02 -3.872223e+00 -5.004679e+00 -9.803759e+07
# ZMB-1998 -9.889811e-13 1.347908e+02 -3.859783e+00 -5.391717e+00 -4.321414e+08
# ZMB-2002 7.410698e-13 2.241507e+02 -1.681762e+00 -1.075309e+01 1.063993e+08
# ZMB-2004 2.709906e-12 -2.725672e+02 -7.773085e-01 1.681942e+00 3.124522e+08
# ZMB-2006 5.747394e-12 -3.032551e+02 8.826697e-01 2.609441e+00 4.480060e+08
# ZMB-2010 3.789429e-12 -3.528718e+02 5.271867e+00 4.600907e+00 -1.432501e+08
# ZMB-2015 6.390825e-12 -1.114041e+03 8.479933e+00 7.761636e+00 -3.871469e+08
# ZWE-2011 -4.654654e-11 -2.813378e-06 6.104804e-09 2.834694e-09 -5.804762e-01
tail(attr(res, "na.rm"))
# [1] 12739 12740 12741 12742 12743 12744
Notes: (1) Because of the different missing case options and associated challenges, panel-identifiers are not preserved in HDB
and HDW
. (2) The default variable.wise = TRUE
and fill = TRUE
was only set for the pseries and pdata.frame methods, to harmonize the default implementations with fbetween
/ B
and fwithin
/ W
for these classes. In the standard default, matrix and data.frame methods, the defaults are variable.wise = FALSE
and fill = FALSE
(i.e. missing cases are removed beforehand), which is generally more efficient.
Next to the above functions for grouped centering and averaging, the function / operator pair fscale
/ STD
can be used to efficiently standardize (i.e. scale and center) panel data along an arbitrary dimension. The architecture is identical to that of fwithin
/ W
or fbetween
/ B
.
# This standardizes GDP per capita in each country
STD_PCGDP <- STD(PCGDP)
# Checks:
head(fmean(STD_PCGDP, index(STD_PCGDP, 1)))
# ABW AFG AGO ALB AND ARE
# -9.436896e-16 -1.318390e-15 -6.296133e-16 3.798131e-16 -6.522560e-16 -1.858978e-16
head(fsd(STD_PCGDP, index(STD_PCGDP, 1)))
# ABW AFG AGO ALB AND ARE
# 1 1 1 1 1 1
# This standardizes GDP per capita in each year
STD_PCGDP_T <- STD(PCGDP, effect = "year")
# Checks:
head(fmean(STD_PCGDP_T, index(STD_PCGDP_T, 2)))
# 1960 1961 1962 1963 1964 1965
# -2.359224e-16 3.808184e-16 1.522080e-17 2.993517e-16 4.938553e-16 -3.615378e-16
head(fsd(STD_PCGDP_T, index(STD_PCGDP_T, 2)))
# 1960 1961 1962 1963 1964 1965
# 1 1 1 1 1 1
And similarly for pdata.frame’s:
head(STD(pwlddev, cols = 9:12))
# iso3c year STD.PCGDP STD.LIFEEX STD.GINI STD.ODA
# ABW-1960 ABW 1960 NA -2.356240 NA NA
# ABW-1961 ABW 1961 NA -2.207971 NA NA
# ABW-1962 ABW 1962 NA -2.074817 NA NA
# ABW-1963 ABW 1963 NA -1.951379 NA NA
# ABW-1964 ABW 1964 NA -1.834059 NA NA
# ABW-1965 ABW 1965 NA -1.718179 NA NA
head(STD(pwlddev, cols = 9:12, effect = "year"))
# iso3c year STD.PCGDP STD.LIFEEX STD.GINI STD.ODA
# ABW-1960 ABW 1960 NA 0.9653371 NA NA
# ABW-1961 ABW 1961 NA 0.9521446 NA NA
# ABW-1962 ABW 1962 NA 0.9613612 NA NA
# ABW-1963 ABW 1963 NA 0.9690544 NA NA
# ABW-1964 ABW 1964 NA 0.9592609 NA NA
# ABW-1965 ABW 1965 NA 0.9563056 NA NA
More customized scaling can be done with the help of the mean
and sd
arguments to fscale
/ STD
. By default mean = 0
and sd = 1
, but these could be assigned any numeric values:
# This will scale the data such that mean mean within each country is 5 and the standard deviation is 3
qsu(fscale(pwlddev$PCGDP, mean = 5, sd = 3))
# N/T Mean SD Min Max
# Overall 8992 5 2.9666 -6.0094 16.0054
# Between 200 5 0 5 5
# Within 44.96 5 2.9666 -6.0094 16.0054
Even further customization (i.e. setting means and standard deviations for each group and / or each column) can of course be achieved by calling collapse::TRA
on the result of fscale
to sweep out an appropriate set of means and standard deviations.
Scaling without centering can be done with the option mean = FALSE
. This will also preserve the mean of the data overall and within each group:
# Scaling without centering: Mean preserving with fscale / STD
qsu(fscale(pwlddev$PCGDP, mean = FALSE, sd = 3))
# N/T Mean SD Min Max
# Overall 8992 11546.3933 17164.1598 249.9883 127809.611
# Between 200 11726.7457 17336.1848 255.3999 127802.226
# Within 44.96 11546.3933 2.9666 11535.3839 11557.3987
# Scaling without centering can also be done using fsd, but this does not preserve the mean
qsu(fsd(pwlddev$PCGDP, index(pwlddev, 1), TRA = "/"))
# N/T Mean SD Min Max
# Overall 8992 4.2785 3.0025 0.0659 22.9048
# Between 200 4.6461 3.3846 0.8296 21.8908
# Within 44.96 4.2785 0.9889 0.6087 7.9469
Finally a special kind of data harmonization in the first two moments can be done by setting mean = "overall.mean"
and sd = "within.sd"
in a grouped scaling task. This will harmonize the data across groups such that the mean of each group is equal to the overall data mean and the standard deviation equal to the within standard deviation (= the standard deviation calculated on the group-centered series):
fmean(pwlddev$PCGDP) # Overall mean
# [1] 11563.65
fsd(W(pwlddev$PCGDP)) # Within sd
# [1] 6334.952
# Scaling and centerin such that the mean of each country is the overall mean, and the sd of each country is the within sd
qsu(fscale(pwlddev$PCGDP, mean = "overall.mean", sd = "within.sd"))
# N/T Mean SD Min Max
# Overall 8992 11563.6529 6264.4535 -11684.3802 34803.1888
# Between 200 11563.6529 0 11563.6529 11563.6529
# Within 44.96 11563.6529 6264.4535 -11684.3802 34803.1888
All of this seamlessly generalizes to weighted scaling an centering, using the w
argument to add a weight vector.
With flag
/ L
/ F
, fdiff
/ D
and fgrowth
/ G
, collapse provides a fast and comprehensive C++ based solution to the computation of (sequences of) lags / leads and (sequences of) lagged / leaded and suitably iterated (quasi-, log-) differences and growth rates on panel data. The pseries and pdata.frame methods to these functions and associated transformation operators use the panel-identifiers in the ‘index’ attached to these objects (where the last variable in the ‘index’ is taken as the time-variable and the variables before that are taken as individual identifiers) to perform fast fully-identified time-dependent operations on panel data, without the need of sorting the data.
With flag
/ L
/ F
, it is easy to lag or lead pseries:
# A panel-lag
head(flag(LIFEEX))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA 65.662 66.074 66.444 66.787 67.113
# A panel-lead
head(flag(LIFEEX, -1))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# 66.074 66.444 66.787 67.113 67.435 67.762
# The lag and lead operators are even more parsimonious to employ:
all_identical(L(LIFEEX), flag(LIFEEX), plm::lag(LIFEEX))
# [1] TRUE
all_identical(F(LIFEEX), flag(LIFEEX, -1), plm::lead(LIFEEX))
# [1] TRUE
It is also possible to compute a sequence of lags / leads using flag
or one of the operators:
# sequence of panel- lags and leads
head(flag(LIFEEX, -1:3))
# F1 -- L1 L2 L3
# ABW-1960 66.074 65.662 NA NA NA
# ABW-1961 66.444 66.074 65.662 NA NA
# ABW-1962 66.787 66.444 66.074 65.662 NA
# ABW-1963 67.113 66.787 66.444 66.074 65.662
# ABW-1964 67.435 67.113 66.787 66.444 66.074
# ABW-1965 67.762 67.435 67.113 66.787 66.444
all_identical(L(LIFEEX, -1:3), F(LIFEEX, 1:-3), flag(LIFEEX, -1:3))
# [1] TRUE
# The native plm implementation also returns a matrix of lags but with different column names
head(plm::lag(LIFEEX, -1:3), 4)
# -1 0 1 2 3
# ABW-1960 66.074 65.662 NA NA NA
# ABW-1961 66.444 66.074 65.662 NA NA
# ABW-1962 66.787 66.444 66.074 65.662 NA
# ABW-1963 67.113 66.787 66.444 66.074 65.662
Of course the lag orders may be unevenly spaced, i.e. L(x, -1:3*12)
would compute seasonal lags on monthly data. On pdata.frame’s, the effects of flag
and L
/ F
differ insofar that flag
will just lag the entire dataset without preserving identifiers (although the index attribute is always preserved), whereas L
/ F
by default (cols = is.numeric
) select the numeric variables and add the panel-id’s on the left (default keep.ids = TRUE
):
# This lags the entire data
head(flag(pwlddev))
# country iso3c date year decade region income OECD PCGDP
# ABW-1960 <NA> <NA> <NA> <NA> NA <NA> <NA> NA NA
# ABW-1961 Aruba ABW 1961-01-01 1960 1960 Latin America & Caribbean High income FALSE NA
# ABW-1962 Aruba ABW 1962-01-01 1961 1960 Latin America & Caribbean High income FALSE NA
# ABW-1963 Aruba ABW 1963-01-01 1962 1960 Latin America & Caribbean High income FALSE NA
# ABW-1964 Aruba ABW 1964-01-01 1963 1960 Latin America & Caribbean High income FALSE NA
# ABW-1965 Aruba ABW 1965-01-01 1964 1960 Latin America & Caribbean High income FALSE NA
# LIFEEX GINI ODA
# ABW-1960 NA NA NA
# ABW-1961 65.662 NA NA
# ABW-1962 66.074 NA NA
# ABW-1963 66.444 NA NA
# ABW-1964 66.787 NA NA
# ABW-1965 67.113 NA NA
# This lags only numeric columns and preserves panel-id's
head(L(pwlddev))
# iso3c year L1.decade L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# ABW-1960 ABW 1960 NA NA NA NA NA
# ABW-1961 ABW 1961 1960 NA 65.662 NA NA
# ABW-1962 ABW 1962 1960 NA 66.074 NA NA
# ABW-1963 ABW 1963 1960 NA 66.444 NA NA
# ABW-1964 ABW 1964 1960 NA 66.787 NA NA
# ABW-1965 ABW 1965 1960 NA 67.113 NA NA
# This lags only columns 9 through 12 and preserves panel-id's
head(L(pwlddev, cols = 9:12))
# iso3c year L1.PCGDP L1.LIFEEX L1.GINI L1.ODA
# ABW-1960 ABW 1960 NA NA NA NA
# ABW-1961 ABW 1961 NA 65.662 NA NA
# ABW-1962 ABW 1962 NA 66.074 NA NA
# ABW-1963 ABW 1963 NA 66.444 NA NA
# ABW-1964 ABW 1964 NA 66.787 NA NA
# ABW-1965 ABW 1965 NA 67.113 NA NA
We can also easily compute a sequence of lags / leads on a panel data.frame:
# This lags only columns 9 through 12 and preserves panel-id's
head(L(pwlddev, -1:3, cols = 9:12))
# iso3c year F1.PCGDP PCGDP L1.PCGDP L2.PCGDP L3.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX L2.LIFEEX
# ABW-1960 ABW 1960 NA NA NA NA NA 66.074 65.662 NA NA
# ABW-1961 ABW 1961 NA NA NA NA NA 66.444 66.074 65.662 NA
# ABW-1962 ABW 1962 NA NA NA NA NA 66.787 66.444 66.074 65.662
# ABW-1963 ABW 1963 NA NA NA NA NA 67.113 66.787 66.444 66.074
# ABW-1964 ABW 1964 NA NA NA NA NA 67.435 67.113 66.787 66.444
# ABW-1965 ABW 1965 NA NA NA NA NA 67.762 67.435 67.113 66.787
# L3.LIFEEX F1.GINI GINI L1.GINI L2.GINI L3.GINI F1.ODA ODA L1.ODA L2.ODA L3.ODA
# ABW-1960 NA NA NA NA NA NA NA NA NA NA NA
# ABW-1961 NA NA NA NA NA NA NA NA NA NA NA
# ABW-1962 NA NA NA NA NA NA NA NA NA NA NA
# ABW-1963 65.662 NA NA NA NA NA NA NA NA NA NA
# ABW-1964 66.074 NA NA NA NA NA NA NA NA NA NA
# ABW-1965 66.444 NA NA NA NA NA NA NA NA NA NA
Essentially the same functionality applies to fdiff
/ D
and fgrowth
/ G
, with the main differences that these functions also have a diff
argument to determine the number of iterations:
# Panel-difference of Life Expectancy
head(fdiff(LIFEEX))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA 0.412 0.370 0.343 0.326 0.322
# Second panel-difference
head(fdiff(LIFEEX, diff = 2))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA NA -0.042 -0.027 -0.017 -0.004
# Panel-growth rate of Life Expectancy
head(fgrowth(LIFEEX))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA 0.6274558 0.5599782 0.5162242 0.4881189 0.4797878
# Growth rate of growth rate of Life Expectancy
head(fgrowth(LIFEEX, diff = 2))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA NA -10.754153 -7.813521 -5.444387 -1.706782
identical(D(LIFEEX), fdiff(LIFEEX))
# [1] TRUE
identical(G(LIFEEX), fgrowth(LIFEEX))
# [1] TRUE
identical(fdiff(LIFEEX), diff(LIFEEX)) # Same as plm::diff.pseries (which does not compute iterated panel-differences)
# [1] TRUE
By default, growth rates are calculated in percentage terms which is set by the default argument scale = 100
. It is also possible to compute log-differences with fdiff(.., log = TRUE)
or the Dlog
operator, and growth rates in percentage terms based on log-differences using fgrowth(.., logdiff = TRUE)
.
# Panel log-difference of Life Expectancy
head(Dlog(LIFEEX))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA 0.006254955 0.005584162 0.005148963 0.004869315 0.004786405
# Panel log-difference growth rate (in percentage terms) of Life Expectancy
head(G(LIFEEX, logdiff = TRUE))
# ABW-1960 ABW-1961 ABW-1962 ABW-1963 ABW-1964 ABW-1965
# NA 0.6254955 0.5584162 0.5148963 0.4869315 0.4786405
It is also possible to compute sequences of lagged / leaded and iterated differences, log-differences and growth rates:
# first and second forward-difference and first and second difference of lags 1-3 of Life-Expectancy
head(D(LIFEEX, -1:3, 1:2))
# FD1 FD2 -- D1 D2 L2D1 L2D2 L3D1 L3D2
# ABW-1960 -0.412 -0.042 65.662 NA NA NA NA NA NA
# ABW-1961 -0.370 -0.027 66.074 0.412 NA NA NA NA NA
# ABW-1962 -0.343 -0.017 66.444 0.370 -0.042 0.782 NA NA NA
# ABW-1963 -0.326 -0.004 66.787 0.343 -0.027 0.713 NA 1.125 NA
# ABW-1964 -0.322 0.005 67.113 0.326 -0.017 0.669 -0.113 1.039 NA
# ABW-1965 -0.327 0.006 67.435 0.322 -0.004 0.648 -0.065 0.991 NA
# Same with Log-differences
head(Dlog(LIFEEX, -1:3, 1:2))
# FDlog1 FDlog2 -- Dlog1 Dlog2 L2Dlog1 L2Dlog2
# ABW-1960 -0.006254955 -6.707929e-04 4.184520 NA NA NA NA
# ABW-1961 -0.005584162 -4.351984e-04 4.190775 0.006254955 NA NA NA
# ABW-1962 -0.005148963 -2.796481e-04 4.196359 0.005584162 -0.0006707929 0.01183912 NA
# ABW-1963 -0.004869315 -8.291000e-05 4.201508 0.005148963 -0.0004351984 0.01073312 NA
# ABW-1964 -0.004786405 5.098981e-05 4.206378 0.004869315 -0.0002796481 0.01001828 -0.001820838
# ABW-1965 -0.004837395 6.482830e-05 4.211164 0.004786405 -0.0000829100 0.00965572 -0.001077405
# L3Dlog1 L3Dlog2
# ABW-1960 NA NA
# ABW-1961 NA NA
# ABW-1962 NA NA
# ABW-1963 0.01698808 NA
# ABW-1964 0.01560244 NA
# ABW-1965 0.01480468 NA
# Same with (exact) growth rates
head(G(LIFEEX, -1:3, 1:2))
# FG1 FG2 -- G1 G2 L2G1 L2G2 L3G1 L3G2
# ABW-1960 -0.6235433 11.974895 65.662 NA NA NA NA NA NA
# ABW-1961 -0.5568599 8.428580 66.074 0.6274558 NA NA NA NA NA
# ABW-1962 -0.5135730 5.728297 66.444 0.5599782 -10.754153 1.1909476 NA NA NA
# ABW-1963 -0.4857479 1.727984 66.787 0.5162242 -7.813521 1.0790931 NA 1.713320 NA
# ABW-1964 -0.4774968 -1.051555 67.113 0.4881189 -5.444387 1.0068629 -15.45699 1.572479 NA
# ABW-1965 -0.4825714 -1.319230 67.435 0.4797878 -1.706782 0.9702487 -10.08666 1.491482 NA
A further possibility is to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\). These are useful for panel-regressions suffering from serial-correlation, following Cochrane & Orcutt (1949), and can be specified with the rho
argument to fdiff
, D
and Dlog
.
# Regression of GDP on Life Expectance with country and time FE
mod <- lm(PCGDP ~ LIFEEX, data = fHDwithin(fselect(pwlddev, PCGDP, LIFEEX), fill = FALSE))
mod
#
# Call:
# lm(formula = PCGDP ~ LIFEEX, data = fHDwithin(fselect(pwlddev,
# PCGDP, LIFEEX), fill = FALSE))
#
# Coefficients:
# (Intercept) LIFEEX
# 7.219e-14 -3.179e+02
# Computing autocorrelation of residuals
r <- residuals(mod)
r <- pwcor(r, L(r, 1, substr(names(r), 1, 3))) # Need this to compute a panel-lag
r
# [1] .98
# Running the regression again quasi-differencing the transformed data
modCO <- lm(PCGDP ~ LIFEEX, data = fdiff(fHDwithin(fselect(pwlddev, PCGDP, LIFEEX), variable.wise = FALSE), rho = r, stubs = FALSE))
modCO
#
# Call:
# lm(formula = PCGDP ~ LIFEEX, data = fdiff(fHDwithin(fselect(pwlddev,
# PCGDP, LIFEEX), variable.wise = FALSE), rho = r, stubs = FALSE))
#
# Coefficients:
# (Intercept) LIFEEX
# -8.033 -86.407
# In this case rho is almost 1, so we might as well just difference the untransformed data and go with that
# We also need to bootstrap this for proper standard errors.
A final important advantage of the collapse functions is that the panel-identifiers are preserved, even if a matrix of lags / leads / differences or growth rates is returned. This allows for nested panel-computations, for example we can compute shifted sequences of lagged / leaded and iterated panel differences:
# Sequence of differneces (same as above), adding one extra lag of the whole sequence
head(L(D(LIFEEX, -1:3, 1:2), 0:1))
# FD1 L1.FD1 FD2 L1.FD2 -- L1.-- D1 L1.D1 D2 L1.D2 L2D1 L1.L2D1 L2D2
# ABW-1960 -0.412 NA -0.042 NA 65.662 NA NA NA NA NA NA NA NA
# ABW-1961 -0.370 -0.412 -0.027 -0.042 66.074 65.662 0.412 NA NA NA NA NA NA
# ABW-1962 -0.343 -0.370 -0.017 -0.027 66.444 66.074 0.370 0.412 -0.042 NA 0.782 NA NA
# ABW-1963 -0.326 -0.343 -0.004 -0.017 66.787 66.444 0.343 0.370 -0.027 -0.042 0.713 0.782 NA
# ABW-1964 -0.322 -0.326 0.005 -0.004 67.113 66.787 0.326 0.343 -0.017 -0.027 0.669 0.713 -0.113
# ABW-1965 -0.327 -0.322 0.006 0.005 67.435 67.113 0.322 0.326 -0.004 -0.017 0.648 0.669 -0.065
# L1.L2D2 L3D1 L1.L3D1 L3D2 L1.L3D2
# ABW-1960 NA NA NA NA NA
# ABW-1961 NA NA NA NA NA
# ABW-1962 NA NA NA NA NA
# ABW-1963 NA 1.125 NA NA NA
# ABW-1964 NA 1.039 1.125 NA NA
# ABW-1965 -0.113 0.991 1.039 NA NA
All of this naturally generalized to computations on pdata.frames:
head(D(pwlddev, -1:3, 1:2, cols = 9:10), 3)
# iso3c year FD1.PCGDP FD2.PCGDP PCGDP D1.PCGDP D2.PCGDP L2D1.PCGDP L2D2.PCGDP L3D1.PCGDP
# ABW-1960 ABW 1960 NA NA NA NA NA NA NA NA
# ABW-1961 ABW 1961 NA NA NA NA NA NA NA NA
# ABW-1962 ABW 1962 NA NA NA NA NA NA NA NA
# L3D2.PCGDP FD1.LIFEEX FD2.LIFEEX LIFEEX D1.LIFEEX D2.LIFEEX L2D1.LIFEEX L2D2.LIFEEX
# ABW-1960 NA -0.412 -0.042 65.662 NA NA NA NA
# ABW-1961 NA -0.370 -0.027 66.074 0.412 NA NA NA
# ABW-1962 NA -0.343 -0.017 66.444 0.370 -0.042 0.782 NA
# L3D1.LIFEEX L3D2.LIFEEX
# ABW-1960 NA NA
# ABW-1961 NA NA
# ABW-1962 NA NA
head(L(D(pwlddev, -1:3, 1:2, cols = 9:10), 0:1), 3)
# iso3c year FD1.PCGDP L1.FD1.PCGDP FD2.PCGDP L1.FD2.PCGDP PCGDP L1.PCGDP D1.PCGDP
# ABW-1960 ABW 1960 NA NA NA NA NA NA NA
# ABW-1961 ABW 1961 NA NA NA NA NA NA NA
# ABW-1962 ABW 1962 NA NA NA NA NA NA NA
# L1.D1.PCGDP D2.PCGDP L1.D2.PCGDP L2D1.PCGDP L1.L2D1.PCGDP L2D2.PCGDP L1.L2D2.PCGDP
# ABW-1960 NA NA NA NA NA NA NA
# ABW-1961 NA NA NA NA NA NA NA
# ABW-1962 NA NA NA NA NA NA NA
# L3D1.PCGDP L1.L3D1.PCGDP L3D2.PCGDP L1.L3D2.PCGDP FD1.LIFEEX L1.FD1.LIFEEX FD2.LIFEEX
# ABW-1960 NA NA NA NA -0.412 NA -0.042
# ABW-1961 NA NA NA NA -0.370 -0.412 -0.027
# ABW-1962 NA NA NA NA -0.343 -0.370 -0.017
# L1.FD2.LIFEEX LIFEEX L1.LIFEEX D1.LIFEEX L1.D1.LIFEEX D2.LIFEEX L1.D2.LIFEEX L2D1.LIFEEX
# ABW-1960 NA 65.662 NA NA NA NA NA NA
# ABW-1961 -0.042 66.074 65.662 0.412 NA NA NA NA
# ABW-1962 -0.027 66.444 66.074 0.370 0.412 -0.042 NA 0.782
# L1.L2D1.LIFEEX L2D2.LIFEEX L1.L2D2.LIFEEX L3D1.LIFEEX L1.L3D1.LIFEEX L3D2.LIFEEX
# ABW-1960 NA NA NA NA NA NA
# ABW-1961 NA NA NA NA NA NA
# ABW-1962 NA NA NA NA NA NA
# L1.L3D2.LIFEEX
# ABW-1960 NA
# ABW-1961 NA
# ABW-1962 NA
Viewing and transforming panel data stored in an array can be a powerful strategy, especially as it provides much more direct access to the different dimensions of the data. The function psmat
can be used to efficiently transform pseries to a 2D matrix, and pdata.frame’s to a 3D array:
# Converting the panel series to array, individual rows (default)
str(psmat(LIFEEX))
# 'psmat' num [1:216, 1:59] 65.7 32.3 33.3 62.3 NA ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# - attr(*, "transpose")= logi FALSE
# Converting the panel series to array, individual columns
str(psmat(LIFEEX, transpose = TRUE))
# 'psmat' num [1:59, 1:216] 65.7 66.1 66.4 66.8 67.1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# - attr(*, "transpose")= logi TRUE
# Same as plm::as.matrix.pseries, apart from attributes
identical(unattrib(psmat(LIFEEX)),
unattrib(as.matrix(LIFEEX)))
# [1] TRUE
identical(unattrib(psmat(LIFEEX, transpose = TRUE)),
unattrib(as.matrix(LIFEEX, idbyrow = FALSE)))
# [1] TRUE
Applying psmat
to a pdata.frame yields a 3D array:
psar <- psmat(pwlddev, cols = 9:12)
str(psar)
# 'psmat' num [1:216, 1:59, 1:4] NA NA NA NA NA ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..$ : chr [1:4] "PCGDP" "LIFEEX" "GINI" "ODA"
# - attr(*, "transpose")= logi FALSE
str(psmat(pwlddev, cols = 9:12, transpose = TRUE))
# 'psmat' num [1:59, 1:216, 1:4] NA NA NA NA NA NA NA NA NA NA ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# ..$ : chr [1:4] "PCGDP" "LIFEEX" "GINI" "ODA"
# - attr(*, "transpose")= logi TRUE
This format can be very convenient to quickly and freely access data for different countries, variables and time-periods:
# Looking at wealth, health and inequality in Brazil and Argentinia, 1990-1999
aperm(psar[c("BRA","ARG"), as.character(1990:1999), c("PCGDP", "LIFEEX", "GINI")])
# , , BRA
#
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
# PCGDP 7987.1 7967.8 7797.8 8028.2 8320.3 8549.0 8599.1 8751.4 8645.5 8555.9
# LIFEEX 65.3 65.7 66.1 66.6 67.1 67.6 68.1 68.6 69.1 69.6
# GINI 60.5 NA 53.2 60.1 NA 59.6 59.9 59.8 59.6 59.0
#
# , , ARG
#
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
# PCGDP 6224.5 6698.0 7130.6 7612.7 7952.7 7630.0 7955.1 8500.9 8728.9 8339.9
# LIFEEX 71.6 71.8 72.0 72.3 72.5 72.7 72.9 73.2 73.4 73.6
# GINI NA 46.8 45.5 44.9 45.9 48.9 49.5 49.1 50.7 49.8
psmat
can also return the output as a list of panel series matrices:
pslist <- psmat(pwlddev, cols = 9:12, array = FALSE)
str(pslist)
# List of 4
# $ PCGDP : 'psmat' num [1:216, 1:59] NA NA NA NA NA ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# .. ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..- attr(*, "transpose")= logi FALSE
# $ LIFEEX: 'psmat' num [1:216, 1:59] 65.7 32.3 33.3 62.3 NA ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# .. ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..- attr(*, "transpose")= logi FALSE
# $ GINI : 'psmat' num [1:216, 1:59] NA NA NA NA NA NA NA NA NA NA ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# .. ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..- attr(*, "transpose")= logi FALSE
# $ ODA : 'psmat' num [1:216, 1:59] NA 114440000 -380000 NA NA ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:216] "ABW" "AFG" "AGO" "ALB" ...
# .. ..$ : chr [1:59] "1960" "1961" "1962" "1963" ...
# ..- attr(*, "transpose")= logi FALSE
This list can then be unlisted using the function unlist2d
(for unlisting in 2-dimensions), to yield a reshaped data.frame:
head(unlist2d(pslist, idcols = "Variable", row.names = "Country Code"), 3)
# Variable Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
# 1 PCGDP ABW NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 PCGDP AFG NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 PCGDP AGO NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
# 1 NA NA NA NA NA NA NA NA NA NA NA 15669.616 18427.61
# 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA 2969.96 2742.656 2646.013 2660.145 2724.889 2732.077 2730.993 2767.18
# 1988 1989 1990 1991 1992 1993 1994 1995 1996
# 1 22134.017 24837.951 25357.787 26329.313 26401.969 26663.21 27272.310 26705.181 26087.776
# 2 NA NA NA NA NA NA NA NA NA
# 3 2861.356 2786.726 2614.493 2560.063 2333.477 1716.21 1684.215 1878.793 2073.215
# 1997 1998 1999 2000 2001 2002 2003 2004 2005
# 1 27190.501 27151.92 26954.405 28417.384 26966.055 25508.3025 25469.287 27005.5295 26979.8854
# 2 NA NA NA NA NA 339.6333 352.244 341.6125 365.5487
# 3 2164.082 2204.91 2190.087 2189.561 2208.792 2426.4318 2412.393 2582.6465 2866.4347
# 2006 2007 2008 2009 2010 2011 2012 2013 2014
# 1 27046.7604 27428.1202 27367.2810 24464.1745 23512.603 24231.3389 23777.3161 24629.0800 24692.4972
# 2 372.8967 412.9196 418.4788 495.1089 550.515 536.0125 584.9074 597.5252 594.5741
# 3 3085.4248 3394.5123 3641.4475 3544.0266 3585.906 3580.2699 3750.2091 3799.4296 3846.2409
# 2015 2016 2017 2018
# 1 24452.6066 24288.9871 24508.8091 NA
# 2 585.7083 583.0551 583.8696 NA
# 3 3751.6945 3533.8652 3413.6564 NA
Of course we could also have applied some transformation (like computing pairwise correlations) to each matrix before unlisting. In any case this kind of programming provides lots of possibilities to explore and manipulate panel data (as we will see in Part 2).
Below benchmarks are provided of the collapse implementation against native plm. To do this the dataset used so far is extended to have approx 1 million observations:
wlddevsmall <- get_vars(wlddev, c("iso3c","year","OECD","PCGDP","LIFEEX","GINI","ODA"))
wlddevsmall$iso3c <- as.character(wlddevsmall$iso3c)
data <- replicate(100, wlddevsmall, simplify = FALSE)
rm(wlddevsmall)
uniquify <- function(x, i) {
x$iso3c <- paste0(x$iso3c, i)
x
}
data <- unlist2d(Map(uniquify, data, as.list(1:100)), idcols = FALSE)
data <- pdata.frame(data, index = c("iso3c", "year"))
pdim(data)
# Balanced Panel: n = 21600, T = 59, N = 1274400
The data has 21600 individuals (countries) each observed for 59 years, the total number of rows is 1274400. We can pull out a series of life expectancy and run some benchmarks. The windows laptop on which these benchmarks were run has a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung SSD hard drive.
library(microbenchmark)
# Creating the extended panel series for Life Expectancy (l for large)
LIFEEX_l <- data$LIFEEX
str(LIFEEX_l)
# 'pseries' Named num [1:1274400] 65.7 66.1 66.4 66.8 67.1 ...
# - attr(*, "names")= chr [1:1274400] "ABW1-1960" "ABW1-1961" "ABW1-1962" "ABW1-1963" ...
# - attr(*, "index")=Classes 'pindex' and 'data.frame': 1274400 obs. of 2 variables:
# ..$ iso3c: Factor w/ 21600 levels "ABW1","ABW10",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ year : Factor w/ 59 levels "1960","1961",..: 1 2 3 4 5 6 7 8 9 10 ...
# Between Transformations
microbenchmark(Between(LIFEEX_l, na.rm = TRUE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Between(LIFEEX_l, na.rm = TRUE) 253.4244 281.1413 352.8482 305.6462 348.1692 714.1521 10
microbenchmark(fbetween(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fbetween(LIFEEX_l) 11.54712 11.76043 23.03592 15.02094 26.31883 74.92062 10
# Within Transformations
microbenchmark(Within(LIFEEX_l, na.rm = TRUE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Within(LIFEEX_l, na.rm = TRUE) 414.255 624.1813 812.2177 764.9711 992.1353 1540.421 10
microbenchmark(fwithin(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(LIFEEX_l) 16.24745 16.86462 31.49424 19.68624 26.72849 119.8043 10
# Higher-Dimenional Between and Within Transformations
microbenchmark(fHDbetween(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fHDbetween(LIFEEX_l) 173.67 186.1395 226.1447 239.0539 257.6977 269.5982 10
microbenchmark(fHDwithin(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fHDwithin(LIFEEX_l) 127.9251 168.9224 190.0325 191.7788 224.84 235.0113 10
# Single Lag
microbenchmark(plm::lag(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# plm::lag(LIFEEX_l) 586.7393 622.3213 635.9714 629.3364 638.7477 733.4367 10
microbenchmark(flag(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(LIFEEX_l) 14.88595 16.41167 32.32346 26.30254 43.28341 83.0834 10
# Sequence of Lags / Leads
microbenchmark(plm::lag(LIFEEX_l, -1:3), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# plm::lag(LIFEEX_l, -1:3) 2.734188 2.777919 2.984943 2.955093 3.182653 3.289805 10
microbenchmark(flag(LIFEEX_l, -1:3), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(LIFEEX_l, -1:3) 43.04779 44.10896 61.28462 49.28723 73.56224 110.3804 10
# Single difference
microbenchmark(diff(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# diff(LIFEEX_l) 662.7563 704.7706 781.4277 727.1863 783.5312 1235.036 10
microbenchmark(fdiff(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(LIFEEX_l) 14.51646 16.7709 28.16077 19.37766 33.40035 70.43806 10
# Iterated Difference
microbenchmark(fdiff(LIFEEX_l, diff = 2), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(LIFEEX_l, diff = 2) 19.52069 20.63497 30.96798 22.57815 34.89528 81.04047 10
# Sequence of Lagged / Leaded and iterated differences
microbenchmark(fdiff(LIFEEX_l, -1:3, 1:2), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(LIFEEX_l, -1:3, 1:2) 88.19339 89.33266 120.1491 102.852 156.211 166.0311 10
# Single Growth Rate
microbenchmark(fgrowth(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fgrowth(LIFEEX_l) 18.94458 19.4716 30.52137 23.25824 33.57661 80.24615 10
# Single Log-Difference
microbenchmark(fdiff(LIFEEX_l, log = TRUE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(LIFEEX_l, log = TRUE) 57.45045 57.84404 92.37 61.64318 80.24213 329.8118 10
# Panel Series to Matrix Conversion
# system.time(as.matrix(LIFEEX_l)) This takes about 3 minutes to compute
microbenchmark(psmat(LIFEEX_l), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# psmat(LIFEEX_l) 4.361185 4.535221 4.770573 4.576054 4.956034 5.827111 10
This shows a comparison between flag and data.table’s shift:
microbenchmark(L(data, cols = 3:6), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# L(data, cols = 3:6) 17.32693 18.31983 30.20824 19.69918 38.44563 88.94621 10
library(data.table)
setDT(data)
# 'Improper' panel-lag
microbenchmark(data[, shift(.SD), by = iso3c, .SDcols = 3:6], times = 10)
# Unit: milliseconds
# expr min lq mean median uq max
# data[, shift(.SD), by = iso3c, .SDcols = 3:6] 575.845 736.8273 909.7106 761.8924 872.9928 2085.917
# neval
# 10
# This does what L is actually doing (without sorting the data)
microbenchmark(data[order(year), shift(.SD), by = iso3c, .SDcols = 3:6], times = 10)
# Unit: milliseconds
# expr min lq mean median
# data[order(year), shift(.SD), by = iso3c, .SDcols = 3:6] 514.4172 525.0187 581.6693 551.3714
# uq max neval
# 579.5685 783.2063 10
The above dataset has 1 million obs in 20 thousand groups, but what about 10 million obs and 1 million groups? Do collapse functions scale efficiently as data and the number of groups grows large? Here is a simple benchmark:
x <- rnorm(1e7) # 10 million obs
g <- qF(rep(1:1e6, each = 10), na.exclude = FALSE) # 1 million individuals
t <- qF(rep(1:10, 1e6), na.exclude = FALSE) # 10 time-periods per individual
microbenchmark(fbetween(x, g), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fbetween(x, g) 78.24919 79.56785 103.0975 84.16131 133.3149 140.1389 10
microbenchmark(fwithin(x, g), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fwithin(x, g) 77.72976 78.0631 103.5636 83.4578 136.2619 142.6785 10
microbenchmark(flag(x, 1, g, t), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(x, 1, g, t) 178.855 181.8761 186.2392 184.4096 187.0459 208.4528 10
microbenchmark(flag(x, -1:1, g, t), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# flag(x, -1:1, g, t) 265.7819 298.8645 391.0726 313.9243 570.7605 636.5245 10
microbenchmark(fdiff(x, 1, 1, g, t), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(x, 1, 1, g, t) 139.9095 153.8303 182.8425 159.1669 236.9462 243.1348 10
microbenchmark(fdiff(x, 1, 2, g, t), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(x, 1, 2, g, t) 169.8394 181.8649 205.7652 185.0371 260.1993 265.8337 10
microbenchmark(fdiff(x, -1:1, 1:2, g, t), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fdiff(x, -1:1, 1:2, g, t) 510.9244 559.6444 628.1527 578.668 696.8296 884.5181 10
The results show that collapse functions perform very well even as the number of groups grows large.
The conclusion of this benchmark analysis is that collapse’s fast functions, with or without the help of plm classes, allow for very fast transformations of panel data, and should enable R programmers and econometricians to implement high-performance panel data estimators without having to dive into C/C++ themselves or resorting to data.table metaprogramming.
collapse also provides some essential functions to summarize and explore panel data, such as a fast check of variation over different dimensions, fast summary-statistics for panel data, panel-auto, partial-auto and cross-correlation functions, and a fast F-test to test fixed effects and other exclusion restrictions on (large) panel data models. Panel data to matrix conversion further allows the application of some correlational and unsupervised learning tools such as PCA, clustering or dynamic factor analysis.
The function varying
can be used to check over which panel-dimensions different variable have variation. When passed a pdata.frame, varying
by default takes the first identifier and checks for variation within that dimension.
# This checks for any variation within "iso3c", the first index variable: TRUE means data vary within country i.e. over time.
varying(pwlddev)
# country date year decade region income OECD PCGDP LIFEEX GINI ODA
# FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Alternatively any index variable or combination of index variables can be specified:
# This checks any variation within time variable, i.e. cross-sectional variation.
varying(pwlddev, effect = "year")
# country iso3c date decade region income OECD PCGDP LIFEEX GINI ODA
# TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Another possibility is checking for variation within each group:
# This checks cross-sectional variation within each year for the 4 indicators.
head(varying(pwlddev, effect = "year", cols = 9:12, any_group = FALSE))
# PCGDP LIFEEX GINI ODA
# 1960 TRUE TRUE NA TRUE
# 1961 TRUE TRUE NA TRUE
# 1962 TRUE TRUE NA TRUE
# 1963 TRUE TRUE NA TRUE
# 1964 TRUE TRUE NA TRUE
# 1965 TRUE TRUE NA TRUE
varying
also has a pseries method. The code below checks for time-variation of the GINI index within each country. A NA
is returned when there are no observations within a particular country.
head(varying(pwlddev$GINI, any_group = FALSE), 20)
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR
# NA NA TRUE TRUE NA NA TRUE TRUE NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA
If we would like to gave more information about this variation, we could also invoke the functions fNdistinct
and fsd
, which do not have pseries methods:
head(fNdistinct(pwlddev$GINI, index(pwlddev, "iso3c")), 20)
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR
# 0 0 2 5 0 0 27 17 0 0 7 10 6 4 8 3 5 9 9 0
head(round(fsd(pwlddev$GINI, index(pwlddev, "iso3c")), 2), 20)
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR
# NA NA 6.58 1.78 NA NA 3.78 2.88 NA NA 1.29 0.71 9.53 4.37 0.82 4.60 5.98 3.02 1.94 NA
Efficient summary statistics for panel data have long been implemented in other statistical softwares. The command qsu
, shorthand for ‘quick-summary’, is a very efficient summary statistics command inspired by the xtsummarize command in the STATA statistical software. It computes a default set of 5 statistics (N, mean, sd, min and max) and can also computed higher moments (skewness and kurtosis) in a single pass through the data (using a numerically stable online algorithm generalized from Welford’s Algorithm for variance computations). With panel data, qsu
computes these statistics not just on the raw data, but also on the between-transformed and within-transformed data:
qsu(pwlddev, cols = 9:12, higher = TRUE)
# , , PCGDP
#
# N/T Mean SD Min Max Skew Kurt
# Overall 8995 11563.6529 18348.4052 131.6464 191586.64 3.1121 16.9585
# Between 203 12488.8577 19628.3668 255.3999 141165.083 3.214 17.2533
# Within 44.3103 11563.6529 6334.9523 -30529.0928 75348.067 0.696 17.0534
#
# , , LIFEEX
#
# N/T Mean SD Min Max Skew Kurt
# Overall 11068 63.8411 11.4497 18.907 85.4171 -0.6692 2.6458
# Between 207 64.5285 10.0235 39.349 85.4171 -0.5253 2.2298
# Within 53.4686 63.8411 5.8292 33.4671 83.8595 -0.2508 3.7497
#
# , , GINI
#
# N/T Mean SD Min Max Skew Kurt
# Overall 1356 39.3976 9.6764 16.2 65.8 0.4613 2.2932
# Between 161 39.5799 8.3679 23.3667 61.7143 0.5169 2.6715
# Within 8.4224 39.3976 3.0406 23.9576 54.7976 0.1421 5.7781
#
# , , ODA
#
# N/T Mean SD Min Max Skew Kurt
# Overall 8336 428,746468 819,868971 -1.08038000e+09 2.45521800e+10 7.1918 122.9003
# Between 178 418,026522 548,293709 423846.154 3.53258914e+09 2.4742 10.6503
# Within 46.8315 428,746468 607,024040 -2.47969577e+09 2.35093916e+10 10.3024 298.1213
Key statistics to look at in this summary are the sample size and the standard-deviation decomposed into the between-individuals and the within-individuals standard-deviation: For GDP per Capita we have 8995 observations in the panel series for 203 countries, with on average 44.31 observations (time-periods T) per country. The between-country standard deviation is 19600 USD, around 3-times larger than the within-country (over-time) standard deviation of 6300 USD. Regarding the mean, the between-mean, computed as a cross-sectional average of country averages, usually differs slightly from the overall average taken across all data points. The within-transformed data is computed and summarized with the overall mean added back (i.e. as in fwithin(PCGDP, mean = "overall.mean")
).
We can also do groupwise panel-statistics and qsu
also supports weights (not shown):
qsu(pwlddev, ~ income, cols = 9:12, higher = TRUE)
# , , Overall, PCGDP
#
# N/T Mean SD Min Max Skew Kurt
# High income 3038 28974.7264 22910.7155 944.2924 191586.64 2.1549 10.2511
# Low income 1405 596.7977 308.2129 131.6464 1506.3002 1.1497 3.5874
# Lower middle income 2120 1583.371 890.7439 150.2214 4662.8838 0.829 3.2752
# Upper middle income 2432 4849.7499 2959.2271 131.9634 20333.9404 1.3181 5.2091
#
# , , Between, PCGDP
#
# N/T Mean SD Min Max Skew Kurt
# High income 70 28974.7264 20222.5425 5191.5912 141165.083 2.1381 10.2783
# Low income 30 596.7977 276.0001 255.3999 1340.7236 1.2822 3.8003
# Lower middle income 47 1583.371 702.7388 410.2004 3120.4375 0.3045 2.1268
# Upper middle income 56 4849.7499 2325.3376 1662.0344 13171.5265 1.3496 5.0979
#
# , , Within, PCGDP
#
# N/T Mean SD Min Max Skew Kurt
# High income 43.4 11563.6529 10767.9925 -30529.0928 75348.067 0.4168 6.0456
# Low income 46.8333 11563.6529 137.1828 11020.597 12234.6404 0.3925 4.9092
# Lower middle income 45.1064 11563.6529 547.3416 9717.2022 14037.9041 0.6503 4.9802
# Upper middle income 43.4286 11563.6529 1830.254 4528.6387 24375.5944 0.7237 8.4739
#
# , , Overall, LIFEEX
#
# N/T Mean SD Min Max Skew Kurt
# High income 3682 73.2157 5.5133 42.672 85.4171 -1.0372 5.8088
# Low income 1881 49.6189 8.8925 27.61 74.43 0.2409 2.6436
# Lower middle income 2628 58.555 9.3854 18.907 76.253 -0.4329 2.7685
# Upper middle income 2877 65.9705 7.6509 36.74 79.831 -1.0301 3.9779
#
# , , Between, LIFEEX
#
# N/T Mean SD Min Max Skew Kurt
# High income 74 73.2157 3.3446 63.3102 85.4171 -0.6454 3.1733
# Low income 33 49.6189 5.2483 39.349 66.6884 1.267 5.6728
# Lower middle income 47 58.555 6.6336 44.2881 71.1231 -0.1694 2.2668
# Upper middle income 53 65.9705 5.1299 47.2945 73.9854 -1.1867 4.9499
#
# , , Within, LIFEEX
#
# N/T Mean SD Min Max Skew Kurt
# High income 49.7568 63.8411 4.3829 43.2028 77.5598 -0.4732 4.0705
# Low income 57 63.8411 7.1785 43.7422 83.2612 0.0029 2.5546
# Lower middle income 55.9149 63.8411 6.6395 33.4671 83.8595 -0.1981 3.5523
# Upper middle income 54.283 63.8411 5.6763 41.2874 81.9514 -0.4808 3.8563
#
# , , Overall, GINI
#
# N/T Mean SD Min Max Skew Kurt
# High income 478 34.3188 7.8637 21 58.9 1.3029 4.1506
# Low income 109 41.4743 6.7878 28.9 65.8 0.6488 3.9304
# Lower middle income 330 40.0652 9.3641 24 63.2 0.4795 2.2733
# Upper middle income 439 43.91 9.7535 16.2 64.8 -0.1703 2.4102
#
# , , Between, GINI
#
# N/T Mean SD Min Max Skew Kurt
# High income 40 34.3188 7.6207 25.2769 54.2208 1.2832 3.8605
# Low income 30 41.4743 4.9098 32.1333 53.7 0.2579 3.058
# Lower middle income 45 40.0652 8.675 27.9263 56.25 0.419 1.8827
# Upper middle income 46 43.91 9.24 23.3667 61.7143 -0.1577 2.1158
#
# , , Within, GINI
#
# N/T Mean SD Min Max Skew Kurt
# High income 11.95 39.3976 1.9394 31.2226 46.8583 -0.1926 5.4996
# Low income 3.6333 39.3976 4.687 23.9576 54.7976 0.0331 4.1693
# Lower middle income 7.3333 39.3976 3.5256 28.8087 54.4976 0.441 4.35
# Upper middle income 9.5435 39.3976 3.1229 26.3076 52.5309 -0.0475 4.7149
#
# , , Overall, ODA
#
# N/T Mean SD Min Max Skew
# High income 1627 151,154554 415,406000 -512,730000 4.64666000e+09 5.2927
# Low income 1798 544,223382 792,312970 -450000 1.11545600e+10 4.796
# Lower middle income 2378 680,100029 1.00278593e+09 -486,220000 1.12780600e+10 3.7602
# Upper middle income 2533 289,108010 757,988522 -1.08038000e+09 2.45521800e+10 16.1195
# Kurt
# High income 37.4383
# Low income 40.1389
# Lower middle income 24.5671
# Upper middle income 445.0067
#
# , , Between, ODA
#
# N/T Mean SD Min Max Skew Kurt
# High income 43 151,154554 335,970871 423846.154 2.16970133e+09 4.159 21.1813
# Low income 33 544,223382 399,556253 59,763076.9 1.41753857e+09 1.0153 2.8419
# Lower middle income 47 680,100029 753,840926 26,981379.3 3.53258914e+09 2.041 7.1017
# Upper middle income 55 289,108010 377,699701 10,907561 1.96011067e+09 2.1651 7.3722
#
# , , Within, ODA
#
# N/T Mean SD Min Max Skew
# High income 37.8372 428,746468 244,306608 -923,883087 2.90570513e+09 2.3015
# Low income 54.4848 428,746468 684,189040 -944,301290 1.01926687e+10 4.3134
# Lower middle income 50.5957 428,746468 661,289258 -2.47969577e+09 1.07855444e+10 3.9138
# Upper middle income 46.0545 428,746468 657,183031 -2.18778866e+09 2.35093916e+10 19.4564
# Kurt
# High income 30.2378
# Low income 44.8548
# Lower middle income 48.0143
# Upper middle income 630.5758
Here it should be noted that any grouping is applied independently from the data-transformation, i.e. the data is first transformed, and then grouped statistics are calculated on the transformed data. The computation of statistics is very efficient:
qsu(LIFEEX_l)
# N/T Mean SD Min Max
# Overall 1,106800 63.8411 11.4492 18.907 85.4171
# Between 20700 64.5285 9.9995 39.349 85.4171
# Within 53.4686 63.8411 5.829 33.4671 83.8595
microbenchmark(qsu(LIFEEX_l))
# Unit: milliseconds
# expr min lq mean median uq max neval
# qsu(LIFEEX_l) 19.35022 21.45562 24.50617 22.94073 25.32615 136.8108 100
Using the transformation functions and the functions pwcor
and pwcov
, we can also easily explore the correlation structure of the data:
# Overall pairwise correlations with pairwise observation count and significance testing (* = significant at 5% level)
pwcor(get_vars(pwlddev, 9:12), N = TRUE, P = TRUE)
# PCGDP LIFEEX GINI ODA
# PCGDP 1 (8995) .57* (8398) -.42* (1342) -.16* (6852)
# LIFEEX .57* (8398) 1 (11068) -.34* (1353) -.02* (7746)
# GINI -.42* (1342) -.34* (1353) 1 (1356) -.17* (951)
# ODA -.16* (6852) -.02* (7746) -.17* (951) 1 (8336)
# Between correlations
pwcor(fmean(get_vars(pwlddev, 9:12), pwlddev$iso3c), N = TRUE, P = TRUE)
# PCGDP LIFEEX GINI ODA
# PCGDP 1 (203) .60* (197) -.41* (159) -.24* (169)
# LIFEEX .60* (197) 1 (207) -.41* (160) -.18* (172)
# GINI -.41* (159) -.41* (160) 1 (161) -.17 (139)
# ODA -.24* (169) -.18* (172) -.17 (139) 1 (178)
# Within correlations
pwcor(W(pwlddev, cols = 9:12, keep.ids = FALSE), N = TRUE, P = TRUE)
# W.PCGDP W.LIFEEX W.GINI W.ODA
# W.PCGDP 1 (8995) .30* (8398) -.03 (1342) -.01 (6852)
# W.LIFEEX .30* (8398) 1 (11068) -.15* (1353) .14* (7746)
# W.GINI -.03 (1342) -.15* (1353) 1 (1356) -.02 (951)
# W.ODA -.01 (6852) .14* (7746) -.02 (951) 1 (8336)
The correlations show that the between (cross-country) relationships of these macro-variables are quite strong, but within countries the relationships are much weaker, for example there seems to be no significant relationship between GDP per Capita and either inequality or ODA received within countries over time.
We can take a single panel series such as GDP per Capita and explore it further:
# Generating a (transposed) matrix of country GDPs per capita
tGDPmat <- psmat(PCGDP, transpose = TRUE)
tGDPmat[1:10, 1:10]
# ABW AFG AGO ALB AND ARE ARG ARM ASM ATG
# 1960 NA NA NA NA NA NA 5605 NA NA NA
# 1961 NA NA NA NA NA NA 5815 NA NA NA
# 1962 NA NA NA NA NA NA 5675 NA NA NA
# 1963 NA NA NA NA NA NA 5291 NA NA NA
# 1964 NA NA NA NA NA NA 5739 NA NA NA
# 1965 NA NA NA NA NA NA 6251 NA NA NA
# 1966 NA NA NA NA NA NA 6121 NA NA NA
# 1967 NA NA NA NA NA NA 6227 NA NA NA
# 1968 NA NA NA NA NA NA 6435 NA NA NA
# 1969 NA NA NA NA NA NA 6955 NA NA NA
# plot the matrix (it will plot correctly no matter how the matrix is transposed)
plot(tGDPmat, main = "GDP per Capita")
# Taking series with more than 20 observation
suffsamp <- tGDPmat[, fNobs(tGDPmat) > 20]
# Minimum pairwise observations between any two series:
min(pwNobs(suffsamp))
# [1] 17
# We can use the pairwise-correlations of the annual growth rates to hierarchically cluster the economies:
plot(hclust(as.dist(1-pwcor(G(suffsamp)))))
# Finally we could do PCA on the growth rates:
eig <- eigen(pwcor(G(suffsamp)))
plot(seq_col(suffsamp), eig$values/sum(eig$values)*100, xlab = "Number of Principal Components", ylab = "% Variance Explained", main = "Screeplot")
There is also a nice plot-method applied to panel series arrays returned when psmat
is applied to a panel data.frame:
Above we have explored the cross-sectional relationship between the different national GDP series. Now we explore the time-dependence of the panel-vectors as a whole:
The functions psacf
, pspacf
and psccf
mimic stats::acf
, stats::pacf
and stats::ccf
for panel-vectors and panel data.frames. Below we compute the panel series autocorrelation function of the data:
The computation is conducted by first scaling and centering (i.e. standardizing) the panel-vectors by groups (using fscale
, default argument gscale = TRUE
), and then taking the covariance of each series with a matrix of properly computed panel-lags of itself (using flag
), and dividing that by the variance of the overall series (using fvar
).
In a similar way we can compute the Partial-ACF (using a multivariate Yule-Walker decomposition on the ACF, as in stats::pacf
),
and the panel-cross-correlation function between GDP per capita and life expectancy (which is already contained in the ACF plot above):
As a final step of exploration, we could analyze our series and simple models for the significance and explanatory power of individual or time-fixed effects, without going all the way to running a Hausman Test of fixed vs. random effects on a fully specified model. The main function here is fFtest
which efficiently computes a fast R-Squared based F-test of exclusion restrictions on models potentially involving many factors. By default (argument full.df = TRUE
) the degrees of freedom of the test are adjusted to make it identical to the F-statistic from regressing the series on a set of country and time dummies1.
# Testing GDP per Capita
fFtest(PCGDP, index(PCGDP)) # Testing individual and time-fixed effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.907 259 8735 330.778 0.000
fFtest(PCGDP, index(PCGDP, 1)) # Testing individual effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.881 215 8779 301.712 0.000
fFtest(PCGDP, index(PCGDP, 2)) # Testing time effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.026 58 8936 4.112 0.000
# Same for Life-Expectancy
fFtest(LIFEEX, index(LIFEEX)) # Testing individual and time-fixed effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.929 262 10805 536.797 0.000
fFtest(LIFEEX, index(LIFEEX, 1)) # Testing individual effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.741 215 10852 144.257 0.000
fFtest(LIFEEX, index(LIFEEX, 2)) # Testing time effects
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.201 58 11009 47.740 0.000
Below we test the correlation between the country and time-means of GDP and Life-Expectancy:
cor.test(B(PCGDP), B(LIFEEX)) # Testing correlation of country means
#
# Pearson's product-moment correlation
#
# data: B(PCGDP) and B(LIFEEX)
# t = 75.595, df = 8396, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.6234848 0.6489418
# sample estimates:
# cor
# 0.6363865
cor.test(B(PCGDP, effect = 2), B(LIFEEX, effect = 2)) # Same for time-means
#
# Pearson's product-moment correlation
#
# data: B(PCGDP, effect = 2) and B(LIFEEX, effect = 2)
# t = 346.09, df = 8396, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.9652616 0.9680649
# sample estimates:
# cor
# 0.9666922
We can also test for the significance of individual and time-fixed effects (or both) in the regression of GDP on life expectancy and ODA received:
fFtest(PCGDP, index(PCGDP), get_vars(pwlddev, c("LIFEEX","ODA"))) # Testing individual and time-fixed effects
# R-Sq. DF1 DF2 F-Stat. P-Value
# Full Model 0.917 223 6294 310.740 0.000
# Restricted Model 0.173 2 6515 681.334 0.000
# Exclusion Rest. 0.744 221 6294 254.388 0.000
fFtest(PCGDP, index(PCGDP, 2), get_vars(pwlddev, c("iso3c","LIFEEX","ODA"))) # Testing time-fixed effects
# R-Sq. DF1 DF2 F-Stat. P-Value
# Full Model 0.917 223 6294 310.740 0.000
# Restricted Model 0.912 167 6350 393.500 0.000
# Exclusion Rest. 0.005 56 6294 6.546 0.000
As can be expected in this cross-country data, individual and time-fixed effects play a large role in explaining the data, and these effects are correlated across series, suggesting that a fixed-effects model with both types of fixed-effects would be appropriate. To round things off, below we compute the Hausman test of Fixed vs. Random effects, which confirms this conclusion:
A central goal of the collapse package is to facilitate advanced and fast programming with data. A primary field of application for the fast functions introduced above is to program efficient panel data estimators. In this section we walk through a short example of how this can be done. The application will be an implementation of the Hausman and Taylor (1981) estimator, considering a more general case than currently implemented in the plm package.
In Hausman and Taylor (1981), in a more general scenario, we have a linear panel-model of the form \[y_{it} = \beta_1X_{1it} + \beta_2X_{2it} + \beta_3Z_{1i} + \beta_4Z_{2i} + \alpha_i + \gamma_t + \epsilon\] where \(\alpha_i\) denotes unobserved individual specific effects and \(\gamma_t\) denotes unobserved global events. This model has up to 4 kinds of covariates:
Time-Varying covariates \(X_{1it}\) that are uncorrelated with the individual specific effect \(\alpha_i\), such that \(E[X_{1it}\alpha_i] = 0\). It may be the case that \(E[X_{1it}\gamma_t] \neq 0\)
Time-Varying covariates \(X_{2it}\) with \(E[X_{2it}\alpha_i] \neq 0\) and possibly \(E[X_{2it}\gamma_t] \neq 0\)
Time-Invariant covariates \(Z_{1i}\) with \(E[Z_{1i}\alpha_i] = 0\)
Time-Invariant covariates \(Z_{2i}\) with \(E[Z_{2i}\alpha_i] \neq 0\)
The main estimation problem arises from \(E[Z_{2i}\alpha_i] \neq 0\), which would usually prevent us from estimating \(\beta_4\) since taking a within-transformation (fixed effects) would remove \(Z_{2i}\) from the equation. Hausman and Taylor (1981) stipulated that since \(E[X_{1it}\alpha_i] = 0\), once could use \(X_{1i.}\) i.e. the between-transformed \(X_{1it}\) to instrument for \(Z_{2i}\). They propose an IV/2SLS estimation of the whole equation where the within-transformed covariates \(\tilde{X}_{1it}\) and \(\tilde{X}_{2it}\) are used to instrument \(X_{1it}\) and \(X_{2it}\), and \(X_{1i.}\) instruments \(Z_{2i}\). Assuming that missing values have been removed beforehand, and also taking into account the possibility that \(E[X_{1it}\gamma_t] \neq 0\) and \(E[X_{2it}\gamma_t] \neq 0\) (i.e. accounting for time fixed-effects), this estimator can be coded as follows:
HT_est <- function(y, X1, Z2, X2 = NULL, Z1 = NULL, time.FE = FALSE) {
# Create matrix of independent variables
X <- cbind(Intercept = 1, do.call(cbind, c(X1, X2, Z1, Z2)))
# Create instrument matrix: if time.FE, higher-order demean X1 and X2, else normal demeaning
IVS <- cbind(Intercept = 1, do.call(cbind,
c(if(time.FE) fHDwithin(X1, na.rm = FALSE) else fwithin(X1, na.rm = FALSE),
if(is.null(X2)) X2 else if(time.FE) fHDwithin(X2, na.rm = FALSE) else fwithin(X2, na.rm = FALSE),
Z1, fbetween(X1, na.rm = FALSE))))
if(length(IVS) == length(X)) { # The IV estimator case
return(drop(solve(crossprod(IVS, X), crossprod(IVS, y))))
} else { # The 2SLS case
Xhat <- qr.fitted(qr(IVS), X) # First stage
return(drop(qr.coef(qr(Xhat), y))) # Second stage
}
}
The estimator is written in such a way that variables of the type \(X_{2it}\) and \(Z_{1i}\) are optional, and it also includes an option to also project out time-FE or not. The expected inputs for \(X_{1it}\) (X1
), and \(X_{2it}\) (X2
) are column-subsets of a pdata.frame.
Having coded the estimator, it would be good to have an example to run it on. I have tried to squeeze an example out of the wlddev
data used so far in this vignette. It is quite crappy and suffers from a weak-IV problem, but for there sake of illustration lets do it:
We want to estimate the panel-regression of life-expectancy on GDP per Capita, ODA received, the GINI index and a time-invariant dummy indicating whether the country is an OECD member. All variables except the dummy enter in logs, so this is an elasticity regression. <
dat <- get_vars(wlddev, c("iso3c","year","OECD","PCGDP","LIFEEX","GINI","ODA"))
get_vars(dat, 4:7) <- lapply(get_vars(dat, 4:7), log) # Taking logs of the data
dat$OECD <- as.numeric(dat$OECD) # Creating OECD dummy
dat <- pdata.frame(droplevels(na_omit(dat)), # Creating Panel data.frame, after removing missing values
index = c("iso3c", "year")) # and dropping unused factor levels
pdim(dat)
# Unbalanced Panel: n = 132, T = 1-30, N = 918
varying(dat)
# year OECD PCGDP LIFEEX GINI ODA
# TRUE FALSE TRUE TRUE TRUE TRUE
Using the GINI index cost a lot of observations and brought the sample size down to 918, but the GINI index will be a key variable in what follows. Clearly the OECD dummy is time-invariant. Below we run Hausman-tests of fixed vs. random effects to determine which covariates might be correlated with the unobserved individual effects, and which model would be most appropriate.
# This tests whether each of the covariates is correlated with alpha_i
phtest(LIFEEX ~ PCGDP, dat) # Likely correlated
#
# Hausman Test
#
# data: LIFEEX ~ PCGDP
# chisq = 13.085, df = 1, p-value = 0.0002977
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ ODA, dat) # Likely correlated
#
# Hausman Test
#
# data: LIFEEX ~ ODA
# chisq = 41.803, df = 1, p-value = 1.009e-10
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ GINI, dat) # Likely not correlated !
#
# Hausman Test
#
# data: LIFEEX ~ GINI
# chisq = 1.3343, df = 1, p-value = 0.248
# alternative hypothesis: one model is inconsistent
phtest(LIFEEX ~ PCGDP + ODA + GINI, dat) # Fixed Effects is the appropriate model for this regression
#
# Hausman Test
#
# data: LIFEEX ~ PCGDP + ODA + GINI
# chisq = 20.652, df = 3, p-value = 0.0001244
# alternative hypothesis: one model is inconsistent
The tests suggest that both GDP per Capita and ODA are correlated with country-specific unobservables affecting life-expectancy, and overall a fixed-effects model would be appropriate. However, the Hausman test on the GINI index fails to reject: Country specific unobservables affecting average life-expectancy are not necessarily correlated with the level of inequality across countries.
Now if we want to include the OECD dummy in the regression, we cannot use fixed-effects as this would wipe-out the dummy as well. If the dummy is uncorrelated with the country-specific unobservables affecting life-expectancy (the \(\alpha_i\)), then we could use a solution suggested by Mundlak (1978) and simply add between-transformed versions of PCGDP and ODA in the regression (in addition to PCGDP and ODA in levels), and so ‘control’ for the part of PCGDP and ODA correlated with the \(\alpha_i\) (in the IV literature this is known as the control-function approach). If however the OECD dummy is correlated with the \(\alpha_i\), then we need to use the Hausman and Taylor (1981) estimator. Below I suggest 2 methods of testing this correlation:
# Testing the correlation between OECD dummy and the Between-transformed Life-Expectancy (i.e. not accounting for other covariates)
cor.test(dat$OECD, B(dat$LIFEEX)) # -> Significant correlation of 0.21
#
# Pearson's product-moment correlation
#
# data: dat$OECD and B(dat$LIFEEX)
# t = 6.4945, df = 916, p-value = 1.364e-10
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.1471020 0.2708361
# sample estimates:
# cor
# 0.2098089
# Getting the fixed-effects (estimates of alpha_i) from the model (i.e. accounting for the other covariates)
fe <- fixef(plm(LIFEEX ~ PCGDP + ODA + GINI, dat, model = "within"))
mODA <- fmean(dat$ODA, dat$iso3c)
# Again testing the correlation
cor.test(fe, mODA[match(names(fe), names(mODA))]) # -> Not Significant.. but probably due to small sample size, the correlation is still 0.13
#
# Pearson's product-moment correlation
#
# data: fe and mODA[match(names(fe), names(mODA))]
# t = 1.4906, df = 130, p-value = 0.1385
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.04217488 0.29399213
# sample estimates:
# cor
# 0.1296318
I interpret the test results as rejecting the hypothesis that the dummy is uncorrelated with \(\alpha_i\), thus we do have a case for Hausman and Taylor (1981) here: the OECD dummy is a \(Z_{2i}\) with \(E[Z_{2i}\alpha_i]\neq 0\). The Hausman tests above suggested that the GINI index is the only variable uncorrelated with \(\alpha_i\), thus GINI is \(X_{1it}\) with \(E[X_{1it}\alpha_i] = 0\). Finally PCGDP and ODA jointly constitute \(X_{2it}\), where the Hausman tests strongly suggested that \(E[X_{2it}\alpha_i] \neq 0\). We do not have a \(Z_{1i}\) in this setup, i.e. a time-invariant variable uncorrelated with the \(\alpha_i\).
The Hausman and Taylor (1981) estimator stipulates that we should instrument the OECD dummy with \(X_{1i.}\), the between-transformed GINI index. Let us therefore test the regression of the dummy on this instrument to see of it would be a good (i.e. relevant) instrument:
# This computes the regression of OECD on the GINI instrument: Weak IV problem !!
fFtest(dat$OECD, B(dat$GINI))
# R-Sq. DF1 DF2 F-Stat. P-value
# 0.000 1 916 0.212 0.645
The 0 R-Squared and the F-Statistic of 0.21 suggest that the instrument is very weak indeed, rubbish to be precise, thus the implementation of the HT estimator below is also a rubbish example, but it is still good for illustration purposes:
HT_est(y = dat$LIFEEX,
X1 = get_vars(dat, "GINI"),
Z2 = get_vars(dat, "OECD"),
X2 = get_vars(dat, c("PCGDP","ODA")))
# Intercept GINI PCGDP ODA OECD
# 2.844195534 -0.021283719 0.119913000 0.004333494 5.950412450
Now a central questions is of course: How computationally efficient is this estimator? Let us try to re-run it on the data generated for the benchmark in Part 1:
dat <- get_vars(data, c("iso3c","year","OECD","PCGDP","LIFEEX","GINI","ODA"))
get_vars(dat, 4:7) <- lapply(get_vars(dat, 4:7), log) # Taking logs of the data
dat$OECD <- as.numeric(dat$OECD) # Creating OECD dummy
dat <- pdata.frame(droplevels(na_omit(dat)), # Creating Panel data.frame, after removing missing values
index = c("iso3c", "year")) # and dropping unused factor levels
pdim(dat)
# Unbalanced Panel: n = 13200, T = 1-30, N = 91800
varying(dat)
# year OECD PCGDP LIFEEX GINI ODA
# TRUE FALSE TRUE TRUE TRUE TRUE
library(microbenchmark)
microbenchmark(HT_est = HT_est(y = dat$LIFEEX, # The estimator as before
X1 = get_vars(dat, "GINI"),
Z2 = get_vars(dat, "OECD"),
X2 = get_vars(dat, c("PCGDP","ODA"))),
HT_est_TFE = HT_est(y = dat$LIFEEX, # Also Projecting out Time-FE
X1 = get_vars(dat, "GINI"),
Z2 = get_vars(dat, "OECD"),
X2 = get_vars(dat, c("PCGDP","ODA")),
time.FE = TRUE))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# HT_est 9.908943 10.98641 18.40251 16.64038 17.57393 135.4984 100 a
# HT_est_TFE 37.908345 46.04992 50.87138 47.46408 50.06727 153.8423 100 b
At around 100,000 obs and 13,000 groups in an unbalanced panel, the computation involving 3 grouped centering and 1 grouped averaging task as well as 2 list-to matrix conversions and an IV-procedure took about 10 milliseconds with only individual effects, and about 40 - 45 milliseconds with individual and time-fixed effects (projected out iteratively). This should leave some room for running this on much larger data.
Hausman J, Taylor W (1981). “Panel Data and Unobservable Individual Effects.” Econometrica, 49, 1377–1398.
Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” Econometrica 46 (1): 69–85.
Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.
Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.
In fact factors are projected out using lfe::demeanlist
and no regression is run at all↩︎