Overview
The TwoStepSDFM package provides a fast and
memory-efficient implementation of the cross-validation, prediction, and
two-step estimation procedure for mixed-frequency sparse dynamic
factor models (SDFM) outlined in Franjic and
Schweikert (2024). The
routines in this package also directly handle heterogeneous publication
lags of the data, which can lead to missing observations at the end of
the observation period (so-called ragged edges).
We estimate the sparse loading matrix via sparse principal
components analysis (SPCA) (Zou, Hastie, and Tibshirani 2006).
The latent states are then filtered via a univariate representation of
the Kalman Filter and Smoother (KFS) as described in Koopman and Durbin (2000).
Our approach is a direct generalisation of the classical two-step
estimator for dense DFMs described in Giannone,
Reichlin, and Small (2008) and Doz, Giannone, and Reichlin (2011). Differing from the
classical model, however, we allow regularisation to shrink some of the
factor loadings towards or directly to zero. This is achieved by
elastic-net type size constraints introduced to the PCA’s least squares
problem according to Zou, Hastie, and Tibshirani
(2006). We also explicitly
account for cross-sectional correlation in the measurement error. For
more details see Franjic and Schweikert (2024).
Example Usage
Below, we will outline the usage of the main features of this
package. To this end, we start by simulating what we refer to as a
pseudo mixed-frequency data vintage. This vintage includes
a set of stationary mixed-frequency variables, a vector indicating the
variables’ frequencies, and a vector indicating the variables’ delays.
Afterwards, we go over a step-by-step example on how to validate the
hyper-parameters, estimate the model parameters, and predict using the
data vintage.
Simulate Data
We start our exposé by simulating data from an approximate DFM with
measurement equation \[
\boldsymbol{x}_t = \boldsymbol{\Lambda} \boldsymbol{f}_{t} +
\boldsymbol{\xi}_t,\quad \boldsymbol{\xi}_t \sim
\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\xi}),
\] and transition equation \[
\boldsymbol{f}_t = \sum_{p=1}^P\boldsymbol{\Phi}_p \boldsymbol{f}_{t-p}
+ \boldsymbol{\epsilon}_t,\quad \boldsymbol{\epsilon}_t \sim
\mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}_{f}),
\] for \(t = 1, \dots,T\). The
vector \(\boldsymbol{x}_t :=
(x_{n,t})_{n=1}^{N}\) represents the vector of all
monthly variables, including those which will be aggregated
into a quarterly representation (quarterfied) later. \(\boldsymbol{\Lambda} :=
(\lambda_{n,r})_{n,r=1}^{N,R}\) is the factor loading matrix that
distributes the dynamics of the \(R\)
factors, represented by \(\boldsymbol{f}_t :=
(f_{r,t})_{r=1}^R\), onto the data. \(\boldsymbol{\Phi}_p:=(\phi_{r,s,p})_{r,s=1}^{R,R}\)
is the \(p\)th transition VAR
coefficient matrix. The vectors \(\boldsymbol{\xi}_t := (\xi_{n,t})_{n=1}^N\)
and \(\boldsymbol{\epsilon}_t :=
(\epsilon_{r,t})_{r=1}^R\) represent the measurement and
transition error, respectively. \(\boldsymbol{\mu} := (\mu_{n})_{n=1}^N\) and
\(\boldsymbol{\Sigma}_{\xi}:=(\sigma_{\xi;n,m})_{n,m=1}^{N,N}\)
represent the measurement error mean and measurement error
variance-covariance matrix, respectively, and \(\boldsymbol{\Sigma}_{f}:=(\sigma_{f;r,s})_{r,s=1}^{R,R}\)
represents the transition error variance-covariance matrix.
As our goal is to simulate mixed-frequency data, we will quarterfy
five variables, i.e. 10% of all variables simulated below. We do so by
using the geometric mean as described in Mariano
and Murasawa (2003):
\[
\boldsymbol{x}^*_{t}:= \frac{1}{3}\left(\boldsymbol{x}_{t} +
2\boldsymbol{x}_{t-1} + 3\boldsymbol{x}_{t-2} + 2\boldsymbol{x}_{t-3} +
\boldsymbol{x}_{t-4}\right)
\] for \(t=3, 6, \dots,T -
\mathrm{mod}(T,3)\). Note that this implies that the simulated
data will always start in the first month of the corresponding quarter.
The simFM() function will therefore check the
starting_date when data gets quarterfied and turned into a
zoo object.
Lastly, we also want to impose a ragged edge structure onto the data.
The ragged edges are set via the delay parameter. Note that
the delay parameter differs from the delay
parameter of the other estimation functions below. In the case of
simFM(), delay refers to the number of months
for monthly data and the number of quarters for quarterly data that will
be missing. For example, consider delay <- c(1, 1) and
assume the variable with index 1 will be quarterfied. In
that case, the variable with index 1 will be delayed by 1
quarter, i.e., it will be missing 3 observations at the end of the
panel. The variable with index 2 will be delayed by 1
month, i.e., it will be missing 1 observation at the end of the panel.
In all other instances, including the delay object returned
by simFM(), delay represents the number of
months since the most recent publication.
As a final remark, note that all RNG calls happen inside
C++. Therefore, set.seed() will not affect the
results of simFM(). Rather all random processes are
governed by the seed argument provided to the function. The
set.seed() call in the chunk below is only used for the
reproducibility of the generation of the DFM parameters.
seed <- 02102025
set.seed(seed)
no_of_obs <- 300
no_of_vars <- 50
no_of_factors <- 3
trans_error_var_cov <- diag(1, no_of_factors)
loading_matrix <- matrix(rnorm(no_of_vars * no_of_factors), no_of_vars, no_of_factors)
meas_error_mean <- rep(0, no_of_vars)
meas_error_var_cov <- diag(1, no_of_vars)
trans_var_coeff <- cbind(diag(0.5, no_of_factors), -diag(0.25, no_of_factors))
factor_lag_order <- 2
delay <- c(2, 1, floor(rexp(no_of_vars - 2, 1)))
quarterfy <- TRUE
quarterly_variable_ratio <- 0.1
corr <- TRUE
beta_param <- 2
burn_in <- 999
starting_date <- "1970-01-01"
rescale <- FALSE
check_stationarity <- TRUE
stationarity_check_threshold <- 1e-10
factor_model <- simFM(no_of_obs = no_of_obs, no_of_vars = no_of_vars,
no_of_factors = no_of_factors, loading_matrix = loading_matrix,
meas_error_mean = meas_error_mean, meas_error_var_cov = meas_error_var_cov,
trans_error_var_cov = trans_error_var_cov, trans_var_coeff = trans_var_coeff,
factor_lag_order = factor_lag_order, delay = delay, quarterfy = quarterfy,
quarterly_variable_ratio = quarterly_variable_ratio, corr = corr,
beta_param = beta_param, seed = seed, burn_in = burn_in,
starting_date = starting_date, rescale = rescale,
check_stationarity = check_stationarity,
stationarity_check_threshold = stationarity_check_threshold)
spca_plots <- plot(factor_model, axis_text_size = 10, legend_title_text_size = 5)
spca_plots$`Data Plots`
#> Warning: Removed 12 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 27 rows containing missing values or values outside the scale range
#> (`geom_line()`).

spca_plots$`Factor Time Series Plots`

spca_plots$`Loading Matrix Heatmap`

Estimate the Number of Factors
In the next step, we need to estimate the number of factors. We do so
by using the procedure described in Onatski (2009). Very loosely, this test
infers the number of factors by looking at the maximum slopes between
the eigenvalues of a special data Gram matrix. For more information on
the procedure, see the literature cited above. Note that as mentioned
above, data should be standardised prior to the analysis. Also, only
provide monthly observations to noOfFactors(), as otherwise
results might get skewed.
The Onatski (2009) procedure is a repeated
testing procedure with non-standard distributions of the slope
coefficients of the eigenvalues. This distribution depends on the number
of eigenvalues considered in the test. Therefore, a minimum and maximum
number of factors must be provided. These values should be close enough
such that the test has enough power, but also far enough apart such that
we do not reject too early due to hitting the maximum number of factors.
When working with this test, we found that using
max_no_factors - min_no_factors = 1 works quite well.
We also need to set a confidence threshold as a stopping rule of the
testing procedure. In general, for finite samples, the Onatski (2009) procedure works quite
well and in our experience better than information-based criteria,
especially when working with data simulated from an SDFM with a high
degree of sparsity. However, we also found that in some cases, the
testing procedure seems to underestimate the number of factors if the
threshold is set too strictly. Users should therefore set
confidence_threshold comparatively large if the cost of
underestimating the number of factors, i.e., including too few factors,
is large.
mtly_data <- scale(factor_model$data[, which(factor_model$frequency == 12)])
no_of_fact_test <- noOfFactors(mtly_data, min_no_factors = 1, max_no_factors = 7,
confidence_threshold = 0.2)
print(no_of_fact_test)
#> The estimated no. of factors is 3 with a p-value of 0.375 and a critical value of alpha = 0.2
Cross-Validate the Hyper-Parameters
With an estimate of the number of factors at hand, we proceed to
infer the hyper-parameters. Just as above, we standardise the data prior
to the analysis. Here, however, we provide the full mixed-frequency data
set, as the function handles the issues of the mixed-frequency structure
internally.
Our cross-validation routine is a random hyper-parameter search (Bergstra and Bengio
2012) time-series cross-validation (Hyndman and Athanasopoulos
2018) scheme. This means that the candidate values for the
hyper-parameters are drawn randomly. Therefore, we do not provide a grid of
parameter candidates but the bounds of the distribution governing the
hyper-parameter candidates. For the global ridge penalty, candidates are
drawn as \(\exp(u)\), where \(u\) is uniformly distributed between
min_ridge_penalty and max_ridge_penalty.
For the factor-specific lasso penalties, it is necessary to first
choose a stopping criterion. The internal SPCA problem, which is used to
regularise the loading matrix, is solved via a LARS-EN type algorithm
found in Zou and Hastie (2020). This algorithm has the
advantage of providing three different stopping criteria:
"selected", "penalty", and
"steps". These criteria represent different conceptual
approaches, but for cross-validation purposes, the choice between them
is typically non-critical as they serve the same predictive goal. In the
example below, we use "selected". When using
"selected", the \(\ell_1\)
size constraints are expressed in terms of the number of non-zero
loadings of the corresponding loading matrix column. Candidates are
therefore drawn from a discrete uniform distribution over the natural
numbers. The upper and lower bounds of this distribution are governed by
the first and second entry of min_max_penalty,
respectively.
no_of_mtly_vars <- sum(factor_model$frequency == 12)
all_data <- scale(factor_model$data)
cross_val_res <- crossVal(data = all_data, variable_of_interest = 1, fcast_horizon = 0,
delay = factor_model$delay, frequency = factor_model$frequency,
no_of_factors = no_of_fact_test$no_of_factors, seed = seed,
min_ridge_penalty = 1e-6, max_ridge_penalty = 1e+5, cv_repetitions = 5,
cv_size = 100, lasso_penalty_type = "selected", min_max_penalty = c(5, no_of_mtly_vars),
verbose = FALSE)
print(cross_val_res)
#> Cross-Validation Results
#> =========================================================================
#> Cross-Validation Error: 0.472523
#> Optimum Ridge Penalty : 1547.582
#>
#> # of non-zero Loadings per Factor
#> Factor 1 : 40
#> Factor 2 : 16
#> Factor 3 : 19
#> =========================================================================
#>
#> BIC Results
#> =========================================================================
#> BIC : 0.3747073
#> Optimum Ridge Penalty : 0
#>
#> # of non-zero Loadings per Factor
#> Factor 1 : 45
#> Factor 2 : 45
#> Factor 3 : 45
#> =========================================================================
cross_val_plot <- plot(cross_val_res)
cross_val_plot$`CV Results`
#> Warning in scale_x_continuous(trans = "log10"): log-10 transformation
#> introduced infinite values.

Estimate the Parameters of the SDFM
Before turning to the main prediction routine, we want to focus on
the estimation of the model parameters first. While written with
prediction as main focus, our package also provides the means to
estimate the model parameters. For an SDFM, this is handled by the
twoStepSDFM() function below. Its usage is straightforward.
Note, however, that the function only accepts monthly data.
mtly_data <- scale(factor_model$data[, which(factor_model$frequency == 12)])
mtly_delay <- factor_model$delay[which(factor_model$frequency == 12)]
inferred_no_of_selected <- cross_val_res$CV$`Min. CV`[3:(3 + no_of_fact_test$no_of_factors - 1)]
inferred_ridge_penalty <- cross_val_res$CV$`Min. CV`[1]
sdfm_fit <- twoStepSDFM(data = mtly_data, delay = mtly_delay, selected = inferred_no_of_selected,
no_of_fact_test$no_of_factors, ridge_penalty = inferred_ridge_penalty)
print(sdfm_fit)
#> Simulated Dynamic Factor Model
#> =========================================================================
#> No. of Observations : 300
#> No. of Variables : 45
#> No. of Factors : 3
#> Factor Lag Order : 2
#> No. of zero elements in the loading matrix : 60
#> =========================================================================
#> Head of the factors :
#> Factor 1 Factor 2 Factor 3
#> Jan 1970 -10.2876529 3.302636 3.737877
#> Feb 1970 -6.5705960 -1.752246 3.513527
#> Mrz 1970 -0.6138966 -6.679130 4.198856
#> Apr 1970 -1.3973223 -6.893427 8.838713
#> Mai 1970 -4.8644665 -5.378115 7.070411
#> Tail of the factors :
#> Factor 1 Factor 2 Factor 3
#> Aug 1994 0.6358366 1.667165 -1.8261503
#> Sep 1994 -4.1323004 -6.608666 -1.4170338
#> Okt 1994 -0.9269175 -2.136796 -1.1725775
#> Nov 1994 6.3287872 -1.251908 0.5935398
#> Dez 1994 -1.6773825 -1.270249 1.4509413
#> Head of the loading matrix :
#> [,1] [,2] [,3]
#> [1,] 0.26945380 0.0000000 -0.2652319
#> [2,] 0.11117513 0.0000000 0.0000000
#> [3,] -0.05484467 -0.9727244 0.0000000
#> [4,] 0.24172759 0.2761174 0.0000000
#> [5,] 0.00000000 -0.1632593 -0.2216994
#> Tail of the loading matrix :
#> [,1] [,2] [,3]
#> [41,] 0.216726824 0.00000000 0.0000000
#> [42,] -0.297284626 0.00000000 0.2309382
#> [43,] 0.115015121 -0.62532497 0.0000000
#> [44,] 0.003274602 -0.65080152 0.3161383
#> [45,] 0.000000000 0.06522026 0.0000000
#> =========================================================================
sdfm_fit_plot <- plot(sdfm_fit)
sdfm_fit_plot$`Factor Time Series Plots`

sdfm_fit_plot$`Loading Matrix Heatmap`

Predict Target Variables
Finally, we turn to the nowcasting routine. Fundamentally, this
routine is an unrestricted factor augmented MIDAS model as described in
Marcellino and Schumacher (2010). For each target, we
use each other target, each additional quarterly predictor, and
quarterfied factor, and predict the target via an individual (AR)DL
model. The exact autoregressive and predictor lag order are inferred
internally via the BIC for each equation. The final forecast is then
given by the unweighted mean of all individual (AR)DL model
forecasts.
After cross-validation, we are set to directly predict the target
variables. Here, just as with crossVal(), we are also able
to provide the complete mixed-frequency data set. Note that, in contrast
to the cross-validation function, it is possible to predict multiple
targets across multiple horizons at once. One should keep in mind,
however, that the optimal hyper-parameters are retrieved for a specific
target and horizon (in our case the nowcast of the first quarterly
variable). Therefore, in a serious prediction exercise, one should
always tune the hyper-parameters for each target and horizon
specifically.
To show off the abilities of this function, however, we still predict
multiple targets and horizons. The first target is lagged by two
quarters, while the second is lagged by only one. The maximum
forecasting horizon is set to two. This means that for both targets we
will forecast two steps ahead. For target one we will additionally
compute a backcast and a nowcast. For target two only a nowcast will be
computed. In general, max_fcast_horizon governs the number
of step-ahead forecasts computed. nowcasts and backcasts are
automatically computed depending on the delay of the targets.
all_data <- scale(factor_model$data)
inferred_no_of_selected <- cross_val_res$CV$`Min. CV`[3:(3 + no_of_fact_test$no_of_factors - 1)]
inferred_ridge_penalty <- cross_val_res$CV$`Min. CV`[1]
nc_results <- nowcast(data = all_data, variables_of_interest = c(1, 2), max_fcast_horizon = 2,
delay = factor_model$delay, ridge_penalty = inferred_ridge_penalty, selected = inferred_no_of_selected, frequency = factor_model$frequency, no_of_factors = no_of_fact_test$no_of_factors)
nc_plot <- plot(nc_results)
nc_plot$`Single Pred. Fcast Density Plots Series 1`

nc_plot$`Single Pred. Fcast Density Plots Series 2`

References
Bergstra, James, and Yoshua Bengio. 2012. “Random Search for
Hyper-Parameter Optimization.” Journal of Machine Learning
Research 13 (2).
Doz, Catherine, Domenico Giannone, and Lucrezia Reichlin. 2011.
“A two-step estimator for large approximate
dynamic factor models based on Kalman filtering.”
Journal of Econometrics 164 (1): 188–205.
https://doi.org/10.1016/j.jeconom.2011.02.012.
Franjic, Domenic, and Karsten Schweikert. 2024. “Nowcasting
Macroeconomic Variables with a Sparse Mixed Frequency Dynamic Factor
Model.” Available at SSRN 4733872.
Giannone, Domenico, Lucrezia Reichlin, and David Small. 2008.
“Nowcasting: The Real-Time Informational Content of Macroeconomic
Data.” Journal of Monetary Economics 55 (4): 665–76.
https://doi.org/10.1016/j.jmoneco.2008.05.010.
Hyndman, Rob J., and George Athanasopoulos. 2018. Forecasting:
Principles and Practice. 3rd ed. Melbourne, Australia: OTexts
Melbourne.
Koopman, Siem Jan, and James Durbin. 2000. “Fast Filtering and
Smoothing for Multivariate State Space Models.” Journal of
Time Series Analysis 21 (3): 281–96.
Marcellino, Massimiliano, and Christian Schumacher. 2010. “Factor MIDAS for nowcasting and forecasting with
ragged-edge data: A model comparison for German GDP.”
Oxford Bulletin of Economics and Statistics 72 (4): 518–50.
Mariano, Roberto S., and Yasutomo Murasawa. 2003.
“A New
Coincident Index of Business Cycles Based on Monthly and Quarterly
Series.” Journal of Applied Econometrics 18 (4): 427–43.
https://doi.org/10.1002/jae.695.
Onatski, Alexei. 2009. “Testing Hypotheses about the Number of
Factors in Large Factor Models.” Econometrica 77 (5):
1447–79.
Zou, Hui, and Trevor Hastie. 2020.
Elasticnet: Elastic-Net for
Sparse Estimation and Sparse PCA.
https://CRAN.R-project.org/package=elasticnet.
Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006. “Sparse
Principal Component Analysis.” Journal of Computational and
Graphical Statistics 15 (2): 265–86.