The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Overview. Data transformations are a useful companion for parametric regression models. A well-chosen or learned transformation can greatly enhance the applicability of a given model, especially for data with irregular marginal features (e.g., multimodality, skewness) or various data domains (e.g., real-valued, positive, or compactly-supported data).
Given paired data \((x_i,y_i)\) for
\(i=1,\ldots,n\), SeBR
implements efficient and fully Bayesian inference for semiparametric
regression models that incorporate (1) an unknown data
transformation
\[ g(y_i) = z_i \]
and (2) a useful parametric regression model
\[ z_i \stackrel{indep}{\sim} P_{Z \mid \theta, X = x_i} \]
with unknown parameters \(\theta\).
Examples. We focus on the following important special cases of \(P_{Z \mid \theta, X}\):
\[ z_i = x_i'\theta + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2) \]
The transformation \(g\) broadens the applicability of this useful class of models, including for positive or compactly-supported data, while \(P_{Z \mid \theta, X=x} = N(x'\theta, \sigma_\epsilon^2)\).
\[ z_i = x_i'\theta + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} ALD(\tau) \]
to target the \(\tau\)th quantile of \(z\) at \(x\), or equivalently, the \(g^{-1}(\tau)\)th quantile of \(y\) at \(x\). The ALD is quite often a very poor model for real data, especially when \(\tau\) is near zero or one. The transformation \(g\) offers a pathway to significantly improve the model adequacy, while still targeting the desired quantile of the data.
\[ z_i = f_\theta(x_i) + \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2) \]
where \(f_\theta\) is a GP and \(\theta\) parameterizes the mean and covariance functions. Although GPs offer substantial flexibility for the regression function \(f_\theta\), this model may be inadequate when \(y\) has irregular marginal features or a restricted domain (e.g., positive or compact).
Challenges: The goal is to provide fully Bayesian posterior inference for the unknowns \((g, \theta)\) and posterior predictive inference for future/unobserved data \(\tilde y(x)\). We prefer a model and algorithm that offer both (i) flexible modeling of \(g\) and (ii) efficient posterior and predictive computations.
Innovations: Our approach (https://arxiv.org/abs/2306.05498) specifies a nonparametric model for \(g\), yet also provides Monte Carlo (not MCMC) sampling for the posterior and predictive distributions. As a result, we control the approximation accuracy via the number of simulations, but do not require the lengthy runs, burn-in periods, convergence diagnostics, or inefficiency factors that accompany MCMC. The Monte Carlo sampling is typically quite fast.
SeBR
The package SeBR
is installed and loaded as follows:
# install.packages("devtools")
# devtools::install_github("drkowal/SeBR")
library(SeBR)
The main functions in SeBR
are:
sblm()
: Monte Carlo sampling for posterior and
predictive inference with the semiparametric Bayesian linear
model;
sbsm()
: Monte Carlo sampling for posterior and
predictive inference with the semiparametric Bayesian spline
model, which replaces the linear model with a spline for nonlinear
modeling of \(x \in
\mathbb{R}\);
sbqr()
: blocked Gibbs sampling for posterior and
predictive inference with the semiparametric Bayesian quantile
regression; and
sbgp()
: Monte Carlo sampling for predictive
inference with the semiparametric Bayesian Gaussian process
model.
Each function returns a point estimate of \(\theta\) (coefficients
), point
predictions at some specified testing points
(fitted.values
), posterior samples of the transformation
\(g\) (post_g
), and
posterior predictive samples of \(\tilde
y(x)\) at the testing points (post_ypred
), as well
as other function-specific quantities (e.g., posterior draws of \(\theta\), post_theta
). The
calls coef()
and fitted()
extract the point
estimates and point predictions, respectively.
Note: The package also includes Box-Cox variants of
these functions, i.e., restricting \(g\) to the (signed) Box-Cox parametric
family \(g(t; \lambda) = \{\mbox{sign}(t)
\vert t \vert^\lambda - 1\}/\lambda\) with known or unknown \(\lambda\). The parametric transformation is
less flexible, especially for irregular marginals or restricted domains,
and requires MCMC sampling. These functions (e.g.,
blm_bc()
, etc.) are primarily for benchmarking.
Detailed documentation and examples are available at https://drkowal.github.io/SeBR/.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.