The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
nonprobsvy
:
an R package for modern statistical inference methods based on
non-probability samples
The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
The package allows for:
ncvreg
, Rcpp
,
RcppArmadillo
packages),survey
and srvyr
packages when probability sample is available (Lumley 2004, 2023; Freedman Ellis
and Schneider 2024),logit
,
probit
and cloglog
) and outcome
(gaussian
, binomial
and poisson
)
variables.Details on the use of the package can be found:
You can install the recent version of nonprobsvy
package
from main branch Github with:
::install_github("ncn-foreigners/nonprobsvy") remotes
or install the stable version from CRAN
install.packages("nonprobsvy")
or development version from the dev
branch
::install_github("ncn-foreigners/nonprobsvy@dev") remotes
Consider the following setting where two samples are available: non-probability (denoted as \(S_A\) ) and probability (denoted as \(S_B\)) where set of auxiliary variables (denoted as \(\boldsymbol{X}\)) is available for both sources while \(Y\) and \(\boldsymbol{d}\) (or \(\boldsymbol{w}\)) is present only in probability sample.
Sample | Auxiliary variables \(\boldsymbol{X}\) | Target variable \(Y\) | Design (\(\boldsymbol{d}\)) or calibrated (\(\boldsymbol{w}\)) weights | |
---|---|---|---|---|
\(S_A\) (non-probability) | 1 | \(\checkmark\) | \(\checkmark\) | ? |
… | \(\checkmark\) | \(\checkmark\) | ? | |
\(n_A\) | \(\checkmark\) | \(\checkmark\) | ? | |
\(S_B\) (probability) | \(n_A+1\) | \(\checkmark\) | ? | \(\checkmark\) |
… | \(\checkmark\) | ? | \(\checkmark\) | |
\(n_A+n_B\) | \(\checkmark\) | ? | \(\checkmark\) |
Suppose \(Y\) is the target variable, \(\boldsymbol{X}\) is a matrix of auxiliary variables, \(R\) is the inclusion indicator. Then, if we are interested in estimating the mean \(\bar{\tau}_Y\) or the sum \(\tau_Y\) of the of the target variable given the observed data set \((y_k, \boldsymbol{x}_k, R_k)\), we can approach this problem with the possible scenarios:
Estimator | Example code |
---|---|
Mass imputation based on regression imputation |
|
Inverse probability weighting |
|
Inverse probability weighting with calibration constraint |
|
Doubly robust estimator |
|
Estimator | Example code |
---|---|
Mass imputation based on regression imputation |
|
Mass imputation based on nearest neighbour imputation |
|
Mass imputation based on predictive mean matching |
|
Mass imputation based on regression imputation with variable selection (LASSO) |
|
Inverse probability weighting |
|
Inverse probability weighting with calibration constraint |
|
Inverse probability weighting with calibration constraint with variable selection (SCAD) |
|
Doubly robust estimator |
|
Doubly robust estimator with variable selection (SCAD) and bias minimization |
|
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
<- 1e6 ## 1000000
N <- 1000
n <- rnorm(n = N, mean = 1, sd = 1)
x1 <- rexp(n = N, rate = 1)
x2 <- rnorm(n = N) # rnorm(N)
epsilon <- 1 + x1 + x2 + epsilon
y1 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
y2 <- exp(x2)/(1+exp(x2))
p1 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
p2 <- rbinom(n = N, size = 1, prob = p1)
flag_bd1 <- as.numeric(1:N %in% sample(1:N, size = n))
flag_srs <- N/n
base_w_srs <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs, pop_size = N)
population <- N/sum(population$flag_bd1) base_w_bd
Declare svydesign
object with survey
package
<- svydesign(ids= ~1, weights = ~ base_w_srs,
sample_prob data = subset(population, flag_srs == 1),
fpc = ~ pop_size)
sample_prob#> Independent Sampling design
#> svydesign(ids = ~1, weights = ~base_w_srs, data = subset(population,
#> flag_srs == 1), fpc = ~pop_size)
or with the srvyr
package
<- srvyr::as_survey_design(.data = subset(population, flag_srs == 1),
sample_prob weights = base_w_srs)
sample_prob
design (with replacement)
Independent Sampling
Called via srvyr:
Sampling variables:
Data variables- x1 (dbl), x2 (dbl), y1 (dbl), y2 (dbl), p1 (dbl), p2 (dbl), base_w_srs (dbl), flag_bd1 (int), flag_srs (dbl)
Estimate population mean of y1
based on doubly robust
estimator using IPW with calibration constraints and we specify that
auxiliary variables should not be combined for the inference.
<- nonprob(
result_dr selection = ~ x2,
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_dr#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9500 (se=0.0414, ci=(2.8689, 3.0312))
#> - variable y2: 1.5762 (se=0.0498, ci=(1.4786, 1.6739))
Mass imputation estimator
<- nonprob(
result_mi outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_mi#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9498 (se=0.0420, ci=(2.8675, 3.0321))
#> - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6398))
Inverse probability weighting estimator
<- nonprob(
result_ipw selection = ~ x2,
target = ~y1+y2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)
Results
result_ipw#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9981 (se=0.0137, ci=(2.9713, 3.0249))
#> - variable y2: 1.5906 (se=0.0137, ci=(1.5639, 1.6174))
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.