The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
nonprobsvy
: an R package for modern statistical inference methods based on non-probability samplesThe goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
The package allows for:
ncvreg
, Rcpp
, RcppArmadillo
packages),survey
package when probability sample is available Lumley (2023),logit
, probit
and cloglog
) and outcome (gaussian
, binomial
and poisson
) variables.Details on use of the package be found:
You can install the recent version of nonprobsvy
package from Github with:
or development version from the dev
branch
Consider the following setting where two samples are available: non-probability (denoted as \(S_A\) ) and probability (denoted as \(S_B\)) where set of auxiliary variables (denoted as \(\boldsymbol{X}\)) is available for both sources while \(Y\) and \(\boldsymbol{d}\) (or \(\boldsymbol{w}\)) is present only in probability sample.
Sample | Auxiliary variables \(\boldsymbol{X}\) | Target variable \(Y\) | Design (\(\boldsymbol{d}\)) or calibrated (\(\boldsymbol{w}\)) weights | |
---|---|---|---|---|
\(S_A\) (non-probability) | 1 | \(\checkmark\) | \(\checkmark\) | ? |
… | \(\checkmark\) | \(\checkmark\) | ? | |
\(n_A\) | \(\checkmark\) | \(\checkmark\) | ? | |
\(S_B\) (probability) | \(n_A+1\) | \(\checkmark\) | ? | \(\checkmark\) |
… | \(\checkmark\) | ? | \(\checkmark\) | |
\(n_A+n_B\) | \(\checkmark\) | ? | \(\checkmark\) |
Suppose \(Y\) is the target variable, \(\boldsymbol{X}\) is a matrix of auxiliary variables, \(R\) is the inclusion indicator. Then, if we are interested in estimating the mean \(\bar{\tau}_Y\) or the sum \(\tau_Y\) of the of the target variable given the observed data set \((y_k, \boldsymbol{x}_k, R_k)\), we can approach this problem with the possible scenarios:
Estimator | Example code |
---|---|
Mass imputation based on regression imputation | |
Inverse probability weighting | |
Inverse probability weighting with calibration constraint | |
Doubly robust estimator | |
Estimator | Example code |
---|---|
Mass imputation based on regression imputation | |
Mass imputation based on nearest neighbour imputation | |
Mass imputation based on predictive mean matching | |
Mass imputation based on regression imputation with variable selection (LASSO) | |
Inverse probability weighting | |
Inverse probability weighting with calibration constraint | |
Inverse probability weighting with calibration constraint with variable selection (SCAD) | |
Doubly robust estimator | |
Doubly robust estimator with variable selection (SCAD) and bias minimization | |
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs)
base_w_bd <- N/sum(population$flag_bd1)
Declare svydesign
object with survey
package
Estimate population mean of y1
based on doubly robust estimator using IPW with calibration constraints.
result_dr <- nonprob(
selection = ~ x2,
outcome = y1 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
summary(result_dr)
#>
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), selection = ~x2,
#> outcome = y1 ~ x1 + x2, svydesign = sample_prob)
#>
#> -------------------------
#> Estimated population mean: 2.95 with overall std.err of: 0.04195
#> And std.err for nonprobability and probability samples being respectively:
#> 0.000783 and 0.04195
#>
#> 95% Confidence inverval for popualtion mean:
#> lower_bound upper_bound
#> y1 2.867789 3.03224
#>
#>
#> Based on: Doubly-Robust method
#> For a population of estimate size: 1025063
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#>
#> Regression coefficients:
#> -----------------------
#> For glm regression on outcome variable:
#> Estimate Std. Error z value P(>|z|)
#> (Intercept) 0.996282 0.002139 465.8 <2e-16 ***
#> x1 1.001931 0.001200 835.3 <2e-16 ***
#> x2 0.999125 0.001098 910.2 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> -----------------------
#> For glm regression on selection variable:
#> Estimate Std. Error z value P(>|z|)
#> (Intercept) -0.498997 0.003702 -134.8 <2e-16 ***
#> x2 1.885629 0.005303 355.6 <2e-16 ***
#> -------------------------
#>
#> Weights:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 1.071 1.313 1.479 1.798 2.647
#> -------------------------
#>
#> Residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.99999 0.06603 0.23778 0.26046 0.44358 0.62222
#>
#> AIC: 1010622
#> BIC: 1010645
#> Log-Likelihood: -505309 on 694009 Degrees of freedom
Mass imputation estimator
result_mi <- nonprob(
outcome = y1 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
summary(result_mi)
#>
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 ~
#> x1 + x2, svydesign = sample_prob)
#>
#> -------------------------
#> Estimated population mean: 2.95 with overall std.err of: 0.04203
#> And std.err for nonprobability and probability samples being respectively:
#> 0.001227 and 0.04201
#>
#> 95% Confidence inverval for popualtion mean:
#> lower_bound upper_bound
#> y1 2.867433 3.032186
#>
#>
#> Based on: Mass Imputation method
#> For a population of estimate size: 1e+06
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#>
#> Regression coefficients:
#> -----------------------
#> For glm regression on outcome variable:
#> Estimate Std. Error z value P(>|z|)
#> (Intercept) 0.996282 0.002139 465.8 <2e-16 ***
#> x1 1.001931 0.001200 835.3 <2e-16 ***
#> x2 0.999125 0.001098 910.2 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> -------------------------
Inverse probability weighting estimator
result_ipw <- nonprob(
selection = ~ x2,
target = ~y1,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)
Results
summary(result_ipw)
#>
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), selection = ~x2,
#> target = ~y1, svydesign = sample_prob)
#>
#> -------------------------
#> Estimated population mean: 2.925 with overall std.err of: 0.05
#> And std.err for nonprobability and probability samples being respectively:
#> 0.001586 and 0.04997
#>
#> 95% Confidence inverval for popualtion mean:
#> lower_bound upper_bound
#> y1 2.82679 3.022776
#>
#>
#> Based on: Inverse probability weighted method
#> For a population of estimate size: 1025063
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#>
#> Regression coefficients:
#> -----------------------
#> For glm regression on selection variable:
#> Estimate Std. Error z value P(>|z|)
#> (Intercept) -0.498997 0.003702 -134.8 <2e-16 ***
#> x2 1.885629 0.005303 355.6 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> -------------------------
#>
#> Weights:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 1.071 1.313 1.479 1.798 2.647
#> -------------------------
#>
#> Residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.99999 0.06603 0.23778 0.26046 0.44358 0.62222
#>
#> AIC: 1010622
#> BIC: 1010645
#> Log-Likelihood: -505309 on 694009 Degrees of freedom
Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.
Chen, Yilin, Pengfei Li, and Changbao Wu. 2020. “Doubly Robust Inference With Nonprobability Survey Samples.” Journal of the American Statistical Association 115 (532): 2011–21. https://doi.org/10.1080/01621459.2019.1677241.
Kim, Jae Kwang, Seho Park, Yilin Chen, and Changbao Wu. 2021. “Combining Non-Probability and Probability Survey Samples Through Mass Imputation.” Journal of the Royal Statistical Society Series A: Statistics in Society 184 (3): 941–63. https://doi.org/10.1111/rssa.12696.
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.
———. 2023. “Survey: Analysis of Complex Survey Samples.”
Wu, Changbao. 2023. “Statistical Inference with Non-Probability Survey Samples.” Survey Methodology 48 (2): 283–311. https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm.
Yang, Shu, Jae Kwang Kim, and Youngdeok Hwang. 2021. “Integration of Data from Probability Surveys and Big Found Data for Finite Population Inference Using Mass Imputation.” Survey Methodology 47 (1): 29–58. https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm.
Yang, Shu, Jae Kwang Kim, and Rui Song. 2020. “Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (2): 445–65. https://doi.org/10.1111/rssb.12354.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.