The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

SLEEV: Semiparametric Likelihood Estimation with Errors in Variables

The complete R package logreg2ph and code for the simulation settings included in this paper.

Part of the R package TwoPhaseReg as described in this paper.

This package combines elements of the R packages logreg2ph and TwoPhaseReg

Install

To install the package, run the following in your R console:

devtools::install_github("Epic-Doughnut/sleev")

Simulation settings

Inside the simulations subdirectory, you will find the following:

Generating Data

Inside each of the files above, you will find code to generate the appropriate data for that simulation setting, e.g.,

```{r, eval = F, tidy = TRUE} set.seed(918)

Set sample sizes —————————————-

N <- 1000 # Phase I size = N n <- 250 # Phase II/audit size = n

Generate true values Y, Xb, Xa —————————-

Xa <- rbinom(n = N, size = 1, prob = 0.25) Xb <- rbinom(n = N, size = 1, prob = 0.5) Y <- rbinom(n = N, size = 1, prob = (1 + exp(-(- 0.65 - 0.2 * Xb - 0.1 * Xa))) ^ (- 1))

Generate error-prone Xb* from error model P(Xb*|Xb,Xa) ——

sensX <- specX <- 0.75 delta0 <- - log(specX / (1 - specX)) delta1 <- - delta0 - log((1 - sensX) / sensX) Xbstar <- rbinom(n = N, size = 1, prob = (1 + exp(- (delta0 + delta1 * Xb + 0.5 * Xa))) ^ (- 1))

Generate error-prone Y* from error model P(Y|Xb,Y,Xb,Xa) —

sensY <- 0.95 specY <- 0.90 theta0 <- - log(specY / (1 - specY)) theta1 <- - theta0 - log((1 - sensY) / sensY) Ystar <- rbinom(n = N, size = 1, prob = (1 + exp(- (theta0 - 0.2 * Xbstar + theta1 * Y - 0.2 * Xb - 0.1 * Xa))) ^ (- 1))


Then, the user has the option of two audit designs: simple random sampling (SRS) or 1:1 case-control sampling based on $Y^*$ (naive case-control). Based on these designs, the validation indicators V are generated as follows: 

```{r, eval = FALSE}
# Choose audit design: SRS or -----------------------------
## Naive case-control: case-control based on Y^* ----
audit <- "SRS" #or "Naive case-control"

# Draw audit of size n based on design --------------------
## V is a TRUE/FALSE vector where TRUE = validated --------
if(audit == "SRS")
{
    V <- seq(1, N) %in% sample(x = seq(1, N), size = n, replace = FALSE)
}
if(audit == "Naive case-control")
{
    V <- seq(1, N) %in% c(sample(x = which(Ystar == 0), size = 0.5 * n, replace = FALSE),
                          sample(x = which(Ystar == 1), size = 0.5 * n, replace = FALSE))
}

Finally, combine the generated data and validation indicators into an analytical dataset:

{r, eval = F, tidy = T} # Build dataset -------------------------------------------- sdat <- cbind(Y, Xb, Ystar, Xbstar, Xa, V) # Make Phase-II variables Y, Xb NA for unaudited subjects --- sdat[!V, c("Y", "Xb")] <- NA

Running Estimator Code

The R scripts each contain implementations for the estimators discussed in the paper. Examples of each are provided below:

1. Naive Analysis

{r, eval = F, tidy = T} naive <- glm(Ystar ~ Xbstar + Xa, family = "binomial", data = data.frame(sdat)) beta_naive <- naive$coefficients[2] se_naive <- sqrt(diag(vcov(naive)))[2]

2. Complete-Case Analysis

{r, eval = F, tidy = T} cc <- glm(Y[V] ~ Xb[V] + Xa[V], family = "binomial") beta_cc <- cc$coefficients[2] se_cc <- sqrt(diag(vcov(cc)))[2]

3. Horvitz–Thompson Estimator (for Naive Case-Control Audit Only)

{r, eval = F, tidy = T} library(sandwich) if (audit == "Naive case-control") { sample_wts <- ifelse(Ystar[V] == 0, 1 / ((0.5 * n) / (table(Ystar)[1])), 1 / ((0.5 * n) / (table(Ystar)[2]))) ht <- glm(Y[V] ~ Xb[V] + Xa[V], family = "binomial", weights = sample_wts) beta_ht <- ht$coefficients[2] se_ht <- sqrt(diag(sandwich(ht)))[2] }

4. Generalized Raking Estimator

```{r, eval = F, tidy = T} ### Influence function for logistic regression ### Taken from: https://github.com/T0ngChen/multiwave/blob/master/sim.r inf.fun <- function(fit) { dm <- model.matrix(fit) Ihat <- (t(dm) %% (dm fit\(fitted.values * (1 - fit\)fitted.values))) / nrow(dm) ## influence function infl <- (dm * resid(fit, type = “response”)) %*% solve(Ihat) infl }

naive_infl <- inf.fun(naive) # error-prone influence functions based on naive model colnames(naive_infl) <- paste0(“if”, 1:3)

Add naive influence functions to sdat ———————————————–

sdat <- cbind(id = 1:N, sdat, naive_infl)

Calibrate raking weights to the sum of the naive influence functions —————-

library(survey) if (audit == “SRS”) { sstudy <- twophase(id = list(~id, ~id), data = data.frame(sdat), subset = ~V) } else if (audit == “Naive case-control”) { sstudy <- twophase(id = list(~id, ~id), data = data.frame(sdat), strat = list(NULL, ~Ystar), subset = ~V) } scal <- calibrate(sstudy, ~ if1 + if2 + if3, phase = 2, calfun = “raking”)

Fit analysis model using calibrated weights —————————————–

rake <- svyglm(Y ~ Xb + Xa, family = “binomial”, design = scal) beta_rake <- rake$coefficients[2] se_rake <- sqrt(diag(vcov(rake)))[2]


#### 5. Maximum Likelihood Estimator (MLE) (for Binary Xb* Only)

```{r, eval = F, tidy = T}
# Script: two-phase log-likelihood specification adapted from Tang et al. (2015) named ~/code/Tang_twophase_loglik_binaryX.R
source("Tang_twophase_loglik_binaryX.R")
fit_Tang <- nlm(f = Tang_twophase_loglik,
                p = rep(0, 12),
                hessian = TRUE,
                Val = "V",
                Y_unval = "Ystar",
                Y_val="Y",
                X_unval = "Xbstar",
                X_val = "Xb",
                C = "Xa",
                data = sdat)
beta_mle <- fit_Tang$estimate[10]
se_mle <- sqrt(diag(solve(fit_Tang$hessian)))[10]

6. Sieve Maximum Likelihood Estimator (SMLE)

```{r, eval = F, tidy = T} # Construct B-spline basis ——————————- nsieve <- 4 B <- matrix(0, nrow = N, ncol = nsieve) B[which(Xa == 0 & Xbstar == 0), 1] <- 1 B[which(Xa == 0 & Xbstar == 1), 2] <- 1 B[which(Xa == 1 & Xbstar == 0), 3] <- 1 B[which(Xa == 1 & Xbstar == 1), 4] <- 1 colnames(B) <- paste0(“bs”, seq(1, nsieve)) sdat <- cbind(sdat, B)

library(“logreg2ph”) smle <- logreg2ph(Y_unval = “Ystar”, Y_val = “Y”, X_unval = “Xbstar”, X_val = “Xb”, C = “Xa”, Validated = “V”, Bspline = colnames(B), data = sdat, noSE = FALSE, MAX_ITER = 1000, TOL = 1E-4) beta_smle <- smle\(Coefficients\)Coefficient[2] se_smle <- smle\(Coefficients\)SE[2] ```

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.