| Type: | Package | 
| Title: | Controlled Difference Estimation for Complex Surveys | 
| Version: | 0.2.0 | 
| Maintainer: | Stephen Salerno <ssalerno@fredhutch.org> | 
| Description: | Estimates the population average controlled difference for a given outcome between levels of a binary treatment, exposure, or other group membership variable of interest for clustered, stratified survey samples where sample selection depends on the comparison group. Provides three methods for estimation, namely outcome modeling and two factorizations of inverse probability weighting. Under stronger assumptions, these methods estimate the causal population average treatment effect. Salerno et al., (2024) <doi:10.48550/arXiv.2406.19597>. | 
| License: | GPL (≥ 3) | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| RoxygenNote: | 7.3.2 | 
| VignetteBuilder: | knitr | 
| Imports: | MASS, betareg, numDeriv, stats, survey | 
| Depends: | R (≥ 4.1.0) | 
| Suggests: | knitr, rmarkdown, markdown, spelling | 
| URL: | https://github.com/salernos/svycdiff, https://salernos.github.io/svycdiff/ | 
| BugReports: | https://github.com/salernos/svycdiff/issues | 
| Language: | en-US | 
| NeedsCompilation: | no | 
| Packaged: | 2025-07-22 20:03:45 UTC; saler | 
| Author: | Stephen Salerno | 
| Repository: | CRAN | 
| Date/Publication: | 2025-07-22 20:40:12 UTC | 
Race, SES, and Telomere Length Data
Description
National Health and Nutrition Examination Survey (NHANES) data on race, socioeconomic status, and leukocyte telomere length from the 1999-2000 and 2001-2002 survey waves.
Usage
data(NHANES)
Format
A dataset with 5,298 observations (rows) of 29 variables (columns):
- SEQN
- Numeric: Respondent Sequence Number 
- iWTMEC4YR
- Numeric: 1/WTMEC4YR (Full Sample 4 Year Probability of Selection) 
- WTMEC4YR
- Numeric: Full Sample 4 Year Interview Weight 
- SDMVPSU
- Numeric: Masked Variance Pseudo-PSU 
- SDMVSTRA
- Numeric: Masked Variance Pseudo-Stratum 
- TELOMEAN
- Numeric: Mean T/S Ratio (See Details) 
- lTELOMEAN
- Numeric: log(TELOMEAN) 
- RACE_2CAT
- Numeric: 0 = Non-Hispanic White, 1 = Non-Hispanic Black (0/1 Coded for Current Functionality) 
- AGE
- Numeric: Age at Screening (Years) 
- SEX
- Factor w/ 2 Levels: Self-Reported Sex - Male, Female 
- EDUC_3CAT
- Factor w/ 3 Levels: Education - High School or GED, Some College, College Graduate 
- MARTL_3CAT
- Factor w/ 3 Levels: Marital Status - Never Married, Widowed/Divorced/Separated, Married/Living with Partner 
- HHSIZE_3CAT
- Factor w/ 5 Levels: Household Size - 1 Person, 2 People, 3 People, 4 People, 5+ People 
- HHINC_5CAT
- Factor w/ 5 Levels: Annual Household Income - $0 - $20,000, $20,000 - $35,000, $35,000 - $55,000, $55,000 - $75,000, $75,000+ 
- PIR
- Factor w/ 3 Levels: Family Poverty-Income Ratio Category - < 1.3, 1.3 <= PIR < 3.5, >= 3.5 
- EMPSTAT_4CAT
- Factor w/ 4 Levels: Employment Status - Full-Time, Part-Time, Retired, Not Working 
- OCC_5CAT
- Factor w/ 5 Levels: Occupation Category - No Work, Low Blue Collar, High Blue Collar, Low White Collar, High White Collar 
- WIC_2CAT
- Factor w/ 2 Levels: WIC Utilization - No WIC, Received WIC 
- FDSEC_3CAT
- Factor w/ 3 Levels: Food Security Status - Food Secure, Marginally Food Secure, Food Insecure 
- HOD_4CAT
- Factor w/ 4 Levels: Home Type - Family Home Detached, Family Home Attached, Apartment, Other 
- OWNHOME_2CAT
- Factor w/ 2 Levels: Home Ownership - Does Not Own Home, Owns Home 
- HIQ_2CAT
- Factor w/ 2 Levels: Insurance Status - Not Insured, Insured 
- LBXWBCSI
- Numeric: White Blood Cell Count (SI) 
- LBXLYPCT
- Numeric: Lymphocyte Percent (%) 
- LBXMOPCT
- Numeric: Monocyte Percent (%) 
- LBXNEPCT
- Numeric: Segmented Neutrophils Percent (%) 
- LBXEOPCT
- Numeric: Eosinophils Percent (%) 
- LBXBAPCT
- Numeric: Basophils Percent (%) 
- LBXBPB_LOD
- Numeric: Blood Lead Concentration (ug/dL; LOD = 0.3 ug/dL; Imputed by LOD / sqrt(2)) 
Details
Our initial sample consisted of 7,839 participants in the 1999-2002 NHANES
waves with laboratory measures recorded, including telomere length,
lTELOMEAN, which was assayed via quantitative polymerase chain
reaction (PCR; see Cawthorn, 2002). Our primary endpoint is the
log-transformed mean ratio of an individual's telomere length to a standard
reference DNA sample across all leukocyte cell types (mean T/S),
TELOMEAN. We focus on the 1999-2002 NHANES waves, as they featured
4-year adjusted survey weights, WTMEC4YR, designed for aggregating
data across cohorts. Among the initial 7,839 participants, 5,308 (67.7%)
self-identified as either non-Hispanic White or non-Hispanic Black.
Excluding those participants without our outcome of interest, our final
analytic sample contained 5,298 Non-Hispanic White or Non-Hispanic Black
identifying participants with measured telomere length. Race,
RACE_2CAT, is our variable of interest. We further included study
participant age, sex, and blood cell composition to account for known
differences in these factors, as well as twelve indicators of SES. Ten of
these, namely marital status, education level, household income, insurance
status, Special Supplemental Nutrition Program for Women, Infants, and
Children (WIC) usage, household size, home ownership, home type, food
security status, and an individual’s poverty income ratio (PIR), were
extracted directly from the NHANES demographic and occupation questionnaires.
Occupation category was constructed by mapping occupation group codes in the
NHANES 1999-2002 occupation questionnaire to the national statistics
socioeconomic job classifications, and employment status was derived from
three occupational measures: type of work done last week, hours worked last
week at all jobs, and main reason for not working last week (see Rehkopf et
al., 2008, Rose et al., 2005).
Source
<https://www.cdc.gov/nchs/nhanes/index.htm>
References
Richard M Cawthon. Telomere measurement by quantitative pcr. Nucleic acids research, 30(10):e47–e47, 2002.
David H Rehkopf, Lisa F Berkman, Brent Coull, and Nancy Krieger. The non-linear risk of mortality by income level in a healthy population: Us national health and nutrition examination survey mortality follow-up cohort, 1988–2001. BMC Public Health, 8(1):1–11, 2008.
David Rose, David J Pevalin, and Karen O’Reilly. The National Statistics Socio-economic Classification: origins, development and use. Palgrave Macmillan, 2005.
Examples
data(NHANES)
Simulate data with varying degrees of selection and confounding bias
Description
Function to simulate data based on specified relationships between the generated outcome, group variable, confounder(s), and selection mechanism.
Usage
simdat(
  N = 1e+06,
  p = 1,
  q = 0,
  n_strat = 1,
  n_clust = 1,
  sigma_strat = 1,
  sigma_clust = 1,
  X_fam = c("gaussian", "binary"),
  tau_0 = 0,
  tau_A = 1,
  tau_X = rep(1, p),
  tau_X12 = 0,
  beta_0 = 0,
  beta_A = 1,
  beta_X = rep(1, p),
  beta_U = rep(1, q),
  Y_fam = c("gaussian", "binary", "poisson"),
  alpha_0 = 0,
  alpha_A = 1,
  alpha_X = rep(1, p),
  alpha_AX = 0
)
Arguments
| N | int - Number of observations to be generated. Defaults to 1000000. | 
| p | int - Number of covariates to be generated. Defaults to 1. | 
| q | int - Number of additional covariates that affect selection to be generated. Defaults to 0. | 
| n_strat | int - Number of strata in the population to be generated. Defaults to 1. | 
| n_clust | int - Number of clusters within each stratum in the population to be generated. Defaults to 1. | 
| sigma_strat | double - Standard deviation of covariate means across strata. Defaults to 1. | 
| sigma_clust | double - Standard deviation of covariate means across clusters. Defaults to 1. | 
| X_fam | string - Distribution of the covariates,  | 
| tau_0 | double - Intercept for propensity model. Defaults to 0. | 
| tau_A | double - Scaling factor for group assignment. Defaults to 1. | 
| tau_X | double - Coefficients for  | 
| tau_X12 | double - Interaction term coefficient for  | 
| beta_0 | double - Intercept for selection model. Defaults to 0. | 
| beta_A | double - Coefficient for  | 
| beta_X | double - Coefficients for  | 
| beta_U | double - Coefficients for  | 
| Y_fam | string - Distribution of the outcome variable,  | 
| alpha_0 | double - Intercept for outcome model. Defaults to 0. | 
| alpha_A | double - Coefficient for  | 
| alpha_X | double - Coefficients for  | 
| alpha_AX | double - Coefficient for interaction between  | 
Details
The function generates data in a hierarchical structure with stratified clusters. The data generation process follows these steps:
1. Stratum and Cluster Means: For each of the n_strat
strata, a matrix of stratum-level means for p covariates is
generated from a normal distribution with standard deviation
sigma_strat. Similarly, for each of the n_clust clusters
within each stratum, cluster-level means are generated from a normal
distribution with standard deviation sigma_clust.
2. Covariate Generation: Within each cluster, covariates,
X, for N / (n_strat * n_clust) individuals are generated
from a multivariate normal distribution with mean equal to the sum of
the cluster and stratum means, and an identity covariance matrix.
3. Covariate Transformation: If X_fam is "binary",
each covariate is discretized at its median, otherwise it remains
continuous.
4. Propensity Model: The group variable, A, is generated
using a logistic regression model with intercept tau_0, covariate
effects tau_X, and an interaction effect between the first two
covariates with coefficient tau_X12. The group membership
probability, pA, is defined by the logistic model.
5. Selection Model: The probability of selection, pS, is
generated using a logistic regression model with intercept beta_0,
group effect beta_A, and covariate effects beta_X. Gaussian
noise is added to the linear predictor.
6. Outcome Model: The outcome, Y, is generated based on a
chosen outcome distribution, Y_fam. The linear predictor includes
an intercept, alpha_0, group effect, alpha_A, covariate
effects, alpha_X, and an optional interaction effect,
alpha_AX, between the group variable and covariates.
7. Controlled Difference: The true controlled difference in the
outcome between groups is calculated as CDIFF.
The output is a data frame containing the generated outcome, group variable, covariates, and selection probabilities.
Value
A data.frame with N observations and the following variables:
- Strata
- Stratum index (integer) 
- Cluster
- Cluster index (integer) 
- X1, X2, ..., Xp
- Confounding covariates (continuous or binary, depending on - X_fam)
- pA
- True probability of A = 1 conditional on X (continuous) 
- A
- Group assignment (binary) 
- pS
- True probability of selection conditional on A and X (continuous) 
- Y0
- Potential outcome under A = 0 (continuous, binary, or count depending on - Y_fam)
- Y1
- Potential outcome under A = 1 (continuous, binary, or count depending on - Y_fam)
- Y
- Observed outcome, based on treatment assignment (continuous, binary, or count depending on - Y_fam)
- CDIFF
- True controlled difference in outcomes by comparison group (double, computed as mean(Y1 - Y0)) 
Examples
N <- 100000
dat <- simdat(N)
head(dat)
Controlled Difference Estimation for Complex Surveys
Description
This is the main function to estimate population average controlled difference (ACD), or under stronger assumptions, the population average treatment effect (PATE), for a given outcome between levels of a binary treatment, exposure, or other group membership variable of interest for clustered, stratified survey samples where sample selection depends on the comparison group.
Usage
svycdiff(
  df,
  id_form,
  a_form,
  s_form,
  y_form,
  y_fam = NULL,
  strata = NULL,
  cluster = NULL
)
Arguments
| df | a 'data.frame' or 'tibble' containing the variables in the models. | 
| id_form | a 'string' indicating which identification formula to be used.
Options include  | 
| a_form | an object of class 'formula' which describes the propensity score model to be fit. | 
| s_form | an object of class 'formula' which describes the selection model to be fit. | 
| y_form | an object of class 'formula' which describes the outcome model
to be fit. Only used if  | 
| y_fam | a 'family' function. Only used if  | 
| strata | a 'string' indicating strata, else  | 
| cluster | a 'string' indicating cluster IDs, else  | 
Details
The argument id_form takes possible values "OM",
"IPW1", "IPW2", or "DR", corresponding to the four
formulas presented in Salerno et al. "OM" refers to the method that
uses outcome modeling and direct standardization to estimate the controlled
difference, while "IPW1" and "IPW2" are inverse probability
weighted methods. "IPW1" and "IPW2" differ with respect to how
the joint propensity and selection mechanisms are factored (see Salerno et
al. for additional details). "DR" refers to the doubly robust form of
estimator, which essentially combines "OM" and "IPW2".
For id_form = "IPW1" or id_form = "IPW2", y_form should
be of the form Y ~ 1.
For known selection mechanisms, s_form should be of the form
pS ~ 1, where pS is the variable corresponding to the
probability of selection (e.g., inverse of the selection weight), and there
should be two additional variables in the dataset: P_S_cond_A1X and
P_S_cond_A0X, corresponding to the known probability of selection
conditional on A = 1 or 0 and X = x, respectively. If
these quantities are not known, s_form should contain the variables
which affect sample selection on the right hand side of the equation,
including the comparison group variable of interest.
Value
'svycdiff' returns an object of class "svycdiff" which contains:
- id_form
- A string denoting Which method was selected for estimation 
- cdiff
- A named vector containing the point estimate (est), standard error (err), lower confidence limit (lcl), upper confidence limit (ucl), and p-value (pval) for the estimated controlled difference 
- fit_y
- An object of class inheriting from "glm" corresponding to the outcome model fit, or NULL for IPW1 and IPW2 
- fit_a
- An object of class inheriting from "glm" corresponding to the propensity model fit 
- wtd_fit_a
- An object of class inheriting from "glm" corresponding to the weighted propensity model fit 
- fit_s
- An object of class "betareg" corresponding to the selection model fit, or NULL if the selection mechanism is known 
Examples
N <- 1000
dat <- simdat(N)
S <- rbinom(N, 1, dat$pS)
samp <- dat[S == 1,]
y_mod <- Y ~ A * X1
a_mod <- A ~ X1
s_mod <- pS ~ A + X1
fit <- svycdiff(samp, "DR", a_mod, s_mod, y_mod, "gaussian")
fit
summary(fit)