Title: | Weighting All of Us |
Version: | 0.1.0 |
Description: | Utilities for using a probability sample to reweight prevalence estimates calculated from the All of Us research program. Weighted estimates will still not be representative of the general U.S. population. However, they will provide an early indication for how unweighted estimates may be biased by the sampling bias in the All of Us sample. |
License: | AGPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Suggests: | testthat (≥ 3.0.0) |
Imports: | glmnet, dplyr, stringr, stats, glue, mice, nonprobsvy, survey, ggplot2, purrr |
Config/testthat/edition: | 3 |
Depends: | R (≥ 3.5) |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2025-09-10 20:51:53 UTC; mbrannock |
Author: | Daniel Brannock |
Maintainer: | Daniel Brannock <mbrannock@rti.org> |
Repository: | CRAN |
Date/Publication: | 2025-09-15 09:10:02 UTC |
NHIS Adult Data 2023
Description
Raw survey results from adults for the 2023 National Health Interview Survey (NHIS). This is public use data. Documentation for the dataset can be found at the source link. NHIS is conducted by the National Center for Health Statistics within the Centers for Disease Control.
Usage
adult2023
Format
adult2023
A data frame with 29,522 rows and 647 columns.
Source
https://www.cdc.gov/nchs/nhis/documentation/2023-nhis.html
Synthetic All of Us Data
Description
Synthetic data intended to show how NHIS survey results can be used to generate weights from All of Us.
Usage
aou_synthetic
Format
Data frame with columns
- SEX_A_R_I
Sex: 0 (female), 1 (male)
- AGEP_A_R_I
Age in years: 1 (18-29), 2 (30-39), 3 (40-49), ..., 6 (70+)
- HISPALLP_A_R_I
Race/ethnicity: 1 (Hispanic), 2 (White), 3 (Black/African American), 4 (Other)
- ORIENT_A_R_I
Sexual orientation: 0 (Bisexual, Gay, or Lesbian), 1 (Straight)
- HICOV_A_R_I
Health insurance: 0 (Not insured), 1 (Insured)
- EDUCP_A_R_I
Education: 1 (Less than HS), 2 (HS or GED), 3 (Some college), 4 (College graduate), 5 (Advanced degree)
- REGION_R_I
Region: 1 (Northeast), 2 (Midwest), 3 (South), 4 (West)
- EMPLASTWK_A_R_I
Employment: 0 (Unemployed), 1 (Employed)
- HOUTENURE_A_R_I
Home ownership: 0 (Does not own home), 1 (Owns home)
- MARITAL_A_R_I
Marital status: 0 (Not married), 1 (Married)
- DEPEV_A_R_I
Depression: 0 (No diagnosis of depression), 1 (Has diagnosis of depression)
- DEMENEV_A_R_I
Depression: 0 (No diagnosis of dementia), 1 (Has diagnosis of dementia)
- DIBTYPE_A_R_I
Depression: 0 (No diagnosis of type 2 diabetes), 1 (Has diagnosis of type 2 diabetes)
Source
Generated from data-raw/aou_synthetic.R
.
Calculate Weights
Description
Calculate weights using three methods: IPW, Calibration, and Calibration+IPW
Usage
calculate_weights(
sample_a,
sample_b,
method,
aux_variables,
study_variables,
weight,
strata,
psu
)
Arguments
sample_a |
data.frame with representative sample |
sample_b |
data.frame with All of Us sample |
method |
string or string vector specifying weighting method to use: "ipw", "cal", and "ipw+cal" |
aux_variables |
character vector with names of calibration variables |
study_variables |
character vector with names of study variables |
weight |
character vector with name of the weight variable in sample_a |
strata |
character vector with name of the strata variable in sample_a |
psu |
character vector with name of the primary sampling units variable in sample_a |
Details
Calculates weights intended to reduce the sampling bias present in All of Us. Three versions of weights are calculated from different reweighting strategies: IPW, Calibration, and Calibration+IPW.
Value
list of data.frame with added (or replaced) weight columns and survey designs
Examples
# Prepare the NHIS data
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)
# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)
# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
nhis_dummied,
aou_dummied,
'ipw',
paste0(calVars, '_I'),
paste0(stuVars, '_I'),
weight='WTFA_A',
strata='PSTRAT',
psu='PPSU'
)
Create Dummy Variables
Description
Create dummy variables of factors and character vectors in a data frame
Usage
dummies(input, vars)
Arguments
input |
data.frame with calibration variables |
vars |
character vector with names of variables requiring dummy encoding |
Value
data.frame with the new dummy variables
Examples
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
# First impute
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
# Then create dummy variables
nhis_vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_dummied <- dummies(nhis_imputed, vars=paste0(nhis_vars_dummies, '_I'))
Extract population totals
Description
Calculate weights using three methods: IPW, Calibration, and Calibration+IPW
Usage
extract_totals(sample, vars, weight)
Arguments
sample |
data.frame with representative sample |
vars |
character vector with names of calibration variables |
weight |
character vector with name of the weight variable |
Details
Calculates weights intended to reduce the sampling bias present in All of Us. Three versions of weights are calculated from different reweighting strategies: IPW, Calibration, and Calibration+IPW.
Value
list of data.frame with added (or replaced) weight columns and survey designs
Impute Data
Description
Add imputed data columns to existing data.frame
Usage
impute_data(
input,
vars,
keep_vars = c(),
return_mice = FALSE,
impute_constant = NULL
)
Arguments
input |
data.frame with calibration variables |
vars |
character vector with names of variables to be imputed |
keep_vars |
character vector with names of additional variables that should be retained |
return_mice |
boolean for whether to return mice object (for looking at logged events) |
impute_constant |
numeric if not NULL will impute with provided constant |
Details
For each of the specified variables, use all variables to predict missing values. Populate actual (when available) and imputed values into new columns appended with names appended with _I.
If you choose to return the mice object with return_mice, the function output will be a list that includes the final data.frame and the mice output.
Value
data.frame with imputed versions of variables
Examples
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
Processed NHIS Data
Description
Survey data from NHIS that has been sampled down, recoded, and subsetted.
Usage
nhis_processed
Format
Data frame with columns
- SEX_A_R_I
Sex: 0 (female), 1 (male)
- AGEP_A_R_I
Age in years: 1 (18-29), 2 (30-39), 3 (40-49), ..., 6 (70+)
- HISPALLP_A_R_I
Race/ethnicity: 1 (Hispanic), 2 (White), 3 (Black/African American), 4 (Other)
- ORIENT_A_R_I
Sexual orientation: 0 (Bisexual, Gay, or Lesbian), 1 (Straight)
- HICOV_A_R_I
Health insurance: 0 (Not insured), 1 (Insured)
- EDUCP_A_R_I
Education: 1 (Less than HS), 2 (HS or GED), 3 (Some college), 4 (College graduate), 5 (Advanced degree)
- REGION_R_I
Region: 1 (Northeast), 2 (Midwest), 3 (South), 4 (West)
- EMPLASTWK_A_R_I
Employment: 0 (Unemployed), 1 (Employed)
- HOUTENURE_A_R_I
Home ownership: 0 (Does not own home), 1 (Owns home)
- MARITAL_A_R_I
Marital status: 0 (Not married), 1 (Married)
- DEPEV_A_R_I
Depression: 0 (No self-reported depression), 1 (Has self-reported depression)
- DEMENEV_A_R_I
Depression: 0 (No self-reported dementia), 1 (Has self-reported dementia)
- DIBTYPE_A_R_I
Depression: 0 (No self-reported type 2 diabetes), 1 (Has self-reported type 2 diabetes)
- PPSU
Person-level ID
- PSTRAT
Stratification to be used as part of the survey design
- WTFA_A
Weights used to assure representativeness of U.S. population (may not be valid for sampled data)
Source
Generated from data-raw/nhis_processed
.
Visualize Prevalence Estimates
Description
Visualize prevalence estimates for calibration or outcome variables using different weighting methods.
Usage
plot_prevalence(df, mean, mean_se, method, cal_vars, cal_levels)
Arguments
df |
data.frame with representative sample |
mean |
character name of mean prevalence estimate variable |
mean_se |
character name of mean prevalence estimate variable |
method |
character name of the weighting method variable |
cal_vars |
character name of the variable with calibration variable names |
cal_levels |
character name of the variable with calibration variable levels |
Details
Specify columns and weighting methodologies of interest to visualize.
Value
ggplot object
Examples
library(dplyr)
library(stringr)
# Prepare the NHIS data
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)
# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)
# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
nhis_dummied,
aou_dummied,
'ipw',
paste0(calVars, '_I'),
paste0(stuVars, '_I'),
weight='WTFA_A',
strata='PSTRAT',
psu='PPSU'
)
# Get IPW results by group
ipw_outcome_df <- summarize_results_by_group(
weights_df,
paste0(stuVars, '_I'),
paste0(calVars, '_I'),
weight_col='ipw_weight',
label='AoU: IPW'
)
# Process data prior to plotting to make labels more readable
plot_df <- ipw_outcome_df %>%
mutate(
Name = case_when(
group_var == 'SEX_A_R_I' & level_var == 1 ~ 'Sex: Male',
group_var == 'SEX_A_R_I' & level_var == 0 ~ 'Sex: Female',
group_var == 'AGEP_A_R_I1' & level_var == 1 ~ 'Age: 18-29',
group_var == 'AGEP_A_R_I2' & level_var == 1 ~ 'Age: 30-39',
group_var == 'AGEP_A_R_I3' & level_var == 1 ~ 'Age: 40-49',
group_var == 'AGEP_A_R_I4' & level_var == 1 ~ 'Age: 50-59',
group_var == 'AGEP_A_R_I5' & level_var == 1 ~ 'Age: 60-69',
group_var == 'AGEP_A_R_I6' & level_var == 1 ~ 'Age: 70+',
group_var == 'HISPALLP_A_R_I1' & level_var == 1 ~ 'Race/Eth: Hispanic',
group_var == 'HISPALLP_A_R_I2' & level_var == 1 ~ 'Race/Eth: White',
group_var == 'HISPALLP_A_R_I3' & level_var == 1 ~ 'Race/Eth: Black',
group_var == 'HISPALLP_A_R_I4' & level_var == 1 ~ 'Race/Eth: Other',
TRUE ~ group_var
)
) %>%
filter(str_detect(group_var, "SEX|AGEP|HISPALLP")) %>%
filter(!str_detect(Name, "_")) %>%
mutate(
condition = case_when(
outcome_var == 'DIBTYPE_A_R_I' ~ "Diabetes"
),
VAR = case_when(
str_detect(group_var, "SEX") ~ "Sex",
str_detect(group_var, "AGE") ~ "Age",
str_detect(group_var, "HISPALL") ~ "Race",
str_detect(group_var, "EDUC") ~ "Educ"
)
)
# Plot
plot_prevalence(
plot_df,
'WMEAN',
'SEMEAN',
'Method',
'VAR',
'Name'
)
Select Variables
Description
Select variables relevant to propensity for inclusion in All of Us
Usage
select_variables(sample_a, sample_b, aux_variables)
Arguments
sample_a |
data.frame of the reference probability sample (i.e., NHIS) |
sample_b |
data.frame of the All of Us sample |
aux_variables |
character vector with names of auxiliary variables |
Details
Chooses which variables are meaningful in modeling propensity for inclusion in All of Us (sample_b) as compared to the general US population as represented by a reference probability sample (sample_a). This function assumes that variable names in both sample_a and sample_b are harmonized (i.e., definitions and names are the same across the two sources).
Value
character vector with selected variable names
Examples
# Prepare the NHIS data
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)
# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)
# Define base variable names of auxiliary variables
aux_variables <- c(
"SEX_A_R_I","AGEP_A_R_I", "HISPALLP_A_R_I","EDUCP_A_R_I",
"REGION_R_I","ORIENT_A_R_I","HICOV_A_R_I",
"EMPLASTWK_A_R_I","HOUTENURE_A_R_I","MARITAL_A_R_I"
)
# Provide All of Us and NHIS data to select variables
selected_base_vars <- select_variables(nhis_dummied, aou_dummied, aux_variables)
Summarize Results
Description
Get adjusted totals and prevalence for provided variables.
Usage
summarize_results(
df,
vars,
weight_col = NULL,
id_col = 1,
strata_col = NULL,
label = NULL
)
Arguments
df |
data.frame with sample and weights (if using a survey design) |
vars |
string vector of variables to calculate prevalences for |
weight_col |
string specifying the column with weights or NULL for unweighted |
id_col |
string specifying the column with IDs for cluster-aware standard error (SE) calculations |
strata_col |
string specifying the column with strata for cluster-aware SE calculations |
label |
string label for weighting method |
Value
data.frame with totals, means, and standard errors (if using a survey design)
Examples
# Prepare the NHIS data
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)
# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)
# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
nhis_dummied,
nhis_dummied,
'ipw',
paste0(calVars, '_I'),
paste0(stuVars, '_I'),
weight='WTFA_A',
strata='PSTRAT',
psu='PPSU'
)
results_ipw <- summarize_results(
weights_df,
c(paste0(calVars, '_I'), paste0(stuVars, '_I')),
weight_col='ipw_weight',
label='AoU: IPW'
)
Summarize Results by Group
Description
Get adjusted totals and prevalences for provided variables, grouped by specified variables.
Usage
summarize_results_by_group(
df,
vars,
group_vars,
weight_col = NULL,
id_col = NULL,
strata_col = NULL,
label = NULL
)
Arguments
df |
data.frame with sample and weights (if using a survey design) |
vars |
string vector of variables to calculate prevalences for |
group_vars |
string vector of variables to group by |
weight_col |
string specifying the column with weights, "nhis" or nhis survey design, or NULL for unweighted |
id_col |
string specifying the column with IDs for cluster-aware standard error (SE) calculations |
strata_col |
string specifying the column with strata for cluster-aware SE calculations |
label |
string label for weighting method |
Details
TODO: Merge into regular summarize_results function
Value
data.frame with totals, means, and standard errors (if using a survey design)
Examples
# Prepare the NHIS data
calVars <- c(
"SEX_A_R", "AGEP_A_R", "HISPALLP_A_R", "ORIENT_A_R", "HICOV_A_R", "EDUCP_A_R", "REGION_R",
"EMPLASTWK_A_R", "HOUTENURE_A_R", "MARITAL_A_R"
)
stuVars <- "DIBTYPE_A_R"
vars_dummies <- c("AGEP_A_R","HISPALLP_A_R","EDUCP_A_R","REGION_R")
nhis_keep_vars <- c("PPSU","PSTRAT","WTFA_A")
nhis_imputed <- impute_data(nhis_processed, c(calVars, stuVars), nhis_keep_vars)
nhis_dummied <- dummies(nhis_imputed, vars=paste0(vars_dummies, '_I'))
factor_vars <- setdiff(names(nhis_dummied), nhis_keep_vars)
nhis_dummied[factor_vars] <- lapply(nhis_dummied[factor_vars], as.factor)
# Prepare the synthetic All of Us data
aou_imputed <- impute_data(aou_synthetic, c(calVars, stuVars))
aou_dummied <- dummies(aou_imputed, vars=paste0(vars_dummies, '_I'))
aou_dummied[] <- lapply(aou_dummied, as.factor)
# Calculate IPW weights using NHIS data and applied to All of Us
weights_df <- calculate_weights(
nhis_dummied,
aou_dummied,
'ipw',
paste0(calVars, '_I'),
paste0(stuVars, '_I'),
weight='WTFA_A',
strata='PSTRAT',
psu='PPSU'
)
# Get IPW results by group
ipw_outcome_df <- summarize_results_by_group(
weights_df,
paste0(stuVars, '_I'),
paste0(calVars, '_I'),
weight_col='ipw_weight',
label='AoU: IPW'
)