Weighted Quantile Sum (WQS) regression is a statistical model for multivariate regression in high-dimensional datasets commonly encountered in environmental exposures, epi/genomics, and metabolomic studies, among others. The model constructs a weighted index estimating the mixed effect of all predictor variables on an outcome, which may then be used in a regression model with relevant covariates to test the association of the index with a dependent variable or outcome. The contribution of each individual predictor to the overall index effect may then be assessed by the relative strength of the weights the model assigns to each variable.
The gWQS package extends WQS regression to applications with continuous and categorical outcomes. In practical terms, the primary outputs of an analysis will be the parameter estimates and significance tests for the overall index effect of predictor variables, and the estimated weights assigned to each predictor, which identify the relevant contribution of each variable to the relationship between the WQS index and the outcome variable.
For additional theoretical background on WQS regression, see the references provided below.
gWQS
packageThe main function of the gWQS
package is gwqs
, which allows the implementation of Weighted Quantile Sum Regression for linear and a logistic regression. We created the wqs_data
dataset (available once the package is installed and loaded) to show how to use this function. These data reflect 34 exposure concentrations simulated from a distribution of PCB exposures measured in subjects participating in the NHANES study (2001-2002). Additionally, an end-point meaure, simulated from a distribution of leukocyte telomere length (LTL), a biomarker of chronic disease, is provided as well (variable name: y), as well as simulated covariates, e.g. sex, and a dichotomous outcome variable (variable name: disease_state). This dataset can thus be used to test the gWQS
package by analyzing the mixed effect of the 34 simulated PCBs on the continuous or binary outcomes, with adjustments for covariates.
Here is an overview of the dataset:
y | disease_state | sex | log_LBX074LA | log_LBX099LA | log_LBX105LA | log_LBX118LA | log_LBX138LA | log_LBX153LA | log_LBX156LA | log_LBX157LA | log_LBX167LA | log_LBX170LA | log_LBX180LA | log_LBX187LA | log_LBX189LA | log_LBX194LA | log_LBX196LA | log_LBX199LA | log_LBXD01LA | log_LBXD02LA | log_LBXD03LA | log_LBXD04LA | log_LBXD05LA | log_LBXD07LA | log_LBXF01LA | log_LBXF02LA | log_LBXF03LA | log_LBXF04LA | log_LBXF05LA | log_LBXF06LA | log_LBXF07LA | log_LBXF08LA | log_LBXF09LA | log_LBXPCBLA | log_LBXTCDLA | log_LBXHXCLA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.4224332 | 1 | 0 | 0.0680716 | -0.1608614 | 0.9648013 | -0.1815684 | 0.7506797 | 0.9363278 | -0.0364088 | 0.1971499 | 1.1192873 | 0.2972489 | 0.1920384 | -0.5257044 | -0.1659253 | 0.4017490 | -0.0143133 | 0.8692975 | 0.0846309 | 0.0012511 | -0.2847784 | 0.6327454 | 0.4078422 | 0.1661874 | -1.2905580 | -1.0359049 | 1.6893966 | -0.1942655 | 0.4322813 | -1.8049708 | -0.7816081 | 0.2459645 | -1.0917383 | -0.3476487 | -0.0358878 | 1.2711118 |
0.7613784 | 0 | 0 | 0.5276676 | 1.4403102 | 0.7753093 | 0.1055764 | -0.9015101 | 0.1677693 | 0.7643587 | -0.6001477 | 0.6175614 | -0.3577213 | 0.2879546 | -0.1557095 | -0.6755189 | 0.0519114 | 0.3351334 | -1.1385368 | 0.5852100 | -1.4539998 | 1.1116691 | 0.0767604 | -0.0295292 | -1.2578515 | -0.0246641 | 0.5600564 | 0.1143720 | 1.1365974 | 0.8777775 | -0.1214907 | 0.7902525 | 0.2255480 | -1.0510728 | 0.5891812 | -0.6304355 | 0.7265865 |
-2.1113324 | 1 | 1 | -1.8650139 | 1.4885628 | -1.5420050 | -0.5412270 | -1.3624492 | -0.4177779 | 0.1414001 | 0.1333590 | -0.9901708 | 0.1024100 | -0.0622932 | 1.0824292 | -0.6920183 | -1.7204760 | -1.2079225 | 0.3543356 | -0.8116185 | -0.8790250 | 0.4616429 | -2.0101411 | -1.2859585 | 0.8281122 | 0.0921212 | -0.6980090 | 0.2987547 | -0.0450820 | 0.4917081 | 0.3754392 | -1.6193861 | 0.0812163 | 0.2236327 | -0.5211774 | -1.0174183 | -0.4092153 |
2.2222394 | 0 | 0 | 1.2585209 | 1.6110700 | 2.2069536 | 1.3286183 | 1.8871201 | 1.8393058 | 2.2751535 | 1.1307767 | 1.2204913 | 0.1958512 | 0.9461497 | -0.0325942 | -0.3940764 | 0.2984753 | 0.7837346 | 1.9352485 | 0.1973576 | 0.8376586 | -1.2715081 | -1.2577145 | -0.5786619 | 0.6304414 | 1.3148279 | 0.2485757 | -0.4723713 | 0.7557946 | 0.3307601 | -0.0342781 | -0.2875164 | 0.4176680 | 0.0977815 | -1.1273511 | 1.0095760 | 0.6639172 |
-0.2190805 | 1 | 0 | -1.0138713 | -1.9903436 | -0.6009229 | -0.8907467 | -0.4789443 | -2.8190387 | -0.0095402 | -1.5477137 | 0.5180853 | -2.1491031 | -2.7951863 | -1.5755779 | -0.5270997 | -0.2450904 | -1.1242993 | -0.6902691 | -0.5791061 | -1.9495504 | -1.3426006 | 0.6178038 | -1.2587666 | -0.3310535 | -1.5882632 | -1.3967154 | -1.2128921 | -0.8581706 | -0.6420270 | -1.6946821 | -0.2143440 | -1.4720725 | -0.5997930 | -1.0896972 | -0.3474791 | -0.7175427 |
-1.7282770 | 0 | 0 | -0.9752573 | -0.6864727 | 0.3389021 | 0.5984246 | 0.0270300 | -0.4798327 | -0.3867123 | -1.0377681 | -0.7329465 | -0.7148301 | 0.4060471 | -0.5076881 | 0.2215813 | 0.3212146 | -0.1809994 | 1.4622912 | 0.0165252 | -0.6026058 | -0.0865081 | -0.7574191 | -0.0918236 | -0.3804055 | -0.0150627 | 0.5944020 | 0.4030363 | -0.1233755 | -0.7816779 | -0.3795485 | 0.4316160 | 1.7915524 | -0.9154308 | 0.2882417 | 0.6230951 | 1.4571064 |
0.4219639 | 0 | 1 | -0.6225983 | 1.0113598 | -0.2162429 | 0.7575008 | -0.2571729 | 0.9804511 | 0.4458043 | 1.3962681 | -0.1979610 | 0.1995122 | 1.4669089 | 1.0634414 | 0.8339600 | 1.5867141 | 0.7249490 | 0.7584681 | 0.2701363 | 0.9317193 | 0.0963273 | 0.3944993 | 0.6827900 | -0.5437556 | 0.9504820 | 0.5142011 | 0.5474847 | 0.4452620 | -0.4822166 | 0.7318831 | 0.0323481 | -0.9661976 | 0.7297474 | -0.1620114 | 0.5754746 | 0.9274046 |
-0.4762419 | 0 | 1 | -0.1801051 | -0.5805092 | -0.8038376 | -0.7786034 | -1.5457539 | -1.3117169 | -0.9734300 | -0.0261900 | 0.5308421 | -0.2485865 | 0.2103118 | -0.0912344 | -0.1011264 | -0.4217051 | 0.4682131 | -0.9008740 | -0.4390236 | -1.0384994 | -1.0510064 | -0.1617107 | -1.5014766 | -1.1384429 | 1.7696655 | 1.9844850 | -0.4584083 | -1.8996102 | -1.0736679 | 1.2719959 | 1.2918048 | -0.0684242 | 0.4754351 | -1.0453685 | 0.0445513 | -1.8944198 |
-0.5126990 | 1 | 0 | 0.4502187 | 0.7892304 | 0.7107290 | 0.2754549 | 0.0613806 | 0.3399312 | 1.0176518 | 1.1807174 | 1.2440091 | 1.5589123 | -0.8405278 | -0.2002233 | 0.2201809 | 0.2284441 | -0.1742271 | -0.1887278 | 0.3027553 | -0.3797491 | -0.4089264 | -0.5428011 | 1.2964970 | -0.2374201 | 0.1206623 | 0.4179228 | 1.3594435 | 1.6920543 | -0.3431321 | -0.2684203 | 0.9664860 | -0.0108648 | -0.9556556 | 2.0269212 | 0.0600171 | 0.4541168 |
-0.8479847 | 1 | 0 | -0.4461221 | -1.0556249 | 1.0676569 | 0.0681564 | -0.9582167 | -0.3967903 | 0.0196100 | 0.0467119 | 0.5172333 | 0.2781621 | 0.1674821 | 1.1879319 | -1.0221745 | -1.3879976 | -0.7295127 | -1.7526492 | -0.7668207 | 0.5923798 | 1.0463819 | 0.9856947 | -1.1673793 | -0.7855519 | -0.8638427 | -0.7379376 | -0.0984674 | -0.7446359 | -0.2253229 | -0.9085362 | -0.8710364 | 1.6918778 | -0.8597897 | -0.4343029 | 0.4885121 | -1.5592026 |
WQS with a continuous outcome: This script calls a wqs model for a continuous outcome using the function gwqs
.
# we save the names of the mixture variables in the variable "mix_name"
toxic_chems = c("log_LBX074LA", "log_LBX099LA", "log_LBX105LA", "log_LBX118LA",
"log_LBX138LA", "log_LBX153LA", "log_LBX156LA", "log_LBX157LA",
"log_LBX167LA", "log_LBX170LA", "log_LBX180LA", "log_LBX187LA",
"log_LBX189LA", "log_LBX194LA", "log_LBX196LA", "log_LBX199LA",
"log_LBXD01LA", "log_LBXD02LA", "log_LBXD03LA", "log_LBXD04LA",
"log_LBXD05LA", "log_LBXD07LA", "log_LBXF01LA", "log_LBXF02LA",
"log_LBXF03LA", "log_LBXF04LA", "log_LBXF05LA", "log_LBXF06LA",
"log_LBXF07LA", "log_LBXF08LA", "log_LBXF09LA", "log_LBXPCBLA",
"log_LBXTCDLA", "log_LBXHXCLA")
# we run the model and save the results in the variable "results"
results = gwqs(y ~ NULL, mix_name = toxic_chems, data = wqs_data, q = 4, validation = 0.6,
valid_var = NULL, b = 3, b1_pos = T, family = "gaussian", seed = 2016,
wqs2 = T, plots = T, tables = T)
This WQS model tests the relationship between our dependent variable, y, and a WQS index estimated from ranking exposure concentrations in quartiles (q = 4
). It also divided the data for training and validation, with 40% of the dataset for training and 60% for validation (validation = 0.6
), and assigned 3 bootstrapping steps (b = 3
) for parameter estimation. We chose to let the function create the training and validation dataset by itself (valid_var = NULL
). Because WQS provides an unidirectional evaluation of mixture effects, we first examined weights derived from bootstrapped models where \(\beta_1\) was positive (b1_pos = T
); we could test for negative associations by testig that parameter to false (b1_pos = F
). We linked our model to a gaussian distribution to test for relationships between the continuous outcome and exposures (family = "gaussian"
), and fixed the seed to 2016 for reproducible results (seed = 2016
). Since we suspected a non-linear dynamic we added the wqs2
parameter (wqs2 = T
) to include a quadratic term in the model. We plotted both a summary model with loess fit and a summary of each variables relative weight (plots = T
). Finally, in the directory we saved summaries (tables = T
) for the linear (Summary_results.html
) and the quadratic model (Summary_results_quadratic.html
), the results of ANOVA (Aov_results.html
) and the table with the weights (Weights_table.html
). A table with the regression results is printed automatically in the Viewer Pane.
The first plot is a barplot showing the weights assigned to each variable ordered from the highest weight to the lowest. These results indicate that the variables log_LBXF06LA
, log_LBXD02LA
, and log_LBXF04LA
are the largest contributors to this mixture effect, with the first 7 chemicals explaining over the 70% of the total weights. The same information is contained in the table results$final_weights
:
mix_name | mean_weight |
---|---|
log_LBXF06LA | 0.160 |
log_LBXD02LA | 0.151 |
log_LBXF04LA | 0.099 |
log_LBXF07LA | 0.092 |
log_LBX167LA | 0.078 |
log_LBXD04LA | 0.070 |
log_LBX138LA | 0.068 |
log_LBX157LA | 0.059 |
log_LBX196LA | 0.052 |
log_LBXD01LA | 0.030 |
log_LBX189LA | 0.028 |
log_LBX118LA | 0.025 |
log_LBX199LA | 0.022 |
log_LBX105LA | 0.015 |
log_LBX156LA | 0.014 |
log_LBXHXCLA | 0.012 |
log_LBX170LA | 0.009 |
log_LBXTCDLA | 0.006 |
log_LBXD05LA | 0.006 |
log_LBXF08LA | 0.004 |
log_LBXF05LA | 0.000 |
log_LBXD07LA | 0.000 |
log_LBX153LA | 0.000 |
log_LBXF02LA | 0.000 |
log_LBXPCBLA | 0.000 |
log_LBXF01LA | 0.000 |
log_LBX074LA | 0.000 |
log_LBX187LA | 0.000 |
log_LBX194LA | 0.000 |
log_LBXF09LA | 0.000 |
log_LBXD03LA | 0.000 |
log_LBX099LA | 0.000 |
log_LBX180LA | 0.000 |
log_LBXF03LA | 0.000 |
In the second plot we have a representation of the wqs index vs the outcome (adjusted for the model residual when covariates are included in the model) that show the direction and the shape of the association between the exposure and the outcome. For example in this case we can observe a linear and positive relationship between the mixture and the y
variable.
To test the statistical significance of the association between the variables in the model, the following code has to be run:
summary(results$fit)
These are the results for our example:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
wqs | 1.492 | 0.1176 | 12.69 | 8.729e-30 |
(Intercept) | -2.304 | 0.1875 | -12.29 | 2.394e-28 |
This last table tells us that the association is positive and statistically significant (p<0.001
).
Since we decided to add also the wqs
quadratic term (wqs2 = TRUE
), the gwqs
function fits a second model adding the wqs quadratic term. The code to view the results of the test is the following:
summary(results$fit_2)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
wqs | 1.345 | 0.5 | 2.689 | 0.007574 |
wqs_2 | 0.0508 | 0.167 | 0.3042 | 0.7612 |
(Intercept) | -2.212 | 0.356 | -6.214 | 1.741e-09 |
The quadratic term is not significant in our model confirming that the relationship between the outcome and the exposure is linear (as shown by the previous plot).
There is also the option to compare the two models with the analysis of variance:
results$aov
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
298 | 376.9 | NA | NA | NA |
297 | 376.7 | 1 | 0.1174 | 0.761 |
Where the first line refers to the simple model and the second to the model with the wqs quadratic term. The results confirm that the linear model best explains the relationship between the outcome and the exposure.
The gwqs
function gives back other outputs like the vector of the estimated \(\beta_1\) in each bootstrap sample (results$b1
), the vector of the values that indicate whether the solver has converged (0) or not (1 or 2) (results$conv
), the matrix with all the estimated weights, \(\beta_1\) and p-values for each bootstrap sample (results$wb1pm
), the vector containing the y values adjusted for the residuals of the fitted model when it is covariates adjusted (results$y_adj
), the list of vectors containing the rownames
of the subjects included in each bootstrap dataset (results$index_b
), the data frame containing the subjects used to estimate the weights in each bootstrap (results$data_t
), the data frame containing the subjects used to estimate the parameters of the final model (results$data_v
) and the data frame containing the final weights associated to each chemical (results$final_weights
).
In the following code we run a logistic regression (family = "binomial"
) to test the association between the exposure to the mixture and the outcome disease_state
and we also add the covariate sex
. Since the mixture concentrations are already standardized we can also run a model without categorizing for quantiles (q = NULL
). We chose to create the training and validation dataset and assign to valid_var
the name of the variable that identifies the two datasets (valid_var = "group"
). Furthermore we examined weights derived from bootstrapped models where \(\beta_1\) was negative (b1_pos = F
) since (as we can see in the following plot) there is a negative association between the exposure and the outcome:
# we create the variable "group" in the dataset to identify the training and validation dataset:
# we choose 300 observations for the validation dataset and the remaining 200 for the training dataset
set.seed(2016)
wqs_data$group = 0
wqs_data$group[rownames(wqs_data) %in% sample(rownames(wqs_data), 300)] = 1
# we run the logistic model and save the results in the variable "results2"
results2 = gwqs(disease_state ~ sex, mix_name = toxic_chems, data = wqs_data, q = NULL,
validation = 0, valid_var = "group", b = 3, b1_pos = F,
family = "binomial", seed = 1959, wqs2 = F, plots = T, tables = T)
From the first plot we see the per-variable calculated weights, ordered by relative magnitude. As above, the second plot shows us a negative relationship between the mixture and the state of disease, but as we can see from the following table it is not statistically significant (p=0.132
):
summary(results2$fit)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
wqs | -0.3129 | 0.2079 | -1.505 | 0.1323 |
sex | 0.3674 | 0.2469 | 1.488 | 0.1368 |
(Intercept) | -0.877 | 0.1782 | -4.922 | 8.556e-07 |
Carrico C, Gennings C, Wheeler D, Factor-Litvak P. Characterization of a weighted quantile sum regression for highly correlated data in a risk analysis setting. J Biol Agricul Environ Stat. 2014:1-21. ISSN: 1085-7117. DOI: 10.1007/ s13253-014-0180-3. http://dx.doi.org/10.1007/s13253-014-0180-3.
Czarnota J, Gennings C, Colt JS, De Roos AJ, Cerhan JR, Severson RK, Hartge P, Ward MH, Wheeler D. 2015. Analysis of environmental chemical mixtures and non-Hodgkin lymphoma risk in the NCI-SEER NHL study. Environmental Health Perspectives.
Czarnota J, Gennings C, Wheeler D. 2015. Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk. Cancer Informatics, 2015:14(S2) 159-171.
This package was developed at the CHEAR Data Center (Dept. of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai) with funding and support from NIEHS (U2C ES026555-01) with additional support from the Empire State Development Corporation.