Relative risk and odds ratio are often confused or interchanged. Especially while coefficients in logistic regression are directly interpreted as (adjusted) odds ratio, they are unwittingly translated as (adjusted) relative risks in many public health studies. In that relative risks are useful in many thousands of applications, instead of odds ratio, we propose a software tool to easily convert from odds ratio to relative risks under logistic regression. Unlike adjusted odds ratio conditional on other confounders, adjusted relative risks may vary depending on other confounders in the logistic model so we also analytically examine the effect of those confounders on the adjusted relative risk.
Let us first define adjusted relative risks of binary exposure \(E\) on binary outcome \(D\) conditional on \(\mathbf{X}\).
\[\frac{P(D = 1 \mid E = 1, \mathbf{X} )}{P(D = 1 \mid E = 0, \mathbf{X})}\]
Generally speaking, when exposure variable of \(E\) is continuous or ordinal, we can define adjusted relative risks as ratio between probability of observing \(D = 1\) when \(E = z+1\) over \(E = z\) conditional on \(\mathbf{X}\). Unlike adjusted odds ratio, these ratio depend on baseline value of exposure \(z\) under logistic regression.
\[\frac{P(D = 1 \mid E = z+1, \mathbf{X} )}{P(D = 1 \mid E = z, \mathbf{X})}\]
On the other hand, when exposure variable is nominal, it is impossible to compare the probabilities in one unit change. Therefore, when a type of exposure variable is factor
, we allow users to specify two values of exposure variable including baseline (\(z_{0}\)) and comparative level (\(z_{1}\)) and derive the relative risks given those two exposure levels.
\[\frac{P(D = 1 \mid E = z_{1}, \mathbf{X} )}{P(D = 1 \mid E = z_{0}, \mathbf{X})}\] The above is more generalized version. By setting \(z_{1} = 1\) and \(z_{0} = 0\) we can go back to binary case.
In case of (adjusted) odds ratio derived from logistic regression, we can directly obtain variance-covariance matrix for coefficients using glm
function in R
. However, deriving variance of adjusted relative risks, as a function of those coefficients, is more challenging.
We first provide a estimated variance of relative risk using Delta method upon estimated variance of odds ratio from glm
. The second method to estimate variance is using sampling variance of bootstrap samples.
Let \(\boldsymbol{\beta}\) be a vector of coefficients used in logistic regression and among them \(\beta_{1}\) is a coefficient associated with an exposure variable of interest taking a value of \(z_{0}\) as baseline level and \(z_{1}\) as comparative level. Then we can represent the adjusted relative risk as a function of \(\boldsymbol{\beta}\) conditional on \(\mathbf{X}\):
\[g(\boldsymbol{\beta}) = \frac{1 + \exp(-\beta_{0} - \beta_{1} z_{0} - \boldsymbol{\beta}^{T}_{2:p} \mathbf{X}) }{ 1 + \exp (-\beta_{0} - \beta_{1} z_{1} - \boldsymbol{\beta}^{T}_{2:p} \mathbf{X}) }\]
Then by Delta method, \[var[g(\boldsymbol{\beta})] = \left\{\frac{\partial g(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right\}^{T} var(\boldsymbol{\beta}) \left\{\frac{\partial g(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \right\}\] Note that \(\frac{\partial g(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\) is \(p \times 1\) and \(var(\boldsymbol{\beta})\) is \(p \times p\), so \(var[g(\boldsymbol{\beta})]\) is a scalar value.
A \(p \times p\) matrix of \(var{(\boldsymbol{\beta})}\) is obtained by summary(fit)$cov.unscaled
when fit
is a glm
object.
Let \(e_{0} = \exp(-\beta_{0} - \beta_{1} z_{0} - \boldsymbol{\beta}^{T}_{2:p} \mathbf{X})\) and \(e_{1} = \exp (-\beta_{0} - \beta_{1} z_{1} - \boldsymbol{\beta}^{T}_{2:p} \mathbf{X})\).
\[\frac{\partial g(\boldsymbol{\beta})}{\partial \beta_{0}} = \frac{- e_{1} + e_{0}}{(1 + e_{1})^2 } = \frac{e_{0}(1 - \exp(-\beta_{1}(z_{1} - z_{0}) ) ) }{(1 + e_{1})^2}\] \[\frac{\partial g(\boldsymbol{\beta})}{\partial \beta_{1}} = \frac{-z_{1} e_{1}( 1 + e_{0}) + z_{0} e_{0}(1 + e_{1}) }{(1 + e_{1})^2 }\] For any \(j = 2,3,\ldots, p\) where \(X_{j}\) is a covariate of which effect is associated with \(\beta_{j}\): \[\frac{\partial g(\boldsymbol{\beta})}{\partial \beta_{j}} = \frac{x_{j}(e_{0} - e_{1} ) }{ (1+e_{1})^2} = \frac{1 - \exp(-(z_{1} - z_{0})\beta_{1}) }{(1 + e_{1})^2}\]
By combining information of estimated \(var{(\boldsymbol{\beta})}\) and \(\frac{\partial g(\boldsymbol{\hat{\beta}})}{\partial \boldsymbol{\hat{\beta}}}\), we can derive the estimated variance of \(g(\boldsymbol{\beta})\).
In both of logisticRR
and nominalRR
, we add a logical input of boot
: by setting boot = TRUE
those functions print out a vector of n.boot
number of (adjusted) relative risks.
As a first example, we generate hypothetical data of size \(n=500\).
library(logisticRR)
n <- 500
set.seed(1234)
X <- rbinom(n, 1, 0.3)
W <- rbinom(n, 1, 0.3); W[sample(1:n, n/3)] = 2
Z <- rep(0, n)
Z[sample(1:n, n/2)] <- "female"; Z <- ifelse(Z == 0, "male", Z)
dummyZ <- ifelse(Z == "female", 1, 0)
Y <- rbinom(n, 1, plogis(X - W + 2*dummyZ))
dat <- as.data.frame(cbind(Y, X, W, Z))
dat$X <- as.numeric(dat$X); dat$X <- ifelse(dat$X == 2, 1, 0)
dat$Y <- as.numeric(dat$Y); dat$Y <- ifelse(dat$Y == 2, 1, 0)
dat$W <- as.factor(dat$W)
dat$Z <- as.factor(dat$Z)
head(dat)
#> Y X W Z
#> 1 0 0 2 male
#> 2 0 0 2 female
#> 3 1 0 1 female
#> 4 0 0 0 male
#> 5 0 1 2 female
#> 6 0 0 2 male
The code below estimates variance of adjusted relative risks of binary \(X\) on binary outcome of \(Y\) by generating n.boot = 200
bootstrap samples. Because we do not specify baseline level of exposure variable (basecov
) nor the value of conditioning covariates of W
and Z
(fixcov
), baseline exposure level is set to 0
as default. Since W
and Z
are both factor, they are fixed to their first level which are 0
and female
.
simresult200 <- logisticRR(Y ~ X + W + Z, data = dat, boot = TRUE, n.boot = 200)
simresult200$RR
#> 1
#> 1.076041
var(simresult200$boot.rr)
#> [1] 0.0005360965
simresult200$delta.var
#> [1] 0.0004270613
## print out conditioning
simresult200$fix.cov
#> W Z
#> 1 0 female
This time we increase the number of bootstrap samples to n.boot = 1000
. Note that sampling variance gets closer to the estimated variance using Delta method (delta.var
).
simresult1000 <- logisticRR(Y ~ X + W + Z, data = dat, boot = TRUE, n.boot = 1000)
var(simresult1000$boot.rr)
#> [1] 0.0004876841
simresult1000$delta.var
#> [1] 0.0004270613
We have a total of six combination of confounder variables. By the assumption made in logistic regression, adjusted odds ratio is consistent against of these confounders but adjusted relative risk is not.
levels(dat$W)
#> [1] "0" "1" "2"
levels(dat$Z)
#> [1] "female" "male"
adjusted <- cbind(rep(levels(dat$W), 2), rep(levels(dat$Z), each = 3))
adjusted <- as.data.frame(adjusted)
names(adjusted) <- c("W", "Z")
Adjusted relative risk tends to be higher for male and for higher level of W
.
## compare with odds ratio
results <- list()
for(i in 1:nrow(adjusted)){
results[[i]] <- logisticRR(Y ~ X + W + Z, data = dat, fixcov = adjusted[i,], boot = FALSE)
}
## adjusted relative risk
# female
print(c(results[[1]]$RR, results[[2]]$RR, results[[3]]$RR))
#> 1 1 1
#> 1.076041 1.206734 1.494543
# male
print(c(results[[4]]$RR, results[[5]]$RR, results[[6]]$RR))
#> 1 1 1
#> 1.650889 2.248270 2.811120
## adjusted odds ratio
## all the same : by the assumption of logistic regression
print(exp(coefficients(results[[1]]$fit)[2]))
#> X
#> 3.441811
# betas <- coefficients(fit)
# exposed <- exp(-predict(fit, expose.cov, type = "link"))
# unexposed <- exp(-predict(fit, unexpose.cov, type = "link"))
# RR <- (1 + unexposed) / (1 + exposed)
Let us change the prevalence of exposure variable (\(X\)).
dat2 <- dat
dat2$Y <- ifelse(dat$Y == 1, rbinom(n, 1, 0.2), rbinom(n, 1, 0.01))
## compare with odds ratio
results2 <- list()
for(i in 1:nrow(adjusted)){
results2[[i]] <- logisticRR(Y ~ X + W + Z, data = dat2, fixcov = adjusted[i,], boot = TRUE, n.boot = 1000)
}
## adjusted relative risk
# female
print(c(results2[[1]]$RR, results2[[2]]$RR, results2[[3]]$RR))
#> 1 1 1
#> 1.173466 1.177487 1.186902
# male
print(c(results2[[4]]$RR, results2[[5]]$RR, results2[[6]]$RR))
#> 1 1 1
#> 1.184617 1.187547 1.194245
## adjusted odds ratio
## all the same : by the assumption of logistic regression
print(exp(coefficients(results2[[1]]$fit)[2]))
#> X
#> 1.209375
# betas <- coefficients(fit)
# exposed <- exp(-predict(fit, expose.cov, type = "link"))
# unexposed <- exp(-predict(fit, unexpose.cov, type = "link"))
# RR <- (1 + unexposed) / (1 + exposed)
Plotting the distribution of adjusted relative risks for each combination of confounders. Each black dot denotes point estimate of adjusted relative risk and red horizontal line denotes adjusted odds ratio which is independent of the levels of confounders.
We introduce airquality
data to illustrate the use of logisticRR
. You can download the data set by :
Because outcome variable of Ozone
is continuous, we are going to binarize this variable into ozone1
(top 10% take 1 and 0 otherwise), ozone2
(top 20% take 1 and 0 otherwise), and ozone3
(top 30% take 1 and 0 otherwise).
# delete observations having NAs
ozonedat <- na.omit(airquality)
# define binary ozone level
ozonedat$ozone1 <- ifelse(ozonedat$Ozone < quantile(ozonedat$Ozone, prob = 0.1), 1, 0)
ozonedat$ozone2 <- ifelse(ozonedat$Ozone < quantile(ozonedat$Ozone, prob = 0.2), 1, 0)
ozonedat$ozone3 <- ifelse(ozonedat$Ozone < quantile(ozonedat$Ozone, prob = 0.3), 1, 0)
As an exposure variable of main interest, we use numerical Temp
(temperature). Because a range of Temp
is wide we devide it by 10 and use Temp2
in the model so that one unit change of Temp2
is ten unit (10 Fahrenheit) in Temp
.
summary(ozonedat$Temp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 57.00 71.00 79.00 77.79 84.50 97.00
ozonedat$Temp2 <- ozonedat$Temp / 10
summary(ozonedat$Temp2)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 5.700 7.100 7.900 7.779 8.450 9.700
As other confounding variables, we chose solar radiation (Solar.R
) and average wind speed (Wind
) so that formula
used for glm
is ozone1 ~ Temp2 + Solar.R + Wind
, for example. We specify conditioning confounder values as an average of each variable by fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind))
.
ozone.fit1 <- logisticRR(ozone1 ~ Temp2 + Solar.R + Wind, data = ozonedat, basecov = min(ozonedat$Temp2),
fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind)),
boot = FALSE)
ozone.fit2 <- logisticRR(ozone2 ~ Temp2 + Solar.R + Wind, data = ozonedat, basecov = min(ozonedat$Temp2),
fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind)),
boot = FALSE)
ozone.fit3 <- logisticRR(ozone3 ~ Temp2 + Solar.R + Wind, data = ozonedat, basecov = min(ozonedat$Temp2),
fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind)),
boot = FALSE)
As prevalence of outcome is smaller (ozone1
< ozone2
< ozone3
), estimated adjusted relative risk is closer to adjusted odds ratio.
print(c(ozone.fit1$RR, ozone.fit2$RR, ozone.fit3$RR))
#> 1 1 1
#> 0.7644533 0.5527379 0.6431837
## odds ratio
exp(ozone.fit1$fit$coefficients[2])
#> Temp2
#> 0.7600622
Next we are going to use nominalRR
when an exposure variable is converted into nominal variable of Temp.factor
having three categories – low
, medium
, and high
.
Note that adjusted relative risk when basecov = "low", comparecov = "medium"
is the reciprocal of that when basecov = "medium", comparecov = "low"
.
# define binary ozone level
ozonedat$ozone1 <- ifelse(ozonedat$Ozone < quantile(ozonedat$Ozone, prob = 0.1), 1, 0)
ozonedat$Temp.factor <- ifelse(ozonedat$Temp <= quantile(ozonedat$Temp, prob = 0.25), "low",
ifelse(ozonedat$Temp > quantile(ozonedat$Temp, prob = 0.8), "high", "medium"))
ozonedat$Temp.factor <- as.factor(ozonedat$Temp.factor)
ozone.fit.factor <- nominalRR(ozone1 ~ Temp.factor + Solar.R + Wind, data = ozonedat,
basecov = "low", comparecov = "medium",
fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind)), boot = FALSE)
ozone.fit.factor2 <- nominalRR(ozone1 ~ Temp.factor + Solar.R + Wind, data = ozonedat,
basecov = "medium", comparecov = "low",
fixcov = data.frame(Solar.R = mean(ozonedat$Solar.R), Wind = mean(ozonedat$Wind)), boot = FALSE)
print(c(ozone.fit.factor$RR, ozone.fit.factor2$RR))
#> 1 1
#> 0.8622041 1.1598181
print(1/ozone.fit.factor2$RR)
#> 1
#> 0.8622041