Compared to the linear model, one advantage of the generalized linear model is its ability to model different relationships between the response variable and the predictors. One challenge is knowing which link to use. In this vignette, we will explore how different relationships affect correlation and the visual appearance of scatter plots.
For the identity link, the underlying model is \[Y = \beta_2X_2 + \beta_1X_1 + \beta_0\] Note there is no need to rearrange for Y because the link is the identity function.
Using the inverse link function, the underlying model is \[ 1/Y = \beta_2X_2 + \beta_1X_1 + \beta_0\].
Rearranging for Y, we get \[ Y = 1 / (\beta_2X_2 + \beta_1X_1 + \beta_0) \] We see the relationship between Y and X is different between the two models. This is the beauty of the glm framework. It can handle many different relationships between Y and X.
First, lets generate some data.
library(GlmSimulatoR)
library(ggplot2)
library(stats)
set.seed(1)
simdata <- simulate_gaussian(N = 1000, weights = c(1, 3), link = "inverse",
unrelated = 1, ancillary = .005)
Next, lets do some basic data exploration. We see the response is gaussian.
The connection between Y and X1 is not obvious. There is only a slight downward trend. One might see it as unrelated.
There is a connection between Y and X2. No surprise as the true weight is three.
The scatter plot between the unrelated variable and Y looks like random noise. It is interesting to note the scatter plot for X1 looks more similar to this one than X2’s scatter plot despite being included in the model.
We see the correlation is very strong between Y and X2. This is no surprise considering the above graph. The correlation between Y and X1 is somewhat larger in absolute value than the unrelated variable. However, I would not see this as particularly good news in predicting Y if I did not know the correct model.
Pretending we don’t know the correct answer, lets see if we can find the correct model. We will try three models. One with just X2, one with X1 and X2, and one with everything. Will the correct model stand out?
glmInverseX2 <- glm(Y ~ X2, data = simdata, family = gaussian(link = "inverse"))
glmInverseX1X2<- glm(Y ~ X1 + X2, data = simdata, family = gaussian(link = "inverse"))
glmInverseX1X2U1<- glm(Y ~ X1 + X2 + Unrelated1, data = simdata, family = gaussian(link = "inverse"))
summary(glmInverseX2)
#>
#> Call:
#> glm(formula = Y ~ X2, family = gaussian(link = "inverse"), data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.0178626 -0.0044753 0.0001566 0.0044814 0.0213965
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.52643 0.07782 58.16 <2e-16 ***
#> X2 2.97858 0.05540 53.77 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 4.160438e-05)
#>
#> Null deviance: 0.166947 on 999 degrees of freedom
#> Residual deviance: 0.041521 on 998 degrees of freedom
#> AIC: -7245.4
#>
#> Number of Fisher Scoring iterations: 4
summary(glmInverseX1X2)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2, family = gaussian(link = "inverse"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.0175791 -0.0033415 -0.0000879 0.0037864 0.0157698
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.00730 0.09212 32.65 <2e-16 ***
#> X1 0.98424 0.04429 22.22 <2e-16 ***
#> X2 3.01378 0.04528 66.56 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 2.780264e-05)
#>
#> Null deviance: 0.166947 on 999 degrees of freedom
#> Residual deviance: 0.027719 on 997 degrees of freedom
#> AIC: -7647.5
#>
#> Number of Fisher Scoring iterations: 4
summary(glmInverseX1X2U1)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2 + Unrelated1, family = gaussian(link = "inverse"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.0173348 -0.0034416 -0.0000618 0.0037611 0.0157805
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.05119 0.11473 26.593 <2e-16 ***
#> X1 0.98359 0.04431 22.196 <2e-16 ***
#> X2 3.01314 0.04531 66.507 <2e-16 ***
#> Unrelated1 -0.02815 0.04368 -0.644 0.519
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 2.781895e-05)
#>
#> Null deviance: 0.166947 on 999 degrees of freedom
#> Residual deviance: 0.027708 on 996 degrees of freedom
#> AIC: -7645.9
#>
#> Number of Fisher Scoring iterations: 4
Inspecting the 3 summaries, we see a few things. First, the correct model has the lowest AIC! Second, the slope estimates are very close to the true values. Overall, the mathematics behind the glm framework worked very well.
Above we saw using the identity link assumes an additive relationship between Y and X. \[Y = \beta_2X_2 + \beta_1X_1 + \beta_0\]
For the log link, the underlying model is \[\ln(Y) = \beta_2X_2 + \beta_1X_1 + \beta_0\]
Rearranging for Y, we get \[ Y = \exp (\beta_2X_2 + \beta_1X_1 + \beta_0) \] Splitting up the exponent, we get \[ Y = \exp (\beta_2X_2) * \exp (\beta_1X_1) * \exp (\beta_0) \] Thus the relationship between Y and X is multiplicative, not additive for the log link.
First, lets generate some data.
library(GlmSimulatoR)
library(ggplot2)
library(stats)
set.seed(1)
simdata <- simulate_gaussian(N = 1000, weights = c(.3, .8), link = "log",
unrelated = 1, ancillary = 1)
We see the response is somewhat gaussian.
The connection between Y and X1 is not obvious…
There is a connection between Y and X2. No surprise as the true weight is .8 on the log scale.
The scatter plot between the unrelated variable and Y looks like random noise.
Again X2’s correlation is large. X1 is in the gray area. The unrelated variable’s correlation is near zero.
Pretending we don’t know the correct answer, lets see if we can find the correct model. For a change we will compare the the link functions for the gaussian family.
glmIdentity <- glm(Y ~ X1 + X2, data = simdata, family = gaussian(link = "identity"))
glmInverse<- glm(Y ~ X1 + X2, data = simdata, family = gaussian(link = "inverse"))
glmLog<- glm(Y ~ X1 + X2, data = simdata, family = gaussian(link = "log"))
summary(glmIdentity)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2, family = gaussian(link = "identity"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.2107 -0.7304 -0.0192 0.8116 3.5724
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.5429 0.2576 -29.28 <2e-16 ***
#> X1 3.6047 0.1207 29.88 <2e-16 ***
#> X2 9.3869 0.1174 79.94 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.208497)
#>
#> Null deviance: 9850.0 on 999 degrees of freedom
#> Residual deviance: 1204.9 on 997 degrees of freedom
#> AIC: 3032.3
#>
#> Number of Fisher Scoring iterations: 2
summary(glmInverse)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2, family = gaussian(link = "inverse"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -4.3959 -0.7346 -0.0333 0.7515 3.4044
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2143035 0.0018807 113.95 <2e-16 ***
#> X1 -0.0216138 0.0007682 -28.14 <2e-16 ***
#> X2 -0.0624986 0.0008634 -72.39 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.25387)
#>
#> Null deviance: 9850.0 on 999 degrees of freedom
#> Residual deviance: 1250.1 on 997 degrees of freedom
#> AIC: 3069.1
#>
#> Number of Fisher Scoring iterations: 4
summary(glmLog)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2, family = gaussian(link = "log"), data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.5647 -0.6737 -0.0253 0.7407 3.1460
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.813052 0.021927 37.08 <2e-16 ***
#> X1 0.300177 0.009585 31.32 <2e-16 ***
#> X2 0.790980 0.009714 81.43 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.111197)
#>
#> Null deviance: 9850.0 on 999 degrees of freedom
#> Residual deviance: 1107.9 on 997 degrees of freedom
#> AIC: 2948.3
#>
#> Number of Fisher Scoring iterations: 4
Again, the correct model has the lowest AIC and the estimated weights are very close to the true values.
We have explored different links for the gaussian distribution, but the gaussian distribution is not a special case. Everything that was done here could be done for any distribution in the glm framework. Once you understand one distribution, you are very far along in understanding the other distributions. The glm framework can handle categorical response variables (binomial), integer response variables (poisson), and right skewed response variables (gamma and inverse gaussian).