Often the first problem in understanding the generalized linear model in a practical way is finding good data. Common problems include finding data with a small number of rows, the response variable does not follow a family in the glm framework, or the data is messy and needs a lot of work before statistical analysis can begin. This package alleviates all of these by allowing you to create the data you want. With data in hand, you can empirically answer any question you have.
The goal of this package is to strike a balance between mathematical flexibility and simplicity of use. You can control the sample size, link function, number of unrelated variables, and ancillary parameter when applicable. Default values are carefully chosen so data can be generated without thinking about mathematical connections between weights, links, and distributions.
All functions return a tibble. The only thing that changes between functions is the distribution of Y. In simulate gaussian, Y follows a gaussian distribution. In simulate_gamma, Y follows a gamma distribution. Think of these functions as helpers that create data. With this data, you can test questions and get familiar with the generalized linear model.
library(GlmSimulatoR)
library(ggplot2)
set.seed(1)
simdata <- simulate_gaussian(N = 200, weights = c(1, 2, 3))
glmModel <- glm(Y ~ X1 + X2 + X3, data = simdata, family = gaussian(link = "identity"))
summary(glmModel)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2 + X3, family = gaussian(link = "identity"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9437 -0.7212 -0.0368 0.7070 3.6459
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.9138 0.7012 4.156 4.84e-05 ***
#> X1 0.9834 0.2868 3.428 0.00074 ***
#> X2 1.7882 0.2702 6.619 3.39e-10 ***
#> X3 3.2822 0.2640 12.430 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.178874)
#>
#> Null deviance: 479.50 on 199 degrees of freedom
#> Residual deviance: 231.06 on 196 degrees of freedom
#> AIC: 606.45
#>
#> Number of Fisher Scoring iterations: 2
rm(glmModel)
In the above, we see the estimates are close to the weights argument. The mathematics behind the generalized linear model worked well.
library(GlmSimulatoR)
library(ggplot2)
set.seed(1)
simdata <- simulate_gaussian(link = "identity")
linearModel <- lm(Y ~ X1 + X2 + X3, data = simdata)
glmModel <- glm(Y ~ X1 + X2 + X3, data = simdata, family = gaussian(link = "identity"))
summary(linearModel)
#>
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3, data = simdata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.6961 -0.6711 0.0049 0.6534 3.6232
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.06105 0.08961 34.16 <2e-16 ***
#> X1 0.99941 0.03428 29.15 <2e-16 ***
#> X2 1.98930 0.03456 57.56 <2e-16 ***
#> X3 2.98383 0.03471 85.97 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9976 on 9996 degrees of freedom
#> Multiple R-squared: 0.5377, Adjusted R-squared: 0.5375
#> F-statistic: 3875 on 3 and 9996 DF, p-value: < 2.2e-16
summary(glmModel)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2 + X3, family = gaussian(link = "identity"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.6961 -0.6711 0.0049 0.6534 3.6232
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.06105 0.08961 34.16 <2e-16 ***
#> X1 0.99941 0.03428 29.15 <2e-16 ***
#> X2 1.98930 0.03456 57.56 <2e-16 ***
#> X3 2.98383 0.03471 85.97 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.9952888)
#>
#> Null deviance: 21518.1 on 9999 degrees of freedom
#> Residual deviance: 9948.9 on 9996 degrees of freedom
#> AIC: 28338
#>
#> Number of Fisher Scoring iterations: 2
rm(simdata, linearModel, glmModel)
In the above, we see the coefficients and standard errors are the same between the linear model and the generalized linear model. This confirms the linear model is identical to a generalized linear model with gaussian family and identity link.