This package has been created to create odds plot for the results of a logistic regression.
The package uses caret to train the model and the final model parameter is used to generate the application.
First we load the required packages. The example dataset we are going to use to work with OddsPlotty is the breast cancer data:
#install.packages("mlbench")
#install.packages("caret")
library(mlbench)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(tibble)
library(ggplot2)
library(OddsPlotty)
library(e1071)
library(ggthemes)
Then we are going to load the data we need for the breast cancer data:
data("BreastCancer", package = "mlbench")
#Use complete cases of breast cancer
breast <- BreastCancer[complete.cases(BreastCancer), ] #Create a copy
breast <- breast[, -1]
head(breast, 10)
#> Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> Bl.cromatin Normal.nucleoli Mitoses Class
#> 1 3 1 1 benign
#> 2 3 2 1 benign
#> 3 3 1 1 benign
#> 4 3 7 1 benign
#> 5 3 1 1 benign
#> 6 9 7 1 malignant
#> 7 3 1 1 benign
#> 8 3 1 1 benign
#> 9 1 1 5 benign
#> 10 2 1 1 benign
#Convert the class to a factor - Beningn (0) and Malignant (1)
breast$Class <- factor(breast$Class)
str(breast)
#> 'data.frame': 683 obs. of 10 variables:
#> $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
#> $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
#> $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
#> $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
#> $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
#> $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
#> $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
#> $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
#> $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
#> $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
This takes care of the class encoding but now we need to code the factors to numeric
for(i in 1:9) {
breast[, i] <- as.numeric(as.character(breast[, i]))
}
#Loops through the first columns - 1 to 9 and changes them from factors to a numerical representation
str(breast)
#> 'data.frame': 683 obs. of 10 variables:
#> $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
#> $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
#> $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
#> $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
#> $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
#> $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
#> $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
#> $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
#> $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
#> $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
This has now changed the data into a numerical value and this can now be used in the GLM model.
I will use Caret to train the Generalised Linear Model (GLM) aka Logistic Regression, as this is the package that best supports the odds plot statistics. Please note: I am training on the full dataset and not undertaking a data partitioning method, as perhaps seen in logistic regression.
library(caret)
glm_model <- train(Class ~ .,
data = breast,
method = "glm",
family = "binomial")
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Once the model is trained we can inspect the results with OddsPlotty:
The below shows how to visualise and expose the plot from the saved list in OddsPlotty.
#> Waiting for profiling to be done...
Each odds plot has an associated tibble under the hood for querying. To access the tibble use:
#> # A tibble: 9 x 4
#> OR lower upper vars
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.71 1.32 2.31 Cl.thickness
#> 2 0.994 0.674 1.55 Cell.size
#> 3 1.38 0.862 2.16 Cell.shape
#> 4 1.39 1.10 1.80 Marg.adhesion
#> 5 1.10 0.805 1.50 Epith.c.size
#> 6 1.47 1.23 1.78 Bare.nuclei
#> 7 1.56 1.13 2.23 Bl.cromatin
#> 8 1.24 0.998 1.56 Normal.nucleoli
#> 9 1.71 0.993 3.02 Mitoses
Additional parameters for the plot can be fed in:
#> Waiting for profiling to be done...
Another example of how to use a different theme:
library(OddsPlotty)
library(ggthemes)
plotty <- OddsPlotty::odds_plot(glm_model$finalModel,
title = "Odds Plot with ggthemes Tufte Theme",
subtitle = "Showing odds of cancer based on various factors",
point_col = "#00f2ff",
error_bar_colour = "black",
point_size = .5,
error_bar_width = .8,
h_line_color = "red")
#> Waiting for profiling to be done...
plotty$odds_plot + ggthemes::theme_tufte()
This package was created by Gary Hutson https://twitter.com/StatsGary and the package is part of his work.