The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

LogisticEnsembles-vignette

Welcome to LogisticEnsembles! This is an all-in-one solution to logistic data, which is very commonly used in areas such as human resources, sports analytics, and talent analytics, among many other areas.

What is logistic data? This is data that is binary, such as 1 or 0, up or down, on or off, etc.

Lebron data is logistic, and will be our example.

The Lebron data set was originally posted at kaggle.com.

It was filtered for Lebron, but others players are also in the data set.

The head of the Lebron data set looks like this:

Head of the Lebron data set
Head of the Lebron data set

The sixth column, result, is the target column. 1 indicates Lebron made the shot, 0 indicates he missed.

How to run LogisticEnsembles (using the Lebron data as an example):

Logistic(data = Lebron,
         colnum = 6,
         numresamples = 25,
         save_all_trained_models = "Y",
         how_to_handle_strings = 1,
         do_you_have_new_data = "N",
         remove_ensemble_correlations_greater_than = 1.00,
         use_parallel = "Y",
         train_amount = 0.60,
         test_amount = 0.20,
         validation_amount = 0.20)

What does LogisticEnsembles do?

The goal of LogisticEnsembles is to automatically conduct a thorough analysis of data which includes logistic data. The user only needs to provide the data and answer a few questions, such as which column is the target column, and LogisticEnsembles will do all the rest.

LogisticEnsembles builds 23 individual logistic models. It does this by building a model, and then converting the results into a probability between 0 and 1. The function then makes predictions. If the probability is >0.5, it assigns a result of 1, otherwise 0.

One of the many results that are returned are ROC curves for all 36 models. Specificity is on the x-axis and sensitivity on the y-axis. Here are nine of the 36 total ROC curves automatically provided by the function:

Example of nine the ROC curves. Note that two have an AUC=1.
Example of nine the ROC curves. Note that two have an AUC=1.

The 23 individual models are:

ADA Boost

Bagged Random Forest

Bayes GLM

Bayes RNN

C50

Cubist

Flexible Discriminant Analysis

Generalized Additive Models

Gradient Boosted

Linear Discriminant Analysis

Linear Model

Mixed Discriminant Analysis

Naive Bayes

Penalized Discriminant Analysis

Quadratic Discriminant Analysis

Random Forest

Ranger

RPart

Support Vector Machines

Trees

XGBoost

The 13 ensembles of models are:

Ensemble ADA Boost

Ensemble Bagging

Ensemble C50

Ensemble Gradient Boosted

Ensemble Partial Least Squares

Ensemble Penalized Discriminant Analysis

Ensemble Random Forest

Ensemble Ranger

Ensemble Regularized Discriminant Analysis

Ensemble RPart

Ensemble Support Vector Machines

Ensemble Trees

Ensemble XGBoost

Installation

You can install the development version of LogisticEnsembles like so:

devtools::install_github("InfiniteCuriosity/LogisticEnsembles")

Example

We will analyze the data on Lebron James. The LogisticEnsembles package will automatically split the data into train (60% in this case), test (20% in this case) and validation (20% in this case), fit each model on the training data, make predictions and track accuracy on the test and holdout data.

Logistic(data = Lebron,
         colnum = 6,
         numresamples = 25,
         save_all_trained_models = "Y",
         how_to_handle_strings = 1,
         do_you_have_new_data = "N",
         remove_ensemble_correlations_greater_than = 1.00,
         use_parallel = "Y",
         train_amount = 0.60,
         test_amount = 0.20,
         validation_amount = 0.20)

Here are a few of the 13 plots which are all produced automatically are:

Accuracy by model and resample

Accuracy including train and holdout by model and resample

Accuracy including train and holdout by model and resample
Accuracy including train and holdout by model and resample

Boxplots of the numeric data

Boxplots of the numeric data
Boxplots of the numeric data

Duration barchart

Duration barchart
Duration barchart

Model accuracy barchart

Model accuracy barchart
Model accuracy barchart

Over or underfitting barchart

Over or underfitting barchart
Over or underfitting barchart

ROC curves

ROC curves
ROC curves

Target vs each predictor

Target vs each predictor
Target vs each predictor

36 summary tables (three are shown):

Logistic summary table
Logistic summary table

The function adds up each of the tables for each of the resamples. What this shows is there were zero false negatives or positives for the BayesRNN model when the data was randomly resampled 25 times.

Summary Report

The function automatically creates a summary report which includes the following:

Model name

Accuracy

True Positive Rate (also known as Sensitivity)

True Negative Rate (also known as Specificity)

False Positive Rate (also known as Type I Error)

False Negative Rate (also known as Type II Error)

Positive Predictive Value (also known as Precision)

Negative Predictive Value

F1 Score

Area Under the Curve

Overfitting Min

Overfitting Mean

Overfitting Max

Duration

Finding the strongest predictor from the most accurate model(s)

If the trained models are saved to the Environment, then the models may be used to find the strongest predictor. Not all models have this, but Random Forest does. In the case of Lebron model results, we would find the strongest predictors for Lebron as follows:

rf_train_fit$importance

The result is:

Strongest predictors for Lebron making a basket
Strongest predictors for Lebron making a basket

Grand Summary

The LogisticEnsembles package was able to use the data about Lebron, split it into train, test and validation. From there it automatically built 23 individual models and 13 ensembles of models, which were randomly resampled 25 times. The function automatically returned 12 plots, a summary table for each model, and a summary report. We are able to identify the strongest predictors of Lebron making a basket from the most accurate models (in particular Random Forest).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.