The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Welcome to LogisticEnsembles! This is an all-in-one solution to logistic data, which is very commonly used in areas such as human resources, sports analytics, and talent analytics, among many other areas.
What is logistic data? This is data that is binary, such as 1 or 0, up or down, on or off, etc.
Lebron data is logistic, and will be our example.
The Lebron data set was originally posted at kaggle.com.
It was filtered for Lebron, but others players are also in the data set.
The head of the Lebron data set looks like this:
The sixth column, result, is the target column. 1 indicates Lebron made the shot, 0 indicates he missed.
How to run LogisticEnsembles (using the Lebron data as an example):
Logistic(data = Lebron,
colnum = 6,
numresamples = 25,
save_all_trained_models = "Y",
how_to_handle_strings = 1,
do_you_have_new_data = "N",
remove_ensemble_correlations_greater_than = 1.00,
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20)
What does LogisticEnsembles do?
The goal of LogisticEnsembles is to automatically conduct a thorough analysis of data which includes logistic data. The user only needs to provide the data and answer a few questions, such as which column is the target column, and LogisticEnsembles will do all the rest.
LogisticEnsembles builds 23 individual logistic models. It does this by building a model, and then converting the results into a probability between 0 and 1. The function then makes predictions. If the probability is >0.5, it assigns a result of 1, otherwise 0.
One of the many results that are returned are ROC curves for all 36 models. Specificity is on the x-axis and sensitivity on the y-axis. Here are nine of the 36 total ROC curves automatically provided by the function:
The 23 individual models are:
ADA Boost
Bagged Random Forest
Bayes GLM
Bayes RNN
C50
Cubist
Flexible Discriminant Analysis
Generalized Additive Models
Gradient Boosted
Linear Discriminant Analysis
Linear Model
Mixed Discriminant Analysis
Naive Bayes
Penalized Discriminant Analysis
Quadratic Discriminant Analysis
Random Forest
Ranger
RPart
Support Vector Machines
Trees
XGBoost
The 13 ensembles of models are:
Ensemble ADA Boost
Ensemble Bagging
Ensemble C50
Ensemble Gradient Boosted
Ensemble Partial Least Squares
Ensemble Penalized Discriminant Analysis
Ensemble Random Forest
Ensemble Ranger
Ensemble Regularized Discriminant Analysis
Ensemble RPart
Ensemble Support Vector Machines
Ensemble Trees
Ensemble XGBoost
You can install the development version of LogisticEnsembles like so:
devtools::install_github("InfiniteCuriosity/LogisticEnsembles")
We will analyze the data on Lebron James. The LogisticEnsembles package will automatically split the data into train (60% in this case), test (20% in this case) and validation (20% in this case), fit each model on the training data, make predictions and track accuracy on the test and holdout data.
Logistic(data = Lebron,
colnum = 6,
numresamples = 25,
save_all_trained_models = "Y",
how_to_handle_strings = 1,
do_you_have_new_data = "N",
remove_ensemble_correlations_greater_than = 1.00,
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20)
Here are a few of the 13 plots which are all produced automatically are:
Accuracy by model and resample
Accuracy including train and holdout by model and resample
Boxplots of the numeric data
Duration barchart
Model accuracy barchart
Over or underfitting barchart
ROC curves
Target vs each predictor
The function adds up each of the tables for each of the resamples. What this shows is there were zero false negatives or positives for the BayesRNN model when the data was randomly resampled 25 times.
The function automatically creates a summary report which includes the following:
Model name
Accuracy
True Positive Rate (also known as Sensitivity)
True Negative Rate (also known as Specificity)
False Positive Rate (also known as Type I Error)
False Negative Rate (also known as Type II Error)
Positive Predictive Value (also known as Precision)
Negative Predictive Value
F1 Score
Area Under the Curve
Overfitting Min
Overfitting Mean
Overfitting Max
Duration
If the trained models are saved to the Environment, then the models may be used to find the strongest predictor. Not all models have this, but Random Forest does. In the case of Lebron model results, we would find the strongest predictors for Lebron as follows:
rf_train_fit$importance
The result is:
The LogisticEnsembles package was able to use the data about Lebron, split it into train, test and validation. From there it automatically built 23 individual models and 13 ensembles of models, which were randomly resampled 25 times. The function automatically returned 12 plots, a summary table for each model, and a summary report. We are able to identify the strongest predictors of Lebron making a basket from the most accurate models (in particular Random Forest).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.