The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Various utilities meant to aid in speeding up common statistical operations, such as: - removing outliers and extremes - generating probability density and cumulative distribution graphs with ggplot2 - running one-sample Kolmogorov-Smirnov tests against multiple distributions at once - generating prediction plots with ggplot2 - scaling data and performing principal component analysis (PCA) - plotting PCA with ggplot2
To install from CRAN
install.packages("ztils")
To install the development version:
remotes::install_github("zachpeagler/ztils")
This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 1.5 times the interquartile range.
This function has no defaults, as it is entirely dependent on the user input.
no_outliers(data,
var
)
Returns the specified dataframe data minus the rows containing outliers in the var variable.
no_outliers(iris, Sepal.Length)
This isn’t a great example because the iris dataset does not contain any statistical outliers.
This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 3.0 times the interquartile range.
This function has no defaults, as it is entirely dependent on the user input.
no_extremes(data,
var
)
Returns the specified dataframe data minus the rows containing extremes in the var variable.
no_extremes(iris, Sepal.Length)
This isn’t a great example because the iris dataset does not contain any statistical outliers.
This function gets the probability density function (PDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).
Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.
multipdf_cont(var,
seq_length = 50,
distributions = "all"
)
This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.
multipdf_cont(iris$Petal.Length)
multipdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal"))
This function extends multiPDF_cont and gets the probability density functions (PDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.
multipdf_plot(var,
seq_length = 50,
distributions = "all",
palette = "oslo",
var_name = NULL
)
A plot showing the PDF of the selected variable against the selected distributions over the selected sequence length.
multipdf_plot(iris$Sepal.Length)
multipdf_plot(iris$Sepal.Length,
seq_length = 100,
distributions = c("normal", "lognormal", "gamma"),
palette = "bilbao",
var_name = "Sepal Length (cm)"
)
This function gets the cumulative distribution function (CDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).
Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.
multicdf_cont(var,
seq_length = 50,
distributions = "all"
)
This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.
multicdf_cont(iris$Petal.Length)
multicdf_cont(iris$Sepal.Length,
100,
c("normal", "lognormal")
)
This function extends multiCDF_cont and gets the cumulative distribution functions (CDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.
multicdf_plot(var,
seq_length = 50,
distributions = "all",
palette = "oslo",
var_name = NULL
)
A plot showing the CDF of the selected variable against the selected distributions over the selected sequence length.
multicdf_plot(iris$Sepal.Length)
multicdf_plot(iris$Sepal.Length,
seq_length = 100,
distributions = c("normal", "lognormal", "gamma"),
palette = "bilbao",
var_name = "Sepal Length (cm)"
)
This function gets the distance and p-value from a one-sample Kolmogorov-Smirnov (KS) test for selected distributions against a continous input variable. Possible distributions include “normal”, “lognormal”, “gamma”, “exponential”, and “all”.
multiks_cont(var,
distributions = "all"
)
Note: If using “lognormal” or “gamma” distributions, the target variable must be non-negative.
Returns a dataframe with the distance and p-value for each performed KS test. The distance is a relative metric of similarity. A p-value of > 0.05 indicates that the target variable’s distribution is not significantly different from the specified distribution.
multiks_cont(iris$Sepal.Length)
multiks_cont(iris$Sepal.Length, c("normal", "lognormal"))
This function calculates the pseudo R^2 (proportion of variance explained by the model) for a general linear model (glm). glms don’t have real R^2 due to the intrinsic difference between a linear model and a generalized linear model, but we can still calculate an approximiation of the R^2 as (1 - (deviance/null deviance)).
glm_pseudor2(mod)
Returns the pseudo R^2 value of the model.
gmod <- glm(Sepal.Length ~ Petal.Length + Species, data = iris)
glm_pseudor2(gmod)
This function performs a principal component analysis (PCA) for the selected pcavars with the option to automatically scale the variables. It then graphs PC1 on the x axis and PC2 on the y-axis using ggplot2, coloring the graph with a scico palette over the specified groups. This is similar to the biplot command from the stats package, but performs all the steps required in graphing a PCA for you.
pca_plot(group,
pcavars,
scaled = FALSE,
palette = "oslo
)
A ggplot object showing PC1 on the x axis and PC2 on the y axis, colored by group with vectors and labels showing the individual pca variables.
pca_plot(iris$Species, iris[,c(1:4)])
pca_plot(iris$Species, iris[,c(1:4)], FALSE, "bilbao")
This function performs a principal component analysis (PCA) on the specified variables, pcavars and attaches the resulting principal components to the specified dataframe, data, with optional variable scaling.
pca_data(data,
pcavars,
scaled = FALSE
)
Returns a dataframe with principal components as additional columns.
pca_data(iris, iris[,c(1:4)], FALSE)
This function performs a prediction based on the supplied model, then graphs it using ggplot2. Options are available for predicting based on the confidence or prediction interval, as well as for applying corrections, such as exponential and logistic.
I would like to alter this function to reduce the number of required inputs, as all the information should be available from the model call, but that’s a work in progress. ### Usage
predict_plot(mod,
data,
rvar,
pvar,
group = NULL,
length = 50,
interval = "confidence",
correction = "normal",
palette = "oslo"
)
Returns a plot with the observed (real) data plotted as points and the prediction plotted as lines, with a 95% confidence or prediction interval.
This function has a known issue with the colors on ungrouped predictions being kind of funky, as the function uses the predictor variable (x-axis) for the color, which works for the actual data (points), but doesn’t translate well to the predicted lines and ribbon.
mod1 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
predict_plot(mod1, iris, Sepal.Length, Petal.Length, Species)
If you find any bugs, please report them at https://github.com/zachpeagler/ztils/issues.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.