The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Diagnostic Plots for Fitting Distributions

Thomas Roh

December 17, 2017

The fitur package includes several tools for visually inspecting how good of a fit a distribution is. To start, fictional empirical data is generated below. Typically this would come from a real-world dataset such as the time it takes to serve a customer at a bank, the length of stay in an emergency department, or customer arrivals to a queue.

set.seed(438)
x <- rweibull(10000, shape = 5, scale = 1)

Histogram

Below is a histogram showing the shape of the distribution and the y-axis has been set to show the probability density.

dt <- data.frame(x)
nbins <- 30
g <- ggplot(dt, aes(x)) +
  geom_histogram(aes(y = ..density..), 
                bins = nbins, fill = NA, color = "black") +
  theme_bw() +
  theme(panel.grid = element_blank())
g

Histogram vs Density Plot

Three distributions have been chosen below to test against the dataset. Using the fit_univariate function, each of the distributions are fit to a fitted object. The first item in each of the fits is the probabilty density function. Each fit is overplotted onto the histogram to see which distribution fits best.

dists <- c('gamma', 'lnorm', 'weibull')
multipleFits <- lapply(dists, fit_univariate, x = x)
## $start.arg
## $start.arg$shape
## [1] 18.97398
## 
## $start.arg$rate
## [1] 20.68217
## 
## 
## $fix.arg
## NULL
## 
## $start.arg
## $start.arg$meanlog
## [1] -0.1162831
## 
## $start.arg$sdlog
## [1] 0.2560369
## 
## 
## $fix.arg
## NULL
## 
## $start.arg
## $start.arg$shape
## [1] 4.686591
## 
## $start.arg$scale
## [1] 1.005784
## 
## 
## $fix.arg
## NULL
plot_density(x, multipleFits, 30) + theme_bw() +
  theme(panel.grid = element_blank())

Q-Q Plot

The next plot used is the quantile-quantile plot. The plot_qq function takes a numeric vector x of the empirical data and sorts them. A range of probabilities are computed and then used to compute comparable quantiles using the q distribution function from the fitted objects. A good fit would closely align with the abline y = 0 + 1*x. Note: the q-q plot tends to be more sensitive around the “tails” of the distributions.

plot_qq(x, multipleFits) +
  theme_bw() +
  theme(panel.grid = element_blank())

P-P Plot

The Percentile-Percentile plot rescales the input data to the interval (0, 1] and then calculates the theoretical percentiles to compare. The plot_pp function takes the same inputs as the Q-Q Plot but it performs on rescaling of x and then computes the percentiles using the p distribution of the fitted object. A good fit matches the abline y = 0 + 1*x. Note: The P-P plot tends to be more sensitive in the middle of the distribution.

plot_pp(x, multipleFits) +
  theme_bw() +
  theme(panel.grid = element_blank())

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.