The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
summ_*()
Concept of summary functions is to take one or more pdqr-function(s)
and return a summary value (which shouldn’t necessarily be a number).
Argument method
is used to choose function-specific
algorithm of computation.
Note that some summary functions can accumulate pdqr
approximation error (like summ_moment()
for example). For
better precision increase number intervals for piecewise-linear density
using either n
argument for density()
in
new_*()
or n_grid
argument in
as_*()
.
We will use the following distributions throughout this vignette:
my_beta <- as_d(dbeta, shape1 = 2, shape2 = 5)
my_norm <- as_d(dnorm, mean = 0.5)
my_beta_mix <- form_mix(list(my_beta, my_beta + 1))
Although they both are continuous, discrete distributions are also fully supported.
# Usage of `summ_center()`
summ_center(my_beta, method = "mean")
#> [1] 0.2857136
summ_center(my_beta, method = "median")
#> [1] 0.2644498
summ_center(my_beta, method = "mode")
#> [1] 0.2000014
# Usage of wrappers
summ_mean(my_beta)
#> [1] 0.2857136
summ_median(my_beta)
#> [1] 0.2644498
summ_mode(my_beta)
#> [1] 0.2000014
# `summ_mode()` can compute local modes instead of default global
summ_mode(my_beta_mix, method = "local")
#> [1] 0.2000014 1.2000014
# Usage of `summ_spread()`
summ_spread(my_beta, method = "sd")
#> [1] 0.1597178
summ_spread(my_beta, method = "var")
#> [1] 0.02550977
summ_spread(my_beta, method = "iqr")
#> [1] 0.2283162
summ_spread(my_beta, method = "mad")
#> [1] 0.1122417
summ_spread(my_beta, method = "range")
#> [1] 0.955573
# Usage of wrappers
summ_sd(my_beta)
#> [1] 0.1597178
summ_var(my_beta)
#> [1] 0.02550977
summ_iqr(my_beta)
#> [1] 0.2283162
summ_mad(my_beta)
#> [1] 0.1122417
summ_range(my_beta)
#> [1] 0.955573
summ_moment()
has extra arguments for controlling the
nature of moment (which can be combined):
summ_moment(my_beta, order = 3)
#> [1] 0.0476182
summ_moment(my_beta, order = 3, central = TRUE)
#> [1] 0.002429287
summ_moment(my_beta, order = 3, standard = TRUE)
#> [1] 11.68727
summ_moment(my_beta, order = 3, absolute = TRUE)
#> [1] 0.0476182
There are wrappers for most common moments: skewness and kurtosis:
summ_quantile(f, probs)
is essentially a more strict
version of as_q(f)(probs)
:
summ_entropy()
computes differential entropy (which can
be negative) for “continuous” type pdqr-functions, and information
entropy for “discrete”:
summ_entropy(my_beta)
#> [1] -0.4845421
summ_entropy(new_d(1:10, type = "discrete"))
#> [1] 2.302585
summ_entropy2()
computes entropy based summary of
relation between a pair of distributions. There are two methods: default
“relative” (for relative entropy which is Kullback-Leibler divergence)
and “cross” (for cross-entropy). It handles different supports by using
clip
(default exp(-20)
) value instead of 0
during log()
computation. Order of input does matter:
summ_entropy2()
uses support of the first pdqr-function as
integration/summation reference.
Distributions can be summarized with regions: union of closed intervals. Region is represented as data frame with rows representing intervals and two columns “left” and “right” with left and right interval edges respectively.
summ_interval()
summarizes input pdqr-function with
single interval based on the desired coverage level supplied in argument
level
. It has three methods:
level
that has minimum width.0.5*(1-level)
and
1 - 0.5*(1-level)
quantiles.level
’s critical value (computed from normal
distribution). Corresponds to classical confidence interval of sample
based on assumption of normality.summ_hdr()
computes highest density region (HDR) of a
distribution: set of intervals with the lowest total width among all
sets with total probability not less than an input level
.
With unimodal distribution it is essentially the same as
summ_interval()
with “minwidth” method.
# Unimodal distribution
summ_hdr(my_beta, level = 0.9)
#> left right
#> 1 0.03013169 0.5253741
# Multimodal distribution
summ_hdr(my_beta_mix, level = 0.9)
#> left right
#> 1 0.03015315 0.5252785
#> 2 1.03015315 1.5252785
# Compare this to single interval of minimum width
summ_interval(my_beta_mix, level = 0.9, method = "minwidth")
#> left right
#> 1 0.05125316 1.448297
There is a region_*()
family of functions which helps
working with them:
beta_mix_hdr <- summ_hdr(my_beta_mix, level = 0.9)
beta_mix_interval <- summ_interval(my_beta_mix, level = 0.9)
# Test if points are inside region
region_is_in(beta_mix_hdr, x = seq(0, 2, by = 0.5))
#> [1] FALSE TRUE FALSE TRUE FALSE
# Compute total probability of a region
region_prob(beta_mix_hdr, f = my_beta_mix)
#> [1] 0.899991
# Pdqr-function doesn't need to be the same as used for computing region
region_prob(beta_mix_hdr, f = my_norm)
#> [1] 0.336239
# Compute height of region: minimum value of d-function inside region
region_height(beta_mix_hdr, f = my_beta_mix)
#> [1] 0.400163
# Compute width of region: sum of interval widths
region_width(beta_mix_hdr)
#> [1] 0.9902507
# Compare widths with single interval
region_width(beta_mix_interval)
#> [1] 1.397043
# Draw region on existing plot
plot(my_beta_mix, main = "90% highest density region")
region_draw(beta_mix_hdr)
Function summ_distance()
takes two pdqr-functions and
returns a distance between two distributions they represent. Many
methods of computation are available. This might be useful for doing
comparison statistical inference.
# Kolmogorov-Smirnov distance
summ_distance(my_beta, my_norm, method = "KS")
#> [1] 0.419766
# Total variation distance
summ_distance(my_beta, my_norm, method = "totvar")
#> [1] 0.730451
# Probability of one distribution being bigger than other, normalized to [0;1]
summ_distance(my_beta, my_norm, method = "compare")
#> [1] 0.1678761
# Wassertein distance: "average path density point should travel while
# transforming from one into another"
summ_distance(my_beta, my_norm, method = "wass")
#> [1] 0.6952109
# Cramer distance: integral of squared difference of p-functions
summ_distance(my_beta, my_norm, method = "cramer")
#> [1] 0.1719884
# "Align" distance: path length for which one of distribution should be "moved"
# towards the other so that they become "aligned" (probability of one being
# greater than the other is 0.5)
summ_distance(my_beta, my_norm, method = "align")
#> [1] 0.2147014
# "Entropy" distance: `KL(f, g) + KL(g, f)`, where `KL()` is Kullback-Leibler
# divergence. Usually should be used for distributions with same support, but
# works even if they are different (with big numerical penalty).
summ_distance(my_beta, my_norm, method = "entropy")
#> [1] 13.05768
Function summ_separation()
computes a threshold that
optimally separates distributions represented by pair of input
pdqr-functions. In other words, summ_separation()
solves a
binary classification problem with one-dimensional linear classifier:
values not more than some threshold are classified as one class, and
more than threshold - as another. Order of input functions doesn’t
matter.
Functions summ_classmetric()
and
summ_classmetric_df()
compute metric(s) of classification
setup, similar to one used in summ_separation()
. Here
classifier threshold should be supplied and order of input matters.
Classification is assumed to be done as follows: any x value not more
than threshold value is classified as “negative”; if more - “positive”.
Classification metrics are computed based on two pdqr-functions:
f
, which represents the distribution of values which should
be classified as “negative” (“true negative”), and g
- the
same for “positive” (“true positive”).
# Many threshold values can be supplied
thres_vec <- seq(0, 1, by = 0.2)
summ_classmetric(f = my_beta, g = my_norm, threshold = thres_vec, method = "F1")
#> [1] 0.5138193 0.5436321 0.6089061 0.6131005 0.5522756 0.4715757
# In `summ_classmetric_df()` many methods can be supplied
summ_classmetric_df(
f = my_beta, g = my_norm, threshold = thres_vec, method = c("GM", "F1", "MCC")
)
#> threshold GM F1 MCC
#> 1 0.0 0.0000000 0.5138193 -0.42709306
#> 2 0.2 0.4614729 0.5436321 -0.03892981
#> 3 0.4 0.6433485 0.6089061 0.31475763
#> 4 0.6 0.6643221 0.6131005 0.48370132
#> 5 0.8 0.6176386 0.5522756 0.48316011
#> 6 1.0 0.5554612 0.4715757 0.42709306
With summ_roc()
and summ_rocauc()
one can
compute data frame of ROC curve points and ROC AUC value respectively.
There is also a roc_plot()
function for predefined plotting
of ROC curve.
my_roc <- summ_roc(my_beta, my_norm)
head(my_roc)
#> threshold fpr tpr
#> 1 5.253425 0 0.000000e+00
#> 2 5.243918 0 4.811646e-08
#> 3 5.234412 0 9.845779e-08
#> 4 5.224905 0 1.511167e-07
#> 5 5.215398 0 2.061949e-07
#> 6 5.205891 0 2.637984e-07
summ_rocauc(my_beta, my_norm)
#> [1] 0.583938
roc_plot(my_roc)
‘pdqr’ has functions that can order set of distributions. They are
summ_order()
, summ_sort()
, and
summ_rank()
, which are analogues of order()
,
sort()
, and rank()
respectively. They take a
list of pdqr-functions as input, establish their ordering based on
specified method, and return the desired output.
There are two sets of methods:
f
is greater than g
if and only if
P(f >= g) > 0.5
, or in ‘pdqr’ code
summ_prob_true(f >= g) > 0.5
. This method orders
input based on this relation and order()
function.
Notes:
order()
.summ_center()
: ordering of distributions is defined as
ordering of corresponding measures of distribution’s center.# Here the only clear "correct" ordering is that `a <= b`.
f_list <- list(a = my_beta, b = my_beta + 1, c = my_norm)
# Returns an integer vector representing a permutation which rearranges f_list
# in desired order
summ_order(f_list, method = "compare")
#> [1] 1 3 2
# In this particular case of `f_list` all orderings agree with each other, but
# generally this is not the case: for any pair of methods there is a case
# when they disagree with each other
summ_order(f_list, method = "mean")
#> [1] 1 3 2
summ_order(f_list, method = "median")
#> [1] 1 3 2
summ_order(f_list, method = "mode")
#> [1] 1 3 2
# Use `decreasing = TRUE` to sort decreasingly
summ_order(f_list, method = "compare", decreasing = TRUE)
#> [1] 2 3 1
# Sort list
summ_sort(f_list)
#> $a
#> Density function of continuous type
#> Support: ~[0, 0.95557] (10000 intervals)
#>
#> $c
#> Density function of continuous type
#> Support: ~[-4.25342, 5.25342] (10000 intervals)
#>
#> $b
#> Density function of continuous type
#> Support: ~[1, 1.95557] (10000 intervals)
summ_sort(f_list, decreasing = TRUE)
#> $b
#> Density function of continuous type
#> Support: ~[1, 1.95557] (10000 intervals)
#>
#> $c
#> Density function of continuous type
#> Support: ~[-4.25342, 5.25342] (10000 intervals)
#>
#> $a
#> Density function of continuous type
#> Support: ~[0, 0.95557] (10000 intervals)
# Rank elements: 1 indicates "the smallest", `length(f_list)` - "the biggest"
summ_rank(f_list)
#> a b c
#> 1 3 2
Functions summ_prob_true()
and
summ_prob_false()
should be used to extract probabilities
from boolean pdqr-functions: outputs of comparing basic operators (like
>=
, ==
, etc.):
summ_prob_true(my_beta >= my_norm)
#> [1] 0.416062
summ_prob_false(my_beta >= 2*my_norm)
#> [1] 0.6391
summ_pval()
computes p-value(s) of observed statistic(s)
based on the distribution. You can compute left, right, or two-sided
p-values with methods “left”, “right”, and “both” respectively. By
default multiple input values are adjusted for multiple comparisons
(using stats::p.adjust()):
# By default two-sided p-value is computed
summ_pval(my_beta, obs = 0.7)
#> [1] 0.02186803
summ_pval(my_beta, obs = 0.7, method = "left")
#> [1] 0.989066
summ_pval(my_beta, obs = 0.7, method = "right")
#> [1] 0.01093401
# Multiple values are adjusted with `p.adjust()` with "holm" method by default
obs_vec <- seq(0, 1, by = 0.1)
summ_pval(my_beta, obs = obs_vec)
#> [1] 0.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
#> [6] 1.0000000000 0.4915085377 0.1530761780 0.0255840348 0.0009720023
#> [11] 0.0000000000
# Use `adjust = "none"` to not adjust
summ_pval(my_beta, obs = obs_vec, adjust = "none")
#> [1] 0.0000000000 0.2285302047 0.6892806594 0.8403488674 0.4665584871
#> [6] 0.2187482323 0.0819180896 0.0218680254 0.0031980044 0.0001080003
#> [11] 0.0000000000
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.