Package MKdescr includes a collection of functions that I found useful in my daily work. It contains several functions for descriptive statistical data analysis. Most of the functions were extracted from package MKmisc. Due to the growing number of dependencies required for MKmisc, I decided to split the package in smaller packages, where each package offers specific functionality.
We first load the package.
library(MKdescr)
I implemented function IQrange before the standard function IQR gained the type argument. Since 2010 (r53643, r53644) the function is identical to function IQR.
x <- rnorm(100)
IQrange(x)
## [1] 1.441639
IQR(x)
## [1] 1.441639
It is also possible to compute a standardized version of the IQR leading to a normal-consistent estimate of the standard deviation.
sIQR(x)
## [1] 1.068689
sd(x)
## [1] 1.018492
The mean absolute deviation under the assumption of symmetry is a robust alternative to the sample standard deviation.
meanAD(x)
## [1] 1.033232
Huber-type skipped mean and SD are robust alternatives to the arithmetic mean and sample standard deviation.
skippedMean(x)
## [1] -0.01598634
skippedSD(x)
## [1] 1.018492
There is a function that computes a so-called five number summary which in contrast to function fivenum uses the first and third quartile instead of the lower and upper hinge.
fiveNS(x)
## Minimum 25% Median 75% Maximum
## -2.7789168 -0.6774345 -0.1063944 0.7642046 2.4248299
There are functions to compute the (classical) coefficient of variation as well as two robust variants. In case of the robust variants, the mean is replaced by the median and the SD is replaced by the (standardized) MAD and the (standardized) IQR, respectively.
## 5% outliers
out <- rbinom(100, prob = 0.05, size = 1)
sum(out)
## [1] 5
x <- (1-out)*rnorm(100, mean = 10, sd = 2) + out*25
CV(x)
## [1] 0.3580192
medCV(x)
## [1] 0.1827217
iqrCV(x)
## [1] 0.1774955
There are functions to compute the (classical) signal to noise ratio as well as two robust variants. In case of the robust variants, the mean is replaced by the median and the SD is replaced by the (standardized) MAD and the (standardized) IQR, respectively.
SNR(x)
## [1] 2.793146
medSNR(x)
## [1] 5.472804
iqrSNR(x)
## [1] 5.633947
In contrast to the standard function boxplot which uses the lower and upper hinge for defining the box and the whiskers, the function qboxplot uses the first and third quartile.
x <- rt(10, df = 3)
par(mfrow = c(1,2))
qboxplot(x, main = "1st and 3rd quartile")
boxplot(x, main = "Lower and upper hinge")
The difference between the two versions often is hardly visible.
The generalized logarithm may be useful as a variance stabilizing transformation when also negative values are present.
curve(log, from = -3, to = 5)
## Warning in log(x): NaNs wurden erzeugt
curve(glog, from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log", "glog"))
As in case of function log there is also glog10 and glog2.
curve(log10(x), from = -3, to = 5)
## Warning in eval(expr, envir = ll, enclos = parent.frame()): NaNs wurden
## erzeugt
curve(glog10(x), from = -3, to = 5, add = TRUE, col = "orange")
legend("topleft", fill = c("black", "orange"), legend = c("log10", "glog10"))
There are also functions that compute the inverse of the generalized logarithm.
inv.glog(glog(10))
## [1] 10
inv.glog(glog(10, base = 3), base = 3)
## [1] 10
inv.glog10(glog10(10))
## [1] 10
inv.glog2(glog2(10))
## [1] 10
The thyroid function is usually investigated by determining the values of TSH, fT3 and fT4. The function thyroid can be used to visualize the measured values as relative values with respect to the provided reference ranges.
thyroid(TSH = 1.5, fT3 = 2.5, fT4 = 14, TSHref = c(0.2, 3.0),
fT3ref = c(1.7, 4.2), fT4ref = c(7.6, 15.0))
We can use the generalized logarithm for transforming the axes in ggplot2 plots.
library(ggplot2)
data(mpg)
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p1
p1 + scale_x_log10()
p1 + scale_x_glog10()
p1 + scale_y_log10()
p1 + scale_y_glog10()
The negative logrithm is for instance useful for displaying p values. The interesting values are on the top. This is for instance used in a so-called volcano plot.
x <- matrix(rnorm(1000, mean = 10), nrow = 10)
g1 <- rep("control", 10)
y1 <- matrix(rnorm(500, mean = 11.25), nrow = 10)
y2 <- matrix(rnorm(500, mean = 9.75), nrow = 10)
g2 <- rep("treatment", 10)
group <- factor(c(g1, g2))
Data <- rbind(x, cbind(y1, y2))
pvals <- apply(Data, 2, function(x, group) t.test(x ~ group)$p.value,
group = group)
## compute log-fold change
logfc <- function(x, group){
res <- tapply(x, group, mean)
log2(res[1]/res[2])
}
lfcs <- apply(Data, 2, logfc, group = group)
ps <- data.frame(pvals = pvals, logfc = lfcs)
ggplot(ps, aes(x = logfc, y = pvals)) + geom_point() +
geom_hline(yintercept = 0.05) + scale_y_neglog10() +
geom_vline(xintercept = c(-0.1, 0.1)) + xlab("log-fold change") +
ylab("-log10(p value)") + ggtitle("A Volcano Plot")
Often it’s better to have the data in a long format than in a wide format; e.g., when plotting with package ggplot2. The necessary transformation can be done with function melt.long.
library(ggplot2)
## some random data
test <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
test.long <- melt.long(test)
test.long
## value variable
## x1 -1.25600780 x
## x2 -1.28190519 x
## x3 -0.50302301 x
## x4 -1.81198050 x
## x5 1.33637959 x
## x6 -0.53002060 x
## x7 0.06905170 x
## x8 0.28367178 x
## x9 -1.40526691 x
## x10 -0.25161151 x
## y1 0.59266825 y
## y2 0.89224059 y
## y3 -1.82099184 y
## y4 -0.42290993 y
## y5 0.80487321 y
## y6 -0.52625631 y
## y7 -1.09019471 y
## y8 -0.70532212 y
## y9 1.68753285 y
## y10 -0.32552647 y
## z1 0.32833993 z
## z2 -0.02516865 z
## z3 -1.24717306 z
## z4 0.52687961 z
## z5 0.47205521 z
## z6 0.10035455 z
## z7 1.20788038 z
## z8 -0.42464503 z
## z9 -1.04361930 z
## z10 -1.71926206 z
ggplot(test.long, aes(x = variable, y = value)) +
geom_boxplot(aes(fill = variable))
## introducing an additional grouping variable
group <- factor(rep(c("a","b"), each = 5))
test.long.gr <- melt.long(test, select = 1:2, group = group)
test.long.gr
## group value variable
## x1 a -1.2560078 x
## x2 a -1.2819052 x
## x3 a -0.5030230 x
## x4 a -1.8119805 x
## x5 a 1.3363796 x
## x6 b -0.5300206 x
## x7 b 0.0690517 x
## x8 b 0.2836718 x
## x9 b -1.4052669 x
## x10 b -0.2516115 x
## y1 a 0.5926683 y
## y2 a 0.8922406 y
## y3 a -1.8209918 y
## y4 a -0.4229099 y
## y5 a 0.8048732 y
## y6 b -0.5262563 y
## y7 b -1.0901947 y
## y8 b -0.7053221 y
## y9 b 1.6875329 y
## y10 b -0.3255265 y
ggplot(test.long.gr, aes(x = variable, y = value, fill = group)) +
geom_boxplot()
sessionInfo()
## R version 3.6.1 Patched (2019-10-31 r77355)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19.2
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
## LAPACK: /home/kohlm/RTOP/Rbranch/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1 MKdescr_0.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 knitr_1.25 magrittr_1.5 tidyselect_0.2.5
## [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.0 rlang_0.4.1
## [9] stringr_1.4.0 dplyr_0.8.3 tools_3.6.1 grid_3.6.1
## [13] gtable_0.3.0 xfun_0.10 withr_2.1.2 htmltools_0.4.0
## [17] assertthat_0.2.1 yaml_2.2.0 lazyeval_0.2.2 digest_0.6.22
## [21] tibble_2.1.3 crayon_1.3.4 purrr_0.3.3 glue_1.3.1
## [25] evaluate_0.14 rmarkdown_1.16 labeling_0.3 stringi_1.4.3
## [29] compiler_3.6.1 pillar_1.4.2 scales_1.0.0 pkgconfig_2.0.3