extremeStat: quantile estimation

Berry Boessenkool, berry-b@gmx.de

2016-05-11

The R package extremeStat, available at github.com/brry, contains code to fit, plot and compare several (extreme value) distribution functions. It can also compute (truncated) distribution quantile estimates and draw a plot with return periods on a linear scale. (Vignette Rmd source)

Main focus of this document:
Quantile estimation via distribution fitting
Comparison of GPD implementations in several R packages

Note: in some disciplines, quantiles are called percentiles, but technically, percentiles are only one kind of quantiles (as are deciles, quartiles, etc).

Package installation

install.packages("extremeStat")
library(extremeStat)

To install the development version from github:

install.packages(c("devtools","evd","evir","extRemes","fExtremes",
                   "ismev","lmomco","pbapply","Renext"))
# reiterate untill all of them work (some may not install properly on first try)

devtools::install_github("brry/berryFunctions")
devtools::install_github("brry/extremeStat") 
library(extremeStat)

extremeStat has 28 dependencies, because of the GPD comparison across the packages.

TOC

Example dataset

Let’s use the dataset rain with 17k values. With very small values removed, as those might be considered uncertain records, this leaves us with 6k values.

data(rain, package="ismev")
rain <- rain[rain>2]
hist(rain, breaks=80, col=4, las=1)
# Visual inspection is easier on a logarithmic scale:
berryFunctions::logHist(rain, breaks=80, col=3, las=1)

TOC

Fitting distributions

The function distLfit fits 17 of the distribution types avalable in the R package lmomco (there are more, but some of these require quite a bit of computation time and are prone to not be able to be fitted to this type of data distribution anyways. Turn them on with speed=FALSE).

The parameters are estimated via linear moments. These are analogous to the conventional statistical moments (mean, variance, skewness and kurtosis), but “robust [and] suitable for analysis of rare events of non-Normal data. […] L-moments are especially useful in the context of quantile functions” Asquith, W. (2015): lmomco package

distLfit ranks the distributions according to their goodness of fit (RMSE between ecdf and cdf).

TOC

Quantile estimation

To estimate the quantile of (small) samples via a distribution function, you can use distLquantile, which internally calls distLfit, in the following manner:

dlq <- distLquantile(rain, probs=c(0.8,0.9,0.99,0.999), returnlist=TRUE, quiet=TRUE)

By default, the 5 best fitting distribution types are drawn and the quantiles for each distribution returned. If returnlist is set to TRUE, it will return an object that can be examined with

distLprint(dlq)
## ----------
## Dataset 'rain' with 6362 values. min/median/max: 2.3/6.6/86.6  nNA: 0
## truncate: 0 threshold: 2.3. dat_full with 6362 values: 2.3/6.6/86.6  nNA: 0
## dlf with 16 distributions. In descending order of fit quality:
## wei, wak, kap, gpa, pe3, exp, gno, ln3, gev, glo, gam, gum, ray, nor, rice, revgum
## gofProp: 1, RMSE min/median/max: 0.014/0.031/0.14  nNA: 0
## quant: 33 rows, 4 columns, 132 values, of which 8 NA.
##  5 distribution colors: #3300FFFF, #00D9FFFF, #00FF19FF, #F2FF00FF, #FF0000FF
## # More information on dlf objects in
## ?extremeStat

plotted with

distLplot(dlq, nbest=8, qlines=TRUE, qlinargs=list(lwd=2), 
          qheights=seq(0.04, 0.01, len=8), breaks=80)

and the resulting parametric quantiles can be obtained with

dlq$quant # distLquantile output if returnlist=FALSE (the default)
##                              80%      90%      99%     99.9%
## wei                     13.35648 18.81148 38.06546  58.44658
## wak                     13.40619 18.87738 37.81141  57.73253
## kap                     13.38974 18.82981 37.92247  58.37445
## gpa                     13.20620 18.54999 38.85626  63.76409
## pe3                     13.43704 18.90830 37.67313  56.83284
## exp                     13.64134 18.79921 35.93328  53.06736
## gno                     12.81427 18.01421 40.77776  74.40654
## ln3                     12.81427 18.01421 40.77776  74.40654
## gev                     12.49364 17.39559 42.07102  90.15945
## glo                     12.27871 16.92716 42.72106 102.56565
## gam                     13.94944 18.56683 33.03442  46.91116
## gum                     14.05929 18.08737 30.70033  43.08422
## ray                     14.58774 18.15385 27.16319  34.07630
## nor                     14.65654 17.55771 24.44775  29.48528
## rice                    13.03579 15.59222 22.05073  27.00652
## revgum                  14.75911 16.68154 20.40216  22.57858
## quantileMean            13.20000 19.10000 36.74778  60.57805
## weighted1               13.34156 18.24385 36.74277  61.21236
## weighted2               13.31851 18.28320 37.09951  62.03576
## weighted3               13.32627 18.69234 38.12765  60.23631
## weightedc                    NaN      NaN      NaN       NaN
## q_gpd_evir_pwm          13.13589 18.51086 39.35954  65.73245
## q_gpd_evir_ml           13.17887 18.50146 38.75922  63.66670
## q_gpd_evd               13.49848 18.84150 39.17160  64.15801
## q_gpd_extRemes_MLE      13.49802 18.84137 39.17554  64.17217
## q_gpd_extRemes_GMLE     13.48895 18.88301 39.69709  65.82174
## q_gpd_extRemes_Bayesian       NA       NA       NA        NA
## q_gpd_extRemes_Lmoments 13.45454 18.85351 39.79964  66.30399
## q_gpd_fExtremes_pwm     13.45497 18.85333 39.79276  66.28045
## q_gpd_fExtremes_mle     13.49613 18.83907 39.17433  64.17709
## q_gpd_ismev             13.49921 18.84241 39.17242  64.15725
## q_gpd_Renext_r          13.49848 18.84150 39.17160  64.15801
## q_gpd_Renext_f          14.79884 20.59891 37.67775  51.83807
distLgofPlot(dlq, ranks=FALSE, 
             legargs=list(cex=0.8, bg="transparent"), quiet=TRUE)

TOC

POT, GPD

The General Pareto Distribution (‘GPD’, or ‘gpa’ in the package lmomco) is often used to obtain parametric quantile values because of the Pickands-Balkema-DeHaan theorem. It states that the tails of many (empirical) distributions converge to the GPD if a Peak-Over-Threshold (POT) method is used, i.e. the distribution is fitted only to the largest values of a sample. The resulting percentiles can be called censored or truncated quantiles.

This package is based on the philosophy that, in order to compare parametric with empirical quantiles, the threshold must be at some percentage of the full sample. That way, the probabilities given to the quantile functions can be updated.
For example, if the censored Q0.99 is to be computed from the top 20 % of the full dataset, Q0.95 of the truncated sample must be used. The probability adjustment for censored quantiles with truncation percentage t happens with the equation \[ p2 = \frac{p-t}{1-t} \]
derived from

\[ \frac{1-p}{1-t} = \frac{1-p2}{1-0} \] as visualized along a probability line:

In distLquantile, you can set the threshold manually, or (better) as a truncate percentage reflecting the proportion of data discarded:

d <- distLquantile(rain, truncate=0.9, plot=TRUE, probs=0.999, quiet=TRUE, breaks=50)

TOC

Truncation effect

To examine the effect of the truncation percentage, we can compute the quantiles for different cutoff percentages. This is quite time consuming, so the code is not performed upon vignette creation. The result is loaded instead.

tt <- seq(0,0.95, len=50) 
if(interactive()) lapply <- pbapply::pblapply # for progress bars
qq <- lapply(tt, function(t) distLquantile(rain, truncate=t, 
                                             probs=c(0.99,0.999), quiet=TRUE))     
save(tt,qq, file="qq.Rdata")   

We can visualize the truncation dependency with

load("qq.Rdata")
par(mar=c(3,2.8,2.2,0.4), mgp=c(1.8,0.5,0))
plot(tt,tt, type="n", xlab="truncation proportion", ylab="Quantile estimation",
     main="truncation effect for 6k values of rain", ylim=c(22,90), las=1)
dn <- c("wak","kap","wei","gpa","pe3","weighted2")
cols <- c(4,5,3,"orange",2,1) ; names(cols) <- dn
for(d in rownames(qq[[1]])) lines(tt, sapply(qq, "[", d, j=2), col=8)
for(d in dn)
  {
  lines(tt, sapply(qq, "[", d, j=1), col=cols[d], lwd=2)
  lines(tt, sapply(qq, "[", d, j=2), col=cols[d], lwd=2)
  }
abline(h=berryFunctions::quantileMean(rain, probs=c(0.99,0.999)), lty=3)
legend("topright", c(dn,"other"), col=c(cols,8), lty=1, lwd=c(rep(2,6),1), bg="white", cex=0.6)
text(0.9, 53, "Q99.9%") ; text(0.9, 34, "Q99%")
text(0.35, 62, "empirical quantile (full sample)", cex=0.7)

The 17 different distribution quantiles and 12 different GPD estimates seem to converge with increasing truncation percentage. However, at least 5 remaining values in the truncated sample are necessary to fit distributions via linear moments, so don’t truncate too much. I found a good cutoff percentage is 0.8. If you fit to the top 20% of the data, you get good results, while needing ‘only’ approximately 25 values in a sample to infer a quantile estimate.

TOC

Sample size dependency

One motivation behind the development of this package is the finding that high empirical quantiles depend not only on the values of a sample (as it should be), but also on the number of observations available. That is not surprising: Given a distribution of a population, small samples tend to less often include the high (and rare) values. The cool thing about parametric quantiles is that they don’t systematically underestimate the actual quantile in small samples. Here’s a quick demonstration.

set.seed(1)
ss <- c(30,50,70,100,200,300,400,500,1000)
rainsamplequantile <- function() sapply(ss, function(s) distLquantile(sample(rain,s), 
          probs=0.999, plot=F, truncate=0.8, quiet=T, sel="wak", gpd=F, weight=F))
sq <- pbapply::pbreplicate(n=100, rainsamplequantile())    
save(ss,sq, file="sq.Rdata")   

Load the resulting R objects:

load("sq.Rdata")
par(mar=c(3,2.8,2.2,0.4), mgp=c(1.7,0.5,0))
sqs <- function(prob,row) apply(sq, 1:2, quantile, na.rm=TRUE, probs=prob)[row,]
berryFunctions::ciBand(yu=sqs(0.6,1), yl=sqs(0.4,1), ym=sqs(0.5,1), x=ss, 
    ylim=c(25,75), xlim=c(30,900), xlab="sample size", ylab="estimated 99.9% quantile", 
    main="quantile estimations of small random samples", colm="blue")
berryFunctions::ciBand(yu=sqs(0.6,2), yl=sqs(0.4,2), ym=sqs(0.5,2), x=ss, add=TRUE)
abline(h=quantile(rain,0.999))
text(250, 50, "empirical", col="forestgreen")
text(400, 62, "Wakeby", col="blue")
text(0, 61, "'True' population value", adj=0)
text(600, 40, "median and central 20% of 100 simulations")

TOC

Extreme value statistics, Return Periods

Once you have a quantile estimator, you can easily compute extremes (= return levels) for given return periods.
A value x in a time series has a certain expected frequency to occur or be exceeded: the exceedance probability Pe. The Return Period (RP) of x can be computed as follows:

\[ RP = \frac{1}{Pe} = \frac{1}{1-quantile(x)} \]

Here is an example with annual block maxima of stream discharge in Austria:

data("annMax") # annual discharge maxima in the extremeStat package itself
dle <- distLextreme(annMax, log=TRUE, legargs=list(cex=0.6, bg="transparent"), nbest=17, quiet=TRUE)

dle$returnlev[1:20,]
##                  RP.2     RP.5    RP.10     RP.20     RP.50
## wak          62.06908 82.00224 93.37393 103.30175 114.81836
## kap          61.63990 82.43319 94.20990 103.80750 113.98584
## wei          61.84405 81.72957 93.39678 103.55521 115.46953
## pe3          61.86107 81.13112 92.97753 103.72004 116.87811
## ray          62.37416 81.92136 93.07332 102.63851 113.71311
## ln3          61.89473 80.85126 92.71419 103.69028 117.47486
## gno          61.89473 80.85126 92.71419 103.69028 117.47486
## gev          61.85979 80.83924 92.82171 103.89743 117.64984
## gum          61.24316 80.26879 92.86542 104.94841 120.58860
## gpa          61.36114 84.02187 95.30208 103.19519 110.11583
## gam          62.54834 81.39612 92.57994 102.53018 114.52039
## glo          62.16467 79.38862 91.10085 103.11624 120.24369
## rice         64.59196 82.14649 91.37297  99.01113 107.62436
## nor          64.78000 82.13652 91.20908  98.70136 107.13390
## exp          57.63946 78.96177 95.09148 111.22119 132.54351
## revgum       68.31684 82.45728 88.46912  92.88645  97.36604
## quantileMean 61.42222 82.14694 93.28444 105.64000 112.76000
## weighted1    62.14138 81.42424 92.91457 103.12462 115.41711
## weighted2    62.03534 81.43120 93.00784 103.29342 115.66890
## weighted3    61.93842 81.57344 93.21973 103.48630 115.66034

Explore the other possibilities of the package by reading the function help files.
A good place to start is the package help:

?extremeStat

TOC

Any Feedback on this package (or this vignette) is very welcome via github or berry-b@gmx.de!