title: “Introduction to ‘eva’ and its capabilities” author: Brian Bader date: “2015-12-29” output: rmarkdown::html_vignette vignette: > % %
The `eva’ package, short for extreme value analysis, provides functionality that allows data analysis of extremes from beginning to end, with model fitting and a slew of newly available tests for diagnostics. In particular, some highlights are:
Implementation of the \(r\) largest order statistics (GEV\(_r\)) model - data generation, fitting, and return levels.
Efficient handling of the near-zero shape parameter.
Maximum product spacings (MPS) estimation for parameters in the block maxima (GEV\(_1\)) and generalized pareto distributions.
Sequential tests for the choice of \(r\) in the GEV\(_r\) model, as well as tests for the selection of threshold in the peaks-over-threshold (POT) approach. For the boostrap based tests, the option to run in parallel is provided.
P-value adjustments to control for the false discover rate (FDR) and family-wise error rate (FWER) in the sequential testing setting.
# load package
library(eva)
# A naive implementation of the GEV cumulative density function
pgev_naive <- function(q, loc = 0, scale = 1, shape = 1) {
exp(-(1 + (shape * (q - loc))/scale)^(-1/shape))
}
curve(pgev_naive(1, 0, 1, x), 1e-20, .01, log = "x", n = 1025)
curve(eva:::pgev(1, 0, 1, x), 1e-20, .01, log = "x", n = 1025)
# Similarly for the GPD cdf
pgpd_naive <- function(q, loc = 0, scale = 1, shape = 1) {
(1 - (1 + (shape * (q - loc))/scale)^(-1/shape))
}
curve(pgpd_naive(1, 0, 1, x), 1e-20, .01, log = "x", n = 1025)
curve(eva:::pgpd(1, 0, 1, x), 1e-20, .01, log = "x", n = 1025)
The GEV\(_r\) distribution has the density function \[f_r (x_1, x_2, ..., x_r | \mu, \sigma, \xi) = \sigma^{-r}\exp\left\{-(1+\xi z_r)^{-\frac{1}{\xi}} - \left(\frac{1}{\xi}+1\right)\sum_{j=1}^{r}\log(1+\xi z_j)\right\}\] for some location parameter \(\mu\), scale parameter \(\sigma > 0\) and shape parameter \(\xi\), where \(x_1 > \cdots> x_r\), \(z_j = (x_j - \mu) / \sigma\), and \(1 + \xi z_j > 0\) for \(j=1, \ldots, r\). When \(r = 1\), this distribution is exactly the GEV distribution or block maxima.
This package includes data generation (rgevr), density function (dgevr), fitting (gevr.fit), and return levels (gevr.returnlevel) for this distribution. If one wants to choose \(r > 1\), goodness-of-fit must be tested. This can be done using function gevr.seqtests. Take, for example, the dataset Lowestoft, which includes the top ten sea levels at Lowestoft harbor from 1984 - 2014. Two available tests are available to run in sequence - the entropy difference and score test.
data(lowestoft)
gevrSeqTests(lowestoft, method = "ed")
## r p.values ForwardStop StrongStop statistic est.loc est.scale
## 1 2 0.6847284 1.1477293 0.9587917 0.40601933 3.431792 0.2346591
## 2 3 0.9254168 1.1469054 1.0682403 -0.09361276 3.434097 0.2397408
## 3 4 0.6035795 0.9399148 1.1358926 0.51925992 3.447928 0.2404563
## 4 5 0.7507191 0.9423539 1.2633692 0.31769135 3.452449 0.2376723
## 5 6 0.2752194 0.8529898 1.1712439 -1.09112153 3.455478 0.2396332
## 6 7 0.8446632 0.9857656 1.4035512 0.19593220 3.454680 0.2372572
## 7 8 0.2831512 0.6936342 1.2288715 -1.07326731 3.455901 0.2376215
## 8 9 0.4617934 0.8740062 1.2526265 -0.73589693 3.458135 0.2356543
## 9 10 0.6764817 1.1284995 1.6947577 -0.41726899 3.459470 0.2342272
## est.shape
## 1 0.10049739
## 2 0.09172687
## 3 0.06802070
## 4 0.05451138
## 5 0.04709329
## 6 0.04555449
## 7 0.03838020
## 8 0.02536685
## 9 0.01964612
The entropy difference test fails to reject for any value of \(r\) from 1 to 10. A common quantity of interest in extreme value analysis are the \(m\)-year return levels, which can be thought of as the average maximum value that will be seen over a period of \(m\) years. For the Lowestoft data, the 250 year sea level return levels, with 95% confidence intervals are plotting using for \(r\) from 1 to 10. The advantage of using more top order statistics can be seen in the plots below. The width of the intervals decrease by over two-thirds from \(r=1\) to \(r=10\). Similarly decreases can be seen in the estimated parameters.
# Make 250 year return level plot using gevr for r = 1 to 10 with the LoweStoft data
data(lowestoft)
result <- matrix(0, 20, 4)
period <- 250
for(i in 1:10) {
z <- gevrFit(as.matrix(lowestoft[, 1:i]))
y1 <- gevrRl(z, period, conf = 0.95, method = "delta")
y2 <- gevrRl(z, period, conf = 0.95, method = "profile")
result[i, 1] <- i
result[i, 2] <- y1$Estimate
result[i, 3:4] <- y1$CI
result[(i + 10), 1] <- i
result[(i + 10), 2] <- y2$Estimate
result[(i + 10), 3:4] <- y2$CI
}
result <- cbind.data.frame(result, c(rep("Delta", 10), rep("Profile", 10)))
colnames(result) <- c("r", "Est", "Lower", "Upper", "Method")
result <- as.data.frame(result)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
ggplot(result, aes(x = r, y = Est,)) +
geom_ribbon(data = result ,aes(ymin = Lower, ymax = Upper), alpha = 0.3) +
facet_grid(Method ~ .) +
geom_line() +
geom_point(size = 3) +
scale_x_continuous(breaks = seq(0, 10, by=1)) +
xlab("r") +
ylab("250 Year Return Level") +
theme(text = element_text(size=15))
In addition, the profile likelihood confidence intervals are compared with the delta method intervals. The advantage of using profile likelihood over the delta method is the allowance for asymmetric intervals. This is especially useful at high quantiles, or large return level periods. In the Lowestoft plots directly above, the asymmetry can be seen in the stable lower bound across values of \(r\), while the upper bound decreases.