MetricsWeighted

2023-03-24

library(MetricsWeighted)

Overview

{MetricsWeighted} provides weighted versions of different machine learning metrics, scoring functions and performance measures as well as tools to use it within a {dplyr} chain.

Installation

# From CRAN
install.packages("MetricsWeighted")

# Development version
devtools::install_github("mayer79/MetricsWeighted")

Usage

Metrics, Scoring Functions and Performance Measures

Currently, the following metrics, scoring functions and performance measures are available:

  • accuracy(), recall(), precision(), f1_score(), and classification_error(): Typical binary performance measures derived from the confusion matrix, see e.g. (https://en.wikipedia.org/wiki/Precision_and_recall). Require binary predictions. Except for classification error, high values are better.

  • AUC() and gini_coefficient(): Area under the receiver operating curve (ROC) and the closely related Gini coefficient. Actual values must be 0 or 1, while predictions can take any value (only their order is relevant). The higher, the better.

  • deviance_bernoulli() and logLoss(): Further metrics relevant for binary targets, namely the average unit deviance of the binary logistic regression model (0-1 response) and logLoss (half that deviance). As with all deviance measures, smaller values are better.

  • mse(), mae(), mape(), rmse(), and medae(): Typical regression metrics (mean-squared error, mean absolute error, mean absolute percentage error, root-mean-squared error, and median absolute error). The lower, the better.

  • deviance_tweedie(): (Unscaled) average unit Tweedie deviance with parameter tweedie_p, see (Jorgensen 1997) and (https://en.wikipedia.org/wiki/Tweedie_distribution) for a reference of the underlying formula.

  • deviance_normal(), deviance_gamma(), and deviance_poisson(): Special cases of Tweedie. deviance_normal() equals mean-squared error.

  • elementary_score_quantile() and elementary_score_expectile(): Consistent scoring functions for quantiles and expectiles, see (Ehm et al. 2016).

  • prop_within(): Proportion of predicted values within a given band around the true values.

They all take at least four arguments:

  1. actual: Actual observed values.

  2. predicted: Predicted values.

  3. w: Optional vector with case weights.

  4. ...: Further arguments.

Examples: Regression

# The data
y_num <- iris[["Sepal.Length"]]
fit_num <- lm(Sepal.Length ~ ., data = iris)
pred_num <- fit_num$fitted
weights <- seq_len(nrow(iris))

# Performance metrics
mae(y_num, pred_num)  # unweighted
#> [1] 0.2428628
mae(y_num, pred_num, w = rep(1, length(y_num)))  # same
#> [1] 0.2428628
mae(y_num, pred_num, w = weights)  # different
#> [1] 0.2561237
rmse(y_num, pred_num)
#> [1] 0.300627
medae(y_num, pred_num, w = weights) # median absolute error
#> [1] 0.2381186

# Normal deviance equals Tweedie deviance with parameter 0
deviance_normal(y_num, pred_num)
#> [1] 0.09037657
deviance_tweedie(y_num, pred_num, tweedie_p = 0)
#> [1] 0.09037657
deviance_tweedie(y_num, pred_num, tweedie_p = -0.001)
#> [1] 0.09053778

# Poisson deviance equals Tweedie deviance with parameter 1
deviance_poisson(y_num, pred_num)
#> [1] 0.01531595
deviance_tweedie(y_num, pred_num, tweedie_p = 1)
#> [1] 0.01531595
deviance_tweedie(y_num, pred_num, tweedie_p = 1.01)
#> [1] 0.01504756

# Gamma deviance equals Tweedie deviance with parameter 2
deviance_gamma(y_num, pred_num)
#> [1] 0.002633186
deviance_tweedie(y_num, pred_num, tweedie_p = 2)
#> [1] 0.002633186
deviance_tweedie(y_num, pred_num, tweedie_p = 1.99)
#> [1] 0.002679764
deviance_tweedie(y_num, pred_num, tweedie_p = 2.01)
#> [1] 0.00258742

Examples: Binary classification

# The data
y_cat <- iris[["Species"]] == "setosa"
fit_cat <- glm(y_cat ~ Sepal.Length, data = iris, family = binomial())
pred_cat <- predict(fit_cat, type = "response")

# Performance metrics
AUC(y_cat, pred_cat)  # unweighted
#> [1] 0.9586
AUC(y_cat, pred_cat, w = weights)  # weighted
#> [1] 0.9629734
logLoss(y_cat, pred_cat)  # Logloss
#> [1] 0.2394547
deviance_bernoulli(y_cat, pred_cat)  # LogLoss * 2
#> [1] 0.4789093

Generalized R-squared

Furthermore, we provide a generalization of R-squared, defined as the proportion of deviance explained, i.e., one minus the ratio of residual deviance and intercept-only deviance, see e.g. (Cohen 2003). By default, it calculates the ordinary R-squared, i.e. proportion of normal deviance (mean-squared error) explained. However, you can specify any different deviance function, e.g. deviance_tweedie() with paramter 1.5 or the deviance of the binary logistic regression (deviance_bernoulli()).

For out-of-sample calculations, the null deviance is ideally calculated from the average in the training data. This can be controlled by setting reference_mean to the (possibly weighted) average in the training data.

Examples

summary(fit_num)$r.squared
#> [1] 0.8673123

# same
r_squared(y_num, pred_num)
#> [1] 0.8673123
r_squared(y_num, pred_num, deviance_function = deviance_tweedie, tweedie_p = 0)
#> [1] 0.8673123
r_squared(y_num, pred_num, deviance_function = deviance_tweedie, tweedie_p = 1.5)
#> [1] 0.8675195

# weighted
r_squared(y_num, pred_num, w = weights)
#> [1] 0.8300011
r_squared(y_num, pred_num, w = weights, deviance_function = deviance_gamma) 
#> [1] 0.8300644
r_squared(
  y_num, pred_num, w = weights, deviance_function = deviance_tweedie, tweedie_p = 2
)
#> [1] 0.8300644
r_squared(y_num, pred_num, deviance_function = deviance_tweedie, tweedie_p = 1.5)
#> [1] 0.8675195

# respect to 'own' deviance formula
myTweedie <- function(actual, predicted, w = NULL, ...) {
  deviance_tweedie(actual, predicted, w, tweedie_p = 1.5, ...)
}
r_squared(y_num, pred_num, deviance_function = myTweedie)
#> [1] 0.8675195

Tidyverse

In order to facilitate the use of these metrics in a {dplyr} chain, you can try out the function performance(): Starting from a data set with actual and predicted values (and optional case weights), it calculates one or more metrics. The resulting values are returned as a data.frame.

Examples

library(dplyr)

fit_num <- lm(Sepal.Length ~ ., data = iris)

# Regression with `Sepal.Length` as response
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  performance("Sepal.Length", "pred")

# Same
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  performance("Sepal.Length", "pred", metrics = rmse)

# Grouped by Species
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  group_by(Species) %>% 
  do(performance(., actual = "Sepal.Length", predicted = "pred"))

# Customized output
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  performance(
    "Sepal.Length", 
    "pred", 
    value = "performance",
    metrics = list(`root-mean-squared error` = rmse)
  )

# Multiple measures
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  performance(
    "Sepal.Length", 
    "pred", 
    metrics = list(rmse = rmse, mae = mae, `R-squared` = r_squared)
  )

# Grouped by Species
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  group_by(Species) %>% 
  do(
    performance(
      ., 
      "Sepal.Length", 
      "pred",
      metrics = list(rmse = rmse, mae = mae, `R-squared` = r_squared)
    )
  )

# Passing extra argument (Tweedie p)
iris %>% 
  mutate(pred = predict(fit_num, data = .)) %>% 
  performance(
    "Sepal.Length", 
    "pred",
    metrics = list(
      `normal deviance` = deviance_normal, `Tweedie with p = 0` = deviance_tweedie
    ),
    tweedie_p = 0
  )

Parametrized scoring functions

Some scoring functions depend on a further parameter \(p\), e.g.

  • tweedie_deviance(): depends on tweedie_p,
  • elementary_score_expectile(), elementary_score_quantile(): depend on theta.
  • prop_within(): Depends on tol.

It might be of key relevance to evaluate such function for varying \(p\). That is where the function multi_metric() shines.

Examples

ir <- iris
ir$pred <- predict(fit_num, data = ir)

# Create multiple Tweedie deviance functions
multi_Tweedie <- multi_metric(deviance_tweedie, tweedie_p = c(0, seq(1, 3, by = 0.2)))
perf <- performance(
  ir, 
  actual = "Sepal.Length", 
  predicted = "pred",
  metrics = multi_Tweedie, 
  key = "Tweedie p", 
  value = "deviance"
)
perf$`Tweedie p` <- as.numeric(as.character(perf$`Tweedie p`))
head(perf)
#>   Tweedie p    deviance
#> 1       0.0 0.090376567
#> 2       1.0 0.015315945
#> 3       1.2 0.010757362
#> 4       1.4 0.007559956
#> 5       1.6 0.005316008
#> 6       1.8 0.003740296

# Deviance vs p
plot(deviance ~ `Tweedie p`, data = perf, type = "s")


# Same for Pseudo-R-Squared regarding Tweedie deviance
multi_Tweedie_r2 <- multi_metric(
  r_squared, deviance_function = deviance_tweedie, tweedie_p = c(0, seq(1, 3, by = 0.2))
)
perf <- performance(
  ir, 
  actual = "Sepal.Length", 
  predicted = "pred", 
  metrics = multi_Tweedie_r2, 
  key = "Tweedie p", 
  value = "R-squared"
)
perf$`Tweedie p` <- as.numeric(as.character(perf$`Tweedie p`))

# Values vs. p
plot(`R-squared` ~ `Tweedie p`, data = perf, type = "s")

Murphy diagrams

The same logic as in the last example can be used to create so-called Murphy diagrams (Ehm et al. 2016). The function murphy_diagram() wraps above calls and allows to get elementary scores for one or multiple models across a range of theta values, see also R package murphydiagram.

Example

y <- 1:10
two_models <- cbind(m1 = 1.1 * y, m2 = 1.2 * y)
murphy_diagram(y, two_models, theta = seq(0.9, 1.3, by = 0.01))

References

Cohen, Cohen, J. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. New York: Routledge. https://doi.org/10.4324/9780203774441.
Ehm, Werner, Tilmann Gneiting, Alexander Jordan, and Fabian Krüger. 2016. “Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations and Forecast Rankings.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (3): 505–62. https://doi.org/10.1111/rssb.12154.
Jorgensen, B. 1997. The Theory of Dispersion Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.