The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
rtForecastEval compares two real-time forecasters on
a common time grid. The schematic below matches the analysis order in
this vignette: prepare a long tibble, run the global delta
test (calc_Z → calc_eig →
calc_pval), and optionally summarize
pointwise loss with calc_L_s2 and
plot_pcb.
Typical workflow: global test (top) and pointwise loss plot (bottom).
The paper (preprint arXiv:2010.00781) has two kinds of graphics:
Those figures are not self-contained in
rtForecastEval because they require the scraped NBA
play-by-play inputs, the load_nba_data() /
pre_process() pipeline, and several bespoke plotting
scripts.
Reproduce them from the separate replication repository (RTPForNBA; historically bundled with the paper’s supplementary code), not from this package alone:
| Idea in the paper | Typical script (replication repo) | Depends on |
|---|---|---|
| Calibration surface (3D: time × forecast × centered residual) | plotting/surfacePlot.R |
nba_FD.R (fits logit, builds my_df),
plot3D, akima,
lattice, binned CIs |
| Reliability-style plot at one time (e.g. mid-game) | plotting/calibrationPlot.R (see also end of
surfacePlot.R) |
Same prepared df_try object |
| Skill curves / \(\hat\Delta_n(t)\) for ESPN vs models (PgRS, LTW, etc.) | plotting/1_PgRS.R, 2_LTW.R,
3_CF.R, … |
utility.R (calc_L_s2,
L_smoothing), nba_FD.R |
| Simulation figures (oracle, score difference, etc.) | simulation/PF-sim/plot_sim_*.R |
simulation/PF-sim/simulator.R and generated
outputs |
Suggested order in the replication repo: prepare
data under data/, run nba_FD.R (or the season
scripts it sources), then source() the relevant file under
plotting/ with working directory set to that project
root.
The rest of this vignette focuses on (1).
library(ggplot2)
library(tibble)
library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
nsamp <- 100 # number of in-game events
ngame <- 200 # number of games (README example uses the same for comparable figures)
#' Parameter for generating the eigenvalues, and p-values
D <- 10 # Number of eigenvalues to keep
N_MC <- 5000 # for simulating the p-value
L <- function(x, y) {
return((x - y) ^ 2)
}
# Data generation ---------------------------------------------------------
df_equ <- df_gen(N = nsamp, Ngame = ngame) %>%
group_by(grid) %>%
mutate(
p_bar_12 = mean(phat_A - phat_B),
diff_non_cent = phat_A - phat_B,
diff_cent = phat_A - phat_B - p_bar_12
) %>% ungroup()
# Apply our test (explicit construction, as in utility.R) ----------------
Z <- df_equ %>% group_by(grid) %>%
summarise(delta_n = mean(L(phat_A, Y) - L(phat_B, Y))) %>%
{
sum((.)$delta_n^2) / nsamp * ngame
}
temp <- df_equ %>% group_split(grid, .keep = FALSE)
eigV_hat <- lapply(1:nsamp, function(i) {
sapply(1:nsamp, function(j) {
as.numeric(temp[[i]]$diff_non_cent %*% temp[[j]]$diff_non_cent / ngame)
})
}) %>% list.rbind() %>% {
RSpectra::eigs_sym(
A = (.),
k = D,
which = "LM",
opts = list(retvec = FALSE)
)$values
} %>%
{
(.)/nsamp
}
eigV_til <- lapply(1:nsamp, function(i) {
sapply(1:nsamp, function(j) {
as.numeric(temp[[i]]$diff_cent %*% temp[[j]]$diff_cent / ngame)
})
}) %>% list.rbind() %>% {
RSpectra::eigs_sym(
A = (.),
k = D,
which = "LM",
opts = list(retvec = FALSE)
)$values
} %>%
{
(.)/nsamp
}
MC_hat <- sapply(1:N_MC, function(x) {
crossprod(eigV_hat, rchisq(D, df = 1))
})
q_90_hat <- quantile(MC_hat, 0.90)
q_95_hat <- quantile(MC_hat, 0.95)
q_99_hat <- quantile(MC_hat, 0.99)
MC_til <- sapply(1:N_MC, function(x) {
crossprod(eigV_til, rchisq(D, df = 1))
})
q_90_til <- quantile(MC_til, 0.90)
q_95_til <- quantile(MC_til, 0.95)
q_99_til <- quantile(MC_til, 0.99)
p_hat <- 1 - ecdf(MC_hat)(Z)
tibble(
type = c("non-center", "center"),
Z = rep(Z, 2),
"pval" = c(p_hat, p_hat),
"90%" = c(q_90_hat, q_90_til),
"95%" = c(q_95_hat, q_95_til),
"99%" = c(q_99_hat, q_99_til)
)
#> # A tibble: 2 × 6
#> type Z pval `90%` `95%` `99%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 non-center 0.0574 0.719 0.442 0.627 0.991
#> 2 center 0.0574 0.719 0.449 0.631 1.04to_center <- FALSE
ZZ <- calc_Z(df = df_equ, pA = "phat_A", pB = "phat_B", Y = "Y", nsamp = nsamp, ngame = ngame)
eigg <- calc_eig(
df = df_equ, n_eig = D, ngame = ngame,
nsamp = nsamp, grid = "grid", cent = to_center
)
oh <- calc_pval(ZZ, eig = eigg, quan = c(0.90, 0.95, 0.99), n_MC = N_MC)
temp <- calc_L_s2(df = df_equ, pA = "phat_A", pB = "phat_B")
plot_pcb(df = temp)Pointwise mean loss difference (A vs B) with naive normal band — simulation setting. This is a skill curve, not a calibration diagram.
The previous figure tracks skill (which forecaster
loses less Brier loss over time). A complementary check is
marginal calibration: within a narrow time slice, do
predicted probabilities match observed event frequencies? Grey
vertical segments show a 95% central range for
\(\bar Y\) under \(H_0\) in each bin: \(n\bar Y \sim \mathrm{Binomial}(n, p)\) with
\(p = \overline{\hat p}\), using
exact qbinom bounds (avoiding asymmetric
normal approximation + clipping that can distort small-sample bands).
The paper’s NBA figures use richer calibration surfaces
(time × forecast × residual); here we only bin
forecasts at a single grid point (closest to mid-game, \(t \approx 0.5\)) and plot mean outcome vs
mean forecast — a standard reliability diagram.
g_mid <- df_equ %>%
distinct(grid) %>%
slice_min(order_by = abs(grid - 0.5), n = 1) %>%
pull(grid)
slice_t <- df_equ %>%
filter(grid == g_mid) %>%
transmute(
game,
Y,
phat_A,
phat_B
)
rel_long <- slice_t %>%
tidyr::pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") %>%
mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))
rel_binned <- rel_long %>%
group_by(forecaster) %>%
mutate(bin = ntile(phat, 10)) %>%
group_by(forecaster, bin) %>%
summarise(
mean_forecast = mean(phat),
mean_outcome = mean(Y),
n_games = dplyr::n(),
.groups = "drop"
) %>%
mutate(
p_bin = pmin(pmax(mean_forecast, 0), 1),
ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
)
ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
geom_errorbar(
aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
width = 0.02,
color = "grey55",
alpha = 0.85,
linewidth = 0.35
) +
geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
facet_wrap(~forecaster, nrow = 1) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
labs(
title = "Binned reliability (calibration) at one grid",
subtitle = paste0(
"Grid = ", signif(g_mid, 3),
" (closest to 0.5); grey: exact 95% Binomial range for mean(Y) under H0"
),
x = "Mean forecast in bin",
y = "Mean outcome (Y) in bin",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold")
)Binned reliability diagram at a fixed grid (closest to 0.5): mean outcome vs mean forecast for A and B; grey bars are exact 95% central Binomial range for mean(Y) under H0. Points near the diagonal indicate better marginal calibration at that time.
The reliability diagram shows levels on the 45° plot; here we plot calibration errors \(\bar Y - \bar{\hat p}_k\) vs bin midpoint. The dashed line is \(\bar{\hat p}_B - \bar{\hat p}_A\).
cal_compare <- cal_bins %>%
group_by(bin) %>%
summarise(
mid_hat = mean(mid),
cal_err_A = mean(Y) - mean(phat_A),
cal_err_B = mean(Y) - mean(phat_B),
n_games = dplyr::n(),
.groups = "drop"
) %>%
filter(!is.na(bin))
cal_long <- cal_compare %>%
tidyr::pivot_longer(
c(cal_err_A, cal_err_B),
names_to = "which",
values_to = "cal_err"
) %>%
mutate(
which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
)
ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
geom_line(linewidth = 0.85) +
geom_point(aes(size = n_games), alpha = 0.85) +
geom_line(
data = cal_compare,
aes(mid_hat, cal_err_B - cal_err_A),
inherit.aes = FALSE,
color = "grey35",
linetype = "dashed",
linewidth = 0.65
) +
labs(
title = "Calibration comparison (same midpoint bins)",
subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
x = "Mean (phat_A + phat_B) / 2 in bin",
y = "mean(Y) - mean(phat)",
color = "Curve",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold"),
legend.position = "bottom"
)Mean calibration error per forecaster (solid) and difference in mean forecasts (dashed), binned by (phat_A + phat_B) / 2 at the same grid as the reliability figure.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.