Regression Diagnostics by Period using REPS

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction

The calculate_regression_diagnostics() function in REPS provides regression diagnostics by period. It is designed for panel or repeated cross-section data (e.g. property transactions over time) to evaluate the quality of period-specific log-linear regressions.

For each period, it:

Fits a log-linear regression model: log(price) ~ covariates
Computes diagnostics:
- Shapiro-Wilk p-value (normality)
- Adjusted R-squared (linearity)
- Durbin-Watson test (autocorrelation)
- Breusch-Pagan test (heteroscedasticity)

These diagnostics help assess model quality over time, identifying periods with issues like non-normality, low fit, heteroscedasticity, or autocorrelation.

Required Data

Your dataset should include:

A period variable (e.g. quarterly/annual codes)
A dependent variable (typically price)
One or more numerical independent variables (e.g. floor area)
Optionally, categorical independent variables (e.g. neighbourhood codes)

# Example dataset (you should already have this loaded)
head(data_constraxion)
#>   period   price floor_area dist_trainstation neighbourhood_code
#> 1 2008Q1 1142226  127.41917       2.887992985                  E
#> 2 2008Q1  667664   88.70604       2.903955192                  D
#> 3 2008Q1  636207  107.26257       8.250659447                  B
#> 4 2008Q1  777841  112.65725       0.005760792                  E
#> 5 2008Q1  795527  108.08537       1.842145127                  E
#> 6 2008Q1  539206   97.87751       6.375981360                  D
#>   dummy_large_city
#> 1                0
#> 2                1
#> 3                1
#> 4                0
#> 5                0
#> 6                1

# We log transform the floor_area again (see vignette on calculating price index as why)
dataset <- data_constraxion
dataset$floor_area <- log(dataset$floor_area)

Using `calculate_regression_diagnostics()`

Example:

diagnostics <- calculate_regression_diagnostics(
  dataset = dataset,
  period_variable = "period",
  dependent_variable = "price",
  numerical_variables = c("floor_area", "dist_trainstation"),
  categorical_variables = c("dummy_large_city", "neighbourhood_code")
)

head(diagnostics)
#>   period norm_pvalue  r_adjust  bp_pvalue autoc_pvalue autoc_dw
#> 1 2008Q1   0.9586930 0.8633499 0.74178260 0.5842200307 2.038772
#> 2 2008Q2   0.8191076 0.8607036 0.81813032 0.9540503936 2.274047
#> 3 2008Q3   0.4560750 0.8825515 0.15220690 0.3246547621 1.924436
#> 4 2008Q4   0.9064669 0.9098143 0.97583499 0.7436197200 2.108734
#> 5 2009Q1   0.4036003 0.8624850 0.04268543 0.4948207614 2.003177
#> 6 2009Q2   0.4644423 0.9002921 0.32760619 0.0007476682 1.487031

Visualizing Diagnostics

For convenient visualization:

plot_regression_diagnostics(diagnostics)

This generates a 3x2 grid of plots:

Normality (p-value Shapiro-Wilk)
Linearity (Adjusted R-squared)
Autocorrelation (Durbin-Watson statistic)
Autocorrelation (p-value Durbin-Watson)
Heteroscedasticity (p-value Breusch-Pagan)

Example:

Interpreting the Output

The hedonic price index relies on a log-linear regression model, which assumes that certain statistical conditions hold. The diagnostics plot provides an overview of how well these assumptions are met across different periods.

Each subplot corresponds to a specific model assumption:

Row 1: Normality and Linearity

Shapiro-Wilk test (left plot)
- Shows p-values for the normality of residuals.
- A p-value below 0.05 (dashed red line) indicates a potential violation of the normality assumption.
Adjusted R-squared (right plot)
- Reflects the explanatory power of the regression model.
- Values below 0.6 (dashed red line) may indicate a weak linear relationship.

Row 2: Independence

Durbin-Watson statistic (left plot)
- Tests for autocorrelation in residuals.
- Ideal value is around 2.
- Values outside the 1.75–2.25 range (dashed lines) suggest potential autocorrelation.
Durbin-Watson p-value (right plot)
- Indicates whether autocorrelation is statistically significant.
- p > 0.05: no significant evidence of autocorrelation.
- p ≤ 0.05: residuals may not be independent.

Row 3: Homoscedasticity

Breusch-Pagan p-value
- Tests whether residuals have constant variance.
- A p-value below 0.05 (dashed red line) suggests heteroscedasticity (non-constant variance).

Summary

The calculate_regression_diagnostics() and plot_regression_diagnostics() functions in REPS enable:

Period-by-period regression checking
Easy comparison of assumptions over time
Detection of problematic periods

They support robust, high-quality hedonic price index modeling by systematically checking regression assumptions.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Regression Diagnostics by Period using REPS

Introduction

Required Data

Using calculate_regression_diagnostics()

Visualizing Diagnostics

Interpreting the Output

Row 1: Normality and Linearity

Row 2: Independence

Row 3: Homoscedasticity

Summary

Using `calculate_regression_diagnostics()`