Getting Started with olr: Optimal Linear Regression

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

📦 Introduction

The olr package provides a systematic way to identify the best linear regression model by testing all combinations of predictor variables. You can choose to optimize based on either R-squared or adjusted R-squared.

📊 Load Example Dataset

# Load data
crudeoildata <- read.csv(system.file("extdata", "crudeoildata.csv", package = "olr"))
dataset <- crudeoildata[, -1]

# Define variables
responseName <- 'CrudeOil'
predictorNames <- c('RigCount', 'API', 'FieldProduction', 'RefinerNetInput',
                    'OperableCapacity', 'Imports', 'StocksExcludingSPR',
                    'NonCommercialLong', 'NonCommercialShort',
                    'CommercialLong', 'CommercialShort', 'OpenInterest')

🔎 Run OLR Models

# Full model using R-squared
model_r2 <- olr(dataset, responseName, predictorNames, adjr2 = FALSE)

## Returning model with max R-squared.
## 
## Call:
## lm(formula = CrudeOil ~ RigCount + API + FieldProduction + RefinerNetInput + 
##     OperableCapacity + Imports + StocksExcludingSPR + NonCommercialLong + 
##     NonCommercialShort + CommercialLong + CommercialShort + OpenInterest, 
##     data = dataset)
## 
## Coefficients:
##        (Intercept)           RigCount                API    FieldProduction 
##       0.0068578950      -0.3551354134       0.0004393875       0.2670366950 
##    RefinerNetInput   OperableCapacity            Imports StocksExcludingSPR 
##       0.3535677365       0.0030449534      -0.1034192549       0.7417144521 
##  NonCommercialLong NonCommercialShort     CommercialLong    CommercialShort 
##      -0.5643353759       0.0207113857      -1.3007001952       1.8508558043 
##       OpenInterest 
##      -0.0409690597

# Adjusted R-squared model
model_adjr2 <- olr(dataset, responseName, predictorNames, adjr2 = TRUE)

## Returning model with max adjusted R-squared.
## 
## Call:
## lm(formula = CrudeOil ~ RigCount + RefinerNetInput + Imports + 
##     StocksExcludingSPR + NonCommercialLong + CommercialLong + 
##     CommercialShort, data = dataset)
## 
## Coefficients:
##        (Intercept)           RigCount    RefinerNetInput            Imports 
##        0.008256759       -0.380836990        0.322995592       -0.102405212 
## StocksExcludingSPR  NonCommercialLong     CommercialLong    CommercialShort 
##        0.694028117       -0.528991035       -1.219766893        1.676484528

📈 Visual Comparison of Model Fits

# Actual values
actual <- dataset[[responseName]]
fitted_r2 <- model_r2$fitted.values
fitted_adjr2 <- model_adjr2$fitted.values

# Data frames for ggplot
plot_data <- data.frame(
  Index = 1:length(actual),
  Actual = actual,
  R2_Fitted = fitted_r2,
  AdjR2_Fitted = fitted_adjr2
)

# Plot both fits
ggplot(plot_data, aes(x = Index)) +
  geom_line(aes(y = Actual), color = "black", size = 1, linetype = "dashed") +
  geom_line(aes(y = R2_Fitted), color = "steelblue", size = 1) +
  labs(
    title = "Full Model (R-squared): Actual vs Fitted Values",
    subtitle = "Observation Index used in place of dates (parsed from original dataset)",
    x = "Observation Index",
    y = "CrudeOil % Change"
  ) +
  theme_minimal()

ggplot(plot_data, aes(x = Index)) +
  geom_line(aes(y = Actual), color = "black", size = 1, linetype = "dashed") +
  geom_line(aes(y = AdjR2_Fitted), color = "limegreen", size = 1.1) +
  labs(
    title = "Optimal Model (Adjusted R-squared): Actual vs Fitted Values",
    subtitle = "Observation Index used in place of dates (parsed from original dataset)",
    x = "Observation Index",
    y = "CrudeOil % Change"
  )+
  theme_minimal() +
  theme(plot.background = element_rect(color = "limegreen", size = 2))

📊 Model Comparison Summary Table

Metric	adjr2 = FALSE (All 12 Predictors)	adjr2 = TRUE (Best Subset of 7 Predictors)
Adjusted R-squared	0.6145	0.6531 ✅ (higher is better)
Multiple R-squared	0.7018	0.699
Residual Std. Error	0.02388	0.02265 ✅ (lower is better)
F-statistic (p-value)	8.042 (1.88e-07)	15.26 (3.99e-10) ✅ (stronger model)
Model Complexity	12 predictors	7 predictors ✅ (simpler, more robust)
Significant Coeffs	4	6 ✅ (more signal, less noise)
R² Difference	—	~0.003 ❗ (negligible)

✅ Best Practice Tips

The olr() function automates model selection by testing every valid predictor combination.
Use adjr2 = TRUE to prioritize models that balance accuracy and parsimony.
A small drop in raw R² is acceptable if the adjusted R² is higher — it means fewer variables, better generalization.

📌 Summary

The adjusted R² model outperformed the full model on: - Adjusted R² - F-statistic - Residual error - Model simplicity - # of significant coefficients

👉 Use adjusted R² (adjr2 = TRUE) in practice to avoid overfitting and ensure interpretability.

Created by Mathew Fok • Author of the olr package

Contact: quiksilver67213@yahoo.com

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.