Interpretability (and, relatedly, communicability), flexibility, and scalability are desirable properties for statistical techniques. Many newer techniques are quite flexible and work well on large data sets but are “black boxes” when it comes to interpretation, making it hard to understand the underlying conditions which make predictions accurate (or how long they will last for). Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2013), and designed to minimize parametric assumptions while maintaining interpretive clarity. However, the interpretability and flexibility come at the cost of scalability because most of the calculations require comparing each observation to each and every other observation and therefore many computationally-costly N x N calculations. We introduce bigKRLS, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly fit KRLS models to medium and large data sets (N > ~2,500).
bigKRLS is the workhorse of this package; there are only two basic inputs: a vector of N observations on the dependent variable, y, and an N x P matrix X, where P is the number of independent variables and also ncol(X).1
mtcars[1:5,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Suppose we want to regress fuel efficiency on the other observables.
reg.out <- bigKRLS(y = as.matrix(mtcars$mpg),
X = as.matrix(mtcars[,-1]), noisy=F)
## 2 cores will be used.
## ***************
## .Intermediate output will be displayed from all cores at once and so look a little wonky :)
##
##
## All done. You may wish to use summary() for more detail, predict() for out-of-sample forecasts, or shiny.bigKRLS() to interact with results. Type vignette("bigKRLS_basics") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.
Unlike classical regression, the algorithm does not directly obtain an estimate of the slope, the “average” marginal effect (the slope). By contrast, KRLS first estimates the the marginal effect dy/dxp at each observation (given the pairwise distances with each of the other observations). For example, we may want to know the effect the number of gears has on a particular vehicle’s fuel efficiency.
summary(reg.out)
##
##
## MODEL SUMMARY:
##
## R2: 0.9405
## R2AME**: 0.7609
##
## Average Marginal Effects:
##
## Estimate Std. Error t value Pr(>|t|)
## cyl -0.0733 0.0765 -0.9581 0.3484
## disp -0.0037 0.0022 -1.6485 0.1135
## hp -0.0145 0.0040 -3.5903 0.0016
## drat 1.2786 0.6303 2.0287 0.0548
## wt -1.4940 0.3115 -4.7968 0.0001
## qsec 0.3678 0.1741 2.1123 0.0462
## vs* 4.5256 2.2089 2.0488 0.0526
## am* 3.5671 2.1310 1.6739 0.1083
## gear 1.0847 0.2867 3.7828 0.0010
## carb -0.5951 0.1518 -3.9200 0.0007
##
##
## Percentiles of Marginal Effects:
##
## 5% 25% 50% 75% 95%
## cyl -0.3219 -0.1071 -0.0182 0.0065 0.0406
## disp -0.0108 -0.0080 -0.0040 -0.0011 0.0044
## hp -0.0372 -0.0258 -0.0138 -0.0009 0.0038
## drat -0.4892 0.6973 1.1559 2.1076 3.4894
## wt -3.5756 -1.9904 -1.2308 -0.7641 -0.2853
## qsec -0.1250 0.0504 0.2099 0.5900 1.2423
## vs* -1.3312 3.1018 4.5509 6.4373 9.3575
## am* -2.4799 1.2282 3.1377 6.6635 9.8087
## gear -0.4504 -0.1866 0.8420 1.9432 3.4149
## carb -1.1902 -1.0159 -0.5844 -0.2336 0.0107
##
## (*) Reported average and percentiles of dy/dx is for discrete change of the dummy variable from min to max (usually 0 to 1)).
##
##
## (**) Pseudo-R^2 computed using only the Average Marginal Effects. If only a subset of marginal effects were estimated, Pseudo-R^2 calculated with that subset.
##
##
## You may also wish to use predict() for out-of-sample forecasts or shiny.bigKRLS() to interact with results. Type vignette("bigKRLS_basics") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.
The “Percentiles of the Marginal Effects” can be interpreted as evidence about whether y is a monotonic function of xp and the extent to which the effect of xp on y is homogeneous, if at all. In this toy data set, the number of cylinders is not a statistically significant predictor of fuel efficiency; perhaps unsurprisingly, the marginal effect of cylinders is negative for about half of the cars investigated. By contrast, horsepower has a more uniformly negative effect on fuel efficiency.
Suppose you wanted to plot how similar a Toyota Corolla is to the other four cylinder cars:
s <- reg.out$K[which(mtcars$cyl == 4), grep("Corolla", rownames(mtcars))]
barplot(s, main = "Similarity to a Toyota Corolla",
ylab = "Kernel", sub="Toy Data from mtcars", cex.names = .7,
col = colorRampPalette(c("red", "blue"))(length(s))[rank(s)],
names.arg = lapply(strsplit(rownames(mtcars), split=" "),
function(x) x[2])[which(mtcars$cyl == 4)])
Apparently my Corolla is more similar to a Civic than a Porsche 914 but more tests are needed… Note on exceedingly large data sets, you may wish to grab the relevant subset first, standardize that X data, and then call bGaussKernel on the smaller set.
It appears that fuel efficiency decreases as horsepower increases but that the effect isn’t quite monotonic further. We might first ask whether the outcome is an additive function of horsepower…
scatter.smooth(mtcars$hp, reg.out$derivatives[,3], ylab="HP's Effect", xlab="Horsepower", pch = 19, bty = "n",
main="Horsepower's Marginal Effect on Fuel Efficiency",
sub="Toy Data from mtcars",
col = colorRampPalette(c("blue", "red"))(nrow(mtcars))[rank(reg.out$coeffs^2)],
ylim = c(-0.042, 0.015), xlim = c(50, 400))
The above graph suggests that though in general lower horsepower helps explain which cars have better fuel efficiency, beyond a certain threshold, that’s no longer the case (or perhaps log horsepower is more relevant).
To interact with your results in a pop up window or your browser, simply call:
shiny.bigKRLS(reg.out) # not run
To remove the big square matrices so that you can easily put your results up on a server, use export:
shiny.bigKRLS(reg.out, export = T) # not run
The output will describe the new, more compact object that has been created.
Let’s say we wanted to know what percentage of cars would have lower gas mileage if they had 200 horsepower.
Xnew <- mtcars[,-1]
Xnew$hp <- 200
forecast = predict(reg.out, as.matrix(Xnew))
## *
mean(forecast$fit < mtcars$mpg)
## [1] 0.6875
If N > 2,500 or if you supply big matrices, using save() and load() will crash your R session. Instead you may do one of two things to save:
out <- bigKRLS(y, X, model_subfolder_name = "my_results") # not run
save.bigKRLS(out, "my_results") # not run
Either will save the model estimates to a new subfolder called “my_results” in your current working directory. To re-load,
load.bigKRLS("my_results") # not run
When N > 2,500 or the user provides big matrices, big matrices will be returned, which are really just memory addresses.
Z <- big.matrix(nrow=5, ncol=5, init=1)
Z
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x1020df460>
You do not necessarily need to work with the big square matrices in the output. But if you do and your comfortable they fit in memory, just use the square brackets:
Z[]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 1 1 1 1 1
## [3,] 1 1 1 1 1
## [4,] 1 1 1 1 1
## [5,] 1 1 1 1 1
X and y should only contain numeric data (no missing data, factors, or vectors of constants) and may be base R matrices or “big” matrices (from bigmemory).↩