Spreading your data across fewer dimensions makes it more understandable
Growing up, I remember my dad, an Air Force pilot, showing me around a plane’s cockpit at a pretty young age. I remember the floor-to-ceiling tapestry of buttons, switches, and meters and wondering how long it took someone to get used to all that information; wondering how long it took someone to learn how it all works. To the former, years of practice. To the latter? It wasn’t until years later when I was in college that he let me in on the secret. In fact, the internal mechanism of each of those instruments can be summed up in a bit of military jargon known as P.F.M.: Pure F****** Magic.
When performing linear regression for inference, P.F.M. is the enemy. An interpretable model is more valuable than a “black box” model, and often that comes with a (hopefully small) loss in accuracy. Often, this means putting ideas of feature interaction, polynomial regression, and other high-accuracy, low-bias procedures aside. Those models, while highly accurate, are hard to interpret and harder still to communicate.
Sticking to linear regression, and simple linear models with fewer features at that, lends a lot to a model’s interpretability. Towards that goal, below I discuss three options for reducing the number of features a linear model uses and when to use each. There are also links to two additional methods that fall under the category of unsupervised learning at the end of the article.
For best subset selection, the idea is simple: try every single subset of features, from a single feature to all of them, and choose the subset that performs best by your metric (R2, MSE, etc.). Just that simple. Try every possible combination (in the mathematical sense of the word) of features and pick the best one. If you have n number of features, then you have 2 to nth power number of feature combos to try. For that reason, this method isn’t feasible when you’re choosing from a very large number of features.
You might be thinking, “Wait a second. Couldn’t this method just wind up telling me to use every feature?” It could, but there’s a pretty good chance it won’t. There’s this popular idea of a dichotomy between a model built for variance and a model built for bias, called the bias-variance trade off, with the general idea that a model can be (1) inflexible (biased) and easy to interpret, like linear regression; or (2), at the other extreme, a model can be highly non-linear (variable) and hard to interpret. (This topic is beaten dead, go read about it.)
There’s the interesting phenomenon, however, of a simpler, more rigid model not infrequently turning out better performance than a more sophisticated, flexible model (assuming you’re interested in it’s performance on test data, and when aren’t you?). So with a small enough number of features, give best subset selection a shot. The best performing model just might be a simple, interpretable one with few features.
This comes in three flavors: forward selection, backward selection, and hybrid.
The general idea behind forward selection is to start with a single feature as a baseline and add in the one feature out of the remaining that best improves the model. You repeat this process until adding variables no longer improves your model. (You choose the threshold for improvement here, usually a low enough p-value.)
Backward selection works the same way except in reverse. You start with all available features and remove one feature at a time until doing so no longer improves your model.
Hybrid selection is a combination of the first two. You start with a single feature, like in forward selection, but for each step, you both (1) add in the feature that most improves the model and then (2) check to see if removing any variables significantly improves your model.
Stepwise selection has an advantage over best subset selection in that it can be performed on a dataset with many more features, but it has also come under fire for being prone to error and overconfidence and for generally being unrigorous. It’s discussed here for it’s popularity, but its use is not recommended.
Stepwise selection is discussed here for its popularity, but its use is not recommended. When the number of features a dataset has is too large for best subset selection, the following two procedures, recursive feature elimination and the Lasso, are more strongly recommended.
Though it passes under a different name, RFE is essentially the same procedure as backward stepwise selection. RFE comes with all the same caveats that stepwise selection does, but its use has been made popular by packages available from scikit-learn (see the documentation for their RFE and RFE with CV).
The lasso is a regularization method for linear and logistic models that will naturally remove some of the features from a model’s fit, unless its hyperparameter, λ, is set to 0; increasing the value of λ increases the number of features that are removed until eventually all features have been removed.
Above you can see that the loss function the lasso is trying to minimize is just the residual sum of squares (RSS) plus an additional penalty for feature weights. The greater the value you choose for λ, the greater the penalty on feature weights, the more of them get removed. This is usually the challenge with using the lasso: finding the sweet spot for λ.
The lasso has a huge advantage over best subset selections: even for a large number of features, the lasso is fast. Really fast. In one study by Hastie and TibshiraniX2, what took best subset selection 144 minutes took the lasso less than a second to complete.
Two variants on the lasso that need mentioning are the relaxed lasso and the adaptive lasso. The first, the relaxed lasso, is the Grand Prix winner in that same paper by Hastie et al., with both best performance out of all methodologies, including best subset selection, and extremely fast computation times. In its simplest form, the relaxed lasso is just the lasso performed twice. Read more about it here.
The second honorable mention, the adaptive lasso, is the only lasso method mentioned here that has the oracle property, being consistent in both variable selection and feature weight estimation (read: reproducible). Read more about it here.
If you have only a few features to consider (~10ish at most), then there’s no reason not to just brute-force your data with best subset selection. If, however, you want an informative model, and not necessarily a model with the best raw predictive power OR if you have a large number of features to consider, then go for one of the Lasso variants. Stepwise selection (and RFE), while something of an industry standard, is facing some controversy in the theory-minded community: proceed with caution.
(All bolding is my own.)
An Introduction to Statistical Learning: with Applications in R. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). New York: Springer.
Feature Engineering and Selection: A Practical Approach for Predictive Models,
Max Kuhn and Kjell Johnson, 2019.
“Best Subset, Forward Stepwise, or Lasso?”, Trevor Hastie, Robert Tibshirani, Ryan J. Tibshirani.
“The Relaxed Lasso: A Better Way to Fit Linear and Logistic Models”, Jehan Gonsal, 2018.
And as promised at the beginning of this article, links to resources on two unsupervised learning methods for reducing number of features, one on principal component analysis (PCA) and one on partial least squares (PLS) regression.