8. Regression II: linear regression#
8.1. Overview#
Up to this point, we have solved all of our predictive problems—both classification
and regression—using K-nearest neighbors (K-NN)-based approaches. In the context of regression,
there is another commonly used method known as linear regression. This chapter provides an introduction
to the basic concept of linear regression, shows how to use scikit-learn
to perform linear regression in Python,
and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual,
on the case where there is a single predictor and single response variable of interest; but the chapter
concludes with an example using multivariable linear regression when there is more than one
predictor.
8.2. Chapter learning objectives#
By the end of the chapter, readers will be able to do the following:
Use Python to fit simple and multivariable linear regression models on training data.
Evaluate the linear regression model on test data.
Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set.
Describe how linear regression is affected by outliers and multicollinearity.
8.3. Simple linear regression#
At the end of the previous chapter, we noted some limitations of K-NN regression. While the method is simple and easy to understand, K-NN regression does not predict well beyond the range of the predictors in the training data, and the method gets significantly slower as the training data set grows. Fortunately, there is an alternative to K-NN regression—linear regression—that addresses both of these limitations. Linear regression is also very commonly used in practice because it provides an interpretable mathematical equation that describes the relationship between the predictor and response variables. In this first part of the chapter, we will focus on simple linear regression, which involves only one predictor variable and one response variable; later on, we will consider multivariable linear regression, which involves multiple predictor variables. Like K-NN regression, simple linear regression involves predicting a numerical response variable (like race time, house price, or height); but how it makes those predictions for a new observation is quite different from K-NN regression. Instead of looking at the K nearest neighbors and averaging over their values for a prediction, in simple linear regression, we create a straight line of best fit through the training data and then “look up” the prediction using the line.
Note
Although we did not cover it in earlier chapters, there is another popular method for classification called logistic regression (it is used for classification even though the name, somewhat confusingly, has the word “regression” in it). In logistic regression—similar to linear regression—you “fit” the model to the training data and then “look up” the prediction for each new observation. Logistic regression and K-NN classification have an advantage/disadvantage comparison similar to that of linear regression and K-NN regression. It is useful to have a good understanding of linear regression before learning about logistic regression. After reading this chapter, see the “Additional Resources” section at the end of the classification chapters to learn more about logistic regression.
Let’s return to the Sacramento housing data from Chapter 7 to learn how to apply linear regression and compare it to K-NN regression. For now, we will consider a smaller version of the housing data to help make our visualizations clear. Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict its sale price? In particular, recall that we have come across a new 2,000 square-foot house we are interested in purchasing with an advertised list price of $350,000. Should we offer the list price, or is that over/undervalued? To answer this question using simple linear regression, we use the data we have to draw the straight line of best fit through our existing data points. The small subset of data as well as the line of best fit are shown in Fig. 8.1.
The equation for the straight line is:
where
\(\beta_0\) is the vertical intercept of the line (the price when house size is 0)
\(\beta_1\) is the slope of the line (how quickly the price increases as you increase house size)
Therefore using the data to find the line of best fit is equivalent to finding coefficients \(\beta_0\) and \(\beta_1\) that parametrize (correspond to) the line of best fit. Now of course, in this particular problem, the idea of a 0 square-foot house is a bit silly; but you can think of \(\beta_0\) here as the “base price,” and \(\beta_1\) as the increase in price for each square foot of space. Let’s push this thought even further: what would happen in the equation for the line if you tried to evaluate the price of a house with size 6 million square feet? Or what about negative 2,000 square feet? As it turns out, nothing in the formula breaks; linear regression will happily make predictions for crazy predictor values if you ask it to. But even though you can make these wild predictions, you shouldn’t. You should only make predictions roughly within the range of your original data, and perhaps a bit beyond it only if it makes sense. For example, the data in Fig. 8.1 only reaches around 600 square feet on the low end, but it would probably be reasonable to use the linear regression model to make a prediction at 500 square feet, say.
Back to the example! Once we have the coefficients \(\beta_0\) and \(\beta_1\), we can use the equation above to evaluate the predicted sale price given the value we have for the predictor variable—here 2,000 square feet. Fig. 8.2 demonstrates this process.
By using simple linear regression on this small data set to predict the sale price for a 2,000 square-foot house, we get a predicted value of $276,027. But wait a minute…how exactly does simple linear regression choose the line of best fit? Many different lines could be drawn through the data points. Some plausible examples are shown in Fig. 8.3.
Simple linear regression chooses the straight line of best fit by choosing the line that minimizes the average squared vertical distance between itself and each of the observed data points in the training data (equivalent to minimizing the RMSE). Fig. 8.4 illustrates these vertical distances as lines. Finally, to assess the predictive accuracy of a simple linear regression model, we use RMSPE—the same measure of predictive performance we used with K-NN regression.
8.4. Linear regression in Python#
We can perform simple linear regression in Python using scikit-learn
in a
very similar manner to how we performed K-NN regression.
To do this, instead of creating a KNeighborsRegressor
model object,
we use a LinearRegression
model object;
and as usual, we first have to import it from sklearn
.
Another difference is that we do not need to choose \(K\) in the
context of linear regression, and so we do not need to perform cross-validation.
Below we illustrate how we can use the usual scikit-learn
workflow to predict house sale
price given house size. We use a simple linear regression approach on the full
Sacramento real estate data set.
As usual, we start by loading packages, setting the seed, loading data, and putting some test data away in a lock box that we can come back to after we choose our final model. Let’s take care of that now.
import numpy as np
import altair as alt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import set_config
# Output dataframes instead of arrays
set_config(transform_output="pandas")
np.random.seed(1)
sacramento = pd.read_csv("data/sacramento.csv")
sacramento_train, sacramento_test = train_test_split(
sacramento, train_size=0.75
)
Now that we have our training data, we will create
and fit the linear regression model object.
We will also extract the slope of the line
via the coef_[0]
property, as well as the
intercept of the line via the intercept_
property.
# fit the linear regression model
lm = LinearRegression()
lm.fit(
sacramento_train[["sqft"]], # A single-column data frame
sacramento_train["price"] # A series
)
# make a dataframe containing slope and intercept coefficients
pd.DataFrame({"slope": [lm.coef_[0]], "intercept": [lm.intercept_]})
slope | intercept | |
---|---|---|
0 | 137.285652 | 15642.309105 |
Note
An additional difference that you will notice here is that we do not standardize (i.e., scale and center) our predictors. In K-nearest neighbors models, recall that the model fit changes depending on whether we standardize first or not. In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!). So you can standardize if you want—it won’t hurt anything—but if you leave the predictors in their original form, the best fit coefficients are usually easier to interpret afterward.
Our coefficients are (intercept) \(\beta_0=\) 15642 and (slope) \(\beta_1=\) 137. This means that the equation of the line of best fit is
\(\text{house sale price} =\) 15642 \(+\) 137 \(\cdot (\text{house size}).\)
In other words, the model predicts that houses start at $15,642 for 0 square feet, and that every extra square foot increases the cost of the house by $137. Finally, we predict on the test data set to assess how well our model does.
# make predictions
sacramento_test["predicted"] = lm.predict(sacramento_test[["sqft"]])
# calculate RMSPE
RMSPE = mean_squared_error(
y_true=sacramento_test["price"],
y_pred=sacramento_test["predicted"]
)**(1/2)
RMSPE
85376.59691629931
Our final model’s test error as assessed by RMSPE is $85,377. Remember that this is in units of the response variable, and here that is US Dollars (USD). Does this mean our model is “good” at predicting house sale price based off of the predictor of home size? Again, answering this is tricky and requires knowledge of how you intend to use the prediction.
To visualize the simple linear regression model, we can plot the predicted house sale price across all possible house sizes we might encounter. Since our model is linear, we only need to compute the predicted price of the minimum and maximum house size, and then connect them with a straight line. We superimpose this prediction line on a scatter plot of the original housing price data, so that we can qualitatively assess if the model seems to fit the data well. Fig. 8.5 displays the result.
sqft_prediction_grid = sacramento[["sqft"]].agg(["min", "max"])
sqft_prediction_grid["predicted"] = lm.predict(sqft_prediction_grid)
all_points = alt.Chart(sacramento).mark_circle().encode(
x=alt.X("sqft")
.scale(zero=False)
.title("House size (square feet)"),
y=alt.Y("price")
.axis(format="$,.0f")
.scale(zero=False)
.title("Price (USD)")
)
sacr_preds_plot = all_points + alt.Chart(sqft_prediction_grid).mark_line(
color="#ff7f0e"
).encode(
x="sqft",
y="predicted"
)
sacr_preds_plot
8.5. Comparing simple linear and K-NN regression#
Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let’s look at the visualization of the simple linear regression model predictions for the Sacramento real estate data (predicting price from house size) and the “best” K-NN regression model obtained from the same problem, shown in Fig. 8.6.
What differences do we observe in Fig. 8.6? One obvious difference is the shape of the orange lines. In simple linear regression we are restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the response variable we predict given a unit increase in the predictor variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line.
There can, however, also be a disadvantage to using a simple linear regression model in some cases, particularly when the relationship between the response variable and the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In these cases the prediction model from a simple linear regression will underfit, meaning that model/predicted values do not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMSPE when assessing model prediction quality on a test data set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future books that may do even better at predicting with such data.
How do these two models compare on the Sacramento house prices data set? In Fig. 8.6, we also printed the RMSPE as calculated from predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear regression model is slightly lower than the RMSPE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model.
Finally, note that the K-NN regression model becomes “flat” at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed data is known as extrapolation; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small house (if the intercept \(\beta_0\) was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of K-NN likely does not match reality.
8.6. Multivariable linear regression#
As in K-NN classification and K-NN regression, we can move beyond the simple case of only one predictor to the case with multiple predictors, known as multivariable linear regression. To do this, we follow a very similar approach to what we did for K-NN regression: we just specify the training data by adding more predictors. But recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. Note once again that we have the same concerns regarding multiple predictors as in the settings of multivariable K-NN regression and classification: having more predictors is not always better. But because the same predictor selection algorithm from Chapter 6 extends to the setting of linear regression, it will not be covered again in this chapter.
We will demonstrate multivariable linear regression using the Sacramento real estate
data with both house size
(measured in square feet) as well as number of bedrooms as our predictors, and
continue to use house sale price as our response variable.
The scikit-learn
framework makes this easy to do: we just need to set
both the sqft
and beds
variables as predictors, and then use the fit
method as usual.
mlm = LinearRegression()
mlm.fit(
sacramento_train[["sqft", "beds"]],
sacramento_train["price"]
)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Finally, we make predictions on the test data set to assess the quality of our model.
sacramento_test["predicted"] = mlm.predict(sacramento_test[["sqft","beds"]])
lm_mult_test_RMSPE = mean_squared_error(
y_true=sacramento_test["price"],
y_pred=sacramento_test["predicted"]
)**(1/2)
lm_mult_test_RMSPE
82331.04630202598
Our model’s test error as assessed by RMSPE is $82,331. In the case of two predictors, we can plot the predictions made by our linear regression creates a plane of best fit, as shown in Fig. 8.7.