5. Classification I: training & predicting#

5.1. Overview#

In previous chapters, we focused solely on descriptive and exploratory data analysis questions. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy.

5.2. Chapter learning objectives#

By the end of the chapter, readers will be able to do the following:

  • Recognize situations where a classifier would be appropriate for making predictions.

  • Describe what a training data set is and how it is used in classification.

  • Interpret the output of a classifier.

  • Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.

  • Explain the K-nearest neighbors classification algorithm.

  • Perform K-nearest neighbors classification in Python using scikit-learn.

  • Use methods from scikit-learn to center, scale, balance, and impute data as a preprocessing step.

  • Combine preprocessing and model training into a Pipeline using make_pipeline.

5.3. The classification problem#

In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” based on the email’s text and past email text data; or a credit card company may want to predict whether a purchase is fraudulent based on the current purchase item, amount, and location as well as past purchases. These tasks are all examples of classification, i.e., predicting a categorical class (sometimes called a label) for an observation given its other variables (sometimes called features).

Generally, a classifier assigns an observation without a known class (e.g., a new patient) to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations for which we do know the class (e.g., previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set; this name comes from the fact that we use these data to train, or teach, our classifier. Once taught, we can use the classifier to make predictions on new data for which we do not know the class.

There are many possible methods that we could use to predict a categorical class/label for an observation. In this book, we will focus on the widely used K-nearest neighbors algorithm [Cover and Hart, 1967, Fix and Hodges, 1951]. In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. It is also worth mentioning that there are many variations on the basic classification problem. For example, we focus on the setting of binary classification where only two classes are involved (e.g., a diagnosis of either healthy or diseased), but you may also run into multiclass classification problems with more than two categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).

5.4. Exploring a data set#

In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [Street et al., 1993]. Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). Diagnosis for each image was conducted by physicians.

As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor? Answering this question is important because traditional, non-data-driven methods for tumor diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumors are not normally dangerous; the cells stay in the same place, and the tumor stops growing before it gets very large. By contrast, in malignant tumors, the cells invade the surrounding tissue and spread into nearby organs, where they can cause serious damage [Stanford Health Care, 2021]. Thus, it is important to quickly and accurately diagnose the tumor type to guide patient treatment.

5.4.1. Loading the cancer data#

Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the pandas and altair packages needed for our analysis.

import pandas as pd
import altair as alt

In this case, the file containing the breast cancer data set is a .csv file with headers. We’ll use the read_csv function with no additional arguments, and then inspect its contents:

cancer = pd.read_csv("data/wdbc.csv")
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry Fractal_Dimension
0 842302 M 1.096100 -2.071512 1.268817 0.983510 1.567087 3.280628 2.650542 2.530249 2.215566 2.253764
1 842517 M 1.828212 -0.353322 1.684473 1.907030 -0.826235 -0.486643 -0.023825 0.547662 0.001391 -0.867889
2 84300903 M 1.578499 0.455786 1.565126 1.557513 0.941382 1.052000 1.362280 2.035440 0.938859 -0.397658
3 84348301 M -0.768233 0.253509 -0.592166 -0.763792 3.280667 3.399917 1.914213 1.450431 2.864862 4.906602
4 84358402 M 1.748758 -1.150804 1.775011 1.824624 0.280125 0.538866 1.369806 1.427237 -0.009552 -0.561956
... ... ... ... ... ... ... ... ... ... ... ... ...
564 926424 M 2.109139 0.720838 2.058974 2.341795 1.040926 0.218868 1.945573 2.318924 -0.312314 -0.930209
565 926682 M 1.703356 2.083301 1.614511 1.722326 0.102368 -0.017817 0.692434 1.262558 -0.217473 -1.057681
566 926954 M 0.701667 2.043775 0.672084 0.577445 -0.839745 -0.038646 0.046547 0.105684 -0.808406 -0.894800
567 927241 M 1.836725 2.334403 1.980781 1.733693 1.524426 3.269267 3.294046 2.656528 2.135315 1.042778
568 92751 B -1.806811 1.220718 -1.812793 -1.346604 -3.109349 -1.149741 -1.113893 -1.260710 -0.819349 -0.560539

569 rows × 12 columns

5.4.2. Describing the variables in the cancer data set#

Breast tumors can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, ten different variables were measured for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been standardized (centered and scaled); we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is:

  1. ID: identification number

  2. Class: the diagnosis (M = malignant or B = benign)

  3. Radius: the mean of distances from center to points on the perimeter

  4. Texture: the standard deviation of gray-scale values

  5. Perimeter: the length of the surrounding contour

  6. Area: the area inside the contour

  7. Smoothness: the local variation in radius lengths

  8. Compactness: the ratio of squared perimeter and area

  9. Concavity: severity of concave portions of the contour

  10. Concave Points: the number of concave portions of the contour

  11. Symmetry: how similar the nucleus is when mirrored

  12. Fractal Dimension: a measurement of how “rough” the perimeter is

Below we use the info method to preview the data frame. This method can make it easier to inspect the data when we have a lot of columns: it prints only the column names down the page (instead of across), as well as their data types and the number of non-missing entries.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 569 non-null    int64  
 1   Class              569 non-null    object 
 2   Radius             569 non-null    float64
 3   Texture            569 non-null    float64
 4   Perimeter          569 non-null    float64
 5   Area               569 non-null    float64
 6   Smoothness         569 non-null    float64
 7   Compactness        569 non-null    float64
 8   Concavity          569 non-null    float64
 9   Concave_Points     569 non-null    float64
 10  Symmetry           569 non-null    float64
 11  Fractal_Dimension  569 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.5+ KB

From the summary of the data above, we can see that Class is of type object. We can use the unique method on the Class column to see all unique values present in that column. We see that there are two diagnoses: benign, represented by "B", and malignant, represented by "M".

array(['M', 'B'], dtype=object)

We will improve the readability of our analysis by renaming "M" to "Malignant" and "B" to "Benign" using the replace method. The replace method takes one argument: a dictionary that maps previous values to desired new values. We will verify the result using the unique method.

cancer["Class"] = cancer["Class"].replace({
    "M" : "Malignant",
    "B" : "Benign"

array(['Malignant', 'Benign'], dtype=object)

5.4.3. Exploring the cancer data#

Before we start doing any modeling, let’s explore our data set. Below we use the groupby and size methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with groupby, size counts the number of observations for each value of the Class variable. Then we calculate the percentage in each group by dividing by the total number of observations and multiplying by 100. The total number of observations equals the number of rows in the data frame, which we can access via the shape attribute of the data frame (shape[0] is the number of rows and shape[1] is the number of columns). We have 357 (63%) benign and 212 (37%) malignant tumor observations.

100 * cancer.groupby("Class").size() / cancer.shape[0]
Benign       62.741652
Malignant    37.258348
dtype: float64

The pandas package also has a more convenient specialized value_counts method for counting the number of occurrences of each value in a column. If we pass no arguments to the method, it outputs a series containing the number of occurences of each value. If we instead pass the argument normalize=True, it instead prints the fraction of occurrences of each value.

Benign       357
Malignant    212
Name: count, dtype: int64
Benign       0.627417
Malignant    0.372583
Name: proportion, dtype: float64

Next, let’s draw a colored scatter plot to visualize the relationship between the perimeter and concavity variables. Recall that the default palette in altair is colorblind-friendly, so we can stick with that here.

perim_concav = alt.Chart(cancer).mark_circle().encode(
    x=alt.X("Perimeter").title("Perimeter (standardized)"),
    y=alt.Y("Concavity").title("Concavity (standardized)"),

Fig. 5.1 Scatter plot of concavity versus perimeter colored by diagnosis label.#

In Fig. 5.1, we can see that malignant observations typically fall in the upper right-hand corner of the plot area. By contrast, benign observations typically fall in the lower left-hand corner of the plot. In other words, benign observations tend to have lower concavity and perimeter values, and malignant ones tend to have larger values. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? Based on the scatter plot, how might you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like it may be possible to make accurate predictions of the Class variable (i.e., a diagnosis) for tumor images with unknown diagnoses.

5.5. Classification with K-nearest neighbors#

In order to actually make predictions for new observations in practice, we will need a classification algorithm. In this book, we will use the K-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign or malignant), the K-nearest neighbors classifier generally finds the \(K\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. \(K\) is a number that we must choose in advance; for now, we will assume that someone has chosen \(K\) for us. We will cover how to choose \(K\) ourselves in the next chapter.

To illustrate the concept of K-nearest neighbors classification, we will walk through an example. Suppose we have a new observation, with standardized perimeter of 2.0 and standardized concavity of 4.0, whose diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in Fig. 5.2.

Fig. 5.2 Scatter plot of concavity versus perimeter with new observation represented as a red diamond.#

Fig. 5.3 shows that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis.

Fig. 5.3 Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label.#

Suppose we have another new observation with standardized perimeter 0.2 and concavity of 3.3. Looking at the scatter plot in Fig. 5.4, how would you classify this red, diamond observation? The nearest neighbor to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points.

Fig. 5.4 Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label.#

To improve the prediction we can consider several neighboring points, say \(K = 3\), that are closest to the new observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. As shown in Fig. 5.5, we see that the diagnoses of 2 of the 3 nearest neighbors to our new observation are malignant. Therefore we take majority vote and classify our new red, diamond observation as malignant.

Fig. 5.5 Scatter plot of concavity versus perimeter with three nearest neighbors.#

Here we chose the \(K=3\) nearest observations, but there is nothing special about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \(K\) in the next chapter.

5.5.1. Distance between points#

We decide which points are the \(K\) “nearest” to our new observation using the straight-line distance (we will often just refer to this as distance). Suppose we have two observations \(a\) and \(b\), each having two predictor variables, \(x\) and \(y\). Denote \(a_x\) and \(a_y\) to be the values of variables \(x\) and \(y\) for observation \(a\); \(b_x\) and \(b_y\) have similar definitions for observation \(b\). Then the straight-line distance between observation \(a\) and \(b\) on the x-y plane can be computed using the following formula:

\[\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\]

To find the \(K\) nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the \(K\) observations corresponding to the \(K\) smallest distance values. For example, suppose we want to use \(K=5\) neighbors to classify a new observation with perimeter 0.0 and concavity 3.5, shown as a red diamond in Fig. 5.6. Let’s calculate the distances between our new point and each of the observations in the training set to find the \(K=5\) neighbors that are nearest to our new point. You will see in the code below, we compute the straight-line distance using the formula above: we square the differences between the two observations’ perimeter and concavity coordinates, add the squared differences, and then take the square root. In order to find the \(K=5\) nearest neighbors, we will use the nsmallest function from pandas.

Fig. 5.6 Scatter plot of concavity versus perimeter with new observation represented as a red diamond.#

new_obs_Perimeter = 0
new_obs_Concavity = 3.5
cancer["dist_from_new"] = (
       (cancer["Perimeter"] - new_obs_Perimeter) ** 2
     + (cancer["Concavity"] - new_obs_Concavity) ** 2
cancer.nsmallest(5, "dist_from_new")[[
Perimeter Concavity Class dist_from_new
112 0.241202 2.653051 Benign 0.880626
258 0.750277 2.870061 Malignant 0.979663
351 0.622700 2.541410 Malignant 1.143088
430 0.416930 2.314364 Malignant 1.256806
152 -1.160091 4.039155 Benign 1.279258

In Table 5.1 we show in mathematical detail how we computed the dist_from_new variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data.

Table 5.1 Evaluating the distances from the new observation to each of its 5 nearest neighbors#

























The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in Fig. 5.7.

Fig. 5.7 Scatter plot of concavity versus perimeter with 5 nearest neighbors circled.#

5.5.2. More than two explanatory variables#

Although the above description is directed toward two predictor variables, exactly the same K-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have \(m\) predictor variables for two observations \(a\) and \(b\), i.e., \(a = (a_{1}, a_{2}, \dots, a_{m})\) and \(b = (b_{1}, b_{2}, \dots, b_{m})\).

The distance formula becomes

\[\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.\]

This formula still corresponds to a straight-line distance, just in a space with more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, and then took the square root. Now we will do the same, except for our three variables. We calculate the distance as follows

\[\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.\]

Let’s calculate the distances between our new observation and each of the observations in the training set to find the \(K=5\) neighbors when we have these three predictors.

new_obs_Perimeter = 0
new_obs_Concavity = 3.5
new_obs_Symmetry = 1
cancer["dist_from_new"] = (
      (cancer["Perimeter"] - new_obs_Perimeter) ** 2
    + (cancer["Concavity"] - new_obs_Concavity) ** 2
    + (cancer["Symmetry"] - new_obs_Symmetry) ** 2
cancer.nsmallest(5, "dist_from_new")[[
Perimeter Concavity Symmetry Class dist_from_new
430 0.416930 2.314364 0.836722 Malignant 1.267368
400 1.334664 2.886368 1.099359 Malignant 1.472326
562 0.470430 2.084810 1.154075 Malignant 1.499268
68 -1.365450 2.812359 1.092064 Benign 1.531594
351 0.622700 2.541410 2.055065 Malignant 1.555575

Based on \(K=5\) nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. Fig. 5.8 shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.