# 5. Classification I: training & predicting#

## 5.1. Overview#

In previous chapters, we focused solely on descriptive and exploratory
data analysis questions.
This chapter and the next together serve as our first
foray into answering *predictive* questions about data. In particular, we will
focus on *classification*, i.e., using one or more
variables to predict the value of a categorical variable of interest. This chapter
will cover the basics of classification, how to preprocess data to make it
suitable for use in a classifier, and how to use our observed data to make
predictions. The next chapter will focus on how to evaluate how accurate the
predictions from our classifier are, as well as how to improve our classifier
(where possible) to maximize its accuracy.

## 5.2. Chapter learning objectives#

By the end of the chapter, readers will be able to do the following:

Recognize situations where a classifier would be appropriate for making predictions.

Describe what a training data set is and how it is used in classification.

Interpret the output of a classifier.

Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.

Explain the K-nearest neighbors classification algorithm.

Perform K-nearest neighbors classification in Python using

`scikit-learn`

.Use methods from

`scikit-learn`

to center, scale, balance, and impute data as a preprocessing step.Combine preprocessing and model training into a

`Pipeline`

using`make_pipeline`

.

## 5.3. The classification problem#

In many situations, we want to make predictions based on the current situation
as well as past experiences. For instance, a doctor may want to diagnose a
patient as either diseased or healthy based on their symptoms and the doctor’s
past experience with patients; an email provider might want to tag a given
email as “spam” or “not spam” based on the email’s text and past email text data;
or a credit card company may want to predict whether a purchase is fraudulent based
on the current purchase item, amount, and location as well as past purchases.
These tasks are all examples of **classification**, i.e., predicting a
categorical class (sometimes called a *label*) for an observation given its
other variables (sometimes called *features*).

Generally, a classifier assigns an observation without a known class (e.g., a new patient)
to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations
for which we do know the class (e.g., previous patients with known diseases and
symptoms). These observations with known classes that we use as a basis for
prediction are called a **training set**; this name comes from the fact that
we use these data to train, or teach, our classifier. Once taught, we can use
the classifier to make predictions on new data for which we do not know the class.

There are many possible methods that we could use to predict
a categorical class/label for an observation. In this book, we will
focus on the widely used **K-nearest neighbors** algorithm [Cover and Hart, 1967, Fix and Hodges, 1951].
In your future studies, you might encounter decision trees, support vector machines (SVMs),
logistic regression, neural networks, and more; see the additional resources
section at the end of the next chapter for where to begin learning more about
these other methods. It is also worth mentioning that there are many
variations on the basic classification problem. For example,
we focus on the setting of **binary classification** where only two
classes are involved (e.g., a diagnosis of either healthy or diseased), but you may
also run into multiclass classification problems with more than two
categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).

## 5.4. Exploring a data set#

In this chapter and the next, we will study a data set of
digitized breast cancer image features,
created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [Street *et al.*, 1993].
Each row in the data set represents an
image of a tumor sample, including the diagnosis (benign or malignant) and
several other measurements (nucleus texture, perimeter, area, and more).
Diagnosis for each image was conducted by physicians.

As with all data analyses, we first need to formulate a precise question that
we want to answer. Here, the question is *predictive*: can
we use the tumor
image measurements available to us to predict whether a future tumor image
(with unknown diagnosis) shows a benign or malignant tumor? Answering this
question is important because traditional, non-data-driven methods for tumor
diagnosis are quite subjective and dependent upon how skilled and experienced
the diagnosing physician is. Furthermore, benign tumors are not normally
dangerous; the cells stay in the same place, and the tumor stops growing before
it gets very large. By contrast, in malignant tumors, the cells invade the
surrounding tissue and spread into nearby organs, where they can cause serious
damage [Stanford Health Care, 2021].
Thus, it is important to quickly and accurately diagnose the tumor type to
guide patient treatment.

### 5.4.1. Loading the cancer data#

Our first step is to load, wrangle, and explore the data using visualizations
in order to better understand the data we are working with. We start by
loading the `pandas`

and `altair`

packages needed for our analysis.

```
import pandas as pd
import altair as alt
```

In this case, the file containing the breast cancer data set is a `.csv`

file with headers. We’ll use the `read_csv`

function with no additional
arguments, and then inspect its contents:

```
cancer = pd.read_csv("data/wdbc.csv")
cancer
```

ID | Class | Radius | Texture | Perimeter | Area | Smoothness | Compactness | Concavity | Concave_Points | Symmetry | Fractal_Dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 842302 | M | 1.096100 | -2.071512 | 1.268817 | 0.983510 | 1.567087 | 3.280628 | 2.650542 | 2.530249 | 2.215566 | 2.253764 |

1 | 842517 | M | 1.828212 | -0.353322 | 1.684473 | 1.907030 | -0.826235 | -0.486643 | -0.023825 | 0.547662 | 0.001391 | -0.867889 |

2 | 84300903 | M | 1.578499 | 0.455786 | 1.565126 | 1.557513 | 0.941382 | 1.052000 | 1.362280 | 2.035440 | 0.938859 | -0.397658 |

3 | 84348301 | M | -0.768233 | 0.253509 | -0.592166 | -0.763792 | 3.280667 | 3.399917 | 1.914213 | 1.450431 | 2.864862 | 4.906602 |

4 | 84358402 | M | 1.748758 | -1.150804 | 1.775011 | 1.824624 | 0.280125 | 0.538866 | 1.369806 | 1.427237 | -0.009552 | -0.561956 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

564 | 926424 | M | 2.109139 | 0.720838 | 2.058974 | 2.341795 | 1.040926 | 0.218868 | 1.945573 | 2.318924 | -0.312314 | -0.930209 |

565 | 926682 | M | 1.703356 | 2.083301 | 1.614511 | 1.722326 | 0.102368 | -0.017817 | 0.692434 | 1.262558 | -0.217473 | -1.057681 |

566 | 926954 | M | 0.701667 | 2.043775 | 0.672084 | 0.577445 | -0.839745 | -0.038646 | 0.046547 | 0.105684 | -0.808406 | -0.894800 |

567 | 927241 | M | 1.836725 | 2.334403 | 1.980781 | 1.733693 | 1.524426 | 3.269267 | 3.294046 | 2.656528 | 2.135315 | 1.042778 |

568 | 92751 | B | -1.806811 | 1.220718 | -1.812793 | -1.346604 | -3.109349 | -1.149741 | -1.113893 | -1.260710 | -0.819349 | -0.560539 |

569 rows × 12 columns

### 5.4.2. Describing the variables in the cancer data set#

Breast tumors can be diagnosed by performing a *biopsy*, a process where
tissue is removed from the body and examined for the presence of disease.
Traditionally these procedures were quite invasive; modern methods such as fine
needle aspiration, used to collect the present data set, extract only a small
amount of tissue and are less invasive. Based on a digital image of each breast
tissue sample collected for this data set, ten different variables were measured
for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean
for each variable across the nuclei was recorded. As part of the
data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
means and why we do it later in this chapter. Each image additionally was given
a unique ID and a diagnosis by a physician. Therefore, the
total set of variables per image in this data set is:

ID: identification number

Class: the diagnosis (M = malignant or B = benign)

Radius: the mean of distances from center to points on the perimeter

Texture: the standard deviation of gray-scale values

Perimeter: the length of the surrounding contour

Area: the area inside the contour

Smoothness: the local variation in radius lengths

Compactness: the ratio of squared perimeter and area

Concavity: severity of concave portions of the contour

Concave Points: the number of concave portions of the contour

Symmetry: how similar the nucleus is when mirrored

Fractal Dimension: a measurement of how “rough” the perimeter is

Below we use the `info`

method to preview the data frame. This method can
make it easier to inspect the data when we have a lot of columns:
it prints only the column names down the page (instead of across),
as well as their data types and the number of non-missing entries.

```
cancer.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 569 non-null int64
1 Class 569 non-null object
2 Radius 569 non-null float64
3 Texture 569 non-null float64
4 Perimeter 569 non-null float64
5 Area 569 non-null float64
6 Smoothness 569 non-null float64
7 Compactness 569 non-null float64
8 Concavity 569 non-null float64
9 Concave_Points 569 non-null float64
10 Symmetry 569 non-null float64
11 Fractal_Dimension 569 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.5+ KB
```

From the summary of the data above, we can see that `Class`

is of type `object`

.
We can use the `unique`

method on the `Class`

column to see all unique values
present in that column. We see that there are two diagnoses:
benign, represented by `"B"`

, and malignant, represented by `"M"`

.

```
cancer["Class"].unique()
```

```
array(['M', 'B'], dtype=object)
```

We will improve the readability of our analysis
by renaming `"M"`

to `"Malignant"`

and `"B"`

to `"Benign"`

using the `replace`

method. The `replace`

method takes one argument: a dictionary that maps
previous values to desired new values.
We will verify the result using the `unique`

method.

```
cancer["Class"] = cancer["Class"].replace({
"M" : "Malignant",
"B" : "Benign"
})
cancer["Class"].unique()
```

```
array(['Malignant', 'Benign'], dtype=object)
```

### 5.4.3. Exploring the cancer data#

Before we start doing any modeling, let’s explore our data set. Below we use
the `groupby`

and `size`

methods to find the number and percentage
of benign and malignant tumor observations in our data set. When paired with
`groupby`

, `size`

counts the number of observations for each value of the `Class`

variable. Then we calculate the percentage in each group by dividing by the total
number of observations and multiplying by 100.
The total number of observations equals the number of rows in the data frame,
which we can access via the `shape`

attribute of the data frame
(`shape[0]`

is the number of rows and `shape[1]`

is the number of columns).
We have
357 (63%) benign and
212 (37%) malignant
tumor observations.

```
100 * cancer.groupby("Class").size() / cancer.shape[0]
```

```
Class
Benign 62.741652
Malignant 37.258348
dtype: float64
```

The `pandas`

package also has a more convenient specialized `value_counts`

method for
counting the number of occurrences of each value in a column. If we pass no arguments
to the method, it outputs a series containing the number of occurences
of each value. If we instead pass the argument `normalize=True`

, it instead prints the fraction
of occurrences of each value.

```
cancer["Class"].value_counts()
```

```
Class
Benign 357
Malignant 212
Name: count, dtype: int64
```

```
cancer["Class"].value_counts(normalize=True)
```

```
Class
Benign 0.627417
Malignant 0.372583
Name: proportion, dtype: float64
```

Next, let’s draw a colored scatter plot to visualize the relationship between the
perimeter and concavity variables. Recall that the default palette in `altair`

is colorblind-friendly, so we can stick with that here.

```
perim_concav = alt.Chart(cancer).mark_circle().encode(
x=alt.X("Perimeter").title("Perimeter (standardized)"),
y=alt.Y("Concavity").title("Concavity (standardized)"),
color=alt.Color("Class").title("Diagnosis")
)
perim_concav
```

In Fig. 5.1, we can see that malignant observations typically fall in
the upper right-hand corner of the plot area. By contrast, benign
observations typically fall in the lower left-hand corner of the plot. In other words,
benign observations tend to have lower concavity and perimeter values, and malignant
ones tend to have larger values. Suppose we
obtain a new observation not in the current data set that has all the variables
measured *except* the label (i.e., an image without the physician’s diagnosis
for the tumor class). We could compute the standardized perimeter and concavity values,
resulting in values of, say, 1 and 1. Could we use this information to classify
that observation as benign or malignant? Based on the scatter plot, how might
you classify that new observation? If the standardized concavity and perimeter
values are 1 and 1 respectively, the point would lie in the middle of the
orange cloud of malignant points and thus we could probably classify it as
malignant. Based on our visualization, it seems like
it may be possible to make accurate predictions of the `Class`

variable (i.e., a diagnosis) for
tumor images with unknown diagnoses.

## 5.5. Classification with K-nearest neighbors#

In order to actually make predictions for new observations in practice, we will need a classification algorithm. In this book, we will use the K-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign or malignant), the K-nearest neighbors classifier generally finds the \(K\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. \(K\) is a number that we must choose in advance; for now, we will assume that someone has chosen \(K\) for us. We will cover how to choose \(K\) ourselves in the next chapter.

To illustrate the concept of K-nearest neighbors classification, we will walk through an example. Suppose we have a new observation, with standardized perimeter of 2.0 and standardized concavity of 4.0, whose diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in Fig. 5.2.

Fig. 5.3 shows that the nearest point to this new observation is
**malignant** and located at the coordinates (2.1,
3.6). The idea here is that if a point is close to another
in the scatter plot, then the perimeter and concavity values are similar,
and so we may expect that they would have the same diagnosis.

Suppose we have another new observation with standardized perimeter
0.2 and concavity of 3.3. Looking at the
scatter plot in Fig. 5.4, how would you classify this red,
diamond observation? The nearest neighbor to this new point is a
**benign** observation at (0.2, 2.7).
Does this seem like the right prediction to make for this observation? Probably
not, if you consider the other nearby points.

To improve the prediction we can consider several
neighboring points, say \(K = 3\), that are closest to the new observation
to predict its diagnosis class. Among those 3 closest points, we use the
*majority class* as our prediction for the new observation. As shown in Fig. 5.5, we
see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
are malignant. Therefore we take majority vote and classify our new red, diamond
observation as malignant.

Here we chose the \(K=3\) nearest observations, but there is nothing special about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \(K\) in the next chapter.

### 5.5.1. Distance between points#

We decide which points are the \(K\) “nearest” to our new observation using the
*straight-line distance* (we will often just refer to this as *distance*).
Suppose we have two observations \(a\) and \(b\), each having two predictor
variables, \(x\) and \(y\). Denote \(a_x\) and \(a_y\) to be the values of variables
\(x\) and \(y\) for observation \(a\); \(b_x\) and \(b_y\) have similar definitions for
observation \(b\). Then the straight-line distance between observation \(a\) and
\(b\) on the x-y plane can be computed using the following formula:

To find the \(K\) nearest neighbors to our new observation, we compute the distance
from that new observation to each observation in our training data, and select the \(K\) observations corresponding to the
\(K\) *smallest* distance values. For example, suppose we want to use \(K=5\) neighbors to classify a new
observation with perimeter 0.0 and
concavity 3.5, shown as a red diamond in Fig. 5.6. Let’s calculate the distances
between our new point and each of the observations in the training set to find
the \(K=5\) neighbors that are nearest to our new point.
You will see in the code below, we compute the straight-line
distance using the formula above: we square the differences between the two observations’ perimeter
and concavity coordinates, add the squared differences, and then take the square root.
In order to find the \(K=5\) nearest neighbors, we will use the `nsmallest`

function from `pandas`

.

```
new_obs_Perimeter = 0
new_obs_Concavity = 3.5
cancer["dist_from_new"] = (
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
)**(1/2)
cancer.nsmallest(5, "dist_from_new")[[
"Perimeter",
"Concavity",
"Class",
"dist_from_new"
]]
```

Perimeter | Concavity | Class | dist_from_new | |
---|---|---|---|---|

112 | 0.241202 | 2.653051 | Benign | 0.880626 |

258 | 0.750277 | 2.870061 | Malignant | 0.979663 |

351 | 0.622700 | 2.541410 | Malignant | 1.143088 |

430 | 0.416930 | 2.314364 | Malignant | 1.256806 |

152 | -1.160091 | 4.039155 | Benign | 1.279258 |

In Table 5.1 we show in mathematical detail how
we computed the `dist_from_new`

variable (the
distance to the new observation) for each of the 5 nearest neighbors in the
training data.

Perimeter |
Concavity |
Distance |
Class |
---|---|---|---|

0.24 |
2.65 |
\(\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88\) |
Benign |

0.75 |
2.87 |
\(\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98\) |
Malignant |

0.62 |
2.54 |
\(\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14\) |
Malignant |

0.42 |
2.31 |
\(\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26\) |
Malignant |

-1.16 |
4.04 |
\(\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28\) |
Benign |

The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in Fig. 5.7.

### 5.5.2. More than two explanatory variables#

Although the above description is directed toward two predictor variables, exactly the same K-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have \(m\) predictor variables for two observations \(a\) and \(b\), i.e., \(a = (a_{1}, a_{2}, \dots, a_{m})\) and \(b = (b_{1}, b_{2}, \dots, b_{m})\).

The distance formula becomes

This formula still corresponds to a straight-line distance, just in a space with more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, and then took the square root. Now we will do the same, except for our three variables. We calculate the distance as follows

Let’s calculate the distances between our new observation and each of the observations in the training set to find the \(K=5\) neighbors when we have these three predictors.

```
new_obs_Perimeter = 0
new_obs_Concavity = 3.5
new_obs_Symmetry = 1
cancer["dist_from_new"] = (
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
+ (cancer["Symmetry"] - new_obs_Symmetry) ** 2
)**(1/2)
cancer.nsmallest(5, "dist_from_new")[[
"Perimeter",
"Concavity",
"Symmetry",
"Class",
"dist_from_new"
]]
```

Perimeter | Concavity | Symmetry | Class | dist_from_new | |
---|---|---|---|---|---|

430 | 0.416930 | 2.314364 | 0.836722 | Malignant | 1.267368 |

400 | 1.334664 | 2.886368 | 1.099359 | Malignant | 1.472326 |

562 | 0.470430 | 2.084810 | 1.154075 | Malignant | 1.499268 |

68 | -1.365450 | 2.812359 | 1.092064 | Benign | 1.531594 |

351 | 0.622700 | 2.541410 | 2.055065 | Malignant | 1.555575 |

Based on \(K=5\) nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. Fig. 5.8 shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.