K-Nearest Neighbors

An Introduction to Statistical Learning with Applications in Python by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Author

Sungkyun Cho

Published

November 25, 2024

Load packages
# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.2f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

K-Nearest Neighbors in Classification

앞서 분류(classification) 문제의 경우에 Bayes classifier를 정의했는데,

\(f(x): P(Y = C_k|X=x)\)가 최대인 클래스에 할당; \(\underset{k}{\mathrm{argmax}}~ P(Y = C_k|X=x)\)

간단히 말해, 어떤 관측치 \(x\)에 대해서, 포함될 가능성이 가장 높은 클래스에 할당하는 것임
(조건부 확률분포 \(P(Y|X=x)\)에서 가장 높은 확률을 가지는 클래스에 할당)

이를 위해서 \(P(Y|X)\)를 추정하는 가장 단순하며 직관적인 방법으로 K-Nearest Neighbors(K-최근접 이웃)가 있음
대표적인 non-parametric 방법론임

  • 주어진 (test) 관측치 \(x\)에 대해서 가장 가까운 K개의 (training) 값을 찾음
  • 이 K개의 값들이 속한 클래스을 파악하면,
  • 각 클래스에 속할 비율, 즉 확률을 추정할 수 있음

클래스 \(C_j\)에 대해서,
\(\displaystyle P(Y = C_j|X=x) = \frac{1}{K} \sum_{i \in N_k} I(y_i = C_j)\)



  • 오렌지 배경: 오렌지 클래스에 할당된 평면 안의 값들,
  • 파란 배경: 파란 클래스에 할당된 평면 안의 값들

다음과 같이 각 클래스에 속할 확률을 설정하여, 클래스 별로 100개씩의 데이터를 추출한 예를 보면 (simulated data),
Bayes classifier가 되는 즉, \(P(Y=orange|X) = P(Y=blue|X)\)\(X\)을 찾을 수 있음 (=\(\frac{1}{2}\))
이를 Bayes decision boundary라고 함. (보라색 점선)

이 때, KNN은 Bayes decision boundary를 추정하는데 사용될 수 있음.

  • K가 증가함에 따라 flexibiliy가 떨어지고 선형에 가까운 결정 경계를 생성
  • K = 100의 경우 variance는 낮지만 bias가 높은 classifier에 해당
  • K = 1, K = 100에 대한 test error rate은 각각 0.1695, 0.1925

K의 값이 bias-variance trade-off를 결정함으로, 과적합이 되지 않도록 cross-validation과 같은 방법을 통해 최적의 K를 찾아야 함

Source: pp. 38-39, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

K-Nearest Neighbors regression

KNN의 추정 방식은 자연스럽게 회귀(regression) 문제에 적용할 수 있음

  • 주어진 (test) 관측치 \(x\)에 대해서 가장 가까운 K개의 (training) 값을 찾음; 집합 \(N\)으로 표현
  • 이 K개의 값들의 평균을 계산하여, 예측값을 구함
    • \(f(x) = E(Y|X=x)\)의 추정치로서 \(E(Y|x_i \in N)\)를 사용

식으로 표현하면,

\(\displaystyle \hat f(x) = \frac{1}{K} \sum_{x_i \in N} y_i\)

Source: p. 112, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

Linear regression과의 비교

X와 Y가 선형관계를 가질 때 (N = 50),


Source: p. 113, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

X와 Y가 선형관계에서 벗어날 때 (N = 50),

Source: p. 114, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

  • 실제 X, Y의 관계가 비선형적일 수록 KNN이 더 나은 예측을 할 수 있음
  • 하지만, 변수의 수가 증가하면 테스트 관측치 \(x_0\)에서 가까운 관측치 수가 기하급수적으로 줄어
  • \(x_0\)에 가장 가까운 K 관측치는 매우 멀리 떨어져 있어, 그 평균값들인 예측값의 정확도가 떨어질 수 있음; the curse of dimensionality

Source: p. 25, The Elements of Statistical Learning (2e) by Hastie, T., Tibshirani, R., & Friedman, J.

특히, Y와 관련이 적은 예측변수들이 많을 경우에는 더욱 그러함
아래 그림은 Y와 관련이 별로 없는 변수들이 늘어날 때, KNN과 선형회귀의 성능의 변화를 보여줌 (N = 50);
KNN의 성능이 빠르게 떨어짐

Source: p. 115, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

  • 일반적으로, parametric 방법이 예측변수 개수 대비 관측치가 적을 때 non-parametric 보다 성능이 우수함.
  • KNN의 경우 해석가능하지 않기 때문에, 예측변수가 적을 때에도 선형회귀를 사용하는 것이 더 나을 수 있음

Python Implementation

from ISLP import load_data
from ISLP.models import ModelSpec

import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_error, r2_score

Regression

Major League Baseball Data from the 1986 and 1987 seasons

from ISLP import load_data

Hitters = load_data('Hitters') # from ISLP
Hitters = Hitters.dropna()
Hitters.head(3)
   AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  CHits  CHmRun  CRuns  \
1    315    81      7    24   38     39     14    3449    835      69    321   
2    479   130     18    66   72     76      3    1624    457      63    224   
3    496   141     20    65   78     37     11    5628   1575     225    828   

   CRBI  CWalks League Division  PutOuts  Assists  Errors  Salary NewLeague  
1   414     375      N        W      632       43      10  475.00         N  
2   266     263      A        W      880       82      14  480.00         A  
3   838     354      N        E      200       11       3  500.00         N  
# suppress warnings
import warnings
warnings.filterwarnings('ignore')
X = Hitters.drop('Salary', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = Hitters['Salary']

X_train, X_test, y_train, y_test = skm.train_test_split(X, y, 
test_size=0.5, random_state=1)

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

K=3일 때의 KNN regression

from sklearn.neighbors import KNeighborsRegressor
K = 3
knn = KNeighborsRegressor(n_neighbors=K)
knn_pred = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)

print(
    f"RMSE: {root_mean_squared_error(y_test, knn_pred):.2f}\n"
    f"R2: {r2_score(y_test, knn_pred):.2f}"
)
RMSE: 315.50
R2: 0.42

여러 K값에 대한 grid search를 통해 최적의 K를 찾음
10-fold cross-validation을 통해 test error rate을 계산

kfold = skm.KFold(n_splits=10, shuffle=True, random_state=1)
knn = KNeighborsRegressor()

K = np.arange(1, 51)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="neg_mean_squared_error")
grid.fit(X_train_scaled, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
             estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid.best_params_
{'n_neighbors': np.int64(8)}

각 K에 대한 test error rate을 시각화

plt.figure(figsize=(6, 4.5), dpi=70)
plt.errorbar(
    1/K,
    -grid.cv_results_["mean_test_score"],
    yerr=grid.cv_results_["std_test_score"] / np.sqrt(10),
)
plt.axvline(1/grid.best_params_['n_neighbors'], color=".5", linestyle=":")
plt.xlabel("1/K")
plt.ylabel("Cross-validated MSE")
plt.show()

K=8에 대한 test error

K = grid.best_params_['n_neighbors']
knn = KNeighborsRegressor(n_neighbors=K)
knn_pred = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)

print(
    f"RMSE: {root_mean_squared_error(y_test, knn_pred):.2f}\n"
    f"R2: {r2_score(y_test, knn_pred):.2f}"
)
RMSE: 318.17
R2: 0.41
Plot validation curves

validation_curve()를 사용하여, traing score와 validation score를 간단히 그릴 수 있음

train_scores, val_scores = skm.validation_curve(knn, X_train_scaled, y_train, param_name='n_neighbors', param_range=K, cv=kfold, scoring='neg_mean_squared_error')  # default: r2

plt.plot(K, -train_scores.mean(1), label='training score')
plt.plot(K, -val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('K')
plt.ylabel('Cross-validated MSE')

Classification

Palmer Archipelago penguin data

앞서 살펴봤던 Palmer Penguins의 데이터를 이용
부리의 길이(bill_length)와 깊이(bill_depth)로부터 종(species)을 분류하는 문제

penguins = sns.load_dataset('penguins')
penguins.head(3)
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen           39.10          18.70             181.00   
1  Adelie  Torgersen           39.50          17.40             186.00   
2  Adelie  Torgersen           40.30          18.00             195.00   

   body_mass_g     sex  
0      3750.00    Male  
1      3800.00  Female  
2      3250.00  Female  
Show the code
penguins = sns.load_dataset('penguins')

sns.relplot(
    data=penguins,
    x="bill_length_mm", y="body_mass_g", hue="species",
    height=4, aspect=1.2, palette=colors[:3]
)

sns.displot(
    data=penguins,
    x="bill_length_mm", y="body_mass_g", hue="species",
    kind="kde", height=4, aspect=1.2, palette=colors[:3]
)
plt.show()

penguins = sns.load_dataset('penguins')
penguins = penguins[['species', 'bill_length_mm', 'body_mass_g']].dropna()

X = penguins[['bill_length_mm', 'body_mass_g']]
y = penguins['species']

X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=0.25, random_state=1)

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train_scaled, y_train)
knn3_pred = knn3.predict(X_test_scaled)

print(
    f"Accuracy: {knn3.score(X_test_scaled, y_test):.2f}\n"
)
Accuracy: 0.94
from ISLP import confusion_table
confusion_table(knn3_pred, y_test)
Truth      Adelie  Chinstrap  Gentoo
Predicted                           
Adelie         35          0       0
Chinstrap       0         11       0
Gentoo          3          2      35
kfold = skm.KFold(n_splits=10, shuffle=True, random_state=1)
knn = KNeighborsClassifier()

K = np.arange(1, 51)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="accuracy")
grid.fit(X_train_scaled, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid.best_params_
{'n_neighbors': np.int64(15)}
grid.best_estimator_.fit(X_train_scaled, y_train)
knn_pred = grid.best_estimator_.predict(X_test_scaled)

confusion_table(knn_pred, y_test)
Truth      Adelie  Chinstrap  Gentoo
Predicted                           
Adelie         36          1       0
Chinstrap       0         10       0
Gentoo          2          2      35
K = np.arange(1, 51)

train_scores, val_scores = skm.validation_curve(knn, X_train_scaled, y_train, param_name='n_neighbors', param_range=K, cv=kfold, scoring='accuracy')

plt.plot(1/K, train_scores.mean(1), label='training score')
plt.plot(1/K, val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('1/K')
plt.ylabel('Cross-validated accuracy')
plt.show()

from sklearn.metrics import classification_report
print(classification_report(y_test, knn_pred))
              precision    recall  f1-score   support

      Adelie       0.97      0.95      0.96        38
   Chinstrap       1.00      0.77      0.87        13
      Gentoo       0.90      1.00      0.95        35

    accuracy                           0.94        86
   macro avg       0.96      0.91      0.93        86
weighted avg       0.95      0.94      0.94        86

Wine recognition dataset

from sklearn import datasets
wine = datasets.load_wine(as_frame=True)
print(wine.DESCR)
wine_df = wine['frame']
wine_df['target'] = wine_df['target'].map({0: 'C0', 1: 'C1', 2: 'C3'})
wine_df.head(3)
   alcohol  malic_acid  ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71 2.43              15.60     127.00           2.80   
1    13.20        1.78 2.14              11.20     100.00           2.65   
2    13.16        2.36 2.67              18.60     101.00           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity  hue  \
0        3.06                  0.28             2.29             5.64 1.04   
1        2.76                  0.26             1.28             4.38 1.05   
2        3.24                  0.30             2.81             5.68 1.03   

   od280/od315_of_diluted_wines  proline target  
0                          3.92  1065.00     C0  
1                          3.40  1050.00     C0  
2                          3.17  1185.00     C0  
Show the code
sns.relplot(
    data=wine_df,
    x="color_intensity", y="alcohol", hue="target",
    height=4, aspect=1.2, palette=colors[:3]
)

sns.displot(
    data=wine_df,
    x="color_intensity", y="alcohol", hue="target",
    kind="kde", height=4, aspect=1.2, palette=colors[:3]
)
plt.show()

X = wine_df[['color_intensity', 'alcohol']]
y = wine_df['target']

X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=0.25, random_state=1)

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train_scaled, y_train)
knn3_pred = knn3.predict(X_test_scaled)

print(
    f"Accuracy: {knn3.score(X_test_scaled, y_test):.2f}\n"
)
Accuracy: 0.87
from ISLP import confusion_table
confusion_table(knn3_pred, y_test)
Truth      C0  C1  C3
Predicted            
C0         17   1   4
C1          0  16   0
C3          1   0   6
kfold = skm.KFold(n_splits=10, shuffle=True, random_state=0)
knn = KNeighborsClassifier()

K = np.arange(1, 51)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="accuracy")
grid.fit(X_train_scaled, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=0, shuffle=True),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid.best_params_
{'n_neighbors': np.int64(11)}
grid.best_estimator_.fit(X_train_scaled, y_train)
knn_pred = grid.best_estimator_.predict(X_test_scaled)

confusion_table(knn_pred, y_test)
Truth      C0  C1  C3
Predicted            
C0         18   2   4
C1          0  15   0
C3          0   0   6
K = np.arange(1, 51)

train_scores, val_scores = skm.validation_curve(knn, X_train_scaled, y_train, param_name='n_neighbors', param_range=K, cv=kfold, scoring='accuracy')

plt.plot(1/K, train_scores.mean(1), label='training score')
plt.plot(1/K, val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('1/K')
plt.ylabel('Cross-validated accuracy')
plt.show()

from sklearn.metrics import classification_report
print(classification_report(y_test, knn_pred))
              precision    recall  f1-score   support

          C0       0.75      1.00      0.86        18
          C1       1.00      0.88      0.94        17
          C3       1.00      0.60      0.75        10

    accuracy                           0.87        45
   macro avg       0.92      0.83      0.85        45
weighted avg       0.90      0.87      0.86        45