K-Nearest Neighbors

An Introduction to Statistical Learning with Applications in Python by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Author

Sungkyun Cho

Published

May 26, 2025

Load packages

# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.2f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

K-Nearest Neighbors in Classification

앞서 분류(classification) 문제의 경우에 Bayes classifier를 정의했는데,

\(f(x): P(Y = C_k|X=x)\)가 최대인 클래스에 할당; \(\underset{k}{\mathrm{argmax}}~ P(Y = C_k|X=x)\)

간단히 말해, 어떤 관측치 \(x\)에 대해서, 포함될 가능성이 가장 높은 클래스에 할당하는 것임
(조건부 확률분포 \(P(Y|X=x)\)에서 가장 높은 확률을 가지는 클래스에 할당)

이를 위해서 \(P(Y|X)\)를 추정하는 가장 단순하며 직관적인 방법으로 K-Nearest Neighbors(K-최근접 이웃)가 있음
대표적인 non-parametric 방법론임

주어진 (test) 관측치 \(x\)에 대해서 가장 가까운 K개의 (training) 값을 찾음
이 K개의 값들이 속한 클래스을 파악하면,
각 클래스에 속할 비율, 즉 확률을 추정할 수 있음

클래스 \(C_j\)에 대해서,
\(\displaystyle P(Y = C_j|X=x) = \frac{1}{K} \sum_{i \in N_k} I(y_i = C_j)\)

오렌지 배경: 오렌지 클래스에 할당된 평면 안의 값들,
파란 배경: 파란 클래스에 할당된 평면 안의 값들

다음과 같이 각 클래스에 속할 확률을 설정하여, 클래스 별로 100개씩의 데이터를 추출한 예를 보면 (simulated data),
Bayes classifier가 되는 즉, \(P(Y=orange|X) = P(Y=blue|X)\)인 \(X\)을 찾을 수 있음 (=\(\frac{1}{2}\))
이를 Bayes decision boundary라고 함. (보라색 점선)

이 때, KNN은 Bayes decision boundary를 추정하는데 사용될 수 있음.

K가 증가함에 따라 flexibiliy가 떨어지고 선형에 가까운 결정 경계를 생성
K = 100의 경우 variance는 낮지만 bias가 높은 classifier에 해당
K = 1, K = 100에 대한 test error rate은 각각 0.1695, 0.1925

K의 값이 bias-variance trade-off를 결정함으로, 과적합이 되지 않도록 cross-validation과 같은 방법을 통해 최적의 K를 찾아야 함

Source: pp. 38-39, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

K-Nearest Neighbors regression

KNN의 추정 방식은 자연스럽게 회귀(regression) 문제에 적용할 수 있음

주어진 (test) 관측치 \(x\)에 대해서 가장 가까운 K개의 (training) 값을 찾음; 집합 \(N\)으로 표현
이 K개의 값들의 평균을 계산하여, 예측값을 구함
- \(f(x) = E(Y|X=x)\)의 추정치로서 \(E(Y|x_i \in N)\)를 사용

식으로 표현하면,

\(\displaystyle \hat f(x) = \frac{1}{K} \sum_{x_i \in N} y_i\)

Source: p. 112, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

Linear regression과의 비교

X와 Y가 선형관계를 가질 때 (N = 50),

Source: p. 113, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

X와 Y가 선형관계에서 벗어날 때 (N = 50),

Source: p. 114, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

실제 X, Y의 관계가 비선형적일 수록 KNN이 더 나은 예측을 할 수 있음
하지만, 변수의 수가 증가하면 테스트 관측치 \(x_0\)에서 가까운 관측치 수가 기하급수적으로 줄어
\(x_0\)에 가장 가까운 K 관측치는 매우 멀리 떨어져 있어, 그 평균값들인 예측값의 정확도가 떨어질 수 있음; the curse of dimensionality

FIGURE 2.7. A simulation example, demonstrating the curse of dimensionality and its effect on MSE, bias and variance. The input features are uniformly distributed in [−1, 1]^p for p = 1,…, 10 The top left panel shows the target function (no noise) in \mathbb{R}: f(X) = e^{−8||X||^2}, and demonstrates the error that 1-nearest neighbor makes in estimating f(0). The training point is indicated by the blue tick mark. The top right panel illustrates why the radius of the 1-nearest neighborhood increases with dimension p. The lower left panel shows the average radius of the 1-nearest neighborhoods. The lower-right panel shows the MSE, squared bias and variance curves as a function of dimension p. — FIGURE 2.7. A simulation example, demonstrating the curse of dimensionality and its effect on MSE, bias and variance. The input features are uniformly distributed in \([−1, 1]^p\) for \(p\) = 1,…, 10 The top left panel shows the target function (no noise) in \(\mathbb{R}\): \(f(X) = e^{−8||X||^2}\), and demonstrates the error that 1-nearest neighbor makes in estimating \(f(0)\). The training point is indicated by the blue tick mark. The top right panel illustrates why the radius of the 1-nearest neighborhood increases with dimension \(p\). The lower left panel shows the average radius of the 1-nearest neighborhoods. The lower-right panel shows the MSE, squared bias and variance curves as a function of dimension \(p\).

Source: p. 25, The Elements of Statistical Learning (2e) by Hastie, T., Tibshirani, R., & Friedman, J.

특히, Y와 관련이 적은 예측변수들이 많을 경우에는 더욱 그러함
아래 그림은 Y와 관련이 별로 없는 변수들이 늘어날 때, KNN과 선형회귀의 성능의 변화를 보여줌 (N = 50);
KNN의 성능이 빠르게 떨어짐

FIGURE 3.20. Test MSE for linear regression (black dashed lines) and KNN(green curves) as the number of variables p increases. The true function is non-linear in the first variable, as in the lower panel in Figure 3.19, and does not depend on the additional variables. The performance of linear regression deteriorates slowly in the presence of these additional noise variables, whereas KNN’s performance degrades much more quickly as p increases.

Source: p. 115, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

일반적으로, parametric 방법이 예측변수 개수 대비 관측치가 적을 때 non-parametric 보다 성능이 우수함.
KNN의 경우 해석가능하지 않기 때문에, 예측변수가 적을 때에도 선형회귀를 사용하는 것이 더 나을 수 있음

Python Implementation

from ISLP import load_data
from ISLP.models import ModelSpec

import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_error, r2_score

Regression

Major League Baseball Data from the 1986 and 1987 seasons

from ISLP import load_data

Hitters = load_data('Hitters') # from ISLP
Hitters = Hitters.dropna()
Hitters.head(3)

   AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  CHits  CHmRun  CRuns  \
1    315    81      7    24   38     39     14    3449    835      69    321   
2    479   130     18    66   72     76      3    1624    457      63    224   
3    496   141     20    65   78     37     11    5628   1575     225    828   

   CRBI  CWalks League Division  PutOuts  Assists  Errors  Salary NewLeague  
1   414     375      N        W      632       43      10  475.00         N  
2   266     263      A        W      880       82      14  480.00         A  
3   838     354      N        E      200       11       3  500.00         N

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

X = Hitters.drop('Salary', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = Hitters['Salary']

X_train, X_test, y_train, y_test = skm.train_test_split(X, y, 
test_size=0.5, random_state=1)

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

K=3일 때의 KNN regression

from sklearn.neighbors import KNeighborsRegressor
K = 3
knn = KNeighborsRegressor(n_neighbors=K)
knn_pred = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)

print(
    f"RMSE: {root_mean_squared_error(y_test, knn_pred):.2f}\n"
    f"R2: {r2_score(y_test, knn_pred):.2f}"
)

RMSE: 315.50
R2: 0.42

여러 K값에 대한 grid search를 통해 최적의 K를 찾음
10-fold cross-validation을 통해 test error rate을 계산

kfold = skm.KFold(n_splits=10, shuffle=True, random_state=1)
knn = KNeighborsRegressor()

K = np.arange(1, 51)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="neg_mean_squared_error")
grid.fit(X_train_scaled, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
             estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='neg_mean_squared_error')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

grid.best_params_

{'n_neighbors': np.int64(8)}

각 K에 대한 test error rate을 시각화

plt.figure(figsize=(6, 4.5), dpi=70)
plt.errorbar(
    1/K,
    -grid.cv_results_["mean_test_score"],
    yerr=grid.cv_results_["std_test_score"] / np.sqrt(10),
)
plt.axvline(1/grid.best_params_['n_neighbors'], color=".5", linestyle=":")
plt.xlabel("1/K")
plt.ylabel("Cross-validated MSE")
plt.show()

K=8에 대한 test error

K = grid.best_params_['n_neighbors']
knn = KNeighborsRegressor(n_neighbors=K)
knn_pred = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)

print(
    f"RMSE: {root_mean_squared_error(y_test, knn_pred):.2f}\n"
    f"R2: {r2_score(y_test, knn_pred):.2f}"
)

RMSE: 318.17
R2: 0.41

Plot validation curves

validation_curve()를 사용하여, traing score와 validation score를 간단히 그릴 수 있음

train_scores, val_scores = skm.validation_curve(knn, X_train_scaled, y_train, param_name='n_neighbors', param_range=K, cv=kfold, scoring='neg_mean_squared_error')  # default: r2

plt.plot(K, -train_scores.mean(1), label='training score')
plt.plot(K, -val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('K')
plt.ylabel('Cross-validated MSE')

Classification

Palmer Archipelago penguin data

앞서 살펴봤던 Palmer Penguins의 데이터를 이용
부리의 길이(bill_length)와 깊이(bill_depth)로부터 종(species)을 분류하는 문제

penguins = sns.load_dataset('penguins')
penguins.head(3)

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen           39.10          18.70             181.00   
1  Adelie  Torgersen           39.50          17.40             186.00   
2  Adelie  Torgersen           40.30          18.00             195.00   

   body_mass_g     sex  
0      3750.00    Male  
1      3800.00  Female  
2      3250.00  Female

Show the code

penguins = sns.load_dataset('penguins')

sns.relplot(
    data=penguins,
    x="bill_length_mm", y="body_mass_g", hue="species",
    height=4, aspect=1.2, palette=colors[:3]
)

sns.displot(
    data=penguins,
    x="bill_length_mm", y="body_mass_g", hue="species",
    kind="kde", height=4, aspect=1.2, palette=colors[:3]
)
plt.show()

from sklearn.neighbors import KNeighborsClassifier

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함: 이 예제에서는 필요없음

# Fit KNN Classifier
knn5 = KNeighborsClassifier(n_neighbors=5, weights="distance")
knn5.fit(X_train_sub, y_train_sub)
knn5_pred = knn5.predict(X_test_sub)

print(
    f"Accuracy: {knn5.score(X_test_sub, y_test_sub)}"
)

Accuracy: 0.94

from ISLP import confusion_table
confusion_table(knn5_pred, y_test_sub)

Truth      Adelie  Chinstrap  Gentoo
Predicted                           
Adelie         35          0       0
Chinstrap       0         11       0
Gentoo          3          2      35

kfold = skm.KFold(n_splits=10, shuffle=True, random_state=1)
knn = KNeighborsClassifier(weights="distance")

K = np.arange(1, 21)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="accuracy")
grid.fit(X_train_sub, y_train_sub)

GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='accuracy')

grid.best_params_

{'n_neighbors': np.int64(15)}

grid.best_estimator_.fit(X_train_sub, y_train_sub)
knn_pred = grid.best_estimator_.predict(X_test_sub)

print(
    f"Accuracy: {grid.best_estimator_.score(X_test_sub, y_test_sub)}\n"
)

Truth      Adelie  Chinstrap  Gentoo
Predicted                           
Adelie         36          1       0
Chinstrap       0         10       0
Gentoo          2          2      35

K = np.arange(1, 21)

train_scores, val_scores = skm.validation_curve(knn, X_train_sub, y_train_sub, param_name='n_neighbors', param_range=K, cv=kfold, scoring='accuracy')

plt.plot(1/K, train_scores.mean(1), label='training score')
plt.plot(1/K, val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('1/K')
plt.ylabel('Cross-validated accuracy')
plt.show()

from sklearn.metrics import classification_report
print("\nClassification Report:\n", classification_report(y_test_sub, y_pred))

              precision    recall  f1-score   support

      Adelie       0.97      0.95      0.96        38
   Chinstrap       1.00      0.77      0.87        13
      Gentoo       0.90      1.00      0.95        35

    accuracy                           0.94        86
   macro avg       0.96      0.91      0.93        86
weighted avg       0.95      0.94      0.94        86

Wine recognition dataset

from sklearn import datasets
wine = datasets.load_wine(as_frame=True)
print(wine.DESCR)

wine_df = wine['frame']
wine_df['target'] = wine_df['target'].map({0: 'C0', 1: 'C1', 2: 'C3'})
wine_df.head(3)

   alcohol  malic_acid  ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71 2.43              15.60     127.00           2.80   
1    13.20        1.78 2.14              11.20     100.00           2.65   
2    13.16        2.36 2.67              18.60     101.00           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity  hue  \
0        3.06                  0.28             2.29             5.64 1.04   
1        2.76                  0.26             1.28             4.38 1.05   
2        3.24                  0.30             2.81             5.68 1.03   

   od280/od315_of_diluted_wines  proline target  
0                          3.92  1065.00     C0  
1                          3.40  1050.00     C0  
2                          3.17  1185.00     C0

Show the code

sns.relplot(
    data=wine_df,
    x="color_intensity", y="alcohol", hue="target",
    height=4, aspect=1.2, palette=colors[:3]
)

sns.displot(
    data=wine_df,
    x="color_intensity", y="alcohol", hue="target",
    kind="kde", height=4, aspect=1.2, palette=colors[:3]
)
plt.show()

X = wine_df[['color_intensity', 'alcohol']]
y = wine_df['target']

X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=0.25, random_state=1)

# 모든 변수들로부터 거리를 적절히 계산하려면 단위를 일치해야 함
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train_scaled, y_train)
knn3_pred = knn3.predict(X_test_scaled)

print(
    f"Accuracy: {knn3.score(X_test_scaled, y_test):.2f}\n"
)

Accuracy: 0.87

from ISLP import confusion_table
confusion_table(knn3_pred, y_test)

Truth      C0  C1  C3
Predicted            
C0         17   1   4
C1          0  16   0
C3          1   0   6

kfold = skm.KFold(n_splits=10, shuffle=True, random_state=0)
knn = KNeighborsClassifier()

K = np.arange(1, 51)
param_grid = {"n_neighbors": K}

grid = skm.GridSearchCV(knn, param_grid, cv=kfold, scoring="accuracy")
grid.fit(X_train_scaled, y_train)

GridSearchCV(cv=KFold(n_splits=10, random_state=0, shuffle=True),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
             scoring='accuracy')

grid.best_params_

{'n_neighbors': np.int64(11)}

grid.best_estimator_.fit(X_train_scaled, y_train)
knn_pred = grid.best_estimator_.predict(X_test_scaled)

confusion_table(knn_pred, y_test)

Truth      C0  C1  C3
Predicted            
C0         18   2   4
C1          0  15   0
C3          0   0   6

K = np.arange(1, 51)

train_scores, val_scores = skm.validation_curve(knn, X_train_scaled, y_train, param_name='n_neighbors', param_range=K, cv=kfold, scoring='accuracy')

plt.plot(1/K, train_scores.mean(1), label='training score')
plt.plot(1/K, val_scores.mean(1), label='validation score')
plt.legend(frameon=False)
plt.xlabel('1/K')
plt.ylabel('Cross-validated accuracy')
plt.show()

from sklearn.metrics import classification_report
print(classification_report(y_test, knn_pred))

              precision    recall  f1-score   support

          C0       0.75      1.00      0.86        18
          C1       1.00      0.88      0.94        17
          C3       1.00      0.60      0.75        10

    accuracy                           0.87        45
   macro avg       0.92      0.83      0.85        45
weighted avg       0.90      0.87      0.86        45

MNIST handwritten digits dataset

import tensorflow as tf

# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

# Shape of the data/labels
print(f"Training data shape: {X_train.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing data shape: {X_test.shape}")
print(f"Testing labels shape: {y_test.shape}")

Training data shape: (60000, 28, 28)
Training labels shape: (60000,)
Testing data shape: (10000, 28, 28)
Testing labels shape: (10000,)

fig, axes = plt.subplots(5, 5, figsize=(5, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i], cmap='gray')
    ax.set_title(f"Label: {y_train[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()

# randomly select subsets
np.random.seed(123)

n1, n2 = 2000, 2000

idx1 = np.random.randint(0, 60000, n1)
idx2 = np.random.randint(0, 10000, n2)

X_train_sub = X_train[idx1, :]
y_train_sub = y_train[idx1]
X_test_sub = X_test[idx2, :]
y_test_sub = y_test[idx2]

# Reshape the data
X_train_sub = X_train_sub.reshape(n1, 28 * 28)
X_test_sub = X_test_sub.reshape(n1, 28 * 28)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Fit KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5, weights="distance")
knn.fit(X_train_sub, y_train_sub)

# Predictions
y_pred = knn.predict(X_test_sub)

print("Predictive Accuracy:", accuracy_score(y_test_sub, y_pred))

Predictive Accuracy: 0.9115

# 오분류된 샘플만 시각화
mis_idx = np.where(y_pred != y_test_sub)[0]

n_show = min(25, len(mis_idx))
fig, axes = plt.subplots(5, 5, figsize=(5, 5))
for i, ax in enumerate(axes.flat):
    if i < n_show:
        idx = mis_idx[i]
        ax.imshow(X_test_sub[idx].reshape(28, 28), cmap='gray')
        ax.set_title(f"True: {y_test_sub[idx]}, Pred: {y_pred[idx]}", fontsize=8)
    ax.axis('off')
plt.tight_layout()
plt.show()