Tree-based Models

Mixed

Author

Sungkyun Cho

Published

December 17, 2024

Load packages

# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.5f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 5, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

sckit-learn packages

import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
from ISLP.models import ModelSpec as MS

from sklearn.tree import (DecisionTreeClassifier as DTC,
                          DecisionTreeRegressor as DTR,
                          plot_tree,
                          export_text)
from sklearn.metrics import (accuracy_score,
                             log_loss)
from sklearn.ensemble import \
     (RandomForestRegressor as RF,
      GradientBoostingRegressor as GBR)
from ISLP.bart import BART

Decision Trees

앞서 “Two Cultures”의 저자인 Leo Breiman이 1984년에 제안한 CART(Classification and Regression Trees) 알고리즘을 기반으로 한 Decision Trees (의사 결정 트리)

‘off-the-shelf’ 알고리즘으로 최적
- 분류와 회귀 문제에 모두 적용 가능함.
- 데이터 전처리가 필요 없음: 모든 변수 타입을 다룰 수 있고, 스케일/(monotonic)변환에 변화가 없음
- 관련없는 변수가 많아도 큰 영향을 받지 않음; variable selection이 이루어짐
- turning parameter가 적음
- 결측값이 있어도 처리 가능
- 상대적으로 빠름
- 어느 정도 해석 가능함
트리 하나로는 예측력이 떨어져 앙상블 기법을 사용함
- Bagging, Random Forest, Gradient Boosted Trees 등
- 해석 가능성은 떨어짐

Regression

Baseball Data
Major League Baseball Data from the 1986 and 1987 seasons.

from ISLP import load_data

Hitters = load_data('Hitters')
Hitters = Hitters.dropna()  # 결측치 제거
Hitters.head(3)

   AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  CHits  CHmRun  CRuns  \
1    315    81      7    24   38     39     14    3449    835      69    321   
2    479   130     18    66   72     76      3    1624    457      63    224   
3    496   141     20    65   78     37     11    5628   1575     225    828   

   CRBI  CWalks League Division  PutOuts  Assists  Errors    Salary NewLeague  
1   414     375      N        W      632       43      10 475.00000         N  
2   266     263      A        W      880       82      14 480.00000         A  
3   838     354      N        E      200       11       3 500.00000         N

code

(
    so.Plot(Hitters, x='Years', y='Hits', color='Salary')
    .add(so.Dot())
    .scale(color=so.Continuous('Blues', norm=(0, 2000), trans="sqrt"))
    .layout(size=(5, 4))
)

from sklearn.tree import DecisionTreeRegressor as DTR, plot_tree, export_text

X = Hitters[["Years", "Hits"]]
y = np.log(Hitters.Salary)  # 종모양의 분포로 변환

# split the data
X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=0.5, random_state=2)

reg_tree = DTR(max_depth=2, random_state=0)        
reg_tree.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=2, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

code

def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, s=30, cmap="Blues",
               clim=(y.min(), y.max()), zorder=3)
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    Z = np.round(Z, 2)

    # Create a color plot with the results
    contours = ax.contour(xx, yy, Z, colors="#fc4f30", linewidths=.5, zorder=1)
    # plt.clabel(contours, inline=True, fontsize=8)
    ax.set(xlim=xlim, ylim=ylim)

fig, ax = plt.subplots(figsize=(5, 6.5))
visualize_classifier(reg_tree, X_train, y_train)
plt.show()

fig, ax = plt.subplots(figsize=(6, 6))
plot_tree(reg_tree, feature_names=reg_tree.feature_names_in_.tolist(), ax=ax)
plt.show()

print(export_text(reg_tree, feature_names=reg_tree.feature_names_in_.tolist(), show_weights=True))

|--- Years <= 4.50
|   |--- Years <= 3.50
|   |   |--- value: [4.73]
|   |--- Years >  3.50
|   |   |--- value: [5.55]
|--- Years >  4.50
|   |--- Hits <= 117.50
|   |   |--- value: [5.97]
|   |--- Hits >  117.50
|   |   |--- value: [6.78]

용어들

Terminal/leaf node (터미널/리프 노드): 트리의 맨 끝에 있는 노드; predictor space에서 나뉘어진 각 영역을 나타냄
Internal node (내부 노드): 트리의 중간에 있는 노드; predictor space에서 나뉘어진 지점
Branch (가지): 노드와 노드를 연결하는 선

Interactive Plots

Interactive plot for decision boundary

from ipywidgets import interact, fixed

def interactive_boundary(depth=1, ax=None):
    reg_tree = DTR(max_depth=depth, random_state=0)
    
    fig, ax = plt.subplots(figsize=(6, 6))
    visualize_classifier(reg_tree, X_train, y_train, ax=ax)

interact(interactive_boundary, depth=(1, 10), ax=fixed(None))
plt.show()

분할 알고리즘

각 분할에 포함된 X값에 대해서 Y값들의 평균값을 그 예측값으로 할당
잔차의 제곱합(RSS)을 최소화하는 분할을 찾음
계산적으로 불가능하므로 대안적인 방법을 사용

Top-down, greedy approach (recursive binary splitting)

모든 예측변수들과 각 예측변수의 모든 가능한 분할점에 대해 RSS가 최소가 되는 예측변수(\(X_j\))와 분할점(\(s\))을 찾음
즉, \(\displaystyle RSS = \sum_{i: x_i \in R_1(j, s)}(y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)}(y_i - \hat{y}_{R_2})^2\)을 최소화하는 \(j\)와 \(s\)를 찾음
- \(\hat{y}_{R_1}\): \(R_1\)에 있는 관측치들의 \(y\)값들의 평균
- \(\hat{y}_{R_2}\): \(R_2\)에 있는 관측치들의 \(y\)값들의 평균
이를 통해 영역을 나눔: \(R_1(j, s) = \{X|X_j < s\}\), \(R_2(j, s) = \{X|X_j \geq s\}\)
각 영역에서 반복적으로 이러한 분할을 수행
특정 기준점에 도달할 때까지 반복;
- 예를 들어, 모든 영역에서 5개 이상의 관측치가 남아있지 않을 때까지
- 또는 트리의 깊이가 특정 수준에 도달할 때까지
- 또는 터미널 노드의 수가 특정 수준에 도달할 때까지

Source: p. 335, An Introduction to Statistical Learning by James, G., Witten, D., Hastie, T., & Tibshirani, R.

Pruning (가지치기)

Cost-complexity pruning
트리의 깊이를 높게 하면, 훈련셋에 대한 예측력은 높아지지만, 테스트셋에 대한 예측력은 낮아질 수 있음
- 즉, 과적합(overfitting)이 발생할 수 있음
과적합을 조절하기 위해 매우 깊은 트리(\(T_0\))를 우선 만들고, 이후에 가지치기(pruning)를 수행
각 \(\alpha\)에 대해 다음을 최소화하는 subtree \(T \subset T_0\)가 존재함.
\(\displaystyle \sum_{m=1}^{|T|} \sum_{i: x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha|T|\)
\(\alpha\)가 커질수록, 트리의 크기에 대한 페널티가 커짐
Cross-validation을 통해 최적의 \(\alpha\)를 찾음

cols = ["AtBat", "Hits", "HmRun", "Runs", "RBI", "Walks", "Years", "PutOuts", "Assists", "Errors"]

X = Hitters[cols]
y = np.log(Hitters["Salary"])  # 종모양의 분포로 변환

# split the data
X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=.5, random_state=1)

reg_tree = DTR(min_samples_leaf=5, random_state=0)        
reg_tree.fit(X_train, y_train)

DecisionTreeRegressor(min_samples_leaf=5, random_state=0)

fig, ax = plt.subplots(figsize=(15, 6))
plot_tree(reg_tree,
          feature_names=reg_tree.feature_names_in_.tolist(),
          ax=ax, fontsize=7);

Cost-complexity pruning

ccp_path = reg_tree.cost_complexity_pruning_path(X_train, y_train)
kfold = skm.KFold(5, shuffle=True, random_state=0)
grid = skm.GridSearchCV(reg_tree,
                        {'ccp_alpha': ccp_path['ccp_alphas']},
                        cv=kfold,
                        scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=DecisionTreeRegressor(min_samples_leaf=5,
                                             random_state=0),
             param_grid={'ccp_alpha': array([0.     , 0.00046, 0.0009 , 0.0011 , 0.0014 , 0.00277, 0.00294,
       0.00363, 0.00505, 0.00574, 0.01015, 0.01114, 0.01298, 0.01535,
       0.01566, 0.02022, 0.0211 , 0.02737, 0.04831, 0.10957, 0.32311])},
             scoring='neg_mean_squared_error')

grid.best_params_

{'ccp_alpha': 0.0153509442666242}

# MSE
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, grid.best_estimator_.predict(X_test))

0.4640166709041431

fig, ax = plt.subplots(figsize=(8, 5))
plot_tree(grid.best_estimator_, precision=1,
          feature_names=grid.best_estimator_.feature_names_in_.tolist(),
          ax=ax);

code

from sklearn.metrics import mean_squared_error

K = 5
results = []

for i, alpha in enumerate(ccp_path['ccp_alphas']):
    reg_tree = DTR(ccp_alpha=alpha, min_samples_leaf=5, random_state=0)

    # training, test error
    reg_tree.fit(X_train, y_train)
    n_leaves = reg_tree.get_n_leaves()
    mse_train = mean_squared_error(y_train, reg_tree.predict(X_train))
    mse_test = mean_squared_error(y_test, reg_tree.predict(X_test))

    # cross-validation error
    kfold = skm.KFold(K, shuffle=True, random_state=0)
    cv_results = skm.cross_validate(reg_tree, X_train, y_train, cv=kfold, return_estimator=True, scoring='neg_mean_squared_error')

    results.append({"alpha": alpha, "i": i, "n_leaves": n_leaves, "mse_train": mse_train, "mse_test": mse_test, "cv": -cv_results["test_score"].mean()})

results = pd.DataFrame(results)
results = results.query("n_leaves < 11")
results

code

# plot the results
fig, ax = plt.subplots(figsize=(8, 5))
sns.lineplot(data=results, x="n_leaves", y="mse_train", label="Training", marker="o", c="#fc4f30", ax=ax)
sns.lineplot(data=results, x="n_leaves", y="mse_test", label="Test", marker="o", c="#7d2be2", ax=ax)
sns.lineplot(data=results, x="n_leaves", y="cv", label="Cross-Validation", marker="o", c="#00A08A", ax=ax)

ax.set(xlabel="Tree Size", ylabel="Mean Squared Error")
sns.despine()
ax.get_legend().set_frame_on(False)

reg_tree_x = DTR(ccp_alpha=0.012, random_state=0)
reg_tree_x.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.012, random_state=0)

fig, ax = plt.subplots(figsize=(8, 5))
plot_tree(reg_tree_x, feature_names=reg_tree_x.feature_names_in_.tolist(), ax=ax);

Classification

회귀문제와 다른 점은

예측값을 각 영역에서 가장 많이 나타나는 클래스에 할당.
동시에, 각 클래스에 속한 비율(class proportion)을 얻을 수 있음.

분할 알고리즘

회귀에서 RSS를 최소화하도록 분할한 것과 비슷하게 분류에서는 다음을 기준으로 분할
- classification error rate: 예측 오류률가 낮아지도록 분할
- Gini index, cross-entropy: 노드의 불순도(impurity)가 낮아지도록 분할. 즉, 한 클래스에 주로 속하도록 분할
  - 동일한 클래스로 예측되는 노드라 할지라도 노드의 순도를 높이면 (해당) 클래스에 속할 확률을 높여 예측에 대한 확신을 높일 수 있음.
새로운 노드들로부터 (가중치를 고려해) 평균적인 오류률/불순도가 줄어들 때 분할을 하며, 그 오류률/불순도가 최소가 되도록 분할

Classification error rate

\(E = 1 - \max_k(\hat{p}_{mk})\), \(\hat{p}_{mk}\): 영역 \(R_m\)에서 클래스 \(C_k\)에 속한 관측치의 비율
- 동그라미: 0.7, 세모: 0.2, 네모: 0.1
- Misclassication error rate = 1 - 0.7 = 0.3
예측의 정확도를 높이기 위해 선호됨
반면, 트리의 분할에 대해 덜 민감해 큰 트리를 만들 수 없음.

Gini index

\(G = \sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk}) = 1 - \sum_{k=1}^K \hat{p}_{mk}^2\), \(\hat{p}_{mk}\): 영역 \(R_m\)에서 클래스 \(C_k\)에 속한 관측치의 비율
- 각 클래스에 대해서 그 클래스에 속하면 1 아니면 0으로 코딩했을 때,
- 해석: a measure of total variance across the K classes
- \(G = 0.7*(1-0.7) + 0.2*(1-0.2) + 0.1*(1-0.1) = 0.46\)
노드의 불순도(impurity)를 측정; 여러 클래스로 나뉘어져 분포하는 정도
\(G\)가 작을수록, 한 개의 클래스에 주로 속해있음을 의미

k1 = np.array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0])
k2 = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0])
k3 = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

# variance = p(1-p)
print(k1.var(), k2.var(), k3.var())
#> 0.21, 0.16, 0.09

Cross-entropy/deviance

\(D = -\sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\)
Gini index와 유사하며, 노드의 불순도를 측정

Gini index와 cross-entropy가 노드 순수도에 더 민감하게 반응하며 더 큰 트리를 만들 수 있음.

예를 들어, 다음 두 가지 분할에 대해 오류율은 동일한 반면,
Gini index는 차이를 보이고, 오른쪽 분할이 더 순수한 분할로 간주됨.

왼쪽

분할 전: 오류율, Gini index = 0.5
분할 후:
- 오류율 = \(\frac{4}{8} \cdot \frac{1}{4} + \frac{4}{8} \cdot \frac{1}{4}=0.25\)
- Gini index = \(\frac{4}{8} \cdot \left(\frac{3}{4}\cdot \frac{1}{4} + \frac{1}{4}\cdot \frac{3}{4} \right) + \frac{4}{8} \cdot \left(\frac{1}{4}\cdot \frac{3}{4} + \frac{3}{4}\cdot \frac{1}{4} \right)=0.375\)

오른쪽

분할 후:
- 오류율 = \(\frac{6}{8} \cdot \frac{2}{6} + \frac{2}{8} \cdot \frac{0}{2}=0.25\)
- Gini index = \(\frac{6}{8} \cdot \left(\frac{2}{6}\cdot \frac{4}{6} + \frac{4}{6}\cdot \frac{2}{6} \right) + \frac{2}{8} \cdot \left(\frac{2}{2}\cdot \frac{0}{2} + \frac{0}{2}\cdot \frac{2}{2} \right)=0.333...\)

세 지표들의 비교

penguins = sns.load_dataset("penguins")

from sklearn.tree import DecisionTreeClassifier as DTC

X = penguins[["bill_length_mm", "bill_depth_mm"]]
y = penguins["species"].map({"Adelie": 0, "Chinstrap": 1, "Gentoo": 2})

clf_tree_2 = DTC(max_depth=2, criterion="gini", random_state=0)
clf_tree_10 = DTC(max_depth=10, criterion="gini", random_state=0)

code

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(colors[:3])

def visualize_classifier2(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, s=50, edgecolors="w", linewidths=.5, cmap=cmap, clim=(y.min(), y.max()), zorder=3)
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.15,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, zorder=1)

    ax.set(xlim=xlim, ylim=ylim)

fig, ax = plt.subplots(figsize=(5.5, 5))
visualize_classifier2(clf_tree_2, X, y, cmap=custom_cmap)
plt.show()

fig, ax = plt.subplots(figsize=(5.5, 5))
visualize_classifier2(clf_tree_10, X, y, cmap=custom_cmap)
plt.show()

fig, ax = plt.subplots(figsize=(8, 5))
plot_tree(clf_tree_2, feature_names=clf_tree_2.feature_names_in_.tolist(), ax=ax);

confusion_table(clf_tree_2.predict(X), y)

Truth        0   1    2
Predicted              
0          139   1    0
1           12  67    9
2            1   0  115

의사 결정 트리의 장단점

장점

사람들에게 설명하기 매우 쉽우며, 사실 선형 회귀보다 설명하기가 더 쉬음.
의사 결정 트리가 다른 접근법보다 인간의 의사 결정을 더 잘 반영한다고 보기도 함.
다이어그램으로 표시할 수 있으며, 비전문가도 쉽게 해석할 수 있음.(특히 크기가 작은 경우).
더미 변수를 만들 필요 없이 자연스럽게 범주형 변수를 처리할 수 있음. (단, scikit-learn에서는 아직…)

단점

일반적으로 다른 회귀 및 분류 접근 방식보다 예측 정확도가 낮음.
Non-robust: 데이터의 작은 변화에 민감하게 트리에 큰 변화가 생길 수 있음.

Sales of Child Car Seats 예제

Carseats 설명

carseats = load_data('Carseats')
carseats.head(3)

     Sales  CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0  9.50000        138      73           11         276    120       Bad   42   
1 11.22000        111      48           16         260     83      Good   65   
2 10.06000        113      35           10         269     80    Medium   59   

   Education Urban   US  
0         17   Yes  Yes  
1         10   Yes  Yes  
2         12   Yes  Yes


# 카테고리 변수를 더미 변수로 변환
X = pd.get_dummies(carseats.drop(columns="Sales"), drop_first=True)

# 분류 문제로 변환
y = np.where(carseats["Sales"] > 8, "high", "low")

# fit the model
clf_tree = DTC(criterion='entropy', max_depth=3, random_state=0)
clf_tree.fit(X, y)

DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)

fig, ax = plt.subplots(figsize=(8, 5))
plot_tree(clf_tree, feature_names=clf_tree.feature_names_in_.tolist(), ax=ax);

print(export_text(clf_tree, feature_names=clf_tree.feature_names_in_.tolist(), show_weights=True))

|--- ShelveLoc_Good <= 0.50
|   |--- Price <= 92.50
|   |   |--- Income <= 57.00
|   |   |   |--- weights: [3.00, 7.00] class: low
|   |   |--- Income >  57.00
|   |   |   |--- weights: [29.00, 7.00] class: high
|   |--- Price >  92.50
|   |   |--- Advertising <= 13.50
|   |   |   |--- weights: [41.00, 183.00] class: low
|   |   |--- Advertising >  13.50
|   |   |   |--- weights: [25.00, 20.00] class: high
|--- ShelveLoc_Good >  0.50
|   |--- Price <= 135.00
|   |   |--- US_Yes <= 0.50
|   |   |   |--- weights: [11.00, 6.00] class: high
|   |   |--- US_Yes >  0.50
|   |   |   |--- weights: [49.00, 2.00] class: high
|   |--- Price >  135.00
|   |   |--- Income <= 46.00
|   |   |   |--- weights: [0.00, 6.00] class: low
|   |   |--- Income >  46.00
|   |   |   |--- weights: [6.00, 5.00] class: high

Training error에 대해 살펴보면,

confusion_table(clf_tree.predict(X), y)

Truth      high  low
Predicted           
high        120   40
low          44  196

from sklearn.metrics import accuracy_score, log_loss

accuracy_score(y, clf_tree.predict(X))

0.79

좀 더 적절하게 cross-validation을 이용해 test error를 구하면,

kfold = skm.KFold(n_splits=5, shuffle=True, random_state=0)
results = skm.cross_validate(clf_tree, X, y, scoring='accuracy', cv=kfold)
results['test_score']

array([0.7125, 0.7625, 0.7625, 0.6125, 0.7125])

Tree-pruning에 의해 더 나은 트리를 얻을 수 있는지 확인
우선 큰 트리를 만들기 위해, max_depth를 설정하지 않고 트리를 만든 후, 적절한 pruning을 얻기 위해 cross-validation을 이용함.

# train-test split
X_train, X_test, y_train, y_test = skm.train_test_split(X, y, test_size=0.5, random_state=0)

clf_tree = DTC(criterion='entropy', random_state=0)
clf_tree.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

fig, ax = plt.subplots(figsize=(8, 5))
plot_tree(clf_tree, feature_names=clf_tree.feature_names_in_.tolist(), ax=ax);

clf_tree.tree_.n_leaves

accuracy_score(y_test, clf_tree.predict(X_test))

0.705

Cost-complexity pruning

ccp_path = clf_tree.cost_complexity_pruning_path(X_train, y_train)
kfold = skm.KFold(10, random_state=1, shuffle=True)

grid = skm.GridSearchCV(clf_tree,
                        {'ccp_alpha': ccp_path.ccp_alphas},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy')
grid.fit(X_train, y_train)
grid.best_score_

0.6849999999999999

fig, ax = plt.subplots(figsize=(8, 5))

best_ = grid.best_estimator_
plot_tree(best_, feature_names=best_.feature_names_in_.tolist(), fontsize=6, ax=ax);

best_.tree_.n_leaves

accuracy_score(y_test, best_.predict(X_test))

0.68

confusion_table(best_.predict(X_test), y_test)

Truth      high  low
Predicted           
high         38   20
low          44   98

pruning을 통해 accuracy에서 작은 손해를 보았으나; 0.705에서 0.68로 감소
35개의 터미널 노드 가진 full tree에서 10개의 터미널 노드로 줄였음.

Decision tree에서 확률의 의미

리프 노드에서 각 클래스에 속하는 관측치들의 비율

best_.predict_proba(X_test)[:3]
# array([[0.30303, 0.69697],
#        [0.775  , 0.225  ],
#        [0.775  , 0.225  ]])

X_test[:3]
#     CompPrice  Income  Advertising  Population  Price  Age  Education  
# 132       125      87            9         232    136   72        10   
# 309       131     111           13          33     80   68        18   
# 341        98     120            0         268     93   72        10   

#      ShelveLoc_Good  ShelveLoc_Medium  Urban_Yes  US_Yes  
# 132            True             False       True    True  
# 309           False             False       True    True  
# 341           False              True      False   False

Ensemble Methods

여러 개의 단순한 모형들을 결합하여 더 나은 예측을 하는 방법

각각의 예측력은 떨어지는 소위 weak learner들이나
각각은 전체의 일부에 대해 고유한 정보를 품고 있다고도 볼 수 있음.
각 모형은 여러 형태를 가질 수 있음

여기서는 결정 트리에 적용시켜 앙상블 방법을 살펴봄

Bagging

Bootstrap aggregation

부트스트랩으로 추출되는 여러 데이터셋으로 모형을 학습
특히, decision tree에서 흔히 사용되는데, 이는 decision tree가 variance가 매우 높은 모형이기 때문

여러 샘플들로부터 얻은 통계치들을 평균내면 그 분포의 분산이 줄어드는 현상이 있는데, 실제로 여러 샘플을 얻을 수 없기 때문에 부트스트랩으로 가상의 여러 샘플을 얻어 이를 이용.

the sampling distribution of the mean

평균이 \(\mu\), 분산이 \(\sigma^2\) 인 모집단으로부터 추출된 표본 사이즈가 \(n\)인 표본들에 대해서

각 표본의 평균들 \(\bar{X}\) 의 분포를 평균의 표본 분포, the sampling distribution of the mean이라고 하고, 이 분포는 the central limit theorem에 의해

평균: \(\displaystyle E(\bar{X})=\frac{m_1+m_2+m_3+\cdots+m_w}{w}\Bigg|_{w\rightarrow\infty}\rightarrow ~~~\mu\)
분산: \(\displaystyle V(\bar{X})\Bigg|_{w\rightarrow\infty}\rightarrow ~~~\frac{\sigma^2}{n},\) 표준 편차 \(\displaystyle\frac{\sigma}{\sqrt{n}}\) 를 standard error of estimate (SE)라고 함.

B개의 부트스트랩 샘플을 이용해 B개의 모형 \(\hat{f}^{b}\)를 만들고, 이들의 평균을 이용해 예측

\(\displaystyle \hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^{b}(x)\)

decision tree에서 특히 효과적임
각 트리들은 매우 깊게 만들어 variance를 높게 만든 후 (bias는 낮아짐)
이들의 평균을 이용해 variance를 줄임
수백, 수천개의 트리를 결합해 매우 높은 정확도를 얻을 수 있음
트리를 키워 variance가 충분히 낮아지도록; 트리의 수가 중요한 요소는 아님. 많은 트리가 overfitting을 만들지 않음
한편, 여러 트리들의 평균값이므로 앞서 단일 트리를 통해 다이어그램을 그리거나 해석하기 어려움.

Source: p. 344, An Introduction to Statistical Learning with Applications in Python

Out-of-bag error

부트스트랩 샘플들은 평균적으로 원래 데이터의 2/3의 관측치만를 포함함
학습에 사용되지 않은 나머지 약 1/3의 관측치들을 out-of-bag(OOB) 관측치라고 함
모형에 대한 test error를 cross-validation 대신에 학습에 사용되지 않은 OOB 관측치를 이용해 효과적으로 추정할 수 있음
즉, 각 관측치 x에 대해 x가 포함되지 않은 부트스트랩 샘플들로 학습된 트리들로 예측한 값들(약 1/3개의 트리)의 평균을 이용해 test error를 추정할 수 있음

분류의 문제의 경우

각 트리들이 예측한 클래스들 중 가장 많이 나온 클래스를 예측값으로 사용: majority/hard voting
각 트리들이 예측한 클래스들의 확률을 평균내어 가장 높은 확률을 가진 클래스를 예측값으로 사용: soft voting

예측변수의 상대적 중요도

예측변수에 대한 해석은 어려운 대신, 예측변수의 중요도에 대한 정보를 얻을 수 있음
회귀문제: 주어진 예측변수로 분할될 때, RSS가 줄어든 총량을 모든 트리에 대해 평균
분류문제: 주어진 예측변수로 분할될 때, 불순도(gini index)가 줄어든 총량을 모든 트리에 대해 평균

Variable importance

모든 상황에 적용할 수 있는 하나의 방법은 없으며, 연구 내용과 측정 변수들의 특성에 따라 적절한 방법을 선택.

선형모형에서는

표준화 회귀계수 (standardized regression coefficient)
Semi-partial correlation coefficient
Dominance analysis

from sklearn.ensemble import RandomForestClassifier as RFC

bag_carseats = RFC(max_features=X_train.shape[1], n_estimators=500, random_state=0)
bag_carseats.fit(X_train, y_train)

RandomForestClassifier(max_features=11, random_state=0)

accuracy_score(y_test, bag_carseats.predict(X_test))

0.77

feature_imp = pd.DataFrame(
    {"importance": bag_carseats.feature_importances_}, index=X.columns
).sort_values("importance", ascending=False)
feature_imp

                  importance
Price                0.31588
CompPrice            0.13542
Age                  0.11689
...                      ...
ShelveLoc_Medium     0.01439
Urban_Yes            0.00840
US_Yes               0.00600

[11 rows x 1 columns]

code

(
    so.Plot(feature_imp.reset_index(names="predictor"), x='importance', y="predictor")
    .add(so.Bar(alpha=1))
    .label(x="Relative Importance", y="Predictor")
)

Random Forest

Baggging의 일종: 부트스트랩 샘플을 이용해 여러 트리를 생성
각 트리를 만들 때, 랜덤하게 선택된 변수들로만 분할을 수행; 대략 \(\sqrt{p}\)개의 변수를 선택
총 예측변수보다 훨씬 적은 수의 변수를 선택함으로써, 트리 간의 상관관계를 줄임; decorrelating
- 서로 관련이 없는 것들의 평균이 상대적으로 더 variance를 줄이는데 효과적
- 상관이 높은 예측변수들이 많다면 특히, 적은 변수를 선택하는 것이 효과적
중요도가 낮은 변수들도 트리에서 데이터를 설명할 수 있는 기회를 얻게 되는데, 이는 데이터 내의 다양한 정보를 캐낼 수 있는 것으로도 볼 수 있음
항상 모든 변수를 선택한다면, Bagging과 같음

from sklearn.ensemble import RandomForestClassifier as RFC

rf_carseats = RFC(max_features='sqrt', n_estimators=500, random_state=0)
rf_carseats.fit(X_train, y_train)

RandomForestClassifier(n_estimators=500, random_state=0)

accuracy_score(y_test, rf_carseats.predict(X_test))

0.805

선택되는 변수의 수와 트리의 개수에 따른 예측력 변화

Boosting

Bagging과 마찬가지로 weak learner들을 결합하는 방법이기는 하나
Boosting은 처음 세운 모형을 점진적으로 향상시키면서 새 모형들을 만들어내는데, 이전에 잡아내지 못한 시그널을 보완하면서 모형을 업데이트하는 방식
Bagging과는 본질적으로 큰 차이가 있음
다양한 알고리즘들이 개발되어 왔음; AdaBoost, Gradient Boosting, XGBoost 등

여기서는 decision tree에 적용해 살펴봄

Source: p. 349, An Introduction to Statistical Learning with Applications in Python

Turning parmaters

트리 수 B: Bagging과는 다르게 B가 너무 크면 과적합이 발생할 있어 교차 검증을 통해 B를 선택
Shrinkage parameter λ: Boosting이 학습하는 속도를 제어.
- 일반적으로 0.01이나 0.001
- λ가 매우 작으면 좋은 성능을 얻기 위해 매우 큰 값의 B가 필요
각 트리의 분할 수 d: 부스트된 앙상블의 복잡도를 제어
- 종종 d = 1을 선택; 단일 분할로 구성된 stump: 한 변수만을 포함
- d는 각 단계에서 선택되는 변수의 개수를 제어하게 되는데, 이는 여러 변수들이 골고루 데이터의 시그널을 잡아낼 수 있는 기회를 제공하게 됨.

# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier as GBC

boost_carseats = GBC(n_estimators=5000, learning_rate=0.01, max_depth=2, random_state=0)
boost_carseats.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=0.01, max_depth=2, n_estimators=5000,
                           random_state=0)

accuracy_score(y_test, boost_carseats.predict(X_test))

0.855

선택되는 변수의 수와 트리의 개수에 따른 예측력 변화

boost_carseats = GBC(n_estimators=5000, learning_rate=0.01, max_depth=1, random_state=0)
boost_carseats.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=0.01, max_depth=1, n_estimators=5000,
                           random_state=0)

accuracy_score(y_test, boost_carseats.predict(X_test))

0.89

Feature importance

code

feature_imp2 = pd.DataFrame(
    {"importance": boot_carseats.feature_importances_}, index=X.columns
).sort_values("importance", ascending=False)

(
    so.Plot(feature_imp2.reset_index(names="predictor"), x='importance', y="predictor")
    .add(so.Bar(alpha=1))
    .label(x="Relative Importance", y="Predictor")
)