Subsetting

Mixed

Author

Sungkyun Cho

Published

March 17, 2024

Load Packages

# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.2f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

DataFrame의 일부를 선택하는 subsetting의 방식에 여러 가지 있음

Bracket [ ]: df[["col1", "col2"]]
Dot-notation . : df.col1
iloc: df.iloc[0:3, 0:2]
loc: df.loc[0:3, ["col1", "col2"]]

Import data

Data: On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013

# import the dataset
flights_data = sm.datasets.get_rdataset("flights", "nycflights13")
flights = flights_data.data.iloc[:, :-4]

# print description
print(flights_data.__doc__)

flights.head(3)

   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
0  2013      1    1    517.00             515       2.00    830.00   
1  2013      1    1    533.00             529       4.00    850.00   
2  2013      1    1    542.00             540       2.00    923.00   

   sched_arr_time  arr_delay carrier  flight tailnum origin dest  air_time  
0             819      11.00      UA    1545  N14228    EWR  IAH    227.00  
1             830      20.00      UA    1714  N24211    LGA  IAH    227.00  
2             850      33.00      AA    1141  N619AA    JFK  MIA    160.00

Bracket [ ]

Bracket안에 labels이 있는 경우 columns을 select

A single string: Series로 반환
A list of a single string: DataFrame으로 반환
A list of strings

flights['dest']  # return as a Series

0         IAH
1         IAH
         ... 
336774    CLE
336775    RDU
Name: dest, Length: 336776, dtype: object

flights[['dest']]  # return as a DataFrame

       dest
0       IAH
1       IAH
...     ...
336774  CLE
336775  RDU

[336776 rows x 1 columns]

flights[['origin', 'dest']]

       origin dest
0         EWR  IAH
1         LGA  IAH
...       ...  ...
336774    LGA  CLE
336775    LGA  RDU

[336776 rows x 2 columns]

Bracket안에 numbers가 있는 경우 rows를 select: position-based

Slicing만 허용
First index는 포함, last index는 제외
[1, 5, 8]과 같이 특정 rows를 선택하는 것은 허용안됨

flights[2:5]

   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
2  2013      1    1    542.00             540       2.00    923.00   
3  2013      1    1    544.00             545      -1.00   1004.00   
4  2013      1    1    554.00             600      -6.00    812.00   

   sched_arr_time  arr_delay carrier  flight tailnum origin dest  air_time  
2             850      33.00      AA    1141  N619AA    JFK  MIA    160.00  
3            1022     -18.00      B6     725  N804JB    JFK  BQN    183.00  
4             837     -25.00      DL     461  N668DN    LGA  ATL    116.00

만약, 아래와 같이 index가 number일 때 out of order가 된 경우에도 row position으로 적용됨

   origin dest  arr_delay
42    LGA  DFW      48.00
2     JFK  MIA      33.00
25    EWR  ORD      32.00
14    LGA  DFW      31.00
33    EWR  MSP      29.00

df_outoforder[2:4]

   origin dest  arr_delay
25    EWR  ORD      32.00
14    LGA  DFW      31.00

Chaining with brackets

flights[['origin', 'dest']][2:5]
# 순서 바꿔어도 동일: flights[2:5][['origin', 'dest']]

  origin dest
2    JFK  MIA
3    JFK  BQN
4    LGA  ATL

Dot notation .

편리하나 주의해서 사용할 필요가 있음

Note

space 또는 . 이 있는 변수명 사용 불가
methods와 동일한 이름의 변수명 사용 불가: 예) 변수명이 count인 경우 df.count는 df의 method로 인식
새로운 변수를 만들어 값을 assgin할 수 없음: 예) df.new_var = 1 불가; 대신 df["new_var"] = 1
만약, 다음과 같이 변수을 지정했을 때 vars_names=["origin", "dest"],
- df[vars_names]는 "orign"과 "dest" columns을 선택
- df.vars_names는 vars_names이라는 이름의 column을 의미

flights.dest  # flihgts["dest"]와 동일

0         IAH
1         IAH
         ... 
336774    CLE
336775    RDU
Name: dest, Length: 336776, dtype: object

.loc & .iloc

각각 location, integer location의 약자
df.(i)loc[row_indexer, column_indexer]

.loc: label-based indexing

Index가 number인 경우도 label로 처리
Slicing의 경우 first, last index 모두 inclusive

flights.loc[2:5, ['origin', 'dest']]  # 2:5는 index의 label, not position

  origin dest
2    JFK  MIA
3    JFK  BQN
4    LGA  ATL
5    EWR  ORD

다음과 같이 index가 labels인 경우는 혼동의 염려 없음

       origin dest
red       JFK  MIA
blue      JFK  BQN
green     LGA  ATL
yellow    EWR  ORD

df_labels.loc["blue":"green", :]

      origin dest
blue     JFK  BQN
green    LGA  ATL

하지만, index가 number인 경우는 혼동이 있음
앞서 본 예에서처럼 index가 out of order인 경우 loc은 다르게 작동

   origin dest  arr_delay
42    LGA  DFW      48.00
2     JFK  MIA      33.00
25    EWR  ORD      32.00
14    LGA  DFW      31.00
33    EWR  MSP      29.00

df_outoforder.loc[2:14, :]  # position 아님

   origin dest  arr_delay
2     JFK  MIA      33.00
25    EWR  ORD      32.00
14    LGA  DFW      31.00

df_outoforder.loc[[25, 33], :]  # slicing이 아닌 특정 index 선택

   origin dest  arr_delay
25    EWR  ORD      32.00
33    EWR  MSP      29.00

flights.loc[2:5, 'dest']  # returns as a Series

2    MIA
3    BQN
4    ATL
5    ORD
Name: dest, dtype: object

flights.loc[2:5, ['dest']]  # return as a DataFrame

  dest
2  MIA
3  BQN
4  ATL
5  ORD

Tip

생략 표시

flights.loc[2:5, :]  # ':' means all
flights.loc[2:5]
flights.loc[2:5, ]  # flights.loc[ , ['dest', 'origin']]은 에러

# select a single row
flights.loc[2, :]  # returns as a Series, column names as its index

year         2013
month           1
            ...  
dest          MIA
air_time   160.00
Name: 2, Length: 15, dtype: object

# select a single row
flights.loc[[2], :]  # returns as a DataFrame

   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
2  2013      1    1    542.00             540       2.00    923.00   

   sched_arr_time  arr_delay carrier  flight tailnum origin dest  air_time  
2             850      33.00      AA    1141  N619AA    JFK  MIA    160.00

.iloc: position-based indexing

Slicing의 경우 as usual: first index는 inclusive, last index는 exclusive

flights.iloc[2:5, 12:14]  # 2:5는 index의 position, last index는 미포함

  origin dest
2    JFK  MIA
3    JFK  BQN
4    LGA  ATL

flights.iloc[2:5, 12]  # return as a Series

2    JFK
3    JFK
4    LGA
Name: origin, dtype: object

flights.iloc[2:5, :]
# 다음 모두 가능
# flights.iloc[2:5]
# flights.iloc[2:5, ]

# flights.iloc[, 2:5]는 에러

   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
2  2013      1    1    542.00             540       2.00    923.00   
3  2013      1    1    544.00             545      -1.00   1004.00   
4  2013      1    1    554.00             600      -6.00    812.00   

   sched_arr_time  arr_delay carrier  flight tailnum origin dest  air_time  
2             850      33.00      AA    1141  N619AA    JFK  MIA    160.00  
3            1022     -18.00      B6     725  N804JB    JFK  BQN    183.00  
4             837     -25.00      DL     461  N668DN    LGA  ATL    116.00

flights.iloc[2:5, [12]]  # return as a DataFrame

  origin
2    JFK
3    JFK
4    LGA

flights.iloc[[2, 5, 7], 12:14]  # 특정 위치의 rows 선택

  origin dest
2    JFK  MIA
5    EWR  ORD
7    LGA  IAD

Note

단 하나의 scalar 값을 추출할 때, 빠른 처리를 하는 다음을 사용할 수 있음
.at[i, j], .iat[i, j]

Series의 indexing

DataFrame과 같은 방식으로 이해

Index가 numbers인 경우

42    DFW
2     MIA
25    ORD
14    DFW
33    MSP
Name: dest, dtype: object

s.loc[25:14]

25    ORD
14    DFW
Name: dest, dtype: object

s.iloc[2:4]

25    ORD
14    DFW
Name: dest, dtype: object

s[:3]

42    DFW
2     MIA
25    ORD
Name: dest, dtype: object

Note

다음과 같은 경우 혼동스러움

s[3] # 3번째? label 3?

#> errors occur

Index가 lables인 경우 다음과 같이 편리하게 subsetting 가능

red       MIA
blue      BQN
green     ATL
yellow    ORD
Name: dest, dtype: object

s["red":"green"]

red      MIA
blue     BQN
green    ATL
Name: dest, dtype: object

s[["red", "green"]]

red      MIA
green    ATL
Name: dest, dtype: object

Boolean indexing

Bracket [ ] 이나 loc을 이용
iloc은 적용 안됨

Bracket [ ]

np.random.seed(123)
flights_6 = flights[:100][["dep_delay", "arr_delay", "origin", "dest"]].sample(6)
flights_6

    dep_delay  arr_delay origin dest
8       -3.00      -8.00    JFK  MCO
70       9.00      20.00    LGA  ORD
..        ...        ...    ...  ...
63      -2.00       2.00    JFK  LAX
0        2.00      11.00    EWR  IAH

[6 rows x 4 columns]

flights_6[flights_6["dep_delay"] < 0]

    dep_delay  arr_delay origin dest
8       -3.00      -8.00    JFK  MCO
82      -1.00     -26.00    JFK  SFO
63      -2.00       2.00    JFK  LAX

idx = flights_6["dep_delay"] < 0
idx # bool type의 Series

8      True
70    False
      ...  
63     True
0     False
Name: dep_delay, Length: 6, dtype: bool

# Select a column with the boolean indexing
flights_6[idx]["dest"]

8     MCO
82    SFO
63    LAX
Name: dest, dtype: object

Note

사실, boolean indexing을 할때, DataFrame/Series의 index와 match함
대부분 염려하지 않아도 되나 다음과 같은 결과 참고

# Reset index
idx_reset = idx.reset_index(drop=True)
# 0     True
# 1    False
# 2     True
# 3    False
# 4     True
# 5    False
# Name: dep_delay, dtype: bool

flights_6[idx_reset]["dest"]
#> IndexingError: Unalignable boolean Series provided as indexer 
#> (index of the boolean Series and of the indexed object do not match)

# Index가 없는 numpy array로 boolean indexing을 하는 경우 문제없음
flights_6[idx_reset.to_numpy()]["dest"]
# 8     MCO
# 82    SFO
# 63    LAX
# Name: dest, dtype: object

bool_idx = flights_6[["dep_delay", "arr_delay"]] > 0
bool_idx

    dep_delay  arr_delay
8       False      False
70       True       True
..        ...        ...
63      False       True
0        True       True

[6 rows x 2 columns]

idx_any = bool_idx.any(axis=1)
idx_any

8     False
70     True
      ...  
63     True
0      True
Length: 6, dtype: bool

bool_idx.all(axis=1)

8     False
70     True
      ...  
63    False
0      True
Length: 6, dtype: bool

`np.where()` 활용

np.where(boolean condition, value if True, value if False)

flights_6["delayed"] = np.where(idx, "delayed", "on-time")
flights_6

    dep_delay  arr_delay origin dest  delayed
8       -3.00      -8.00    JFK  MCO  delayed
70       9.00      20.00    LGA  ORD  on-time
..        ...        ...    ...  ...      ...
63      -2.00       2.00    JFK  LAX  delayed
0        2.00      11.00    EWR  IAH  on-time

[6 rows x 5 columns]

np.where(flights_6["dest"].str.startswith("S"), "S", "T")  # str method: "S"로 시작하는지 여부

array(['T', 'T', 'S', 'S', 'T', 'T'], dtype='<U1')

flights_6["dest_S"] = np.where(flights_6["dest"].str.startswith("S"), "S", "T")
flights_6

    dep_delay  arr_delay origin dest  delayed dest_S
8       -3.00      -8.00    JFK  MCO  delayed      T
70       9.00      20.00    LGA  ORD  on-time      T
..        ...        ...    ...  ...      ...    ...
63      -2.00       2.00    JFK  LAX  delayed      T
0        2.00      11.00    EWR  IAH  on-time      T

[6 rows x 6 columns]

loc

flights_6.loc[idx, "dest"]  # flights_6[idx]["dest"]과 동일

8     MCO
82    SFO
63    LAX
Name: dest, dtype: object

만약 column 이름에 “time”을 포함하는 columns만 선택하고자 하면

Series/Index object는 str method 존재
str.contains(), str.startswith(), str.endswith()

자세한 사항은 7.4 String Manipulation/String Functions in pandas by Wes McKinney

cols = flights.columns.str.contains("time")  # str method: "time"을 포함하는지 여부
cols

array([False, False, False,  True,  True, False,  True,  True, False,
       False, False, False, False, False,  True])

# Columns 쪽으로 boolean indexing
flights.loc[:, cols]

        dep_time  sched_dep_time  arr_time  sched_arr_time  air_time
0         517.00             515    830.00             819    227.00
1         533.00             529    850.00             830    227.00
...          ...             ...       ...             ...       ...
336774       NaN            1159       NaN            1344       NaN
336775       NaN             840       NaN            1020       NaN

[336776 rows x 5 columns]

Warning

Chained indexing으로 값을 assign하는 경우 copy vs. view 경고 메세지

flights[flights["arr_delay"] < 0]["arr_delay"] = 0

/var/folders/mp/vcywncl97ml2q4c_5k2r573m0000gn/T/ipykernel_96692/3780864177.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

경고가 제시하는데로 .loc을 이용하여 assign

flights.loc[flights["arr_delay"] < 0, "arr_delay"] = 0

Summary

Bracket [ ]의 경우
- 간단히 columns을 선택하고자 할때 column labels: df[["var1", "var2"]]
- 간단히 rows를 선택하고자 할때 numerical indexing: df[:10]
Dot-notation은
- pandas의 methods와 중복된 이름을 피하고,
- assignment의 왼편에는 사용을 피할 것
가능하면 분명한 loc 또는 iloc을 사용
- loc[:, ["var1", "var2"]]는 df[["var1", "var2"]]과 동일
- iloc[:10, :]은 df[:10]와 동일
- loc의 경우, index가 숫자라 할지라도 label로 처리됨
- loc은 iloc과는 다른게 slicing(:)에서 first, last index 모두 inclusive
Boolean indexing의 경우
- Bracket [ ]: df[bool_idx]
- loc: df.loc[bool_idx, :]
- iloc 불가
Assignment를 할때는,
- chained indexing을 피하고: df[:5]["dest"]
- loc or iloc 사용:
  - df.loc[:4, "dest"]: index가 0부터 정렬되어 있다고 가정했을 때, slicing에서 위치 하나 차이남
  - df.iloc[:5, 13]: “dest”의 column 위치 13
한 개의 column 혹은 row을 선택하면 Series로 반환: df["var1"] 또는 df.loc[2, :]

Note

Numpy의 indexing에 대해서는 교재 참고
Ch.4/Basic Indexing and Slicing in Python Data Analysis by Wes McKinney

Bracket [ ]

Dot notation .

.loc & .iloc

.loc: label-based indexing

.iloc: position-based indexing

Series의 indexing

Boolean indexing

Bracket [ ]

np.where() 활용

loc

Summary

`np.where()` 활용