hr_data_binary_classification¶

IBM HR data¶

https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

구글 드라이브 연동¶

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

1. 문제 파악 및 목표 설정¶

2. 데이터 수집 및 전처리¶

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

데이터 불러오기¶

In [ ]:

df = pd.read_csv("/content/drive/MyDrive/hr_data.csv", encoding="utf-8", index_col=0)
df

Out[ ]:

	birthday	entry_year	department	marital_status	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition
0	1980-7-20	2013	sales	single	high	very high	8.33	9431500	8923739	8.0	yes
1	1972-11-8	2011	rnd	married	very high	medium	6.93	5170672	4617495	NaN	no
2	1984-5-7	2014	rnd	single	high	high	9.00	9898200	9176045	6.0	yes
3	1988-10-19	2013	rnd	married	high	high	8.33	5673500	5362476	1.0	no
4	1994-7-11	2015	rnd	married	high	medium	7.20	3484080	3284389	9.0	no
...	...	...	...	...	...	...	...	...	...	...	...
1465	1985-2-13	2004	rnd	married	high	very high	7.50	3488175	3214315	4.0	no
1466	1982-5-21	2012	rnd	married	high	low	8.33	4442500	4113806	4.0	no
1467	1994-2-5	2015	rnd	married	very high	medium	8.33	8715500	7908802	1.0	no
1468	1972-4-17	2004	sales	married	high	medium	8.67	6804200	6333023	2.0	no
1469	1987-3-17	2015	rnd	married	high	high	9.00	8639190	8120302	2.0	no

1470 rows × 11 columns

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1470 entries, 0 to 1469
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   birthday              1470 non-null   object 
 1   entry_year            1470 non-null   int64  
 2   department            1470 non-null   object 
 3   marital_status        1143 non-null   object 
 4   performance_rating    1470 non-null   object 
 5   job_satisfaction      1470 non-null   object 
 6   working_hours         1470 non-null   float64
 7   salary                1470 non-null   int64  
 8   last_year_salary      1470 non-null   int64  
 9   num_companies_worked  1209 non-null   float64
 10  attrition             1470 non-null   object 
dtypes: float64(2), int64(3), object(6)
memory usage: 137.8+ KB

In [ ]:

msno.matrix(df);

결측 해결하기¶

numerical
categorical

numerical

In [ ]:

df["num_companies_worked"].median()

Out[ ]:

2.0

In [ ]:

df["num_companies_worked"].plot(kind="box");

In [ ]:

df["num_companies_worked"] = df["num_companies_worked"].fillna(df["num_companies_worked"].mean())

In [ ]:

df

Out[ ]:

	birthday	entry_year	department	marital_status	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition
0	1980-7-20	2013	sales	single	high	very high	8.33	9431500	8923739	8.000000	yes
1	1972-11-8	2011	rnd	married	very high	medium	6.93	5170672	4617495	2.723739	no
2	1984-5-7	2014	rnd	single	high	high	9.00	9898200	9176045	6.000000	yes
3	1988-10-19	2013	rnd	married	high	high	8.33	5673500	5362476	1.000000	no
4	1994-7-11	2015	rnd	married	high	medium	7.20	3484080	3284389	9.000000	no
...	...	...	...	...	...	...	...	...	...	...	...
1465	1985-2-13	2004	rnd	married	high	very high	7.50	3488175	3214315	4.000000	no
1466	1982-5-21	2012	rnd	married	high	low	8.33	4442500	4113806	4.000000	no
1467	1994-2-5	2015	rnd	married	very high	medium	8.33	8715500	7908802	1.000000	no
1468	1972-4-17	2004	sales	married	high	medium	8.67	6804200	6333023	2.000000	no
1469	1987-3-17	2015	rnd	married	high	high	9.00	8639190	8120302	2.000000	no

1470 rows × 11 columns

In [ ]:

df["num_companies_worked"].plot(kind="box");

categorical

In [ ]:

df["marital_status"].value_counts()

Out[ ]:

married    673
single     470
Name: marital_status, dtype: int64

In [ ]:

df["marital_status"] = df["marital_status"].fillna("etc")

In [ ]:

df["marital_status"].value_counts()

Out[ ]:

married    673
single     470
etc        327
Name: marital_status, dtype: int64

전처리¶

numerical
- birthday $\rightarrow$ age
- entry_yaer $\rightarrow$ years_at_company
- salary&last_year_salary $\rightarrow$ salary_inceasing_rate
categorical ordinal $\rightarrow$ values
- performance_rating
- job_satisfaction
categorical nominal $\rightarrow$ one-hot encoding
- department
- marital_status
- attrition

In [ ]:

df

Out[ ]:

	birthday	entry_year	department	marital_status	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition
0	1980-7-20	2013	sales	single	high	very high	8.33	9431500	8923739	8.000000	yes
1	1972-11-8	2011	rnd	married	very high	medium	6.93	5170672	4617495	2.723739	no
2	1984-5-7	2014	rnd	single	high	high	9.00	9898200	9176045	6.000000	yes
3	1988-10-19	2013	rnd	married	high	high	8.33	5673500	5362476	1.000000	no
4	1994-7-11	2015	rnd	married	high	medium	7.20	3484080	3284389	9.000000	no
...	...	...	...	...	...	...	...	...	...	...	...
1465	1985-2-13	2004	rnd	married	high	very high	7.50	3488175	3214315	4.000000	no
1466	1982-5-21	2012	rnd	married	high	low	8.33	4442500	4113806	4.000000	no
1467	1994-2-5	2015	rnd	married	very high	medium	8.33	8715500	7908802	1.000000	no
1468	1972-4-17	2004	sales	married	high	medium	8.67	6804200	6333023	2.000000	no
1469	1987-3-17	2015	rnd	married	high	high	9.00	8639190	8120302	2.000000	no

1470 rows × 11 columns

In [ ]:

df["salary"].plot(kind='kde');

In [ ]:

df['birthday'] = pd.to_datetime(df['birthday'], format='%Y-%m-%d')
df["birth_year"] = df["birthday"].dt.year
df["age"] = 2021-df["birth_year"]+1

In [ ]:

df["years_at_company"] = 2021-df["entry_year"]+1

$$ \text{salary} = \text{last_year_salary}(1 + \frac{\alpha}{100}) $$

In [ ]:

df["salary_increasing_rate"] = df["salary"]/df["last_year_salary"]*100-100

In [ ]:

df[df["entry_year"] == 2021]

Out[ ]:

	birthday	entry_year	department	marital_status	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition	birth_year	age	years_at_company
23	2000-07-21	2021	rnd	single	high	very high	7.50	8492175	8492175	1.000000	no	2000	22	1
127	2002-03-30	2021	sales	single	high	high	6.93	3990896	3990896	1.000000	yes	2002	20	1
296	2003-02-03	2021	rnd	single	high	high	7.80	5154084	5154084	1.000000	yes	2003	19	1
301	2003-03-01	2021	sales	single	high	high	7.20	5903064	5903064	1.000000	no	2003	19	1
457	2003-05-30	2021	sales	single	high	medium	7.50	6086700	6086700	1.000000	yes	2003	19	1
615	1994-06-01	2021	rnd	married	high	very high	6.67	4083200	4083200	1.000000	no	1994	28	1
727	2003-12-18	2021	rnd	single	high	very high	6.67	5790600	5790600	1.000000	no	2003	19	1
828	2003-12-19	2021	rnd	single	high	high	7.50	7185150	7185150	1.000000	yes	2003	19	1
972	2003-02-04	2021	rnd	single	high	very high	8.67	10140260	10140260	2.723739	no	2003	19	1
1153	2003-05-05	2021	sales	single	high	very high	8.33	6961750	6961750	2.723739	yes	2003	19	1
1311	2003-09-18	2021	rnd	single	high	high	6.93	2923024	2923024	1.000000	no	2003	19	1

categorical ordinal $\rightarrow$ values
- performance_rating
- job_satisfaction

In [ ]:

df["performance_rating"].value_counts()

Out[ ]:

high         1244
very high     226
Name: performance_rating, dtype: int64

In [ ]:

df["job_satisfaction"].value_counts()

Out[ ]:

very high    459
high         442
low          289
medium       280
Name: job_satisfaction, dtype: int64

In [ ]:

level = {"low":0, 
        "medium":1,
        "high":2,
        "very high":3}

In [ ]:

df["performance_rating"] = df["performance_rating"].replace(level)
df["job_satisfaction"] = df["job_satisfaction"].replace(level)

categorical nominal $\rightarrow$ one-hot encoding
- department
- marital_status
- attrition

In [ ]:

df["department"].value_counts()

Out[ ]:

rnd      961
sales    446
hr        63
Name: department, dtype: int64

In [ ]:

df["marital_status"].value_counts()

Out[ ]:

married    673
single     470
etc        327
Name: marital_status, dtype: int64

In [ ]:

df["attrition"].value_counts()

Out[ ]:

no     1233
yes     237
Name: attrition, dtype: int64

In [ ]:

df = pd.get_dummies(df, columns = ["department", "marital_status"])

In [ ]:

df

Out[ ]:

	birthday	entry_year	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition	birth_year	age	years_at_company	salary_increasing_rate	department_hr	department_rnd	department_sales	marital_status_etc	marital_status_married	marital_status_single
0	1980-07-20	2013	2	3	8.33	9431500	8923739	8.000000	yes	1980	42	9	5.690003	0	0	1	0	0	1
1	1972-11-08	2011	3	1	6.93	5170672	4617495	2.723739	no	1972	50	11	11.980024	0	1	0	0	1	0
2	1984-05-07	2014	2	2	9.00	9898200	9176045	6.000000	yes	1984	38	8	7.870003	0	1	0	0	0	1
3	1988-10-19	2013	2	2	8.33	5673500	5362476	1.000000	no	1988	34	9	5.800007	0	1	0	0	1	0
4	1994-07-11	2015	2	1	7.20	3484080	3284389	9.000000	no	1994	28	7	6.080005	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1465	1985-02-13	2004	2	3	7.50	3488175	3214315	4.000000	no	1985	37	18	8.520011	0	1	0	0	1	0
1466	1982-05-21	2012	2	0	8.33	4442500	4113806	4.000000	no	1982	40	10	7.990022	0	1	0	0	1	0
1467	1994-02-05	2015	3	1	8.33	8715500	7908802	1.000000	no	1994	28	7	10.200002	0	1	0	0	1	0
1468	1972-04-17	2004	2	1	8.67	6804200	6333023	2.000000	no	1972	50	18	7.440001	0	0	1	0	1	0
1469	1987-03-17	2015	2	2	9.00	8639190	8120302	2.000000	no	1987	35	7	6.390009	0	1	0	0	1	0

1470 rows × 19 columns

In [ ]:

df["attrition"] = pd.get_dummies(df["attrition"], drop_first=True)

In [ ]:

df

Out[ ]:

	birthday	entry_year	performance_rating	job_satisfaction	working_hours	salary	last_year_salary	num_companies_worked	attrition	birth_year	age	years_at_company	salary_increasing_rate	department_hr	department_rnd	department_sales	marital_status_etc	marital_status_married	marital_status_single
0	1980-07-20	2013	2	3	8.33	9431500	8923739	8.000000	1	1980	42	9	5.690003	0	0	1	0	0	1
1	1972-11-08	2011	3	1	6.93	5170672	4617495	2.723739	0	1972	50	11	11.980024	0	1	0	0	1	0
2	1984-05-07	2014	2	2	9.00	9898200	9176045	6.000000	1	1984	38	8	7.870003	0	1	0	0	0	1
3	1988-10-19	2013	2	2	8.33	5673500	5362476	1.000000	0	1988	34	9	5.800007	0	1	0	0	1	0
4	1994-07-11	2015	2	1	7.20	3484080	3284389	9.000000	0	1994	28	7	6.080005	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1465	1985-02-13	2004	2	3	7.50	3488175	3214315	4.000000	0	1985	37	18	8.520011	0	1	0	0	1	0
1466	1982-05-21	2012	2	0	8.33	4442500	4113806	4.000000	0	1982	40	10	7.990022	0	1	0	0	1	0
1467	1994-02-05	2015	3	1	8.33	8715500	7908802	1.000000	0	1994	28	7	10.200002	0	1	0	0	1	0
1468	1972-04-17	2004	2	1	8.67	6804200	6333023	2.000000	0	1972	50	18	7.440001	0	0	1	0	1	0
1469	1987-03-17	2015	2	2	9.00	8639190	8120302	2.000000	0	1987	35	7	6.390009	0	1	0	0	1	0

1470 rows × 19 columns

3. 모델 적용¶

In [ ]:

df.columns

Out[ ]:

Index(['birthday', 'entry_year', 'performance_rating', 'job_satisfaction',
       'working_hours', 'salary', 'last_year_salary', 'num_companies_worked',
       'attrition', 'birth_year', 'age', 'years_at_company',
       'salary_increasing_rate', 'department_hr', 'department_rnd',
       'department_sales', 'marital_status_etc', 'marital_status_married',
       'marital_status_single'],
      dtype='object')

In [ ]:

col = ['job_satisfaction', 'working_hours', 'num_companies_worked', 'age', 'years_at_company',
            'salary_increasing_rate', 'department_hr', 'department_rnd',
            'department_sales', 'marital_status_etc', 'marital_status_married', 'marital_status_single']

In [ ]:

x_data = df[col].values
y_data = df["attrition"].values

In [ ]:

x_train, x_test, y_train, y_test = train_test_split(x_data, 
                                                    y_data, 
                                                    random_state = 42, 
                                                    stratify=y_data)

In [ ]:

rfc = RandomForestClassifier(random_state=42)
rfc.fit(x_train, y_train)

Out[ ]:

RandomForestClassifier(random_state=42)

In [ ]:

pred = rfc.predict(x_test)

In [ ]:

rfc.score(x_test, y_test)

Out[ ]:

0.8152173913043478

In [ ]:

df["attrition"].value_counts()

Out[ ]:

0    1233
1     237
Name: attrition, dtype: int64

In [ ]:

1-y_train.sum()/len(y_train)

Out[ ]:

0.8384754990925589

4. 모델 평가¶

In [ ]:

matrix = metrics.confusion_matrix(y_test, pred)
matrix

Out[ ]:

array([[292,  17],
       [ 51,   8]])

In [ ]:

plt.figure(figsize=(5,5))
ax = sns.heatmap(matrix, 
                    cmap = "coolwarm",
                    linecolor="white",
                    linewidth=1,
                    annot=True)
    
buttom, top = ax.get_ylim()
ax.set_ylim(buttom+0.5, top-0.5)
    
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

In [ ]:

accuracy = (matrix[0][0]+matrix[1][1])/matrix.sum()
accuracy

Out[ ]:

0.8152173913043478

In [ ]:

precision = matrix[1][1]/(matrix[0][1]+matrix[1][1])
precision

Out[ ]:

0.32

In [ ]:

recall = matrix[1][1]/(matrix[1][0]+matrix[1][1])
recall

Out[ ]:

0.13559322033898305

In [ ]:

print(metrics.classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.85      0.94      0.90       309
           1       0.32      0.14      0.19        59

    accuracy                           0.82       368
   macro avg       0.59      0.54      0.54       368
weighted avg       0.77      0.82      0.78       368

In [ ]:

rfc.feature_importances_

Out[ ]:

array([0.08408061, 0.11591178, 0.11350636, 0.17991101, 0.18669547,
       0.21037191, 0.00885571, 0.01830537, 0.0182078 , 0.01610867,
       0.01819513, 0.02985018])

In [ ]:

plt.bar(col, rfc.feature_importances_)
plt.xticks(rotation="90");

5. sampling¶

In [ ]:

df["attrition"].value_counts()

Out[ ]:

0    1233
1     237
Name: attrition, dtype: int64

In [ ]:

df_sampling = pd.concat([df[df["attrition"] == 0].sample(n=237, random_state=42), df[df["attrition"] == 1]])

In [ ]:

x_data_sampling = df_sampling[col].values
y_data_sampling = df_sampling["attrition"].values

In [ ]:

x_train_sampling, x_test_sampling, y_train_sampling, y_test_sampling = train_test_split(x_data_sampling, 
                                                    y_data_sampling, 
                                                    random_state = 42, 
                                                    stratify=y_data_sampling)

In [ ]:

rfc_sampling = RandomForestClassifier(random_state=42)
rfc_sampling.fit(x_train_sampling, y_train_sampling)

Out[ ]:

RandomForestClassifier(random_state=42)

In [ ]:

rfc_sampling.score(x_test_sampling, y_test_sampling)

Out[ ]:

0.6302521008403361

In [ ]:

pred_sampling = rfc_sampling.predict(x_test_sampling)

In [ ]:

plt.figure(figsize=(5,5))
ax = sns.heatmap(metrics.confusion_matrix(y_test_sampling, pred_sampling), 
                    cmap = "coolwarm",
                    linecolor="white",
                    linewidth=1,
                    annot=True)
    
buttom, top = ax.get_ylim()
ax.set_ylim(buttom+0.5, top-0.5)
    
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()

In [ ]:

print(metrics.classification_report(y_test_sampling, pred_sampling))

              precision    recall  f1-score   support

           0       0.62      0.68      0.65        60
           1       0.64      0.58      0.61        59

    accuracy                           0.63       119
   macro avg       0.63      0.63      0.63       119
weighted avg       0.63      0.63      0.63       119

In [ ]:

plt.bar(col, rfc_sampling.feature_importances_)
plt.xticks(rotation="90");

In [ ]:

Markov Decision Process Example (0)	2024.09.10
Exploratory data analysis and visualization (0)	2024.09.10

새소식

인기 검색어

IBM HR data Binary Classification

hr_data_binary_classification¶

IBM HR data¶

구글 드라이브 연동¶

1. 문제 파악 및 목표 설정¶

2. 데이터 수집 및 전처리¶

데이터 불러오기¶

결측 해결하기¶

전처리¶

3. 모델 적용¶

4. 모델 평가¶

5. sampling¶

'Graduate School > Mathematics for AI' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바