In [14]:
# read in data and mapping
import pandas as pd
import numpy as np

file = pd.read_csv('car.data', header = None)
X = file.iloc[0:, 0:6]
mapping_X={'low':1, 'med':2, 'high':3, 'vhigh':4, 'small':1, 'big':3, 'more':6, '5more':5, '2':2, '3':3, '4':4}
X[0]=X[0].map(mapping_X)
X[1]=X[1].map(mapping_X)
X[2]=X[2].map(mapping_X)
X[3]=X[3].map(mapping_X)
X[4]=X[4].map(mapping_X)
X[5]=X[5].map(mapping_X)

y = file.iloc[0:, 6]
mapping_y={'vgood':1, 'good':1, 'acc':0, 'unacc':0}
y=y.map(mapping_y)

# split dataset into training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

透過mapping的方式把X(feature)中的label換成整數數值,因為這些feature都屬於ordinal label,所以直接依照大小優劣關係給定數字。數字大小也大概都是依照數據中的數字來決定的。y(class label)也是屬於ordinal label,good和vgood設定為1,另兩個設為0。Dataset大概有1700筆左右,感覺並不是到非常多,所以選擇test_size=0.3。(dataset大的時候才選較大的test_size)

In [15]:
# training desicion tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

acc=[]
dep=[]
for i in range(3,12):
    tree = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=1)
    tree.fit(X_train, y_train)
    score=tree.score(X_test, y_test)
    acc.append(score)
    dep.append(i)
    print('Max Depth:', i, ', Accuracy: %.4f'%score)
    
plt.plot(dep, acc, marker='o')
plt.ylim([0.85,1.05])
plt.ylabel('Accuracy')
plt.xlabel('Max Depth')
plt.grid()
plt.tight_layout()
plt.show()
Max Depth: 3 , Accuracy: 0.9075
Max Depth: 4 , Accuracy: 0.9595
Max Depth: 5 , Accuracy: 0.9557
Max Depth: 6 , Accuracy: 0.9634
Max Depth: 7 , Accuracy: 0.9711
Max Depth: 8 , Accuracy: 0.9807
Max Depth: 9 , Accuracy: 0.9827
Max Depth: 10 , Accuracy: 0.9865
Max Depth: 11 , Accuracy: 0.9865

因為feature是不連續的數值(整數),所以第一直覺認為Decision tree能夠很方便的切割資料,依照每個feature的不同多寡優劣程度(數值)去分枝。 改動max_depth,到8以後正確率都在0.98以上,到10以後則停在0.986。 以效率為考量的話,選擇max_depth=8比較好, 如果要最高準確度,則是max_depth=10。

In [4]:
# mapping for problem 2
y = file.iloc[0:, 6]
mapping_y2={'vgood':3, 'good':2, 'acc':1, 'unacc':0}
y=y.map(mapping_y2)

# split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

重新對y做mapping,因為依舊屬於ordinal label,所以依照優到劣給予大到小的數值。

In [8]:
# training desicion tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

acc=[]
dep=[]
for i in range(3,15):
    tree = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=1)
    tree.fit(X_train, y_train)
    score=tree.score(X_test, y_test)
    acc.append(score)
    dep.append(i)
    print('Max Depth:', i, ', Accuracy: %.4f'%score)
    
plt.plot(dep, acc, marker='o')
plt.ylim([0.75,1.05])
plt.ylabel('Accuracy')
plt.xlabel('Max Depth')
plt.grid()
plt.tight_layout()
plt.show()
Max Depth: 3 , Accuracy: 0.7919
Max Depth: 4 , Accuracy: 0.8613
Max Depth: 5 , Accuracy: 0.8613
Max Depth: 6 , Accuracy: 0.9345
Max Depth: 7 , Accuracy: 0.9287
Max Depth: 8 , Accuracy: 0.9653
Max Depth: 9 , Accuracy: 0.9615
Max Depth: 10 , Accuracy: 0.9711
Max Depth: 11 , Accuracy: 0.9788
Max Depth: 12 , Accuracy: 0.9750
Max Depth: 13 , Accuracy: 0.9750
Max Depth: 14 , Accuracy: 0.9750
In [12]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

est=[]
acc=[]
for i in range(23,33):
    forest = RandomForestClassifier(criterion='gini', n_estimators=i, random_state=1)
    forest.fit(X_train, y_train)
    score=forest.score(X_test,y_test)
    acc.append(score)
    est.append(i)
    print('n_estimators=',i,', Accuracy: %.4f'%score)
    
plt.plot(est, acc, marker='o')
plt.ylim([0.95,1])
plt.ylabel('Accuracy')
plt.xlabel('Estimator')
plt.grid()
plt.tight_layout()
plt.show()
n_estimators= 23 , Accuracy: 0.9769
n_estimators= 24 , Accuracy: 0.9788
n_estimators= 25 , Accuracy: 0.9788
n_estimators= 26 , Accuracy: 0.9827
n_estimators= 27 , Accuracy: 0.9827
n_estimators= 28 , Accuracy: 0.9827
n_estimators= 29 , Accuracy: 0.9807
n_estimators= 30 , Accuracy: 0.9846
n_estimators= 31 , Accuracy: 0.9827
n_estimators= 32 , Accuracy: 0.9846

用Decision Tree做第二題,準確度大概在0.97左右(max_depdth=11時有最大值0.9788),因此使用Random Forest,透過不同的feature組合train出很多不同的tree,來提升準確度。結果顯示準確度上升到0.98左右,在n_estimators=30時有最大值0.9846。

In [9]:
from sklearn.ensemble import RandomForestClassifier
feat_labels=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safty']
forest = RandomForestClassifier(n_estimators=30, random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" %(f+1, 15, feat_labels[indices[f]], importances[indices[f]]))
 1) safty           0.311520
 2) persons         0.232809
 3) maint           0.158141
 4) buying          0.154873
 5) lug_boot        0.072699
 6) doors           0.069958
In [20]:
# exclude feature lug_boot and doors
X_ext = X.iloc[0:, [0, 1, 3, 4, 5]]
X_ext_train, X_ext_test, y_train, y_test = train_test_split(X_ext, y, test_size=0.3, random_state=1, stratify=y)

# Random Forest
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

est=[]
acc=[]
for i in range(25,36):
    forest = RandomForestClassifier(criterion='gini', n_estimators=i, random_state=1)
    forest.fit(X_ext_train, y_train)
    score=forest.score(X_ext_test,y_test)
    acc.append(score)
    est.append(i)
    print('estimator=',i,', Accuracy: %.4f'%score)
    
plt.plot(est, acc, marker='o')
plt.ylim([0.9,1])
plt.ylabel('Accuracy')
plt.xlabel('Estimator')
plt.grid()
plt.tight_layout()
plt.show()
estimator= 25 , Accuracy: 0.9480
estimator= 26 , Accuracy: 0.9480
estimator= 27 , Accuracy: 0.9518
estimator= 28 , Accuracy: 0.9518
estimator= 29 , Accuracy: 0.9518
estimator= 30 , Accuracy: 0.9518
estimator= 31 , Accuracy: 0.9480
estimator= 32 , Accuracy: 0.9480
estimator= 33 , Accuracy: 0.9518
estimator= 34 , Accuracy: 0.9518
estimator= 35 , Accuracy: 0.9518

將每個feature的importance印出來發現lug_boot和doors這兩個貢獻度最低,如果從feature中移除兩者,score下降到0.88左右,只刪除最低的doors的話則有0.95。因為這個案例中比較沒有overfitting的問題,所以拿掉feature使得準確度下降,如果不考慮效率問題,應該還是如一開始做的將六個feature全部保留較好。

總結: 這次的dataset是評估車子優劣的考量,而這些考量都有固定的數值(例如可容納2,3,4..人),或是有序的分類(例如小、中、大等等),不像鳶尾花分類那樣,長度寬度可以是任何數值,因此除了在做label轉number上可以很簡單的直接給定整數數值(例如3,4,5more中的5more因為是最大的分類,所以依照前面的規則給定5)之外,我認為這樣不連續的數值應該很適合用來作為decision tree分枝的依據,因此選擇decision tree的演算法。一開始測試時只讓max_depth=6,也就是feature的數量,然後再慢慢往上調,直到發現準確度不再上升為止,max_depth=10時達到準確度約0.987。第二題分為4類,一樣由優劣給予大到小的mapping數值。相同的概念繼續做decision tree,但準確度比第一題要低,因此使用random forest,用不同的feature組合多種幾棵樹來提升準確度,結果確實有提升。對於n_estimators這個參數的選法一開始比較沒有概念,所以大概每次以十為級距印出來觀察,發現在30以後準確度就不會再上升了,並且偶爾會下降再上升。之後也試過設定一百到兩百多的數值,畢竟6種feature可以有6!種選法,但準確度依舊是比30的時候稍稍下降,所以最後決定30會是最好的數值,準確度約為0.985。這次的training其實並沒有overfitting的問題,test set的準確度並沒有和train set相差太多,用內建功能印出每個feature的importance之後,我嘗試過刪除1~2個feature,用相同的參數做random forest的fitting,兩種的結果卻度都是下降的,所以除非效率是考量的重點,否則不需要特別除去那兩個feature。