此題我用三個方法去做,分別是perceptron、Logistic Regression、KNN,分別用三個方法去分析哪個的accurancy較為準確。

In [2]:
################################Perceptron################################
import pandas as pd
#preprocessing
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['one','two','three','four','five','six','seven']
one_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
two_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
three_mapping = {'2':1,'3':2,'4':3,'5more':4}
four_mapping = {'2':1,'4':2,'more':3}
five_mapping = {'small':1,'med':2,'big':3}
six_mapping = {'low':1,'med':2,'high':3}
seven_mapping = {'unacc':1, 'acc':2, 'good':3, 'vgood':4}
car_data['one'] = car_data['one'].map(one_mapping)
car_data['two'] = car_data['two'].map(two_mapping)
car_data['three'] = car_data['three'].map(three_mapping)
car_data['four'] = car_data['four'].map(four_mapping)
car_data['five'] = car_data['five'].map(five_mapping)
car_data['six'] = car_data['six'].map(six_mapping)
car_data['seven'] = car_data['seven'].map(seven_mapping)
car_value = car_data.values
X = car_value[:,0:6] 
Y = car_value[:,6] 

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =  train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

n_car=0
from sklearn.linear_model import Perceptron
ppn = Perceptron(None, eta0=0.1, random_state=0)
ppn.fit(x_train_std,y_train)
y_pred = ppn.predict(x_test_std)
print('Misclassified samples: %d' %(y_test != y_pred).sum())

from sklearn.metrics import accuracy_score
print('Accuracy: %f' %accuracy_score(y_test, y_pred))
#print('Accuracy: %f' %ppn.score(x_test_std, y_test)) 另外一種accurancy打法
Misclassified samples: 137
Accuracy: 0.736031

第一種方式我選擇用perceptron去做,首先先將dataset利用mapping的方式變成矩陣,先把前面的6個feature存到X裡面,而第七列label則存到Y中,因為題目的要求需要區分成四類,我把unacc設為1,acc設為2,good設為3,vgood設為4。之後設定test_size為0.3,stratify=Y讓不管是train或test中各類數據的比例與原數據集一樣。再來就是做preprocessing的部分,sc.fit(x_train)這行是先算出x_train裡面的平均值和標準差,之後再經過標準化過後得到x_train_std以及y_test_std。之後再引入perceptron函數,把學習速率,在n_iter選代數的部分設none,eta0學習速率設0.1,random_state的部分則是在每次選代數時都要打亂一次,最後最後即可以預測出y_pre,最後比較y_test和y_pred即可得到此題的accuracy=0.736031,因為是區分成4個不同的label,故這次的misclassified samples較多。

In [2]:
################################Logistic Regression################################
# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
#preprocessing
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['one','two','three','four','five','six','seven']
one_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
two_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
three_mapping = {'2':1,'3':2,'4':3,'5more':4}
four_mapping = {'2':1,'4':2,'more':3}
five_mapping = {'small':1,'med':2,'big':3}
six_mapping = {'low':1,'med':2,'high':3}
seven_mapping = {'unacc':1, 'acc':2 ,'good':3, 'vgood':4}
car_data['one'] = car_data['one'].map(one_mapping)
car_data['two'] = car_data['two'].map(two_mapping)
car_data['three'] = car_data['three'].map(three_mapping)
car_data['four'] = car_data['four'].map(four_mapping)
car_data['five'] = car_data['five'].map(five_mapping)
car_data['six'] = car_data['six'].map(six_mapping)
car_data['seven'] = car_data['seven'].map(seven_mapping)
car_data.tail()
car_value = car_data.values
X = car_value[:,0:6] 
Y = car_value[:,6] 

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =  train_test_split(X, Y, test_size=0.3, random_state=1, stratify=Y)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logistic_regression= LogisticRegression()
logistic_regression.fit(x_train_std, y_train)
y_pred=logistic_regression.predict(x_test_std)

print('Misclassified samples: %d' %(y_test != y_pred).sum())
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
Misclassified samples: 110
Accuracy:  0.7880539499036608

第二種方式我選擇用Logistic Regression的方式去做,前面的mapping和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.7880539499036608,且分錯的數量是110筆,和perceptron比較起來較為準確,這也是能預見的結果,畢竟數據太多且若不是呈現線性分布,perceptron的準確率就會比較低。

In [3]:
#####KNN#####
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.base import clone
from sklearn.model_selection import train_test_split
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score

# Sequential Backward Selection


class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,
                 test_size=0.25, random_state=1):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):

        X_train, X_test, y_train, y_test = \
            train_test_split(X, y, test_size=self.test_size,
                             random_state=self.random_state)

        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train,
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X_train, y_train,
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1

            self.scores_.append(scores[best])
        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score


# Test the SBS
#preprocessing
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['one','two','three','four','five','six','seven']
one_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
two_mapping = {'vhigh':1,'high':2,'med':3,'low':4}
three_mapping = {'2':1,'3':2,'4':3,'5more':4}
four_mapping = {'2':1,'4':2,'more':3}
five_mapping = {'small':1,'med':2,'big':3}
six_mapping = {'low':1,'med':2,'high':3}
seven_mapping = {'unacc':1, 'acc':2, 'good':3, 'vgood':4}
car_data['one'] = car_data['one'].map(one_mapping)
car_data['two'] = car_data['two'].map(two_mapping)
car_data['three'] = car_data['three'].map(three_mapping)
car_data['four'] = car_data['four'].map(four_mapping)
car_data['five'] = car_data['five'].map(five_mapping)
car_data['six'] = car_data['six'].map(six_mapping)
car_data['seven'] = car_data['seven'].map(seven_mapping)
car_data.tail()
car_value = car_data.values
X = car_value[:,0:6] 
y = car_value[:,6] 

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=0)

# Feature scaling - computes mean and standard deviation
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)
knn.fit(X_train_std, y_train)
y_pred = knn.predict(X_test_std)


print('Misclassified samples: %d' %(y_test != y_pred).sum())
print('Test accuracy:', knn.score(X_test_std, y_test))
Misclassified samples: 20
Test accuracy: 0.9614643545279383

此題我用knn演算法去做,前面的mapping和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.9614643545279383,且分錯的數量是20筆,明顯優於上面兩種演算法。由於Logistic Regression是用二元分類(Binary classification)法,而knn則是用多元分類(Multiclass classification),故knn準確率更高。