此題我用三個方法去做,分別是perceptron、Logistic Regression、KNN、random forest,分別用四個方法去分析哪個的accurancy較為準確。

In [17]:
################################Perceptron################################
import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784] 
Y_train = fashion_train[:,0] 
X_test = fashion_test[:,1:1784] 
Y_test = fashion_test[:,0] 

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(fashion_train)
X_train_std = sc.transform(fashion_train)
X_test_std = sc.transform(fashion_test)

from sklearn.linear_model import Perceptron
ppn = Perceptron(None, eta0=0.1, random_state=0)
ppn.fit(X_train_std,Y_train)
Y_pred = ppn.predict(X_test_std)
print('Misclassified samples: %d' %(Y_test != Y_pred).sum())

from sklearn.metrics import accuracy_score
print('Accuracy: %f' %accuracy_score(Y_test, Y_pred))
Misclassified samples: 952
Accuracy: 0.904800

第一種方式我選擇用perceptron去做,首先pandas讀取兩個csv檔案,因為題目已經給了train跟test的數據,所以不需要去設定test_size跟stratify=Y去分哪些數據是拿來train。再來就是做preprocessing的部分,sc.fit(x_train)這行是先算出x_train裡面的平均值和標準差,之後再經過標準化過後得到x_train_std以及y_test_std。之後再引入perceptron函數,把學習速率,在n_iter選代數的部分設none,eta0學習速率設0.1,random_state的部分則是在每次選代數時都要打亂一次,最後最後即可以預測出y_pre,最後比較y_test和y_pred即可得到此題的accuracy=0.904800,依結果來論算是不錯的正確率。

In [18]:
################################Logistic Regression################################
# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784] 
Y_train = fashion_train[:,0] 
X_test = fashion_test[:,1:1784] 
Y_test = fashion_test[:,0] 

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(fashion_train)
X_train_std = sc.transform(fashion_train)
X_test_std = sc.transform(fashion_test)

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train_std, Y_train)
Y_pred=logistic_regression.predict(X_test_std)

print('Misclassified samples: %d' %(Y_test != Y_pred).sum())
print('Accuracy: ',metrics.accuracy_score(Y_test, Y_pred))
Misclassified samples: 533
Accuracy:  0.9467

第二種方式我選擇用Logistic Regression的方式去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.9467,且分錯的數量是533筆,和perceptron比較起來較為準確,這也是能預見的結果,畢竟數據太多且若不是呈現線性分布,perceptron的準確率就會比較低。

In [2]:
#####KNN#####
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np



import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784] 
Y_train = fashion_train[:,0] 
X_test = fashion_test[:,1:1784] 
Y_test = fashion_test[:,0] 

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

    
knn = KNeighborsClassifier(n_neighbors=19,p=2,metric='minkowski')
knn.fit(X_train_std, Y_train)
y_pred = knn.predict(X_test_std)


print('Misclassified samples: %d' %(Y_test != y_pred).sum())
print('Test accuracy:', knn.score(X_test_std, Y_test))
Misclassified samples: 1524
Test accuracy: 0.8476

此題我用knn演算法去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.8476,且分錯的數量是1524筆,略劣於上面兩種演算法,而且用knn跑的時間大約為40分鐘,蠻耗時跟cpu的。可是這也是可以預見的結果,因為Logistic Regression是用二元分類(Binary classification)法,而knn則是用多元分類(Multiclass classification),故knn花的時間是Logistic Regression得好幾倍。

In [7]:
################################Random Forest################################
import numpy as np
import pandas as pd
from sklearn import  ensemble
from sklearn import  metrics

# 載入資料
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784] 
Y_train = fashion_train[:,0] 
X_test = fashion_test[:,1:1784] 
Y_test = fashion_test[:,0] 


# 建立 random forest 模型
forest = ensemble.RandomForestClassifier(n_estimators = 100)
forest_fit = forest.fit(X_train, Y_train)

# 預測
y_pred = forest.predict(X_test)

# 績效
print('Misclassified samples: %d' %(Y_test != y_pred).sum())
accuracy = metrics.accuracy_score(Y_test, y_pred)
print('Test accuracy:', accuracy)
Misclassified samples: 1161
Test accuracy: 0.8839

此題我用random forest演算法去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train和y_train做fit,然後再將x_test去做predict即可得到accurancy=0.8839,且分錯的數量是1161筆。因為random forest的算法是將它們全部平均起來取最平均解,就正確率來說算是還可以,而且也不會那麼花時間,是個不錯的演算法選擇。

結論就是Logistic Regression的正確率最高,其次是Perceptron,第三為random forest,最後為KNN。耗時的部分,Perceptron和random forest跑的時間最短,因為perceptron是用一元分類,相對於Logistic Regression(二元分類)和KNN(多元分類),省非常多時間。經過這次的作業,下次在決定演算法時,我覺得要先看data的數量,如果很多的話,像這次有有1700多個features,可能就比較不適合用KNN,因為會花太多時間而且準確性也不一定會很高。但是主要還是要看data的性質,不同data真的要用不同的演算法去做才會知道比較適合哪一種!