此題我用三個方法去做,分別是perceptron、Logistic Regression、KNN、random forest,分別用四個方法去分析哪個的accurancy較為準確。
################################Perceptron################################
import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784]
Y_train = fashion_train[:,0]
X_test = fashion_test[:,1:1784]
Y_test = fashion_test[:,0]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(fashion_train)
X_train_std = sc.transform(fashion_train)
X_test_std = sc.transform(fashion_test)
from sklearn.linear_model import Perceptron
ppn = Perceptron(None, eta0=0.1, random_state=0)
ppn.fit(X_train_std,Y_train)
Y_pred = ppn.predict(X_test_std)
print('Misclassified samples: %d' %(Y_test != Y_pred).sum())
from sklearn.metrics import accuracy_score
print('Accuracy: %f' %accuracy_score(Y_test, Y_pred))
第一種方式我選擇用perceptron去做,首先pandas讀取兩個csv檔案,因為題目已經給了train跟test的數據,所以不需要去設定test_size跟stratify=Y去分哪些數據是拿來train。再來就是做preprocessing的部分,sc.fit(x_train)這行是先算出x_train裡面的平均值和標準差,之後再經過標準化過後得到x_train_std以及y_test_std。之後再引入perceptron函數,把學習速率,在n_iter選代數的部分設none,eta0學習速率設0.1,random_state的部分則是在每次選代數時都要打亂一次,最後最後即可以預測出y_pre,最後比較y_test和y_pred即可得到此題的accuracy=0.904800,依結果來論算是不錯的正確率。
################################Logistic Regression################################
# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784]
Y_train = fashion_train[:,0]
X_test = fashion_test[:,1:1784]
Y_test = fashion_test[:,0]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(fashion_train)
X_train_std = sc.transform(fashion_train)
X_test_std = sc.transform(fashion_test)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train_std, Y_train)
Y_pred=logistic_regression.predict(X_test_std)
print('Misclassified samples: %d' %(Y_test != Y_pred).sum())
print('Accuracy: ',metrics.accuracy_score(Y_test, Y_pred))
第二種方式我選擇用Logistic Regression的方式去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.9467,且分錯的數量是533筆,和perceptron比較起來較為準確,這也是能預見的結果,畢竟數據太多且若不是呈現線性分布,perceptron的準確率就會比較低。
#####KNN#####
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import pandas as pd
#preprocessing
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784]
Y_train = fashion_train[:,0]
X_test = fashion_test[:,1:1784]
Y_test = fashion_test[:,0]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=19,p=2,metric='minkowski')
knn.fit(X_train_std, Y_train)
y_pred = knn.predict(X_test_std)
print('Misclassified samples: %d' %(Y_test != y_pred).sum())
print('Test accuracy:', knn.score(X_test_std, Y_test))
此題我用knn演算法去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train_std和y_train做fit,然後再將x_test_std去做predict即可得到accurancy=0.8476,且分錯的數量是1524筆,略劣於上面兩種演算法,而且用knn跑的時間大約為40分鐘,蠻耗時跟cpu的。可是這也是可以預見的結果,因為Logistic Regression是用二元分類(Binary classification)法,而knn則是用多元分類(Multiclass classification),故knn花的時間是Logistic Regression得好幾倍。
################################Random Forest################################
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
# 載入資料
fashion_train = pd.read_csv('fashion-mnist_train.csv')
fashion_test = pd.read_csv('fashion-mnist_test.csv')
fashion_train = fashion_train.values
fashion_test = fashion_test.values
X_train = fashion_train[:,1:1784]
Y_train = fashion_train[:,0]
X_test = fashion_test[:,1:1784]
Y_test = fashion_test[:,0]
# 建立 random forest 模型
forest = ensemble.RandomForestClassifier(n_estimators = 100)
forest_fit = forest.fit(X_train, Y_train)
# 預測
y_pred = forest.predict(X_test)
# 績效
print('Misclassified samples: %d' %(Y_test != y_pred).sum())
accuracy = metrics.accuracy_score(Y_test, y_pred)
print('Test accuracy:', accuracy)
此題我用random forest演算法去做,前面的讀取數據和preprocessing的方式都和上面相同,最後再將x_train和y_train做fit,然後再將x_test去做predict即可得到accurancy=0.8839,且分錯的數量是1161筆。因為random forest的算法是將它們全部平均起來取最平均解,就正確率來說算是還可以,而且也不會那麼花時間,是個不錯的演算法選擇。
結論就是Logistic Regression的正確率最高,其次是Perceptron,第三為random forest,最後為KNN。耗時的部分,Perceptron和random forest跑的時間最短,因為perceptron是用一元分類,相對於Logistic Regression(二元分類)和KNN(多元分類),省非常多時間。經過這次的作業,下次在決定演算法時,我覺得要先看data的數量,如果很多的話,像這次有有1700多個features,可能就比較不適合用KNN,因為會花太多時間而且準確性也不一定會很高。但是主要還是要看data的性質,不同data真的要用不同的演算法去做才會知道比較適合哪一種!