先將csv檔做預處理,像是將標點符號刪除,以及將英文字體都變成小寫,可以看到我的title_abstract_processed即為處理過後的樣子。

In [2]:
import pandas as pd
papers = pd.read_csv('Google_AI_published_research.csv')

#加載正則表達式庫
import re 
#刪除標點符號
papers['title_abstract_processed']=papers['title_abstract'].map(lambda x:re.sub('[,\.!?:-]','',x))
#將標題轉換為小寫的
papers ['title_abstract_processed'] = papers ['title_abstract_processed'].map(lambda x:x.lower())
#打印出論文的第一行
papers ['title_abstract_processed'].head()

papers.head(10)
Out[2]:
title_abstract title_abstract_processed
0 Evaluating similarity measures: a large-scale ... evaluating similarity measures a largescale st...
1 Web Search for a Planet: The Google Cluster Ar... web search for a planet the google cluster arc...
2 The Price of Performance: An Economic Case for... the price of performance an economic case for ...
3 The Google File System. We have designed and ... the google file system we have designed and i...
4 Interpreting the Data: Parallel Analysis with ... interpreting the data parallel analysis with s...
5 Query-Free News Search. Many daily activities... queryfree news search many daily activities p...
6 Searching the Web by Voice. Spoken queries ar... searching the web by voice spoken queries are...
7 Who Links to Whom: Mining Linkage between Web ... who links to whom mining linkage between web s...
8 PowerPoint: Shot with its own bullets. Imagin... powerpoint shot with its own bullets imagine ...
9 The Chubby lock service for loosely-coupled di... the chubby lock service for looselycoupled dis...

再來是將資料處理成數字,利用TfidfVectorizer,將一些常用的英文助詞刪去(ex:I,and,or)。

In [3]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
text = vectorizer.fit_transform(papers.title_abstract_processed)
X=text.toarray()
In [ ]:
再將資料變成一個樹狀圖來可視化分組的歷史記錄並找出最佳的簇數
In [8]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

算出Silhouette score,Silhouette score的範圍是-1到+1,其中較高的值表示對象與其自身的群集匹配良好,而與相鄰群集的匹配較差。如果大多數對像都具有較高的值,則集群配置是合適的。如果許多點的值較低或為負,則群集配置可能包含太多或太少的群集。

In [7]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
for i in range(10):
    ac = AgglomerativeClustering(n_clusters=i+2,
                                 affinity='euclidean',
                                 linkage='complete')
    y_km = ac.fit_predict(X)
    print(i+2, 'Clusters score = ',silhouette_score(X, y_km, metric='euclidean'))
2 Clusters score =  0.00028083678922144646
3 Clusters score =  0.0003630574174508796
4 Clusters score =  -0.0005473138235332339
5 Clusters score =  -0.0005728901666307463
6 Clusters score =  -0.0006574279576907401
7 Clusters score =  -0.0004112794550459987
8 Clusters score =  -0.00022898600365118946
9 Clusters score =  -0.00045843681470635994
10 Clusters score =  -0.00021998659891019678
11 Clusters score =  1.1146551220764525e-06