先將csv檔做預處理，像是將標點符號刪除，以及將英文字體都變成小寫，可以看到我的title_abstract_processed即為處理過後的樣子。

import pandas as pd
papers = pd.read_csv('Google_AI_published_research.csv')

#加載正則表達式庫
import re 
#刪除標點符號
papers['title_abstract_processed']=papers['title_abstract'].map(lambda x:re.sub('[,\.!?:-]','',x))
#將標題轉換為小寫的
papers ['title_abstract_processed'] = papers ['title_abstract_processed'].map(lambda x:x.lower())
#打印出論文的第一行
papers ['title_abstract_processed'].head()

papers.head(10)

再來是將資料處理成數字，利用TfidfVectorizer，將一些常用的英文助詞刪去(ex:I,and,or)。

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
text = vectorizer.fit_transform(papers.title_abstract_processed)
X=text.toarray()

再將資料變成一個樹狀圖來可視化分組的歷史記錄，並找出最佳的簇數。

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

算出Silhouette score，Silhouette score的範圍是-1到+1，其中較高的值表示對象與其自身的群集匹配良好，而與相鄰群集的匹配較差。如果大多數對像都具有較高的值，則集群配置是合適的。如果許多點的值較低或為負，則群集配置可能包含太多或太少的群集。

from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
for i in range(10):
    ac = AgglomerativeClustering(n_clusters=i+2,
                                 affinity='euclidean',
                                 linkage='complete')
    y_km = ac.fit_predict(X)
    print(i+2, 'Clusters score = ',silhouette_score(X, y_km, metric='euclidean'))

2 Clusters score =  0.00028083678922144646
3 Clusters score =  0.0003630574174508796
4 Clusters score =  -0.0005473138235332339
5 Clusters score =  -0.0005728901666307463
6 Clusters score =  -0.0006574279576907401
7 Clusters score =  -0.0004112794550459987
8 Clusters score =  -0.00022898600365118946
9 Clusters score =  -0.00045843681470635994
10 Clusters score =  -0.00021998659891019678
11 Clusters score =  1.1146551220764525e-06

	title_abstract	title_abstract_processed
0	Evaluating similarity measures: a large-scale ...	evaluating similarity measures a largescale st...
1	Web Search for a Planet: The Google Cluster Ar...	web search for a planet the google cluster arc...
2	The Price of Performance: An Economic Case for...	the price of performance an economic case for ...
3	The Google File System. We have designed and ...	the google file system we have designed and i...
4	Interpreting the Data: Parallel Analysis with ...	interpreting the data parallel analysis with s...
5	Query-Free News Search. Many daily activities...	queryfree news search many daily activities p...
6	Searching the Web by Voice. Spoken queries ar...	searching the web by voice spoken queries are...
7	Who Links to Whom: Mining Linkage between Web ...	who links to whom mining linkage between web s...
8	PowerPoint: Shot with its own bullets. Imagin...	powerpoint shot with its own bullets imagine ...
9	The Chubby lock service for loosely-coupled di...	the chubby lock service for looselycoupled dis...