先將csv檔做預處理，像是將標點符號刪除，以及將英文字體都變成小寫，可以看到我的title_abstract_processed即為處理過後的樣子。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import pandas as pd
papers = pd.read_csv('Google_AI_published_research.csv')

#加載正則表達式庫
import re 
#刪除標點符號
papers['title_abstract_processed']=papers['title_abstract'].map(lambda x:re.sub('[,\.!?:-]','',x))
#將標題轉換為小寫的
papers ['title_abstract_processed'] = papers ['title_abstract_processed'].map(lambda x:x.lower())
#打印出論文的第一行
papers ['title_abstract_processed'].head()

papers.head(10)

再來是將資料處理成數字，利用TfidfVectorizer，將一些常用的英文助詞刪去(ex:I,and,or)。再利用elbow method，幫助找到在數據集中的簇的適當數量，由處理完的圖可以知道，當clusters=20的時候，他的SSE是最小的。

tfidf = TfidfVectorizer(
    min_df = 5,
    max_df = 0.95,
    max_features = 8000,
    stop_words = 'english'
)
tfidf.fit(papers.title_abstract_processed)
text = tfidf.transform(papers.title_abstract_processed)

def find_optimal_clusters(data, max_k):
    iters = range(2, max_k+1, 2)
    
    sse = []
    for k in iters:
        sse.append(MiniBatchKMeans(n_clusters=k, init_size=1024, batch_size=2048, random_state=20).fit(data).inertia_)
        print('Fit {} clusters'.format(k))
        
    f, ax = plt.subplots(1, 1)
    ax.plot(iters, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(iters)
    ax.set_xticklabels(iters)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')
    
find_optimal_clusters(text, 20)

Fit 2 clusters
Fit 4 clusters
Fit 6 clusters
Fit 8 clusters
Fit 10 clusters
Fit 12 clusters
Fit 14 clusters
Fit 16 clusters
Fit 18 clusters
Fit 20 clusters

由Elbow method可以知道，當clusters centers為20的時候，SSE最小。但是若分成20種的話，在下面作圖會變得很亂反而不好分析，所以我下面的cluster center只取6，這樣分顏色才會比較清楚，而我下面分別用PCA跟TSNE兩種方法去做圖，在用PCA降維的時候，我們希望留下最重要的特徵，剩下的比較不重要的特徵我們直接捨棄掉。而TSNE則是用了更複雜的公式來表達高維與低維之間的關係，TSNE主要是將高維的數據用高斯分佈的機率密度函數近似，而低維數據的部分使用t分佈的方式來近似，在使用KL距離計算相似度，最後再以梯度下降（或隨機梯度下降）求最佳解。

def plot_tsne_pca(data, labels):
    max_label = max(labels)
    max_items = np.random.choice(range(data.shape[0]), size=3000, replace=True)
    
    pca = PCA(n_components=2).fit_transform(data[max_items,:].todense())
    tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:].todense()))
    
    
    idx = np.random.choice(range(pca.shape[0]), size=300, replace=False)
    label_subset = labels[max_items]
    label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
    
    f, ax = plt.subplots(1, 2, figsize=(14, 6))
    
    ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
    ax[0].set_title('PCA Cluster Plot')
    
    ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
    ax[1].set_title('TSNE Cluster Plot')
    
clusters = MiniBatchKMeans(n_clusters=6, init_size=1024, batch_size=2048, random_state=20).fit_predict(text)
plot_tsne_pca(text, clusters)

最後則是利用kmeans得出6種centers的前十個topics。

def get_top_keywords(data, clusters, labels, n_terms):
    df = pd.DataFrame(data.todense()).groupby(clusters).mean()
    
    for i,r in df.iterrows():
        print('\nCluster {}'.format(i))
        print(','.join([labels[t] for t in np.argsort(r)[-n_terms:]]))
            
get_top_keywords(text, clusters, tfidf.get_feature_names(), 10)

Cluster 0
services,applications,google,quantum,performance,algorithm,network,systems,distributed,data

Cluster 1
data,user,users,software,queries,query,code,google,web,search

Cluster 2
recognition,deep,network,language,training,networks,speech,model,models,neural

Cluster 3
using,methods,algorithm,training,model,images,data,method,image,learning

Cluster 4
object,spatiotemporal,quality,action,frames,motion,content,youtube,videos,video

Cluster 5
information,security,design,mobile,privacy,data,user,users,research,online

	title_abstract	title_abstract_processed
0	Evaluating similarity measures: a large-scale ...	evaluating similarity measures a largescale st...
1	Web Search for a Planet: The Google Cluster Ar...	web search for a planet the google cluster arc...
2	The Price of Performance: An Economic Case for...	the price of performance an economic case for ...
3	The Google File System. We have designed and ...	the google file system we have designed and i...
4	Interpreting the Data: Parallel Analysis with ...	interpreting the data parallel analysis with s...
5	Query-Free News Search. Many daily activities...	queryfree news search many daily activities p...
6	Searching the Web by Voice. Spoken queries ar...	searching the web by voice spoken queries are...
7	Who Links to Whom: Mining Linkage between Web ...	who links to whom mining linkage between web s...
8	PowerPoint: Shot with its own bullets. Imagin...	powerpoint shot with its own bullets imagine ...
9	The Chubby lock service for loosely-coupled di...	the chubby lock service for looselycoupled dis...