先將csv檔做預處理,像是將標點符號刪除,以及將英文字體都變成小寫,可以看到我的title_abstract_processed即為處理過後的樣子。

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import pandas as pd
papers = pd.read_csv('Google_AI_published_research.csv')

#加載正則表達式庫
import re 
#刪除標點符號
papers['title_abstract_processed']=papers['title_abstract'].map(lambda x:re.sub('[,\.!?:-]','',x))
#將標題轉換為小寫的
papers ['title_abstract_processed'] = papers ['title_abstract_processed'].map(lambda x:x.lower())
#打印出論文的第一行
papers ['title_abstract_processed'].head()

papers.head(10)
Out[2]:
title_abstract title_abstract_processed
0 Evaluating similarity measures: a large-scale ... evaluating similarity measures a largescale st...
1 Web Search for a Planet: The Google Cluster Ar... web search for a planet the google cluster arc...
2 The Price of Performance: An Economic Case for... the price of performance an economic case for ...
3 The Google File System. We have designed and ... the google file system we have designed and i...
4 Interpreting the Data: Parallel Analysis with ... interpreting the data parallel analysis with s...
5 Query-Free News Search. Many daily activities... queryfree news search many daily activities p...
6 Searching the Web by Voice. Spoken queries ar... searching the web by voice spoken queries are...
7 Who Links to Whom: Mining Linkage between Web ... who links to whom mining linkage between web s...
8 PowerPoint: Shot with its own bullets. Imagin... powerpoint shot with its own bullets imagine ...
9 The Chubby lock service for loosely-coupled di... the chubby lock service for looselycoupled dis...

再來是將資料處理成數字,利用TfidfVectorizer,將一些常用的英文助詞刪去(ex:I,and,or)。再利用elbow method,幫助找到在數據集中的簇的適當數量,由處理完的圖可以知道,當clusters=20的時候,他的SSE是最小的。

In [3]:
tfidf = TfidfVectorizer(
    min_df = 5,
    max_df = 0.95,
    max_features = 8000,
    stop_words = 'english'
)
tfidf.fit(papers.title_abstract_processed)
text = tfidf.transform(papers.title_abstract_processed)

def find_optimal_clusters(data, max_k):
    iters = range(2, max_k+1, 2)
    
    sse = []
    for k in iters:
        sse.append(MiniBatchKMeans(n_clusters=k, init_size=1024, batch_size=2048, random_state=20).fit(data).inertia_)
        print('Fit {} clusters'.format(k))
        
    f, ax = plt.subplots(1, 1)
    ax.plot(iters, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(iters)
    ax.set_xticklabels(iters)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')
    
find_optimal_clusters(text, 20)
Fit 2 clusters
Fit 4 clusters
Fit 6 clusters
Fit 8 clusters
Fit 10 clusters
Fit 12 clusters
Fit 14 clusters
Fit 16 clusters
Fit 18 clusters
Fit 20 clusters

由Elbow method可以知道,當clusters centers為20的時候,SSE最小。但是若分成20種的話,在下面作圖會變得很亂反而不好分析,所以我下面的cluster center只取6,這樣分顏色才會比較清楚,而我下面分別用PCA跟TSNE兩種方法去做圖,在用PCA降維的時候,我們希望留下最重要的特徵,剩下的比較不重要的特徵我們直接捨棄掉。而TSNE則是用了更複雜的公式來表達高維與低維之間的關係,TSNE主要是將高維的數據用高斯分佈的機率密度函數近似,而低維數據的部分使用t分佈的方式來近似,在使用KL距離計算相似度,最後再以梯度下降(或隨機梯度下降)求最佳解 。

In [5]:
def plot_tsne_pca(data, labels):
    max_label = max(labels)
    max_items = np.random.choice(range(data.shape[0]), size=3000, replace=True)
    
    pca = PCA(n_components=2).fit_transform(data[max_items,:].todense())
    tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:].todense()))
    
    
    idx = np.random.choice(range(pca.shape[0]), size=300, replace=False)
    label_subset = labels[max_items]
    label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
    
    f, ax = plt.subplots(1, 2, figsize=(14, 6))
    
    ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
    ax[0].set_title('PCA Cluster Plot')
    
    ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
    ax[1].set_title('TSNE Cluster Plot')
    
clusters = MiniBatchKMeans(n_clusters=6, init_size=1024, batch_size=2048, random_state=20).fit_predict(text)
plot_tsne_pca(text, clusters)

最後則是利用kmeans得出6種centers的前十個topics。

In [6]:
def get_top_keywords(data, clusters, labels, n_terms):
    df = pd.DataFrame(data.todense()).groupby(clusters).mean()
    
    for i,r in df.iterrows():
        print('\nCluster {}'.format(i))
        print(','.join([labels[t] for t in np.argsort(r)[-n_terms:]]))
            
get_top_keywords(text, clusters, tfidf.get_feature_names(), 10)
Cluster 0
services,applications,google,quantum,performance,algorithm,network,systems,distributed,data

Cluster 1
data,user,users,software,queries,query,code,google,web,search

Cluster 2
recognition,deep,network,language,training,networks,speech,model,models,neural

Cluster 3
using,methods,algorithm,training,model,images,data,method,image,learning

Cluster 4
object,spatiotemporal,quality,action,frames,motion,content,youtube,videos,video

Cluster 5
information,security,design,mobile,privacy,data,user,users,research,online