先將csv檔做預處理,像是將標點符號刪除,以及將英文字體都變成小寫,可以看到我的title_abstract_processed即為處理過後的樣子。

In [2]:
import pandas as pd
papers = pd.read_csv('Google_AI_published_research.csv')

#加載正則表達式庫
import re 
#刪除標點符號
papers['title_abstract_processed']=papers['title_abstract'].map(lambda x:re.sub('[,\.!?:-]','',x))
#將標題轉換為小寫的
papers ['title_abstract_processed'] = papers ['title_abstract_processed'].map(lambda x:x.lower())
#打印出論文的第一行
papers ['title_abstract_processed'].head()

papers.head(10)
Out[2]:
title_abstract title_abstract_processed
0 Evaluating similarity measures: a large-scale ... evaluating similarity measures a largescale st...
1 Web Search for a Planet: The Google Cluster Ar... web search for a planet the google cluster arc...
2 The Price of Performance: An Economic Case for... the price of performance an economic case for ...
3 The Google File System. We have designed and ... the google file system we have designed and i...
4 Interpreting the Data: Parallel Analysis with ... interpreting the data parallel analysis with s...
5 Query-Free News Search. Many daily activities... queryfree news search many daily activities p...
6 Searching the Web by Voice. Spoken queries ar... searching the web by voice spoken queries are...
7 Who Links to Whom: Mining Linkage between Web ... who links to whom mining linkage between web s...
8 PowerPoint: Shot with its own bullets. Imagin... powerpoint shot with its own bullets imagine ...
9 The Chubby lock service for loosely-coupled di... the chubby lock service for looselycoupled dis...

再來是將資料處理成數字,利用TfidfVectorizer,將一些常用的英文助詞刪去(ex:I,and,or)。因為在Kmean的方法裡,已經使用過Elbow method去做過分析,因此這裡的cluster數直接採用和kmean一樣等於6。

In [17]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
text = vectorizer.fit_transform(papers.title_abstract_processed)

true_k = 6
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(text)
Out[17]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=6, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

再來使用兩種降維的方式去做圖值分析。以下分別用PCA跟TSNE兩種方法去做圖,在用PCA降維的時候,我們希望留下最重要的特徵,剩下的比較不重要的特徵我們直接捨棄掉。而TSNE則是用了更複雜的公式來表達高維與低維之間的關係,TSNE主要是將高維的數據用高斯分佈的機率密度函數近似,而低維數據的部分使用t分佈的方式來近似,在使用KL距離計算相似度,最後再以梯度下降(或隨機梯度下降)求最佳解 。

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def plot_tsne_pca(data, labels):
    max_label = max(labels)
    max_items = np.random.choice(range(data.shape[0]), size=3000, replace=True)
    
    pca = PCA(n_components=2).fit_transform(data[max_items,:].todense())
    tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:].todense()))
    
    
    idx = np.random.choice(range(pca.shape[0]), size=300, replace=False)
    label_subset = labels[max_items]
    label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
    
    f, ax = plt.subplots(1, 2, figsize=(14, 6))
    
    ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
    ax[0].set_title('PCA Cluster Plot')
    
    ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
    ax[1].set_title('TSNE Cluster Plot')
    
clusters = model.fit_predict(text)
plot_tsne_pca(text, clusters)

最後則是利用kmeans++得出6種centers的前十個topics。

In [30]:
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print ("Cluster %d:" %i),
    for ind in order_centroids[i, :10]:
        print ('   %s' % terms[ind]),
    print
Cluster 0:
   web
   users
   user
   research
   google
   search
   information
   data
   mobile
   social
Cluster 1:
   learning
   algorithm
   video
   problem
   image
   model
   algorithms
   method
   images
   approach
Cluster 2:
   data
   software
   distributed
   systems
   code
   google
   analysis
   applications
   large
   performance
Cluster 3:
   speech
   language
   models
   recognition
   acoustic
   model
   training
   word
   data
   modeling
Cluster 4:
   advertising
   ad
   ads
   online
   advertisers
   auction
   auctions
   revenue
   advertiser
   attribution
Cluster 5:
   neural
   networks
   deep
   model
   network
   models
   training
   learning
   recurrent
   image