Hw4 106061212 賴傳堯¶

Read Data¶

import pandas as pd
df = pd.read_csv('Google_AI_published_research.csv', encoding='utf-8')
df.head()

Read¶

用pd讀進資料，將頭5項印出來方便看內容、紀錄處理階段

import re

for i in range(len(df['title_abstract'])):
    df['title_abstract'][i] =\
    re.sub(re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});'), '',  df['title_abstract'][i])

Cleaning Html Syntax¶

將讀進的原始檔進行一次noise cleaning，清除檔案中帶有的html syntax
方法: 用for迴圈迭代整份資料的每一筆文章，一次對一筆文章用re.sub()找到html的內容用框字串取代
目的: 獨立處理原始資料，方便之後LDA印出、觀察清理過的文章

Data Cleaning & Tokenizing¶

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

stop = stopwords.words('english')
def cleaner_tokenizer(text, stopword=stop):
    text = re.sub('[\W]+', ' ', text.lower())
    text = re.sub(r'[0123456789]*', '', text)
    text = text.split()
    text = [w for w in text if w not in stopword]
    return text

Cleaning¶

用re.sub()將數字以及標點符號清掉，只留下文字
- 因為希望只留下能有代表意義的部分，所以文字以外都在這一步清掉
- 有些compound word會用到dash符號，原本希望能保留原本的複合字(比較能完整代表其意義)，但因觀察到會產生只有dash符號的token(可能來自運算式的減符號，或是例如1998-2019這種的時間表示，數字刪掉之後剩下的)，所以還是決定將它清掉
用從nltk.corpus引入的stopword來清除一些很常出現但無法代表文章意義的詞

Tokenizing¶

在做cleaning時順便將文章轉為小寫(text.lower())
做stopword時需要逐字檢查，所以做split()處理
結果如下，每筆文章轉為單字token組成的list

X = pd.DataFrame(df['title_abstract'].apply(cleaner_tokenizer))
X.head()

Apply Method¶

用apply()將df['title_abstract']套用定義好的method，最後的結果存到一個新的變數X，與原始資料區隔

Stemming & Lemmatizing¶

from nltk.corpus import wordnet

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer() 
X_lem = []
for row in X['title_abstract']:
    X_lem.append(' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in row]))
    
X_lem = pd.DataFrame(X_lem, columns=['Text_Lemmatized'])
X_lem.head()

Lemmatizing¶

Lemmatization會按照文法邏輯將文字正確轉回原型，例如可以將better轉回good，但缺點是例如evaluations只會轉回evaluation，而不是evaluate，這會使得兩個意思相近或一樣的字，對機器而言是完全不同的兩個字
這裡用WordNetLemmatizer()來轉換，一樣用for迴圈一次處理一筆資料，並將轉換後的結果存到X_lem；因為WordNetLemmatizer()需要提供詞性才能正確轉換，所以透過get_wordnet_pos來取得字的詞性，並轉換為WordNetLemmatizer()能接受的格式
- get_wordnet_pos: nltk.pos_tag()取得詞性，建立取得的詞性與WordNetLemmatizer()能接受的詞性格式對應的字典，回傳WordNetLemmatizer()能接受的詞性格式，default會回傳名詞(wordnet.NOUN)

porter = PorterStemmer()

X_stem = []
for row in X['title_abstract']:
    X_stem.append(' '.join([porter.stem(word) for word in row]))

X_stem = pd.DataFrame(X_stem, columns=['Text_Stemmed'])
X_stem.head()

Stemming¶

stem的功用是將文字轉為詞幹，有些時候這些詞幹並不會是一個存在的字，例如evaluating被轉為evaluate，有時候則不會有正確的轉換，例如this被轉為thi
- 原本按照講義中的順序，先做stemming再清除stopword，就會得到thi這個token，甚至會出現再topic的top words之中，明顯對結果是不好的影響，所以改為先清除stopword，才做stemming
這裡用PorterStemer()來轉換，用for迴圈一次處理一筆資料，並將轉換後的結果存到X_stem

Comparison¶

因為stemming和lemmatization各有優缺，所以兩種分別都做，會在之後的方法中稍微比較兩者哪個較好

LDA¶

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Count Vectorize Without Setting max_df¶

count = CountVectorizer(stop_words='english',
                        max_features=10000)
lda = LatentDirichletAllocation(n_components=10, random_state=6, learning_method='batch')
X_lda = count.fit_transform(df['title_abstract'].values)
X_topics = lda.fit_transform(X_lda)

n_top_words = 5
top_words = []
feature_names = count.get_feature_names()

for idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (idx + 1))
    top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]])
    print(" ".join(top_words[idx]))

Topic 1:
problem data algorithm optimization using
Topic 2:
learning image images data large
Topic 3:
model models language neural memory
Topic 4:
data model learning training language
Topic 5:
algorithm algorithms problem large graph
Topic 6:
models model neural learning speech
Topic 7:
video based time model using
Topic 8:
users google user data research
Topic 9:
data web latency systems distributed
Topic 10:
data privacy media users based

Comment on Obtained Top Words & Possible Topics¶

Top Words: 這裡用的是沒經過Stemming/Lemmatization處理的文章，所以會看到同一個主題同時出現同一個字的單複數或第一、第三人稱用法，好處是可以很清楚觀察是那些字，壞處是反而會占用或擋掉其他也代表這個topic的字
Possible Topics:
- Topic 1: optimizing algorithm
- Topic 2: image processing
- Topic 3、4: NLP
- Topic 5: ploting graph of large dataset
- Topic 6: speech recognizing
- Topic 7: vedio processing
- Topic 8: study of google research data
- Topic 9: website systems
- Topic 10: media vs privacy
Trivial Word Problem:
- 可以看到像是data、 model、use等等的字不斷重複出現
- 主要原因是LDA假設每個字是從多項式分布中選取產生的，使用Tfidf取倒數之後產生的weight並不合哩，因此只適合用CountVectorizer，連帶就會有其缺點:無法過濾頻繁出現，卻不具代表性的字
- 這個現象再加上使用沒經過Stemming/Lemmatization處理的文章，使得會有model、models同時出現的現象，導致更難看出主題。

text_idx = []
for i in range(10):
    topic = X_topics[:, i].argsort()
    text_idx.append(topic[-1:-4:-1])
    print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])
    print(df['title_abstract'][topic[-1]])#[:300],'...')

Topic #1: (text #719)
Whole-page optimization and submodular welfare maximization with online bidders.  In the context of online ad serving, display ads may appear on different types of webpages, where each page includes several ad slots and therefore multiple ads can be shown on each page. The set of ads that can be assigned to ad slots of the same page needs to satisfy various prespecified constraints including exclusion constraints, diversity constraints, and the like. Upon arrival of a user, the ad serving system needs to allocate a set of ads to the current webpage respecting these per-page allocation constraints. Previous slot-based settings ignore the important concept of a page and may lead to highly suboptimal results in general. In this article, motivated by these applications in display advertising and inspired by the submodular welfare maximization problem with online bidders, we study a general class of page-based ad allocation problems, present the first (tight) constant-factor approximation algorithms for these problems, and confirm the performance of our algorithms experimentally on real-world datasets. A key technical ingredient of our results is a novel primal-dual analysis for handling free disposal, which updates dual variables using a ??level function?? instead of a single level and unifies with previous analyses of related problems. This new analysis method allows us to handle arbitrarily complicated allocation constraints for each page. Our main result is an algorithm that achieves a 1 minus frac 1 e minus o(1)-competitive ratio. Moreover, our experiments on real-world datasets show significant improvements of our page-based algorithms compared to the slot-based algorithms. Finally, we observe that our problem is closely related to the submodular welfare maximization (SWM) problem. In particular, we introduce a variant of the SWM problem with online bidders and show how to solve this problem using our algorithm for whole-page optimization.
Topic #2: (text #141)
Large Scale Online Learning  of Image Similarity Through Ranking: Extended Abstract.  Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object.  Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions.  To address real-world large-scale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features.  The current abstract presents OASIS, an {\em Online Algorithm for Scalable Image Similarity} learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU.  Human evaluations showed that 35\% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.
Topic #3: (text #1009)


is any arbitrary non-decreasing function of her submission's quality. Further, the optimal threshold mechanism uses exactly the same threshold for each rank. We study what contest parameters determine the extent of the benefit from incorporating such cardinal information into an ordinal rank-order contest, and investigate the extent of improvement in equilibrium effort via numerical simulations.
Topic #4: (text #1705)
A Segmental Framework for Fully-Unsupervised  Large-Vocabulary Speech Recognition.  Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early systems focused on identifying isolated recurring terms in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units??effectively performing unsupervised speech recognition. To our knowledge, this article presents the first such system evaluated on large vocabulary multi-speaker data. The system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). We show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less  speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system??s discovered clusters are still less pure than those of two multi-speaker term discovery systems, but provide far greater coverage.
Topic #5: (text #1472)
A PTAS for Planar Group Steiner Tree via Bootstrapping Approximation.  We present the first polynomial-time approximation scheme (PTAS), i.e., (1 + ??)-approximation algorithm for any constant ??  0, for the planar group Steiner tree problem (in which each group lies on a boundary of a face). This result improves on the best previous approximation factor of O(log n(log log n)O(1) ). We achieve this result via a novel and powerful technique called bootstrapping approximation, which allows one to bootstrap from a superconstant approximation factor (even superpolynomial in the input size) all the way down to a PTAS. This is in contrast with the popular ex- isting approach for planar PTASs of constructing light-weight spanners in one iteration, which notably requires a constant-factor approximate solution to start from. Bootstrapping approximation removes the main barrier for designing PTASs for problems which have no known constant-factor approxima- tion (even on planar graphs), and thus can be used to obtain PTASs for several difficult-to-approximate problems. Our second major contribution required for the planar group Steiner tree PTAS is a spanner con- struction, which reduces the graph to have total weight within a factor of the optimal solution while approximately preserving the optimal solution. This is particularly challenging because group Steiner tree requires deciding which terminal in each group to connect by the tree, making it much harder than recent previous approaches to construct spanners for planar TSP by Klein (FOCS??05  SICOMP??08), subset TSP by Klein (STOC??06), Steiner tree by Borradaile, Klein, and Mathieu (SODA??07  TALG??09), and Steiner forest by Bateni, Hajiaghayi, and Marx (STOC??10  JACM??11) (and its improvement to an efficient PTAS by Eisenstat, Klein, and Mathieu (SODA??12)). The spanner construction algorithm first constructs a ??pre-spanner?? by several steps. we divide the groups in a solution into minimal and non- minimal groups according to whether the optimal solution includes one or multiple vertices of the group. Next we make sure every vertex in a minimal group reached by the optimal solution is in the pre-spanner. By adding more to the pre-spanner, we guarantee that the vertices of (nonminimal) groups reached by the optimal solution but not in the spanner, called tips, exist for only a constant number of nonminimal groups. Next we make sure each such tip of a nonminimal group is first weakly and then via yet another involved step strongly isolated. We are then able to handle strongly isolated tips of one group via another structural result of optimum solutions. Finally we invoke the known spanner construction for Steiner tree as a black box on top of our already constructed pre-spanner to construct an actual spanner. We iterate all the above a polynomial number of times via the bootstrapping approximation technique to obtain the desired PTAS for planar group Steiner tree. Our PTAS for planar group Steiner tree implies the first PTAS for geometric Euclidean group Steiner tree with obstacles, as well as a (2 + ?? )-approximation algorithm for group TSP with obstacles, improv- ing over the best previous constant-factor approximation algorithms. By contrast, we show that planar group Steiner forest, a slight generalization of planar group Steiner tree, is APX-hard on planar graphs of treewidth 3, even if the groups are pairwise disjoint and every group is a vertex or an edge.
Topic #6: (text #2210)
Identifying Exoplanets with Deep Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet around Kepler-90.  NASA's Kepler Space Telescope was designed to determine the frequency of Earth-sized planets orbiting Sun-like stars, but these planets are on the very edge of the mission's detection sensitivity. Accurately determining the occurrence rate of these planets will require automatically and accurately assessing the likelihood that individual candidates are indeed planets, even at low signal-to-noise ratios. We present a method for classifying potential planet signals using deep learning, a class of machine learning algorithms that have recently become state-of-the-art in a wide variety of tasks. We train a deep convolutional neural network to predict whether a given signal is a transiting exoplanet or a false positive caused by astrophysical or instrumental phenomena. Our model is highly effective at ranking individual candidates by the likelihood that they are indeed planets: 98.8% of the time it ranks plausible planet signals higher than false-positive signals in our test set. We apply our model to a new set of candidate signals that we identified in a search of known Kepler multi-planet systems. We statistically validate two new planets that are identified with high confidence by our model. One of these planets is part of a five-planet resonant chain around Kepler-80, with an orbital period closely matching the prediction by three-body Laplace relations. The other planet orbits Kepler-90, a star that was previously known to host seven transiting planets. Our discovery of an eighth planet brings Kepler-90 into a tie with our Sun as the star known to host the most planets.
Topic #7: (text #2086)
Novel modes and adaptive block scanning order for intra prediction in AV1.  The demand for streaming video content is on the rise and growing exponentially. Networks bandwidth is very costly and therefore there is a constant effort to improve video compression rates and enable the sending of reduced data volumes while retaining quality of experience (QoE). One basic feature that utilizes the spatial correlation of pixels for video compression is Intra-Prediction, which determines the codec??s compression efficiency. Intra prediction enables significant reduction of the Intra-Frame (I frame) size and, therefore, contributes to efficient exploitation of bandwidth. In this presentation, we propose new Intra-Prediction algorithms that improve the AV1 prediction model and provide better compression ratios. Two (2) types of methods are considered: )1( New scanning order method that maximizes spatial correlation in order to reduce prediction error; and )2( New Intra-Prediction modes implementation in AVI. Modern video coding standards, including AVI codec, utilize fixed scan orders in processing blocks during intra coding. The fixed scan orders typically result in residual blocks with high prediction error mainly in blocks with edges. This means that the fixed scan orders cannot fully exploit the content-adaptive spatial correlations between adjacent blocks, thus the bitrate after compression tends to be large. To reduce the bitrate induced by inaccurate intra prediction, the proposed approach adaptively chooses the scanning order of blocks according to criteria of firstly predicting blocks with maximum number of surrounding, already Inter-Predicted blocks. Using the modified scanning order method and the new modes has reduced the MSE by up to five (5) times when compared to conventional TM mode / Raster scan and up to two (2) times when compared to conventional CALIC mode / Raster scan, depending on the image characteristics (which determines the percentage of blocks predicted with Inter-Prediction, which in turn impacts the efficiency of the new scanning method). For the same cases, the PSNR was shown to improve by up to 7.4dB and up to 4 dB, respectively. The new modes have yielded 5% improvement in BD-Rate over traditionally used modes, when run on K-Frame, which is expected to yield ~1% of overall improvement.
Topic #8: (text #2231)
Recent Books and Journals Articles in Public Opinion, Survey Methods, Survey Statistics, Big Data and User Experience Research. 2017 Update.  Welcome to the 9th edition of this column on recent books and journal articles in the field of public opinion, survey methods, survey statistics, Big Data, and user experience research. This article is an update of the 2016 article. Like the previous year, the books are organized by topic; this should help the readers to focus on their interests. Given that the last update listed books available as of August 2016, I added a few books, papers, and special issues that came out in late 2016, so there is no gap. This year I added a new section on user experience research. User experience research is a growing field with many applications to desktop and mobile platforms. Given almost all data collection methods in survey research rely heavily on technology, the learnings from the user experience field can be very beneficial to the survey researcher and practitioner. You will also note that I use very broad definitions of public opinion, survey methods, survey statistics, Big Data, and user experience research. This is because there are many books published in different outlets that can very useful to the readers of Survey Practice, even if they do not come from traditional sources of survey content. It is unlikely I have exhaustively listed all new books in each subcategory; I did my best scouting different resources and Websites, but I take full responsibility for any omission. The list is also focused only on books published in the English language and available for purchase (as an ebook or in print) at the time of this review (January 2018) and with copyright year as either 2016 or 2017. Books are listed based on the relevance to the topic, and no judgment is made in terms of quality of the content. We let the readers do so.
Topic #9: (text #962)
Meta-DNA: A DNA-Based Approach to Synthetic Biology.  The goal of synthetic biology is to design and assemble synthetic systems that mimic biological systems. One of the most fundamental challenges in synthetic biology is to synthesize artificial biochemical systems, which we will call meta-biochemical systems, that provide the same functionality as biological nucleic acids-enzyme systems, but that use a very limited number of types of molecules. The motivation for developing such synthetic biology systems is to enable a better understanding of the basic processes of natural biology, and also to enable re-engineering and programmability of synthetic versions of biological systems. One of the key aspects of modern nucleic acid biochemistry is its extensive use of protein enzymes that were originally evolved in cells to manipulate nucleic acids, and then later adapted by man for laboratory use. This practice provided powerful tools for manipulating nucleic acids, but also limited the extent of the programmability of the available chemistry for manipulating nucleic acids, since it is very difficult to predictively modify the behavior of protein enzymes. Meta-biochemical systems offer the possible advantage of being far easier to re-engineer and program for desired functionality. The approach taken here is to develop a biochemical system which we call meta-DNA (abbreviated as mDNA),Meta-DNA (mDNA) based entirely on strands of DNA as the only component molecules. Our work leverages prior work on the development of self-assembled DNA nanostructures. Each base of a mDNA Meta-DNA (mDNA)is a DNA nanostructure. Our mDNA bases are paired similar to DNA bases, but have a much larger alphabet of bases, thereby providing increased power of base addressability. Our mDNA bases can be assembled to form flexible linear assemblies (single stranded mDNA) analogous to single stranded DNA, and can be hybridized to form stiff helical structures (duplex mDNA) analogous to double Double strand meta-DNA (dsmDNA) stranded DNA, and also can be denatured back to single stranded mDNA. Our work also leverages the abstract activatable tile model developed by Majumder et al. and prior work on the development of enzyme-free isothermal protocols based on DNA hybridization and sophisticated strand displacement hybridization reactions. We describe various isothermal hybridization reactions that manipulate our mDNA in powerful ways analogous to DNA??DNA reactions and the action of various enzymes on DNA. These operations on mDNA include (i) hybridization of single strand mDNA (ssmDNA)Single strand meta-DNA (ssmDNA) into a double strand mDNA (dsmDNA)Double strand meta-DNA (dsmDNA) and heat denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(analogous to DNA hybridization and denaturation), (ii) strand displacement of one ssmDNA Single strand meta-DNA (ssmDNA)by another (similar to strand displacement in DNA), (iii) restriction cuts on the backbones of ssmDNA Single strand meta-DNA (ssmDNA)and dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of restriction enzymes on DNA), (iv) polymerization chain reactions that extend ssmDNA Single strand meta-DNA (ssmDNA)on a template to form a complete dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of polymerase enzyme on DNA), (v) isothermal denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(like the activity of helicase enzyme on DNA) and (vi) an isothermal replicator reaction which exponentially amplifies ssmDNA Single strand meta-DNA (ssmDNA)strands (similar to the isothermal PCR reaction). We provide a formal model to describe the required properties and operations of our mDNA, and show that our proposed DNA nanostructures and hybridization reactions provide these properties and functionality.
Topic #10: (text #1868)
On expression patterns and developmental origin of human brain regions.  Anatomical substructures of the human brain have characteristic cell-types, connectivity and local circuitry, which are reflected in area-specific transcriptome signatures, but the principles governing area-specific transcription and their relation to brain development are still being studied. In adult rodents, areal transcriptome patterns agree with the embryonic origin of brain regions, but the processes and genes that preserve an embryonic signature in regional expression profiles were not quantified. Furthermore, it is not clear how embryonic-origin signatures of adult-brain expression interplay with changes in expression patterns during development. Here we first quantify which genes have regional expression-patterns related to the developmental origin of brain regions, using genome-wide mRNA expression from post-mortem adult human brains. We find that almost all human genes (92%) exhibit an expression pattern that agrees with developmental brain-region ontology, but that this agreement changes at multiple phases during development. Agreement is particularly strong in neuron-specific genes, but also in genes that are not spatially correlated with neuron-specific or glia-specific markers. Surprisingly, agreement is also stronger in early-evolved genes. We further find that pairs of similar genes having high agreement to developmental region ontology tend to be more strongly correlated or anti-correlated, and that the strength of spatial correlation changes more strongly in gene pairs with stronger embryonic signatures. These results suggest that transcription regulation of most genes in the adult human brain is spatially tuned in a way that changes through life, but in agreement with development-determined brain regions.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-9765808bd785> in <module>
      1 text_idx = []
      2 for i in range(len(X_topics)-1):
----> 3     topic = X_topics[:, i].argsort()
      4     text_idx.append(topic[-1:-4:-1])
      5     print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])

IndexError: index 10 is out of bounds for axis 1 with size 10

Comparison Between Top Words and Corresponding Articles¶

Topic 1: Whole-page optimization (optimization)
Topic 2: Online Learning of Image Similarity (image)
Topic 3: Cardinal Contest (no match)
Topic 4: Speech Recognition (language)
Topic 5: Planar Group->planar graphs (graph, which however is not really the main topic)
Topic 6: Identifying Exoplanet (neural is the matching word, yet not really the main topic)
Topic 7: Novel modes and adaptive block scanning order (time is the matching word, yet not the main topic)
Topic 8: User Experience Research (research)
Topic 9: DNA-Based Approach to Synthetic Biology (system is the matching word, yet not the main topic)
Topic 10: human brain regions (no match)
Comment:
- 因為top words一開始就是許多trivial的字，因此實際上大多與文章主題關
- Topic 3、10幾乎沒有相符的字出現，表示這兩群的分類相對較差
- Topic 2、4看起來符合度很不錯，分得相對較佳
- 除了CountVecterizer的問題，這份Dataset內容其實本來就很雜，也導致容易發現共通點比較傾向於一些ML的常見用詞

Adding max_df Parameter¶

count = CountVectorizer(stop_words='english',
                        max_df=0.1,
                        max_features=10000)
lda = LatentDirichletAllocation(n_components=10, random_state=6, learning_method='batch')
X_lda = count.fit_transform(df['title_abstract'].values)
X_topics = lda.fit_transform(X_lda)

n_top_words = 5
top_words = []
feature_names = count.get_feature_names()

for idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (idx + 1))
    top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]])
    print(" ".join(top_words[idx]))

Topic 1:
image object images 3d objects
Topic 2:
distributed latency service online power
Topic 3:
quantum software dna energy single
Topic 4:
video videos image dataset images
Topic 5:
graph random functions function linear
Topic 6:
speech sequence recognition text word
Topic 7:
online ad traffic social ads
Topic 8:
code software source control rate
Topic 9:
matrix error policy estimation prediction
Topic 10:
web mobile experience privacy survey

Tuning max_df Parameter¶

Top Words (Tuning max_df Parameter): 同樣使用未經stem/lemmatization的資料，max_df從0.2開始測試，降到0.1之後發現原本回有的data、user等等這些比較無法代表文章卻很常出現的詞消失，取而代之的是(例如Topic 1的)3d、object，更容易判斷主題

text_idx = []
for i in range(10):
    topic = X_topics[:, i].argsort()
    text_idx.append(topic[-1:-4:-1])
    print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])
    print(df['title_abstract'][topic[-1]])#[:300],'...')

Topic #1: (text #2209)
Assessing microscope image focus quality with deep learning.  Background: Large image datasets acquired on automated microscopes typically have some fraction of low quality, out-of-focus images, despite the use of hardware autofocus systems. Identification of these images using automated image analysis with high accuracy is important for obtaining a clean, unbiased image dataset. Complicating this task is the fact that image focus quality is only well-defined in foreground regions of images, and as a result, most previous approaches only enable a computation of the relative difference in quality between two or more images, rather than an absolute measure of quality. Results: We present a deep neural network model capable of predicting an absolute measure of image focus on a single image in isolation, without any user-specified parameters. The model operates at the image-patch level, and also outputs a measure of prediction certainty, enabling interpretable predictions. The model was trained on only 384 in-focus Hoechst (nuclei) stain images of U2OS cells, which were synthetically defocused to one of 11 absolute defocus levels during training. The trained model can generalize on previously unseen real Hoechst stain images, identifying the absolute image focus to within one defocus level (approximately 3 pixel blur diameter difference) with 95% accuracy. On a simpler binary in/out-of-focus classification task, the trained model outperforms previous approaches on both Hoechst and Phalloidin (actin) stain images (F-scores of 0.89 and 0.86, respectively over 0.84 and 0.83), despite only having been presented Hoechst stain images during training. Lastly, we observe qualitatively that the model generalizes to two additional stains, Hoechst and Tubulin, of an unseen cell type (Human MCF-7) acquired on a different instrument. Conclusions: Our deep neural network enables classification of out-of-focus microscope images with both higher accuracy and greater precision than previous approaches via interpretable patch-level focus and certainty predictions. The use of synthetically defocused images precludes the need for a manually annotated training dataset. The model also generalizes to different image and cell types. The framework for model training and image prediction is available as a free software library and the pre-trained model is available for immediate use in Fiji (ImageJ) and CellProfiler.
Topic #2: (text #313)
Power Management of Online Data-Intensive Services.  Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These workloads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques. We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.
Topic #3: (text #962)
Meta-DNA: A DNA-Based Approach to Synthetic Biology.  The goal of synthetic biology is to design and assemble synthetic systems that mimic biological systems. One of the most fundamental challenges in synthetic biology is to synthesize artificial biochemical systems, which we will call meta-biochemical systems, that provide the same functionality as biological nucleic acids-enzyme systems, but that use a very limited number of types of molecules. The motivation for developing such synthetic biology systems is to enable a better understanding of the basic processes of natural biology, and also to enable re-engineering and programmability of synthetic versions of biological systems. One of the key aspects of modern nucleic acid biochemistry is its extensive use of protein enzymes that were originally evolved in cells to manipulate nucleic acids, and then later adapted by man for laboratory use. This practice provided powerful tools for manipulating nucleic acids, but also limited the extent of the programmability of the available chemistry for manipulating nucleic acids, since it is very difficult to predictively modify the behavior of protein enzymes. Meta-biochemical systems offer the possible advantage of being far easier to re-engineer and program for desired functionality. The approach taken here is to develop a biochemical system which we call meta-DNA (abbreviated as mDNA),Meta-DNA (mDNA) based entirely on strands of DNA as the only component molecules. Our work leverages prior work on the development of self-assembled DNA nanostructures. Each base of a mDNA Meta-DNA (mDNA)is a DNA nanostructure. Our mDNA bases are paired similar to DNA bases, but have a much larger alphabet of bases, thereby providing increased power of base addressability. Our mDNA bases can be assembled to form flexible linear assemblies (single stranded mDNA) analogous to single stranded DNA, and can be hybridized to form stiff helical structures (duplex mDNA) analogous to double Double strand meta-DNA (dsmDNA) stranded DNA, and also can be denatured back to single stranded mDNA. Our work also leverages the abstract activatable tile model developed by Majumder et al. and prior work on the development of enzyme-free isothermal protocols based on DNA hybridization and sophisticated strand displacement hybridization reactions. We describe various isothermal hybridization reactions that manipulate our mDNA in powerful ways analogous to DNA??DNA reactions and the action of various enzymes on DNA. These operations on mDNA include (i) hybridization of single strand mDNA (ssmDNA)Single strand meta-DNA (ssmDNA) into a double strand mDNA (dsmDNA)Double strand meta-DNA (dsmDNA) and heat denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(analogous to DNA hybridization and denaturation), (ii) strand displacement of one ssmDNA Single strand meta-DNA (ssmDNA)by another (similar to strand displacement in DNA), (iii) restriction cuts on the backbones of ssmDNA Single strand meta-DNA (ssmDNA)and dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of restriction enzymes on DNA), (iv) polymerization chain reactions that extend ssmDNA Single strand meta-DNA (ssmDNA)on a template to form a complete dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of polymerase enzyme on DNA), (v) isothermal denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(like the activity of helicase enzyme on DNA) and (vi) an isothermal replicator reaction which exponentially amplifies ssmDNA Single strand meta-DNA (ssmDNA)strands (similar to the isothermal PCR reaction). We provide a formal model to describe the required properties and operations of our mDNA, and show that our proposed DNA nanostructures and hybridization reactions provide these properties and functionality.
Topic #4: (text #600)
Recursive Sparse Spatiotemporal Coding.  We present a new approach to learning sparse, spatiotemporal codes in which the number of basis vectors, their orientations, velocities and the size of their receptive fields change over the duration of unsupervised training. The algorithm starts with a relatively small, initial basis with minimal temporal extent. This initial basis is obtained through conventional sparse coding techniques and is expanded over time by recursively constructing a new basis consisting of basis vectors with larger temporal extent that proportionally conserve regions of previously trained weights. These proportionally conserved weights are combined with the result of adjusting newly added weights to represent a greater range of primitive motion features. The size of the current basis is determined probabilistically by sampling from existing basis vectors according to their activation on the training set. The resulting algorithm produces bases consisting of filters that are bandpass, spatially oriented and temporally diverse in terms of their transformations and velocities. The basic methodology borrows inspiration from the layer-by-layer learning of multiple-layer restricted Boltzmann machines developed by Geoff Hinton and his students. Indeed, we can learn multiple-layer sparse codes by training a stack of denoising autoencoders, but we have had greater success using L1 regularized regression in a variation on Olshausen and Field's original SPARSENET. To accelerate learning and focus attention, we apply a space-time interest-point operator that selects for periodic motion. This attentional mechanism enables us to efficiently compute and compactly represent a broad range of interesting motion. We demonstrate the utility of our approach by using it to recognize human activity in video. Our algorithm meets or exceeds the performance of state-of-the-art activity-recognition methods.
Topic #5: (text #1092)
An optimal online algorithm for retrieving heavily perturbed statistical databases in the low-dimensional querying model.  We give the first O(1/sqrt{T})-error online algorithm for reconstructing noisy statistical databases, where T is the number of (online) sample queries received. The algorithm is optimal up to the poly(log(T)) factor in terms of the error and requires only O(log T) memory. It aims to learn a hidden database-vector w in R^d in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution D. We assume the distribution D is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in O(dD)-time per query, where d is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database ??- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle O that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant O(D) amount of noise to be added while other works focused on the low noise o(sqrt{D}) setting. For a stream of T queries our algorithm achieves an average error O(1/sqrt{T}) by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector w onto the manifold defining the query-space. Our algorithm may be also applied in the adversarial machine learning context to compromise machine learning engines by heavily exploiting the vulnerabilities of the systems that output only binary signal and in the presence of significant noise.
Topic #6: (text #1478)
Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic.  Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon  Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language Processing (NLP) applications. Noun-Noun compounds exhibit special behaviour in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive computational syntactic representation of the Arabic language is found in the LDC Arabic Treebank (ATB). Despite its coverage, ICs are not explicitly labeled in the ATB and furthermore, there is no clear distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels in addition to further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: False Idafas (FIC), Grammatical Idafas (GIC), and True Idafas (TIC), which are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB Treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types.
Topic #7: (text #472)
Look Who I Found: Understanding the Effects of Sharing Curated Friend Groups.  Online social networks like Google+, Twitter, and Facebook allow users to build, organize, and manage their social connections for the purposes of information sharing and consumption. Nonetheless, most social network users still report that building and curating contact groups is a time-consuming burden. To help users overcome the burdens of contact discovery and grouping, Google+ recently launched a new feature known as "circle sharing". The feature makes it easy for users to share the benefits of their own contact curation by sharing entire "circles" (contact groups) with others. Recipients of a shared circle can adopt the circle as a whole, merge the circle into one of their own circles, or select specific members of the circle to add. In this paper, we investigate the impact that circle-sharing has had on the growth and structure of the Google+ social network. Using a cluster analysis, we identify two natural categories of shared circles, which represent two qualitatively different use cases: circles comprised primarily of celebrities (celebrity circles), and circles comprised of members of a community (community circles). We observe that exposure to circle-sharing accelerates the rate at which a user adds others to his or her circles. More specifically, we notice that circle-sharing has accelerated the "densification" rate of community circles, and also that it has disproportionately affected users with few connections, allowing them to find new contacts at a faster rate than would be expected based on accepted models of network growth. Finally, we identify features that can be used to predict which of a user??s circles (s)he is most likely to share, thus demonstrating that it is feasible to suggest to a user which circles to share with friends.
Topic #8: (text #506)
A rational deconstruction of Landin's SECD machine with the J operator.  Landin's SECD machine was the first abstract machine for applicative expressions, ie, functional programs. Landin's J operator was the first control operator for functional languages, and was specified by an extension of the SECD machine. We present a family of evaluation functions corresponding to this extension of the SECD machine, using a series of elementary transformations (transformation into continu-ation-passing style (CPS) and defunctionalization, chiefly) and their left inverses (transformation into direct style and refunctionalization). To this end, we modernize the SECD machine into a bisimilar one that operates in lockstep with the original one but that (1) does not use a data stack and (2) uses the caller-save rather than the callee-save convention for environments. We also identify that the dump component of the SECD machine is managed in a callee-save way. The caller-save counterpart of the modernized SECD machine precisely corresponds to Thielecke's double-barrelled continuations and to Felleisen's encoding of J in terms of call/cc. We then variously characterize the J operator in terms of CPS and in terms of delimited-control operators in the CPS hierarchy. As a byproduct, we also present several reduction semantics for applicative expressions with the J operator, based on Curien's original calculus of explicit substitutions. These reduction semantics mechanically correspond to the modernized versions of the SECD machine and to the best of our knowledge, they provide the first syntactic theories of applicative expressions with the J operator. The present work is concluded by a motivated wish to see Landin's name added to the list of co-discoverers of continuations. Methodologically, however, it mainly illustrates the value of Reynolds's defunctionalization and of refunctionalization as well as the expressive power of the CPS hierarchy (1) to account for the first control operator and the first abstract machine for functional languages and (2) to connect them to their successors. Our work also illustrates the value of Danvy and Nielsen's refocusing technique to connect environment-based abstract machines and syntactic theories in the form of reduction semantics for calculi of explicit substitutions.
Topic #9: (text #396)
Models for Neural Spike Computation and Cognition.  This monograph addresses the intertwined mathematical, neurological, and cognitive mysteries of the brain. It first evaluates the mathematical performance limits of simple spiking neuron models that both learn and later recognize complex spike excitation patterns in less than one second without using training signals unique to each pattern. Simulations validate these models, while theoretical expressions validate their simpler performance parameters. These single-neuron models are then qualitatively related to the training and performance of multi-layer neural networks that may have significant feedback. The advantages of feedback are then qualitatively explained and related to a model for cognition. This model is then compared to observed mild hallucinations that arguably include accelerated time-reversed video memories. The learning mechanism for these binary threshold-firing ??cognon?? neurons is spike-timing-dependent plasticity (STDP) that depends only on whether the spike excitation pattern presented to a given single ??learning-ready?? neuron within a period of milliseconds causes that neuron to fire or ??spike??. The ??false-alarm?? probability that a trained neuron will fire for a random unlearned pattern can be made almost arbitrarily low by reducing the number of patterns learned by each neuron. Models that use and that do not use spike timing within patterns are evaluated. A Shannon mutual information metric (recoverable bits/neuron) is derived for binary neuron models that are characterized only by their probability of learning a random input excitation pattern presented to that neuron during learning readiness, and by their false-alarm probability for random unlearned patterns. Based on simulations, the upper bounds to recoverable information are ~0.1 bits per neuron for optimized neuron parameters and training. This information metric assumes that: 1) each neural spike indicates only that the responsible neuron input excitation pattern (a pattern lasts less than the time between consecutive patterns, say 30 milliseconds) had probably been seen earlier while that neuron was ??learning ready??, and 2) information is stored in the binary synapse strengths. This focus on recallable learned information differs from most prior metrics such as pattern classification performance and metrics relying on pattern-specific training signals other than the normal input spikes. This metric also shows that neuron models can recall useful Shannon information only if their probability of firing randomly is lowered between learning and recall. Also discussed are: 1) how rich feedback might permit improved noise immunity, learning and recognition of pattern sequences, compression of data, associative or content-addressable memory, and development of communications links through white matter, 2) extensions of cognon models that use spike timing, dendrite compartments, and new learning mechanisms in addition to spike timing- dependent plasticity (STDP), 3) simulations that show how simple optimized neuron models can have optimum numbers of binary synapses in the range between 200 and 10,000, depending on neural parameters, and 4) simulation results for parameters like the average bits/spike, bits/neuron/second, maximum number of learnable patterns, optimum ratios between the strengths of weak and strong synapses, and probabilities of false alarms.
Topic #10: (text #2181)
What is in a Web View? An Analysis of Progressive Web App Features When the Means of Web Access is not a Web Browser.  Progressive Web Apps (PWA) are a new class of Web applications, enabled for the most part by the Service Workers APIs. Service Workers allow apps to work offline by intercepting network requests to deliver programmatic or cached responses, Service Workers can receive push notifications and synchronize data in the background even when the app is not running, and??together with Web App Manifests??allow users to install PWAs to their devices?? home screens. Service Workers being a Web standard, support has landed in several stand-alone Android Web browsers??among them (but not limited to) Chrome and its open-source foundation Chromium, Firefox, Edge, Opera, UC Browser, Samsung Internet, and??eagerly awaited??iOS Safari. In this paper, we examine the PWA feature support situation in Web Views, that is, in-app Web experiences that are explicitly not stand-alone browsers. Such in-app browsers can commonly be encountered in chat applications like WeChat or WhatsApp, online social networks like Facebook or Twitter, but also email clients like Gmail, or simply anywhere where Web content is displayed inside native apps. We have developed an open-source application called PWA Feature Detector that allows for easily testing in-app browsers (and naturally stand-alone browsers), and have evaluated the level of support for PWA features on different devices and Web Views. On the one hand, our results show that there are big differences between the various Web View technologies and the browser engines they are based upon, but on the other hand, that for Android the results are independent from the devices?? operating systems, which is good news given the problematic update policy of many device manufacturers. These findings help developers make educated choices when it comes to determining whether a PWA is the right approach given their target users?? means of Web access.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-16-9765808bd785> in <module>
      1 text_idx = []
      2 for i in range(len(X_topics)-1):
----> 3     topic = X_topics[:, i].argsort()
      4     text_idx.append(topic[-1:-4:-1])
      5     print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])

IndexError: index 10 is out of bounds for axis 1 with size 10

Comparison¶

Topic 1: Image Focus Quality (image)
Topic 2: Power Management of Online Data-Intensive Services (power、service)
Topic 3: DNA-Based Approach to Synthetic Biology (dna)
Topic 4: Recursive Sparse Spatiotemporal Coding ('vedio' is the only match, yet not really the topic)
Topic 5: Planar Group->planar graphs (graph, which however is not really the main topic)
Topic 6: Syntactic and Semantic Annotation (can be related to 'text'(NLP), however no matching word)
Topic 7: Sharing Curated Friend Groups (online、social(e.g. Google+, Twitter, and Facebook))
Topic 8: Rational Deconstruction of Landin's SECD Machine (control)
Topic 9: Models for Neural Spike Computation and Cognition (no match for top words)
Topic 10: Progressive Web App Features (web)
Comment:
- 分類的結果比前面要好的許多，幾乎每篇都和Top words蠻相符的
- 第6和第9篇都有在文章中找不到top words，但內容卻與top word類似的情況，也許是由其他未被印出的top eords貢獻的
- 因為少了許多trivial的字，比較起來相符度大幅提高，但因為輸入資料是沒經過處理，所以還是有例如Topic 7中出現的ad和ads這種只差在單複數的詞

Using Lemmatized Data¶

count = CountVectorizer(stop_words='english',
                        max_df=0.1,
                        max_features=10000)
lda = LatentDirichletAllocation(n_components=10, random_state=6, learning_method='batch')
X_lda = count.fit_transform(X_lem['Text_Lemmatized'])
X_topics = lda.fit_transform(X_lda)

n_top_words = 5
top_words_lem = []
feature_names = count.get_feature_names()

for idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (idx + 1))
    top_words_lem.append([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]])
    print(" ".join(top_words_lem[idx]))

Topic 1:
video speech label recognition word
Topic 2:
query web track device hash
Topic 3:
sequence loss function translation error
Topic 4:
survey experience web online tool
Topic 5:
input gesture accuracy test matrix
Topic 6:
graph auction cluster program node
Topic 7:
web interaction browser click ad
Topic 8:
query service share social resource
Topic 9:
distribute policy security software control
Topic 10:
quantum ad advertiser optimization space

Comment on Top Words¶

資料經過Lemmatization處理，會發現就不會像原本那樣同個topic重複出現一樣的字，使得有不同的可能更有意義的字出現
但Lemmatization有另一個問題是他是根據詞性轉換，因此會有像測試過程中use、user同時出現的情況

text_idx_lem = []
for i in range(10):
    topic = X_topics[:, i].argsort()
    text_idx_lem.append(topic[:-4:-1])
    print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])
    print(df['title_abstract'][topic[-1]])#[:300],'...')

Topic #1: (text #1842)

In this demo, we focus on the cold-start problem, where either the seed and/or the candidate video are freshly uploaded (or undiscovered) so the CF system cannot identify any related videos for them.  Being able to recommend freshly uploaded videos as well as recommend good related videos for fresh video seeds are important for improving freshness and user engagement. We model this as a video content-based similarity learning  problem,  and  learn  deep  video  embeddings  trained  to  predict  ground-truth  video  relationships (identified by a CF co-watch-based system) but using only visual content. The system does not depend on availability on video metadata or any click information, and can generalize to both popular and tail content, as well as new video uploads. It embeds any new video into a 1024-dimensional representation based on its content and pairwise video similarity is computed simply as a dot product in the embedding space. We show that the learned video embeddings generalize beyond simple visual similarity and are able to capture complex semantic relationships.
Topic #2: (text #1095)
F-ing modules.  ML modules are a powerful language mechanism for decomposing programs into reusable components. Unfortunately, they also have a reputation for being "complex" and requiring fancy type theory that is mostly opaque to non-experts. While this reputation is certainly understandable, given the many non-standard methodologies that have been developed in the process of studying modules, we aim here to demonstrate that it is undeserved. To do so, we give a very simple elaboration semantics for a full-featured, higher-order ML-like module language. Our elaboration defines the meaning of module expressions by a straightforward, compositional translation into vanilla System F?? (the higher-order polymorphic ??-calculus), under plain F?? typing environments. We thereby show that ML modules are merely a particular mode of use of System F??. We start out with a module language that supports the usual second-class modules with Standard ML-style generative functors, and includes local module definitions. To demonstrate the versatility of our approach, we further extend the language with the ability to package modules as first-class values ?? a very simple extension, as it turns out ?? and a novel treatment of OCaml-style applicative functors. Unlike previous work combining both generative and applicative functors, we do not require two distinct forms of functor or sealing expressions. Instead, whether a functor is applicative or not depends only on the computational purity of its body ?? in fact, we argue that applicative/generative is rather incidental terminology for what is best understood as pure vs. impure functors. This approach results in a semantics that we feel is simpler and more natural, and moreover prohibits breaches of data abstraction that are possible under earlier semantics for applicative functors. We also revive (in refined form) the long-lost notion of structure sharing from SML'90. Although previous work on module type systems has disparaged structure sharing as type-theoretically questionable, we observe that (1) some variant of it is in fact necessary in order to provide a proper treatment of abstraction in the presence of applicative functors, and (2) it is straightforward to account for using ``phantom types''. Based on this, we can even justify the (previously poorly understood) "where module" operator for signatures and the related notion of manifest module specifications. Altogether, we describe a comprehensive, unified, and yet simple semantics of a full-blown module language that ?? with the main exception of cross-module recursion ?? covers almost all interesting features that can be found in either the literature or in practical implementations of ML modules. We prove the language sound and its type checking decidable.
Topic #3: (text #506)
A rational deconstruction of Landin's SECD machine with the J operator.  Landin's SECD machine was the first abstract machine for applicative expressions, ie, functional programs. Landin's J operator was the first control operator for functional languages, and was specified by an extension of the SECD machine. We present a family of evaluation functions corresponding to this extension of the SECD machine, using a series of elementary transformations (transformation into continu-ation-passing style (CPS) and defunctionalization, chiefly) and their left inverses (transformation into direct style and refunctionalization). To this end, we modernize the SECD machine into a bisimilar one that operates in lockstep with the original one but that (1) does not use a data stack and (2) uses the caller-save rather than the callee-save convention for environments. We also identify that the dump component of the SECD machine is managed in a callee-save way. The caller-save counterpart of the modernized SECD machine precisely corresponds to Thielecke's double-barrelled continuations and to Felleisen's encoding of J in terms of call/cc. We then variously characterize the J operator in terms of CPS and in terms of delimited-control operators in the CPS hierarchy. As a byproduct, we also present several reduction semantics for applicative expressions with the J operator, based on Curien's original calculus of explicit substitutions. These reduction semantics mechanically correspond to the modernized versions of the SECD machine and to the best of our knowledge, they provide the first syntactic theories of applicative expressions with the J operator. The present work is concluded by a motivated wish to see Landin's name added to the list of co-discoverers of continuations. Methodologically, however, it mainly illustrates the value of Reynolds's defunctionalization and of refunctionalization as well as the expressive power of the CPS hierarchy (1) to account for the first control operator and the first abstract machine for functional languages and (2) to connect them to their successors. Our work also illustrates the value of Danvy and Nielsen's refocusing technique to connect environment-based abstract machines and syntactic theories in the form of reduction semantics for calculi of explicit substitutions.
Topic #4: (text #1251)


The Inquiry considered eight different potential causes of the polling miss and assessed the evidence in support of each of them. Our conclusion is that the primary cause of the polling miss in 2015 was unrepresentative samples. The methods the pollsters used to collect samples of voters systematically over-represented Labour supporters and under-represented Conservative supporters. The statistical adjustment procedures applied to the raw data did not mitigate this basic problem to any notable degree. The other putative causes can have made, at most, only a small contribution to the total error. We were able to replicate all published estimates for the final polls using raw microdata, so we can exclude the possibility that flawed analysis, or use of inaccurate weighting targets on the part of the pollsters, contributed to the polling miss. The procedures used by the pollsters to handle postal voters, overseas voters, and unregistered voters made no detectable contribution to the polling errors. There may have been a very modest ??late swing?? to the Conservatives between the final polls and Election Day, although this can have contributed ?? at most ?? around one percentage point to the error on the Conservative lead. We reject deliberate misreporting as a contributory factor in the polling miss on the grounds that it cannot easily be reconciled with the results of the re-contact surveys carried out by the pollsters and with two random surveys undertaken after the election. Evidence from several different sources does not support differential turnout misreporting making anything but, at most, a very small contribution to the polling errors. There was no difference between online and phone modes in the accuracy of the final polls. However, over the 2010-2015 parliament and in much of the election campaign, phone polls produced somewhat higher estimates of the Conservative vote share (1 to 2 percentage points). It is not possible to say what caused this effect, given the many confounded differences between the two modes. Neither is it possible to say which was the more accurate mode on the basis of this evidence. The decrease in the variance on the estimate of the Conservative lead in the final week of the campaign is consistent with herding - where pollsters make design and reporting decisions that cause published estimates to vary less than expected, given their sample sizes. Our interpretation of the evidence is that this convergence was unlikely to have been the result of deliberate collusion, or other forms of malpractice by the pollsters.
Topic #5: (text #1868)
On expression patterns and developmental origin of human brain regions.  Anatomical substructures of the human brain have characteristic cell-types, connectivity and local circuitry, which are reflected in area-specific transcriptome signatures, but the principles governing area-specific transcription and their relation to brain development are still being studied. In adult rodents, areal transcriptome patterns agree with the embryonic origin of brain regions, but the processes and genes that preserve an embryonic signature in regional expression profiles were not quantified. Furthermore, it is not clear how embryonic-origin signatures of adult-brain expression interplay with changes in expression patterns during development. Here we first quantify which genes have regional expression-patterns related to the developmental origin of brain regions, using genome-wide mRNA expression from post-mortem adult human brains. We find that almost all human genes (92%) exhibit an expression pattern that agrees with developmental brain-region ontology, but that this agreement changes at multiple phases during development. Agreement is particularly strong in neuron-specific genes, but also in genes that are not spatially correlated with neuron-specific or glia-specific markers. Surprisingly, agreement is also stronger in early-evolved genes. We further find that pairs of similar genes having high agreement to developmental region ontology tend to be more strongly correlated or anti-correlated, and that the strength of spatial correlation changes more strongly in gene pairs with stronger embryonic signatures. These results suggest that transcription regulation of most genes in the adult human brain is spatially tuned in a way that changes through life, but in agreement with development-determined brain regions.
Topic #6: (text #962)
Meta-DNA: A DNA-Based Approach to Synthetic Biology.  The goal of synthetic biology is to design and assemble synthetic systems that mimic biological systems. One of the most fundamental challenges in synthetic biology is to synthesize artificial biochemical systems, which we will call meta-biochemical systems, that provide the same functionality as biological nucleic acids-enzyme systems, but that use a very limited number of types of molecules. The motivation for developing such synthetic biology systems is to enable a better understanding of the basic processes of natural biology, and also to enable re-engineering and programmability of synthetic versions of biological systems. One of the key aspects of modern nucleic acid biochemistry is its extensive use of protein enzymes that were originally evolved in cells to manipulate nucleic acids, and then later adapted by man for laboratory use. This practice provided powerful tools for manipulating nucleic acids, but also limited the extent of the programmability of the available chemistry for manipulating nucleic acids, since it is very difficult to predictively modify the behavior of protein enzymes. Meta-biochemical systems offer the possible advantage of being far easier to re-engineer and program for desired functionality. The approach taken here is to develop a biochemical system which we call meta-DNA (abbreviated as mDNA),Meta-DNA (mDNA) based entirely on strands of DNA as the only component molecules. Our work leverages prior work on the development of self-assembled DNA nanostructures. Each base of a mDNA Meta-DNA (mDNA)is a DNA nanostructure. Our mDNA bases are paired similar to DNA bases, but have a much larger alphabet of bases, thereby providing increased power of base addressability. Our mDNA bases can be assembled to form flexible linear assemblies (single stranded mDNA) analogous to single stranded DNA, and can be hybridized to form stiff helical structures (duplex mDNA) analogous to double Double strand meta-DNA (dsmDNA) stranded DNA, and also can be denatured back to single stranded mDNA. Our work also leverages the abstract activatable tile model developed by Majumder et al. and prior work on the development of enzyme-free isothermal protocols based on DNA hybridization and sophisticated strand displacement hybridization reactions. We describe various isothermal hybridization reactions that manipulate our mDNA in powerful ways analogous to DNA??DNA reactions and the action of various enzymes on DNA. These operations on mDNA include (i) hybridization of single strand mDNA (ssmDNA)Single strand meta-DNA (ssmDNA) into a double strand mDNA (dsmDNA)Double strand meta-DNA (dsmDNA) and heat denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(analogous to DNA hybridization and denaturation), (ii) strand displacement of one ssmDNA Single strand meta-DNA (ssmDNA)by another (similar to strand displacement in DNA), (iii) restriction cuts on the backbones of ssmDNA Single strand meta-DNA (ssmDNA)and dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of restriction enzymes on DNA), (iv) polymerization chain reactions that extend ssmDNA Single strand meta-DNA (ssmDNA)on a template to form a complete dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of polymerase enzyme on DNA), (v) isothermal denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(like the activity of helicase enzyme on DNA) and (vi) an isothermal replicator reaction which exponentially amplifies ssmDNA Single strand meta-DNA (ssmDNA)strands (similar to the isothermal PCR reaction). We provide a formal model to describe the required properties and operations of our mDNA, and show that our proposed DNA nanostructures and hybridization reactions provide these properties and functionality.
Topic #7: (text #2183)


user during interaction on the body.
Topic #8: (text #472)
Look Who I Found: Understanding the Effects of Sharing Curated Friend Groups.  Online social networks like Google+, Twitter, and Facebook allow users to build, organize, and manage their social connections for the purposes of information sharing and consumption. Nonetheless, most social network users still report that building and curating contact groups is a time-consuming burden. To help users overcome the burdens of contact discovery and grouping, Google+ recently launched a new feature known as "circle sharing". The feature makes it easy for users to share the benefits of their own contact curation by sharing entire "circles" (contact groups) with others. Recipients of a shared circle can adopt the circle as a whole, merge the circle into one of their own circles, or select specific members of the circle to add. In this paper, we investigate the impact that circle-sharing has had on the growth and structure of the Google+ social network. Using a cluster analysis, we identify two natural categories of shared circles, which represent two qualitatively different use cases: circles comprised primarily of celebrities (celebrity circles), and circles comprised of members of a community (community circles). We observe that exposure to circle-sharing accelerates the rate at which a user adds others to his or her circles. More specifically, we notice that circle-sharing has accelerated the "densification" rate of community circles, and also that it has disproportionately affected users with few connections, allowing them to find new contacts at a faster rate than would be expected based on accepted models of network growth. Finally, we identify features that can be used to predict which of a user??s circles (s)he is most likely to share, thus demonstrating that it is feasible to suggest to a user which circles to share with friends.
Topic #9: (text #1351)
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection.  Motivated by the growing complexity of heterogeneity and hierarchical organization in data centers, in this paper we address the problem of placing copies of tasks on machines with the goal of increasing availability in the presence of failures. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we want to find a placement that maximizes the probability that at least K out of N tasks are available at any time. The problems in the first category covering adversarial failures lie in $\Sigma_2$, the second level of the polynomial hierarchy. We show that a basic variant, which we call RobustFAP, is co-NP-hard, and an all-or-nothing version of RobustFAP is $\Sigma_2$-complete. Our first algorithmic result here is is a PTAS for RobustFAP, a key ingredient of which is a solution to a fractional variant of RobustFAP. Our second algorithmic result is for fractional HierRobustFAP over hierarchies: a new Generalized Spreading algorithm, which is simultaneously optimal for all W. This algorithm generalizes the classical notion of {\em max-min fairness} to work with nodes of differing capacities, differing reliability weights and hierarchical structures. We then apply randomized rounding to obtain an algorithm for integral RobustFAP over hierarchies. For the probabilistic version ProbFAP, we first focus on the single level problem and give an algorithm that achieves an additive \eps approximation in the failure probability while giving up a (1 + \eps) multiplicative factor in the number of failures. We then extend the result to the hierarchical version achieving a \eps additive approximation in failure probability while giving up a (L + \eps) multiplicative factor in the number of failures, where L is the number of levels in the hierarchy.
Topic #10: (text #1097)
Non-Parametric Parametricity.  Type abstraction and intensional type analysis are features seemingly at odds??type abstraction is intended to guarantee parametricity and representation independence, while type analysis is inherently non-parametric. Recently, however, several researchers have proposed and implemented ??dynamic type generation?? as a way to reconcile these features. The idea is that, when one defines an abstract type, one should also be able to generate at run time a fresh type name, which may be used as a dynamic representative of the abstract type for purposes of type analysis. The question remains: in a language with non-parametric polymorphism, does dynamic type generation provide us with the same kinds of abstraction guarantees that we get from parametric polymorphism? Our goal is to provide a rigorous answer to this question. We define a step-indexed Kripke logical relation for a language with both non-parametric polymorphism (in the form of type-safe cast) and dynamic type generation. Our logical relation enables us to establish parametricity and representation independence results, even in a non-parametric setting, by attaching arbitrary relational interpretations to dynamically-generated type names. In addition, we explore how programs that are provably equivalent in a more traditional parametric logical relation may be ??wrapped?? systematically to produce terms that are related by our non-parametric relation, and vice versa. This leads us to develop a ??polarized?? variant of our logical relation, which enables us to distinguish formally between positive and negative notions of parametricity.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-18-4a2e4c69352d> in <module>
      1 text_idx_lem = []
      2 for i in range(len(X_topics)-1):
----> 3     topic = X_topics[:, i].argsort()
      4     text_idx_lem.append(topic[:-4:-1])
      5     print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])

IndexError: index 10 is out of bounds for axis 1 with size 10

Comparison¶

Topic: Main topic of text (related top words)

Topic 1: Video Recommendations (vedio)
Topic 2: F-ing modules (no match)
Topic 3: Rational Deconstruction of Landin's SECD Machine (no match)
Topic 4: Inquiry Into Election Opinion Polls (survey)
Topic 5: Human Brain (not related)
Topic 6: DNA-Based Approach to Synthetic Biology (program)
Topic 7: On-Skin Interaction (interaction)
Topic 8: Effects of Sharing Curated Friend Groups (share、social)
Topic 9: Hand-Eye Coordination for Robotic Grasping (no match for top words)
Topic 10: Non-Parametric Parametricity (no match)
Comment:
- 分類的結果，變得比較難找到top word的相關性，分類效果也相較比較差
- Lemmatization並沒有比原始資料好，反而增加運算的time cost(lemmatization除了要逐字做，還要另外判斷詞性，效率低)

Using Stemmed Data¶

count = CountVectorizer(stop_words='english',
                        max_df=0.1,
                        max_features=10000)
lda = LatentDirichletAllocation(n_components=10, random_state=6, learning_method='batch')
X_lda = count.fit_transform(X_stem['Text_Stemmed'])
X_topics = lda.fit_transform(X_lda)

n_top_words = 5
top_words_stem = []
feature_names = count.get_feature_names()

for idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (idx + 1))
    top_words_stem.append([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]])
    print(" ".join(top_words_stem[idx]))

Topic 1:
label domain annot semant supervis
Topic 2:
web survey privaci mobil secur
Topic 3:
advertis media ad commun interact
Topic 4:
queri approxim cluster adversari ad
Topic 5:
polici test onlin human price
Topic 6:
video object local visual represent
Topic 7:
market sourc color actor program
Topic 8:
softwar servic latenc engin program
Topic 9:
speech recognit sequenc word acoust
Topic 10:
graph kernel random quantum estim

Comment on Top Words¶

Possible Topics:
- Topic 1: supervised training
- Topic 2: online survey
- Topic 3、4: media & advertisement (ad suggest...)
- Topic 5: website policies
- Topic 6: vedio processing
- Topic 7: marketing program
- Topic 8: software programming
- Topic 9: speech recognition
- Topic 10: kernel models(ML)
Top words因為stem的關係只剩詞幹，有些比較難猜出原本的字
有幾個topic(例如第5、第7)的字感覺不出有甚麼關聯，比較難猜主題

text_idx_stem = []
for i in range(10):
    topic = X_topics[:, i].argsort()
    text_idx_stem.append(topic[-1:-4:-1])
    print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])
    print(df['title_abstract'][topic[-1]])#[:300],'...')

Topic #1: (text #1153)
Hierarchical Label Propagation and Discovery for Machine Generated Email.  Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.
Topic #2: (text #2181)
What is in a Web View? An Analysis of Progressive Web App Features When the Means of Web Access is not a Web Browser.  Progressive Web Apps (PWA) are a new class of Web applications, enabled for the most part by the Service Workers APIs. Service Workers allow apps to work offline by intercepting network requests to deliver programmatic or cached responses, Service Workers can receive push notifications and synchronize data in the background even when the app is not running, and??together with Web App Manifests??allow users to install PWAs to their devices?? home screens. Service Workers being a Web standard, support has landed in several stand-alone Android Web browsers??among them (but not limited to) Chrome and its open-source foundation Chromium, Firefox, Edge, Opera, UC Browser, Samsung Internet, and??eagerly awaited??iOS Safari. In this paper, we examine the PWA feature support situation in Web Views, that is, in-app Web experiences that are explicitly not stand-alone browsers. Such in-app browsers can commonly be encountered in chat applications like WeChat or WhatsApp, online social networks like Facebook or Twitter, but also email clients like Gmail, or simply anywhere where Web content is displayed inside native apps. We have developed an open-source application called PWA Feature Detector that allows for easily testing in-app browsers (and naturally stand-alone browsers), and have evaluated the level of support for PWA features on different devices and Web Views. On the one hand, our results show that there are big differences between the various Web View technologies and the browser engines they are based upon, but on the other hand, that for Android the results are independent from the devices?? operating systems, which is good news given the problematic update policy of many device manufacturers. These findings help developers make educated choices when it comes to determining whether a PWA is the right approach given their target users?? means of Web access.
Topic #3: (text #1875)
The modeling, simulation, and operational control of aerospace communication networks.  A paradigm shift is taking place in aerospace communications. Traditionally, aerospace systems have relied upon circuit switched communications; geostationary communications satellites act as bent-pipe transponders and are not burdened with packet processing and the complexity of mobility in a network topology. But factors such as growing mission complexity (platforms with multiple payloads, multi-hop communication relay requirements, etc.) and NewSpace development practices are driving the rapid adoption of packet-based network protocols in aerospace networks. Meanwhile, several new aerospace networks are being designed to provide either low latency, high-resolution imaging or low-latency Internet access while operating in non-geostationary orbits or even in the upper atmosphere. The need for high data-rate communications in these networks is simultaneously driving greater reliance on beamforming, directionality, and narrow beamwidths in RF communications and free-space optical communications. This dissertation explores the challenges and offers novel solutions in the modeling, simulation, and operational control of these new aerospace networks. In the concept, design, and development phases of such networks, the dissertation motivates the use of network simulators to model network protocols and network application traffic instead of relying solely on static, offline link budget calculations. It also contributes a new approach to network simulation that can integrate with spatial temporal information systems for high-fidelity modeling of time-dynamic geometry, antenna gain patterns, and wireless signal propagation in the physical layer. And towards the operational control of such networks, the dissertation introduces Temporospatial Software Defined Networking (TS-SDN), a new approach that leverages predictability in the propagated motion of platforms and high-fidelity wireless link modeling to build a holistic, predictive view of the accessible network topology and provides SDN applications with the ability to optimize the network topology and routing through the direct expression of network behavior and requirements. A high-level overview of an implementation of Temporospatial SDN at Alphabet is included. The dissertation also describes and demonstrates the benefits of the application of TS-SDN in Low Earth Orbiting (LEO) satellite constellations and High Altitude Platform Systems (HAPS).
Topic #4: (text #962)
Meta-DNA: A DNA-Based Approach to Synthetic Biology.  The goal of synthetic biology is to design and assemble synthetic systems that mimic biological systems. One of the most fundamental challenges in synthetic biology is to synthesize artificial biochemical systems, which we will call meta-biochemical systems, that provide the same functionality as biological nucleic acids-enzyme systems, but that use a very limited number of types of molecules. The motivation for developing such synthetic biology systems is to enable a better understanding of the basic processes of natural biology, and also to enable re-engineering and programmability of synthetic versions of biological systems. One of the key aspects of modern nucleic acid biochemistry is its extensive use of protein enzymes that were originally evolved in cells to manipulate nucleic acids, and then later adapted by man for laboratory use. This practice provided powerful tools for manipulating nucleic acids, but also limited the extent of the programmability of the available chemistry for manipulating nucleic acids, since it is very difficult to predictively modify the behavior of protein enzymes. Meta-biochemical systems offer the possible advantage of being far easier to re-engineer and program for desired functionality. The approach taken here is to develop a biochemical system which we call meta-DNA (abbreviated as mDNA),Meta-DNA (mDNA) based entirely on strands of DNA as the only component molecules. Our work leverages prior work on the development of self-assembled DNA nanostructures. Each base of a mDNA Meta-DNA (mDNA)is a DNA nanostructure. Our mDNA bases are paired similar to DNA bases, but have a much larger alphabet of bases, thereby providing increased power of base addressability. Our mDNA bases can be assembled to form flexible linear assemblies (single stranded mDNA) analogous to single stranded DNA, and can be hybridized to form stiff helical structures (duplex mDNA) analogous to double Double strand meta-DNA (dsmDNA) stranded DNA, and also can be denatured back to single stranded mDNA. Our work also leverages the abstract activatable tile model developed by Majumder et al. and prior work on the development of enzyme-free isothermal protocols based on DNA hybridization and sophisticated strand displacement hybridization reactions. We describe various isothermal hybridization reactions that manipulate our mDNA in powerful ways analogous to DNA??DNA reactions and the action of various enzymes on DNA. These operations on mDNA include (i) hybridization of single strand mDNA (ssmDNA)Single strand meta-DNA (ssmDNA) into a double strand mDNA (dsmDNA)Double strand meta-DNA (dsmDNA) and heat denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(analogous to DNA hybridization and denaturation), (ii) strand displacement of one ssmDNA Single strand meta-DNA (ssmDNA)by another (similar to strand displacement in DNA), (iii) restriction cuts on the backbones of ssmDNA Single strand meta-DNA (ssmDNA)and dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of restriction enzymes on DNA), (iv) polymerization chain reactions that extend ssmDNA Single strand meta-DNA (ssmDNA)on a template to form a complete dsmDNA Double strand meta-DNA (dsmDNA)(similar to the action of polymerase enzyme on DNA), (v) isothermal denaturation of a dsmDNA Double strand meta-DNA (dsmDNA)into its component ssmDNA Single strand meta-DNA (ssmDNA)(like the activity of helicase enzyme on DNA) and (vi) an isothermal replicator reaction which exponentially amplifies ssmDNA Single strand meta-DNA (ssmDNA)strands (similar to the isothermal PCR reaction). We provide a formal model to describe the required properties and operations of our mDNA, and show that our proposed DNA nanostructures and hybridization reactions provide these properties and functionality.
Topic #5: (text #2219)


Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.
Topic #6: (text #1842)

In this demo, we focus on the cold-start problem, where either the seed and/or the candidate video are freshly uploaded (or undiscovered) so the CF system cannot identify any related videos for them.  Being able to recommend freshly uploaded videos as well as recommend good related videos for fresh video seeds are important for improving freshness and user engagement. We model this as a video content-based similarity learning  problem,  and  learn  deep  video  embeddings  trained  to  predict  ground-truth  video  relationships (identified by a CF co-watch-based system) but using only visual content. The system does not depend on availability on video metadata or any click information, and can generalize to both popular and tail content, as well as new video uploads. It embeds any new video into a 1024-dimensional representation based on its content and pairwise video similarity is computed simply as a dot product in the embedding space. We show that the learned video embeddings generalize beyond simple visual similarity and are able to capture complex semantic relationships.
Topic #7: (text #761)


Conclusions:  Our method is advantageous in identifying previously unknown adverse drug reactions. These ADRs should be considered as candidates for further scrutiny by medical regulatory authorities, e.g., through Phase IV trials.
Topic #8: (text #313)
Power Management of Online Data-Intensive Services.  Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These workloads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques. We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.
Topic #9: (text #558)
Large Scale Distributed Acoustic Modeling With Back-off N-grams.  The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantly beyond triphones, as well as increase the number of Gaussian mixture components for the context-dependent states that allow it. We have experimented with contexts that span seven or more context-independent phones, and up to 620 mixture components per state. Dealing with unseen phonetic contexts is accomplished using the familiar back-off technique used in language modeling due to implementation simplicity. The back-off acoustic model is estimated, stored and served using MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an N-best list rescoring framework for Google Voice Search. Training big models on large amounts of data proves to be an effective way to increase the accuracy of a state-of-the-art automatic speech recognition system. We use 87,000 hours of training data (speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic speech recognition confidence. Models ranging in size between 20--40 million Gaussians are estimated using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual information, respectively. Increasing the context size beyond five phones (quinphones) does not help.
Topic #10: (text #1196)
What is the Computational Value of Finite Range Tunneling?.  Quantum annealing (QA) has been proposed as a quantum enhanced optimization heuristic exploiting tunneling. Here, we demonstrate how finite-range tunneling can provide considerable computational advantage. For a crafted problem designed to have tall and narrow energy barriers separating local minima, the D-Wave 2X quantum annealer achieves significant runtime advantages relative to simulated annealing (SA). For instances with 945 variables, this results in a time-to-99%-success-probability that is ~1e8 times faster than SA running on a single processor core. We also compare physical QA with the quantum Monte Carlo algorithm, an algorithm that emulates quantum tunneling on classical processors. We observe a substantial constant overhead against physical QA: D-Wave 2X again runs up to ~ 1e8 times faster than an optimized implementation of the quantum Monte Carlo algorithm on a single core. We note that there exist heuristic classical algorithms that can solve most instances of Chimera structured problems in a time scale comparable to the D-Wave 2X. However, it is well known that such solvers will become ineffective for sufficiently dense connectivity graphs. To investigate whether finite-range tunneling will also confer an advantage for problems of practical interest, we conduct numerical studies on binary optimization problems that cannot yet be represented on quantum hardware. For random instances of the number partitioning problem, we find numerically that algorithms designed to simulate QA scale better than SA. We discuss the implications of these findings for the design of next-generation quantum annealers.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-20-25acbb1e40a7> in <module>
      1 text_idx_stem = []
      2 for i in range(len(X_topics)-1):
----> 3     topic = X_topics[:, i].argsort()
      4     text_idx_stem.append(topic[-1:-4:-1])
      5     print('Topic #%d:' % (i + 1), '(text #%d)' %topic[-1])

IndexError: index 10 is out of bounds for axis 1 with size 10

Comparison¶

Topic 1: Hierarchical Label Propagation and Discovery for Machine Generated Email (label)
Topic 2: Analysis of Progressive Web App Features (web)
Topic 3: Aerospace Communication Networks (commun)
Topic 4: DNA-Based Approach to Synthetic Biology (no match)
Topic 5: Evaluating Machine Learning Models (test)
Topic 6: Content-based Related Video Recommendations (video、visual)
Topic 7: Postmarket Drug Surveillance (market)
Topic 8: Online Data-Intensive Services (service)
Topic 9: Acoustic Modeling With Back-off N-grams (speech、word)
Topic 10: Computational Value of Finite Range Tunneling (quantum)
Comment:
- 分類的結果比lemmatized資料要好，且幾乎每篇都和Top words蠻相符的
- 經過stem的資料雖然只剩詞幹有些難以辨認，但比起未經處理，不會有重複的字出現在同一組top word，更容易分辨出topic，效果較好

match = []
print('Raw Text\n')
for n, idx in enumerate(text_idx):
    m = 0
    for i in idx:
        for w in X['title_abstract'][i]:
            if w in top_words[n]:
                m+=1
    print('   Topic %2d (%s): %d matches' % ((n+1), ', '.join(top_words[n]), m))
    match.append(m)
print('   Total matches:', sum(match))    

match = []
print('\nLemmatized Text\n')
for n, idx in enumerate(text_idx_lem):
    m = 0
    for i in idx:
        for w in X_lem['Text_Lemmatized'][i].split():
            if w in top_words_lem[n]:
                m+=1
    print('   Topic %2d (%s): %d matches' % ((n+1), ', '.join(top_words_lem[n]), m))
    match.append(m)
print('   Total matches:', sum(match))

match = []    
print('\nStemmed Text\n')
for n, idx in enumerate(text_idx_stem):
    m = 0
    for i in idx:
        for w in X_stem['Text_Stemmed'][i].split():
            if w in top_words_stem[n]:
                m+=1
    print('   Topic %2d (%s): %d matches' % ((n+1), ', '.join(top_words_stem[n]), m))
    match.append(m)
print('   Total matches:', sum(match))

Raw Text

   Topic  1 (image, object, images, 3d, objects): 21 matches
   Topic  2 (distributed, latency, service, online, power): 32 matches
   Topic  3 (quantum, software, dna, energy, single): 51 matches
   Topic  4 (video, videos, image, dataset, images): 1 matches
   Topic  5 (graph, random, functions, function, linear): 12 matches
   Topic  6 (speech, sequence, recognition, text, word): 13 matches
   Topic  7 (online, ad, traffic, social, ads): 17 matches
   Topic  8 (code, software, source, control, rate): 3 matches
   Topic  9 (matrix, error, policy, estimation, prediction): 27 matches
   Topic 10 (web, mobile, experience, privacy, survey): 22 matches
   Total matches: 199

Lemmatized Text

   Topic  1 (video, speech, label, recognition, word): 40 matches
   Topic  2 (query, web, track, device, hash): 1 matches
   Topic  3 (sequence, loss, function, translation, error): 4 matches
   Topic  4 (survey, experience, web, online, tool): 35 matches
   Topic  5 (input, gesture, accuracy, test, matrix): 19 matches
   Topic  6 (graph, auction, cluster, program, node): 15 matches
   Topic  7 (web, interaction, browser, click, ad): 10 matches
   Topic  8 (query, service, share, social, resource): 17 matches
   Topic  9 (distribute, policy, security, software, control): 0 matches
   Topic 10 (quantum, ad, advertiser, optimization, space): 5 matches
   Total matches: 146

Stemmed Text

   Topic  1 (label, domain, annot, semant, supervis): 28 matches
   Topic  2 (web, survey, privaci, mobil, secur): 26 matches
   Topic  3 (advertis, media, ad, commun, interact): 9 matches
   Topic  4 (queri, approxim, cluster, adversari, ad): 24 matches
   Topic  5 (polici, test, onlin, human, price): 5 matches
   Topic  6 (video, object, local, visual, represent): 36 matches
   Topic  7 (market, sourc, color, actor, program): 10 matches
   Topic  8 (softwar, servic, latenc, engin, program): 8 matches
   Topic  9 (speech, recognit, sequenc, word, acoust): 15 matches
   Topic 10 (graph, kernel, random, quantum, estim): 34 matches
   Total matches: 195

Summary for LDA¶

Meaning wise
- 前面每個LDA的個別分析主要是以文章主題內容與top word字義為基準來分析分群的效果如何
- 就結果而言，我得到的結論是Stemming > 原始資料 > Lemmatization
Word wise
- 另外用top words去與每個主題的頭三篇去比較，尋找完全相符的用字的字數，單純以重複的數量來分析
- Raw Text: 可以看到topic 4只有一個符合字，與前面結論"no match"相符，是分類較差的一群；總符合數=199
- Lemmatized Text: 第2、9群幾乎沒有符合，也與前面分析相符；總符合數=146
- Stemmed Text: 每群符合量都差不多，說明各群的分類效果還蠻平均的；總符合數=195
- 結果: Raw > Stemmed > Lemmatized
小結:
- Lemmatized是表現最差的，Raw 和 Stemmed Data表現差不多，但我認為Stemmed會更容易將類似的字併在一起，比較不會導致意義相近的字卻被當成不同字來計算，影響結果，會是比較好的選擇。以下的分群將會以Stemmed的資料為主。
- LDA的分群能夠將top words與article本身比對，人為解釋大部分看起來都還蠻相符的。

Vectorize¶

因為Tfidf可以濾掉在每一篇文章都很常出現(無助於區分不同cluster)的token，用TfidfVectorizer處理過的資料比CountVectorizer得到的distortion還小，所以這裡直接用TfidfVectorizer
分別用stemmed、lemmatized的資料轉成vector，並用array形式印出來

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf1 = TfidfVectorizer(max_features=10000, max_df=0.1)
X_lem_vect = tfidf1.fit_transform(X_lem['Text_Lemmatized'])

tfidf2 = TfidfVectorizer(max_features=10000, max_df=0.1)
X_stem_vect = tfidf2.fit_transform(X_stem['Text_Stemmed'])

X_lem_vect = X_lem_vect.toarray()
X_stem_vect = X_stem_vect.toarray()

print('X Lemmatized:')
print(X_lem_vect)
print('X Stemmed:')
print(X_stem_vect)

X Lemmatized:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
X Stemmed:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

K-Means++¶

from sklearn.cluster import KMeans

import pyprind
pbar = pyprind.ProgBar(18)

distortions = []
for i in range(3, 21):
    km = KMeans(n_clusters=i,
                n_init=10,
                max_iter=500,
                tol=1e-04,
                random_state=0,
                n_jobs=-1)
    km.fit(X_stem_vect.T)
    distortions.append(km.inertia_)
    pbar.update()

print('# of Clusters | Distortion')
print('--------------|-----------')
for i, d in enumerate(distortions):
    print('      %-2d      |   %.2f' % (i+3, d))

0% [##################] 100% | ETA: 00:00:00

# of Clusters | Distortion
--------------|-----------
      3       |   2262.51
      4       |   2256.96
      5       |   2253.07
      6       |   2248.24
      7       |   2245.33
      8       |   2228.33
      9       |   2225.81
      10      |   2220.52
      11      |   2219.36
      12      |   2214.71
      13      |   2212.27
      14      |   2211.36
      15      |   2207.84
      16      |   2206.23
      17      |   2203.63
      18      |   2201.45
      19      |   2194.72
      20      |   2192.21

Total time elapsed: 00:04:59

%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(range(3,21), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.xlim(3,20)
plt.grid()
plt.tight_layout()
plt.show()

Elbow Method¶

Algorithm: K-means++
Parameters:
- max_iter: 若設為1000，會大幅提高執行時間，但並沒有明顯得到較低的Distortion，顯示這樣的設定效率不彰，因此設定為500
- n_init: 10
- tol: 1e-4
Result:
- 如果以大尺度來看(測試n_cluster的範圍較大，例如1~50群)，會得到一條幾乎是斜直線的圖形
- 將範圍降到3~20之後發現在7~8群之間有一段明顯下降，除此之外下降率還是幾乎差不多
- 選擇分為8群

from matplotlib import cm
from sklearn.metrics import silhouette_samples
import numpy as np

km = KMeans(n_clusters=8, n_init=10, max_iter=1000, tol=1e-04, random_state=0)
y_km = km.fit_predict(X_stem_vect.T)
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X_stem_vect.T, y_km, metric='euclidean')

feature_stem = tfidf2.get_feature_names()
for n in range(8):
    print('Cluster #%2d | ' % (n+1), end='')
    val = []
    index = []
    for i, c in enumerate(y_km):
        if c == n:
            val.append(silhouette_vals[i])
            index.append(i)
    sort = np.argsort(-np.array(val))
    print('Avg Silhouette: %6.3f | Top Words: ' %np.average(val), end='')
    for j in range(5):
        if j < len(index):
            print(feature_stem[index[sort[j]]], end=' ')
    print('')

Cluster # 1 | Avg Silhouette: -0.270 | Top Words: queri object graph web cluster 
Cluster # 2 | Avg Silhouette:  0.000 | Top Words: video 
Cluster # 3 | Avg Silhouette:  0.000 | Top Words: code 
Cluster # 4 | Avg Silhouette: -0.230 | Top Words: speech sequenc word recognit acoust 
Cluster # 5 | Avg Silhouette:  0.000 | Top Words: label 
Cluster # 6 | Avg Silhouette:  0.616 | Top Words: majumd duplex polymer pcr stiff 
Cluster # 7 | Avg Silhouette: -0.231 | Top Words: advertis ad onlin attribut media 
Cluster # 8 | Avg Silhouette:  0.000 | Top Words: resolut

Comment On Top Words¶

大部份與LDA結果不同，第4、7群有些類似，分別可能為speech recognition、media advertising相關主題
第6群包含大部分的字，因此Top Words看起來反而比較砸，無法直觀判斷相關性

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)
    yticks.append((y_ax_lower+y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='red')
plt.yticks(yticks, cluster_labels +1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
plt.show()

K-Means++ Model Training & Result¶

Model Parameters:
- max_iter: 選擇保守的1000次(希望盡量能讓model更好)
- n_cluster: 由elbow method結果選擇8
Top Words & Silhouette:
- 有幾群只包含單一個字，因此sihlouette值會是0(no distance)
- 第6群是唯一平均silhouette為正值的一群，包含大部分的字
- 剩下的幾群silhouette值約都在-0.2左右，表示是分類錯誤的結果，嘗試改動參數(max_iter from 500 $\rightarrow$ 1000)，沒有太大改善
- Plot: 由寬度可以看出第6群包含了大部分的字，且是唯一一群silhouette值為正的cluster，平均值約0.6，並不算非常理想

Word Clustering: 這一部份用來training的資料都經過transpose，使得word成為instance，而文章本身則是feature。這麼做是將文字分群，類似LDA所做的事情，方便比較。但有一例外是FCM的training資料要求格式本來就恰好是其他方法的transpose，因此要用FCM做word clustering，反而不用transpose。之後會有Article Clustering，則不需要transpose，其結果是將文章依類似的主題(擁有相似的字)分群。¶

Fuzzy C-Means¶

import skfuzzy as fuzz

fpcs = []
print('# of Clusters |    FPC    ')
print('--------------|-----------')
for ncenters in range(3, 11):
    cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        X_stem_vect, ncenters, 2, error=0.0001, maxiter=1000, init=None)
    fpcs.append(fpc)
    print('      %2d      |   %.3f' % (ncenters, fpc))

# of Clusters |    FPC    
--------------|-----------
       3      |   0.333
       4      |   0.250
       5      |   0.200
       6      |   0.167
       7      |   0.143
       8      |   0.125
       9      |   0.111
      10      |   0.100

fig2, ax2 = plt.subplots()
ax2.plot(np.r_[3:11], fpcs)
ax2.set_xlabel("Number of centers")
ax2.set_ylabel("Fuzzy partition coefficient")

Text(0, 0.5, 'Fuzzy partition coefficient')

Fuzzy Partition Coeffitient (FPC)¶

用來判斷Fuzzy C-Means 好壞的參數，越接近1表示越好
定義: $F(U;c)=\frac{1}{n}\sum^n_{i=1}\sum^c_{j=1}u_{ij}^2$
- $n$: number of pattern(words)
- $c$: number of clusters
- $u_{ij}$: membership degree of $i_{th}$ pattern to $j_{th}$ center
結果: 每個FPC都是c的倒數，因此只要c越大，FPC越小，但不確定原因，猜測與training data中帶有許多0有關

cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        X_stem_vect, 3, 2, error=0.005, maxiter=1000, init=None)
cluster_membership = np.argmax(u, axis=0)
print(u, u.shape)
print(cluster_membership, len(cluster_membership))

[[0.33333299 0.33333282 0.33333213 ... 0.33333273 0.33333318 0.33333314]
 [0.33333387 0.33333414 0.33333522 ... 0.33333428 0.33333358 0.33333364]
 [0.33333314 0.33333304 0.33333265 ... 0.33333299 0.33333324 0.33333322]] (3, 8722)
[1 1 1 ... 1 1 1] 8722

Model Training¶

Parameters:
- cluster數: 依FPC結果選擇3群
- maxiter: 1000
Result:
- u存的是每個資料屬於各群的機率，印出來觀察會看到幾乎每筆資料，屬於各群的機率都是0.3333，也就是機率相等，有分等於沒分，結果不具參考價值
- cluster_membership顯示的是最終該字會被分到哪一群，雖然看出幾乎都屬於三群中的第二群，但如前面所述，不大具有參考價值

for i, c in enumerate(cluster_membership):
    if c == 1:
        print(feature_stem[i], end=' ')

print(X_stem_vect.T.shape)
print(u.shape)
print(u.T)

from sklearn.metrics import silhouette_samples

silhouette_vals = silhouette_samples(X_stem_vect.T, (u.T).argmax(axis = 1), metric='euclidean')
feature_stem = tfidf2.get_feature_names()
for n in range(3):
    print('Cluster #%2d | ' % (n+1), end='')
    val = []
    index = []
    for i, c in enumerate(cluster_membership):
        if c == n:
            val.append(silhouette_vals[i])
            index.append(i)
    sort = np.argsort(-np.array(val))
    print('Top Words: ', end='')
    for j in range(5):
        if j < len(index):
            print(feature_stem[index[sort[j]]], end=' ')
    print('')

Cluster # 1 | Top Words: video sequenc queri advertis object 
Cluster # 2 | Top Words: pcr polymer helicas helic stiff 
Cluster # 3 | Top Words:

FCM Result & Comment¶

分群結果，第三群完全沒有字，另外兩群則看不出主題，簡單來講效果不甚理想
FCM所使用的這個class是自己上網搜尋的，並不包含在sklearn中，因此許多部份(例如實際演算的方法)不是很確定是否與講義相同，所以實在也很難斷定結果到底如何。
- References:
  - FCM Document
  - FCM Example
  - about FPC

Agglomerative¶

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
%matplotlib inline
import matplotlib.pyplot as plt

import pyprind
pbar = pyprind.ProgBar(8)

sil_val = []
for n in range(3, 11):
    ac = AgglomerativeClustering(n_clusters=n,
                                 affinity='euclidean',
                                 linkage='complete')
    y_ac = ac.fit_predict(X_stem_vect.T)
    sil_val.append(silhouette_score(X_stem_vect.T, y_ac, metric='euclidean'))
    pbar.update()

plt.plot(range(3,11), sil_val, marker='s')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Value')
plt.xlim(3,10)
plt.grid()
plt.tight_layout()
plt.show()

0% [########] 100% | ETA: 00:00:00
Total time elapsed: 00:19:40

Silhouette Score¶

和silhouette sample不一樣，score回傳的是整份的平均silhouette值，可以用來觀察分幾群的效果好不好；sample則是回傳的是個別instance的值
結果: 分3~4群的效果看起來差不多，而第4群silhouette值又較高一點，到第5群明顯下降，所以選擇分4群

ac = AgglomerativeClustering(n_clusters=4,
                             affinity='euclidean',
                             linkage='complete')
labels = ac.fit_predict(X_stem_vect.T)
print('Cluster labels %s' % labels[0:10])

Cluster labels [0 0 0 0 0 0 0 0 0 0]

from matplotlib import cm
from sklearn.metrics import silhouette_samples

cluster_labels = np.unique(labels)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X_stem_vect.T, labels, metric='euclidean')

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[labels==c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)
    yticks.append((y_ax_lower+y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='red')
plt.yticks(yticks, cluster_labels +1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
plt.show()

silhouette_vals = silhouette_samples(X_stem_vect.T, labels, metric='euclidean')
feature_stem = tfidf2.get_feature_names()
for n in range(4):
    print('Cluster #%2d | ' % (n+1), end='')
    val = []
    index = []
    for i, c in enumerate(labels):
        if c == n:
            val.append(silhouette_vals[i])
            index.append(i)
    sort = np.argsort(-np.array(val))
    print('Top Words: ', end='')
    for j in range(5):
        if j < len(index):
            print(feature_stem[index[sort[j]]], end=' ')
    print('')

Cluster # 1 | Top Words: helic majumd helicas biochemistri polymer 
Cluster # 2 | Top Words: video 
Cluster # 3 | Top Words: speech sequenc word recognit acoust 
Cluster # 4 | Top Words: queri

Model Training & Result¶

分4群
從Silhouette Plot可以看出幾乎都分在第1群，平均值約0.8左右，是不錯的值，但其他群分布則不平均
第3群可以看出是speech recognition相關主題，第1群包含大部分字，top words較雜看不出關聯性，其他群則都只包含一個字。

Summary for Word Clustering¶

LDA效果應屬最佳，其中尤其用raw data(max_df=0.1)、stemmed data
FCM 的效果最差，無法由FPC判斷適合分幾群，且分出來每群機率都相同，沒有太大意義
K-Means++ 與 Agglomerative結果差不多，都是有一群特別多，差在Agglomerative的silhouette值較高，相對好一點
- 因此底下Article Clustering 會選擇使用K-Means++、Agglomerative做traing
K-Means++是K-means的較佳版本(有依據的選擇起始center，而不是random的選擇，因此本篇報告直接選擇K-Means++，沒有使用K-Means)

Clustering With Respect To The Articles¶

K-Means++¶

import pyprind
pbar = pyprind.ProgBar(18)

distortions = []
for i in range(3, 21):
    km = KMeans(n_clusters=i,
                n_init=10,
                max_iter=500,
                tol=1e-04,
                random_state=0,
                n_jobs=-1)
    km.fit(X_stem_vect)
    distortions.append(km.inertia_)
    pbar.update()

print('# of Clusters | Distortion')
print('--------------|-----------')
for i, d in enumerate(distortions):
    print('      %-2d      |   %.2f' % (i+3, d))

0% [##################] 100% | ETA: 00:00:00

# of Clusters | Distortion
--------------|-----------
      3       |   2239.09
      4       |   2229.00
      5       |   2219.05
      6       |   2213.71
      7       |   2207.73
      8       |   2202.56
      9       |   2197.06
      10      |   2191.30
      11      |   2187.17
      12      |   2182.18
      13      |   2179.93
      14      |   2175.05
      15      |   2171.17
      16      |   2166.37
      17      |   2163.95
      18      |   2160.07
      19      |   2154.28
      20      |   2147.76

Total time elapsed: 00:08:33

plt.plot(range(3,21), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.xlim(3,20)
plt.grid()
plt.tight_layout()
plt.show()

Elbow Method¶

可以看到幾乎是條斜直線，沒有明顯轉折
在第5群有些微轉折，所以最後選擇分5群

km = KMeans(n_clusters=5, n_init=10, max_iter=1000, tol=1e-04, random_state=0)
y_km = km.fit_predict(X_stem_vect)
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X_stem_vect, y_km, metric='euclidean')

for n in range(5):
    print('Cluster #%2d | ' % (n+1), end='')
    val = []
    index = []
    for i, c in enumerate(y_km):
        if c == n:
            val.append(silhouette_vals[i])
            index.append(i)
    sort = np.argsort(-np.array(val))
    print('Avg Silhouette: %6.3f | Topics: ' %np.average(val), end='')
    for j in range(3):
        print(df['title_abstract'][index[sort[j]]].split('.')[0], end='. ')
    print('')

Cluster # 1 | Avg Silhouette:  0.001 | Topics: Content-based Related Video Recommendations. Large-Scale Content-Only Video Recommendation. Catching a viral video. 
Cluster # 2 | Avg Silhouette: -0.002 | Topics: Deep Learning in Speech Synthesis. Statistical parametric speech synthesis: from HMM to LSTM-RNN. Generative Model-Based Text-to-Speech Synthesis. 
Cluster # 3 | Avg Silhouette:  0.000 | Topics: Multilingual Language Processing From Bytes. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices. 
Cluster # 4 | Avg Silhouette:  0.001 | Topics: Whole-page optimization and submodular welfare maximization with online bidders. Whole-Page Optimization and Submodular Welfare Maximization with Online Bidders. Value of Targeting. 
Cluster # 5 | Avg Silhouette:  0.000 | Topics: An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions. Canary Analysis Service. State of Mutation Testing at Google.

Comment on Top Topics¶

第一群很明顯都與vedio有關
第二群都在做speech synthesis
第三群比較不確定共通性(maybe NLP)
第四群不明所以印出兩個一模一樣的題目(也許資料中有重複)
第五群有兩篇與Mutation有關
雖說silhouette值很差，但還是看得出分出來的文章間具有一定相關性。

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km==c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)
    yticks.append((y_ax_lower+y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='red')
plt.yticks(yticks, cluster_labels +1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
plt.show()

Model Training & Result¶

Silhouette值全部都超級低，甚至很多是負的
分群也不是很平均(不一定是壞事)，第一群特別少，第三、五群則特別多
總結來說印出題目還是能看得出相關性，但由silhouette值低推測，第一可能是這樣的資料並不適合這樣分群，第二可能是只有頭幾篇有相關，後面則相關性很差

Agglomerative¶

import pyprind
pbar = pyprind.ProgBar(8)

sil_val = []
for n in range(3, 11):
    ac = AgglomerativeClustering(n_clusters=n,
                                 affinity='euclidean',
                                 linkage='complete')
    y_ac = ac.fit_predict(X_stem_vect)
    sil_val.append(silhouette_score(X_stem_vect, y_ac, metric='euclidean'))
    pbar.update()

plt.plot(range(3,11), sil_val, marker='s')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Value')
plt.xlim(3,10)
plt.grid()
plt.tight_layout()
plt.show()

0% [########] 100% | ETA: 00:00:00
Total time elapsed: 00:04:56

Comment¶

Silhouette分超過4群以上都是負的，只有分4群時為正，但也非常小
選擇分4群

ac = AgglomerativeClustering(n_clusters=4,
                             affinity='euclidean',
                             linkage='complete')
labels = ac.fit_predict(X_stem_vect)
print('Cluster labels %s' % labels[0:10])

Cluster labels [2 3 2 1 2 2 3 2 1 2]

from matplotlib import cm
from sklearn.metrics import silhouette_samples

cluster_labels = np.unique(labels)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X_stem_vect, labels, metric='euclidean')

y_ax_lower, y_ax_upper = 0, 0
yticks = []

for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[labels==c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color)
    yticks.append((y_ax_lower+y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='red')
plt.yticks(yticks, cluster_labels +1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.tight_layout()
plt.show()

silhouette_vals = silhouette_samples(X_stem_vect, labels, metric='euclidean')
for n in range(4):
    print('Cluster #%2d | ' % (n+1), end='')
    val = []
    index = []
    for i, c in enumerate(labels):
        if c == n:
            val.append(silhouette_vals[i])
            index.append(i)
    sort = np.argsort(-np.array(val))
    print('Article: ', end='')
    for j in range(3):
        print(df['title_abstract'][index[sort[j]]].split('.')[0], end='. ')
    print('')

Cluster # 1 | Article: Deep Learning in Speech Synthesis. Statistical parametric speech synthesis: from HMM to LSTM-RNN. Generative Model-Based Text-to-Speech Synthesis. 
Cluster # 2 | Article: Fast quantum methods for optimization. Quantum Chemistry Calculations on a Trapped-Ion Quantum Simulator. Application of Fermionic Marginal Constraints to Hybrid Quantum Algorithms. 
Cluster # 3 | Article: Content-based Related Video Recommendations. Large-Scale Content-Only Video Recommendation. Catching a viral video. 
Cluster # 4 | Article: Randomized Composable Core-sets for Distributed Submodular Maximization. Randomized Composable Core-sets for Distributed Submodular Maximization. Improving Resource Efficiency at Scale with Heracles.

Result¶

Cluster 1: speech synthesis
Cluster 2: quantum related topics
Cluster 3: vedio related
Cluster 4: (not sure what the relationship is) something to do with 'randomized'
與上一個結果一樣，還是可以看出明顯的相關性，應該是後面的文章的相關性很低，所以silhouette很低

Summary¶

Preporcessing
- text cleaning: 清掉了html、標點符號、數字，也把stop words刪去
- tokenizing, then stemming、lematizing
  - 在LDA中比較了 stemming、lematizing ，結論為stemming較佳
  - 也有兩者並用的作法，但我主要想做兩者的比較，所以選擇平行的做
- vectorizing: CountVectorizer for LDA(因其計算方式關係只能用這個), TfidfVectorizer for the rest(可以去掉常見卻不具代表意義的詞)
Parameter Choosing
- 觀察trivial的有無來調動max_df
- 用Silhouette、Elbow Method選擇分幾群
- 增加max_iter來達到較好的分群結果
Algorithm Chossing/Comparing
- 作業題目中提到的五個方法中，K-Means++是K-Means的較佳版，故除了K-Means以外其他都有做
- 在Word Clustering比較結果，我認為LDA會是最佳(最適合以字分群)，K-Means++與Agglomerative次佳
- 在Article Clustering，選擇K-Means++與Agglomerative來做，兩者結果依然相近，效果都差不多。

	title_abstract
0	Evaluating similarity measures: a large-scale ...
1	Web Search for a Planet: The Google Cluster Ar...
2	The Price of Performance: An Economic Case for...
3	The Google File System. We have designed and ...
4	Interpreting the Data: Parallel Analysis with ...

	title_abstract
0	[evaluating, similarity, measures, large, scal...
1	[web, search, planet, google, cluster, archite...
2	[price, performance, economic, case, chip, mul...
3	[google, file, system, designed, implemented, ...
4	[interpreting, data, parallel, analysis, sawza...

	Text_Lemmatized
0	evaluate similarity measure large scale study ...
1	web search planet google cluster architecture ...
2	price performance economic case chip multiproc...
3	google file system design implement google fil...
4	interpret data parallel analysis sawzall large...

	Text_Stemmed
0	evalu similar measur larg scale studi orkut so...
1	web search planet googl cluster architectur am...
2	price perform econom case chip multiprocess la...
3	googl file system design implement googl file ...
4	interpret data parallel analysi sawzal larg da...