Topic Modeling of Short Texts A Pseudo-Document View with Word Embedding Enhancement
ABSTARCT :
Recent years have witnessed the unprecedented growth of online social media, resulting in short texts being the prevalent format of information on the Internet. Given the sparsity of data, however, short-text topic modeling remains a critical yet much-watched challenge in both academia and industry. Research has been devoted to building different types of probabilistic topic models for short texts, among which self-aggregation methods emerged recently to provide informative cross-text word co-occurrences. However, models along this line are still in their infancy and typically yield overfit results and exhibit high computational costs. In this paper, we propose a novel model called Pseudo-document-based Topic Model (PTM), which introduces the concept of pseudo-document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo-documents rather than short texts, PTM yields excellent performance in accuracy and efficiency. A word embedding-enhanced PTM (WE-PTM) is also proposed to leverage pre-trained word embeddings, which is essential to further alleviating data sparsity. Extensive experiments with self-aggregation or word embedding-based baselines on four real-world datasets including two online media short texts, demonstrate the high-quality topics learned by our models. Robustness to limited training samples and the explainable semantics of topics are also investigated.
EXISTING SYSTEM :
? It is also natural to exploit external lexical knowledge to guide the topic inference over short texts. Existing works in this line largely rely on either external thesauri (e.g., WordNet) or lexical knowledge derived from documents in a specific domain.
? Compared with existing approaches of incorporating word embeddings in topic model, GPU reduces the computational cost significantly.
? Several existing works exploit the GPU model and the external thesauri or domain-specific knowledge for better topic inference of the standard LDA.
? The experimental results show that GPU-DMM outperforms existing state-of-the-art alternatives in terms of effectiveness and efficiency.
DISADVANTAGE :
? The problem lies in that auxiliary information is not always available or just too costly for deployment.
? The problem of these methods lies in that they bring in little additional word cooccurrence information and therefore still face data sparsity problem.
? The second phase means the inference procedure has to estimate the probability distribution of pseudo documents on short texts independently, and the number of parameters thus grows linearly with the size of corpus, which might lead to serious overfitting problem when training samples are in shortage.
? PTM and SPTM both reveal topics from P pseudo documents, adjusting P is the key to ease the data sparsity problem faced with by traditional topic models like LDA.
PROPOSED SYSTEM :
• Several ingenious strategies have been proposed to deal with the data sparsity problem in short texts. One strategy is to aggregate a subset of short texts to form a longer pesudo-document.
• On two real-world short text collections in two languages, we evaluate the proposed GPU-DMM against a few stateof-the-art alternatives for short texts.
• We conduct extensive experiments to evaluate the proposed GPU-DMM against the state-of-the-art alternatives.
• DMM is based on the assumption made in the mixture of unigrams model proposed by Nigam et al., i.e., each document is sampled from a single latent topic.
ADVANTAGE :
? The superior performance of our methods as compared to LDA is in accordance with our understanding that learning topics from regular-sized pseudo documents can guarantee the quality of topics.
? To compare how the latent semantic representation of documents learned by difference topic models enhances the classification performance when the training data is rare, we conduct this experiment.
? By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency.
? This demonstrates the outstanding performance of our methods against baselines in learning semantic representations of short texts.
|