FSPRM A Feature Subsequence Based Probability Representation Model for Chinese Word Embedding

      

ABSTARCT :

Chinese word embedding models capture Chinese semantics based on the character feature of Chinese words and the internal features of Chinese characters such as radical, component, stroke, structure and pinyin. However, some features are overlapping and most methods do not consider their relevance. Meanwhile, they express words as point vectors that cannot better capture different aspect semantics of Chinese words. In this paper, we propose a Feature Subsequence based Probability Representation Model (FSPRM) for learning Chinese word embeddings, in which we first integrate the morphological and phonetic features (stroke, structure and pinyin) of Chinese characters and learn their relevance by designing a feature subsequence to capture relatively comprehensive semantics of Chinese words, then feature probability distribution is proposed for capturing different aspect meanings of Chinese words based on the three internal features and probability representation by estimating its mean as the sum of feature subsequences. Chinese words with similar features may have similar semantics, then we map Chinese words to feature probability distributions and design a similarity-based objective for predicting the contextual words of the target word to learn their semantics. Extensive experiments on word analogy, word similarity, text classification and named entity recognition tasks demonstrate that the proposed method outperforms most state-of-the-art approaches.

EXISTING SYSTEM :

? To better deal with these scenarios, We also train two embeddings to represent character-independent segmentation actions, ALL-s and ALL-c, and use them to average with or substitute embeddings of infrequent or unknown characters, which are either insufficiently trained or nonexistent. ? However, in general, is more likely to start a new word instead of joining the existing one as in this example. Given such conflicting evidence, models can rarely find optimal feature weights, if they exist at all. ? Nevertheless, many feature weights in such models are inevitably poorly estimated because the number of parameters is so large with respect to the limited amount of training data.

DISADVANTAGE :

? It is widely accepted that Chinese Word Segmentation can be resolved as a character based tagging problem (Xue and others, 2003). ? To reduce the problem of error propagation and improve the low-level tasks by incorporating the knowledge from the high-level tasks, many successful joint methods have been proposed to simultaneously solve related tasks, which can be categorized into three types. ? We think this is caused by the pipeline model, in which is more sensitive to word segmentation errors and suffers more from the OOV problem, as depicted. ? Thus, the joint CWS and POS tagging can be regarded as a sequence labeling problem.

PROPOSED SYSTEM :

• To the best of our knowledge, this is the first graph-based method to integrate CWS and dependency parsing both in the training phase and the decoding phase. The proposed model is very concise and easily implemented. • Nevertheless, by using the our proposed model, we can exploit BERT to implement CWS and dependency parsing jointly. • They also proposed an effective POS tag pruning method that could greatly improve the decoding efficiency. • When decoding, we first use the proposed model to predict the character-level labeled dependency tree, and then recover the word segmentation and word-level dependency tree based on the predicted character-level arc labels.

ADVANTAGE :

? In the neural-CRF model, the word (or character) embeddings are treated as input features and the performance of further application highly depends on the quality of word (or character) representation. ? The linguistic feature of English has been studied and used in the word embedding learning procedure. ? Linear chain conditional random field (CRF) (Lafferty et al., 2001) is a widely used algorithm for Chinese word segmentation. ? In view of polysemy, we divide characters into different clusters according to their most frequently-used meanings. ? In this part, we apply character embedding as features for Chinese word segmentation using neural CRF. We conduct experiments on the widely-used Penn Chinese Treebanks 5 (CTB5) and CTB7.

Download DOC Download PPT

We have more than 145000 Documents , PPT and Research Papers

Have a question ?

Chat on WhatsApp