Rumors dataset filtering suspended account

      

ABSTARCT :

Large amount of Twitter accounts are suspended. Over five year period, about 14% accounts are terminated for reasons not specified explicitly by the service provider. We collected about 120,000 suspended users, along with their tweets and social relations. This thesis studies these suspended users, and compares them with normal users in terms of their tweets. We train classifiers to automatically predict whether a user will be suspended. Three different kinds of features are used. We experimented using Nave Bayes method, including Bernoulli (BNB) and multinomial (MNB) plus various feature selection mechanisms (mutual information, chi square and point-wise mutual information) and achieved F1=78%. To reduce the high dimensions, in our second approach we use word2vec and doc2vec to represent each user with a vector of a shot and fixed length and achieved F1 (73%) using SVM with RBF function kernel. Random forest works best with F1=74% on this approach.

EXISTING SYSTEM :

? At the heart of the Twitter spam craft are thousands of fraudulent accounts created for the explicit purposes of soliciting products. 77% of these accounts are banned within a day of their first post, and 89% acquire less than 10 followers at the height of their existence. ? A second form of monetization we identify is the use of syndicated ads from existing ad networks. ? The creation times of the campaigns’ accounts, with activation and suspension times overlapped for the single day of the campaign’s existence. ? Existing blacklists are too slow at identifying threats appearing on social networks, as well as often inaccurate with respect to both false positives and negatives

DISADVANTAGE :

? Our work tried to used full text of tweets and large dataset of suspended users to avoid these problems. ? When converting words vector into tweet vector, we may face the problem that the word cannot be found in the vocabulary of pre-trained model. ? When using n-gram language model, the main problems we face is the huge size of feature set, leading to a lot of time spending on training and testing. ? In order to solve the problem that the feature selection method will have more probability to select the features from suspended users than select from non-suspended users, we used token frequency instead of number of users.

PROPOSED SYSTEM :

• This infrastructure includes automatically generated accounts created for the explicit purpose of soliciting spam; the emergence of spam-as-a-service programs that connect Twitter account controllers to marketers selling products; and finally the techniques required to maintain large-scale spam campaigns despite Twitter’s counter-efforts. • While 11% of spam accounts attempt to befriend users, either for the purpose of acquiring followers or for obtaining the privilege to direct messages, it is clear that legitimate Twitter users rarely respond in kind. • In contrast, we find a majority of suspended accounts in our dataset were fraudulent and created for the explicit purpose of spamming.

ADVANTAGE :

? The authors claimed that using RIPPER in two steps achieved the best performance among the combinations of classifiers. ? In this approach, all the terms are potential features, although we will also need to select from them for efficiency and performance consideration. ? All these processes have been done by C++ so that we can manually control the memory and achieve a better performance. ? In order to evaluate the performances of each feature selection methods, we tried to adjust the size of features to filter the dataset and then run 10-fold cross validation on classification by using Multinomial Naive Bayes classifier. ? Machine learning approaches have already been widely used to detect spam email.

Download DOC Download PPT

We have more than 145000 Documents , PPT and Research Papers

Have a question ?

Chat on WhatsApp