WELFake Word Embedding Over Linguistic Features for Fake News Detection
ABSTARCT :
Social media is a popular medium for the dissemination of real-time news all over the world. Easy and quick information proliferation is one of the reasons for its popularity. An extensive number of users with different age groups, gender, and societal beliefs are engaged in social media websites. Despite these favorable aspects, a significant disadvantage comes in the form of fake news, as people usually read and share information without caring about its genuineness. Therefore, it is imperative to research methods for the authentication of news. To address this issue, this article proposes a two-phase benchmark model named WELFake based on word embedding (WE) over linguistic features for fake news detection using machine learning classification. The first phase preprocesses the data set and validates the veracity of news content by using linguistic features. The second phase merges the linguistic feature sets with WE and applies voting classification. To validate its approach, this article also carefully designs a novel WELFake data set with approximately 72 000 articles, which incorporates different data sets to generate an unbiased classification output. Experimental results show that the WELFake model categorizes the news in real and fake with a 96.73% which improves the overall accuracy by 1.31% compared to bidirectional encoder representations from transformer (BERT) and 4.25% compared to convolutional neural network (CNN) models. Our frequencybased and focused analyzing writing patterns model outperforms predictive-based related works implemented using the Word2vec WE method by up to 1.73%.
EXISTING SYSTEM :
? In this paper we performed a series of experiments where bi-directional recurrent neural network classification models were trained on interpretable features derived from multi-disciplinary integrated approaches to language.
? We apply our approach to two benchmark datasets. We demonstrate that our approach is promising as it achieves similar results on these two datasets as the best performing black box models reported in the literature.
? In a second step we report on ablation experiments geared towards assessing the relative importance of the human-interpretable features in distinguishing fake news from real news.
DISADVANTAGE :
? This is a supervised learning algorithm that classifies the data for both categorical and continuous dependent variables. This classifier uses tree structures to solve a problem by distributing complete data sets into homogeneous ones.
? Internal nodes, branches and leaf nodes in this tree structure represent the data set, the decision rules and the outcome.
? This is a supervised learning algorithm that works for both classification and regression problems.
? The algorithm finds the best line for set separation and predicts the correct set for new data values.
PROPOSED SYSTEM :
• we provide a concise overview of recent approaches geared towards fake news detection that employ machine learning and deep learning techniques and we focus in particular on language-based approaches that are most pertinent to the purposes of this paper.
• Fake news detection is most often formulated as a binary classification task. However, categorizing all the news into two classes (fake vs real) is not the only conceivable way, since there are cases where the news is partially real and partially fake.
• A common practice is to add more classes distinguishing between several degrees of truthfulness and thus formulating fake news detection as a multi-class classification task. As will become evident later in this paper, we apply our approach to both scenarios.
ADVANTAGE :
? We evaluated the performance of each ML model on different training and testing data distributions as explained and found out that a 70%–30% data distribution gives better accuracy for all six ML methods.
? BERT is a pretrained model which works well with labeled data, while its performance gets compromised in a generalized data set where testing data are independent of the training data.
? We use different training and testing data sets to analyze the generalization performance of the WELFake model. For this purpose, we followed an adversarial approach that splits the WELFake data set into four constituent subsets (i.e., BuzzFeed, Reuters, McIntire, and Kaggle).
|