SENTIMENTAL ANALYSIS USING A CODE-MIX DATASET OF TAMIL-ENGLISH

Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github and Zenodo.

Existing System

• It exists in two forms, the continuous Bag-of-Words model (cBoW) and the Skip-Gram model. • In the work of Named Entity Recognition in Tweets , it has been seen that the performance of the existing named extraction systems is not so good. • A low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios.

Disadvantages

• To facilitate the researchers working on these problems, there have been shared tasks conducted on aggression identification in social media and offensive language identification by providing necessary datasets. • The problems include the comparison of the movie with movies of same or other industries, expression of opinion of different aspects of the movie in the same sentence. • While speaking about its cons, they are vulnerable to mistakes in classification problems having many classes and a comparatively limited number of training examples. • We expect this resource will enable the researchers to address new and exciting problems in code-mixed research. • Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem.

Proposed System

Propose a novel multilingual sentiment analysis framework. In the proposed framework, no manually labelled corpus is needed and all extracted information is domaindependent. In general, the contributions of this study can be summarized as follows: • Propose a statistical method for opinion lexicon extraction based on a few seed words, which can be easily transplanted to almost any language and does not need to refer to synonyms and antonyms dictionaries; • On the basis of the extracted opinion lexicon, propose a key sentence extraction method for capturing the overall opinion of reviews, which solves the problem of conflictive sentiments; • Propose a Self-Supervised Learning (SSL) method for sentiment classification, which combines unsupervised and supervised techniques together by virtue of the above extracted opinion lexicon and key sentences; • Finally, extensive experiments on multilingual datasets in different domains well demonstrate the effectiveness of the proposed methods.

Advantages

• Cost-efficient framework for multilingual sentiment analysis. • Improve the performance of sentiment classification. • The concept of key sentences is reasonable and key sentences are useful for dealing with conflictive sentiments. • The key sentence extraction method is effective and language-independent Abbreviations and Acronyms

Download DOC Download PPT