
Enriching datasets for sentiment analysis in tweets with instance selection
Author(s) -
Eliseu Guimarães,
Daniela Vianna,
Aline Paes,
Alexandre Plastino
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/kdmile.2021.17463
Subject(s) - computer science , leverage (statistics) , popularity , sentiment analysis , classifier (uml) , artificial intelligence , machine learning , field (mathematics) , set (abstract data type) , data mining , selection (genetic algorithm) , information retrieval , psychology , social psychology , mathematics , pure mathematics , programming language
Sentiment analysis in tweets is a research field of great importance, mainly due to the popularity of Twitter. However, collecting and annotating tweets is an expensive and time-consuming task, making that some domains have only a limited set of labeled data. A promising strategy to handle this issue is to leverage labeled domains rich in data to select instances that enrich target datasets. This paper proposes different strategies for selecting instances from a set of labeled source datasets in order to improve the performance of classifiers trained only with the target dataset. Different approaches are proposed, including similarity metrics and variations in the number of selected instances. The results show that the size of the training set plays an essential role in the predictive capacity of the classifier. Furthermore, the results point out the importance of taking into account diversity criteria when selecting the instances.