Premium
Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques
Author(s) -
Zhou Jiale,
Li Hong,
Wang Chiyu,
Li Xinrong,
Shi Jiawen,
Pang Zhicheng
Publication year - 2020
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.5952
Subject(s) - rewriting , computer science , oversampling , similarity (geometry) , artificial intelligence , key (lock) , natural language processing , task (project management) , information retrieval , data mining , programming language , image (mathematics) , computer network , computer security , bandwidth (computing) , management , economics
Summary Oversampling technology has been widely used to improve the classification task of unbalanced data. However, unlike structured data, the basic unit of text is words or characters, which can cause oversampling instances in digital space to lose word similarity in semantic space. To solve this problem, use text rewriting to directly generate artificial samples. Unfortunately, existing rewriting techniques usually destroy the grammatical structure and logic of the original text. In this article, we improve and limit some existing text rewriting methods, and propose an effective algorithm to mine feature words in various texts to help complete text rewriting. At the same time, by calculating the similarity between texts, various types of data are divided into key data and non‐key data, and finally different rewriting processes are designed for them. The experimental results of four unbalanced text classification tasks show that our method is superior to the previous text rewriting method, which can improve the classification accuracy of the model by 1.7% to 2.9%, and the AUC can be increased by 0.012 to 0.058. The ablation experiment also explored the effects of various variables and methods on the experimental results.