Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques | Zendy

Zhou Jiale | Zendy; Li Hong | Zendy; Wang Chiyu | Zendy; Li Xinrong | Zendy; Shi Jiawen | Zendy; Pang Zhicheng | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques

Author(s) -

Zhou Jiale,

Li Hong,

Wang Chiyu,

Li Xinrong,

Shi Jiawen,

Pang Zhicheng

Publication year - 2020

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.5952

Subject(s) - rewriting , computer science , oversampling , similarity (geometry) , artificial intelligence , key (lock) , natural language processing , task (project management) , information retrieval , data mining , programming language , image (mathematics) , computer network , computer security , bandwidth (computing) , management , economics

Summary Oversampling technology has been widely used to improve the classification task of unbalanced data. However, unlike structured data, the basic unit of text is words or characters, which can cause oversampling instances in digital space to lose word similarity in semantic space. To solve this problem, use text rewriting to directly generate artificial samples. Unfortunately, existing rewriting techniques usually destroy the grammatical structure and logic of the original text. In this article, we improve and limit some existing text rewriting methods, and propose an effective algorithm to mine feature words in various texts to help complete text rewriting. At the same time, by calculating the similarity between texts, various types of data are divided into key data and non‐key data, and finally different rewriting processes are designed for them. The experimental results of four unbalanced text classification tasks show that our method is superior to the previous text rewriting method, which can improve the classification accuracy of the model by 1.7% to 2.9%, and the AUC can be increased by 0.012 to 0.058. The ablation experiment also explored the effects of various variables and methods on the experimental results.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research