z-logo
open-access-imgOpen Access
Text Generation with Content and Structure-Based Preprocessing in Imbalanced Data of Product Review
Author(s) -
Ana Alimatus Zaqiyah,
Diana Purwitasari,
Chastine Fatichah
Publication year - 2021
Publication title -
international journal of intelligent engineering and systems
Language(s) - English
Resource type - Journals
eISSN - 2185-310X
pISSN - 1882-708X
DOI - 10.22266/ijies2021.0228.48
Subject(s) - computer science , notice , statement (logic) , preprocessor , context (archaeology) , product (mathematics) , class (philosophy) , information retrieval , sentiment analysis , artificial intelligence , problem statement , natural language processing , linguistics , paleontology , philosophy , geometry , mathematics , management science , political science , law , economics , biology
Spam detection frequently categorizes product reviews as spam and non-spam. The spam reviews may contain texts of fake reviews and non-review statements describing unrelated things about products. Most of the publicly available spam reviews are labelled as fake reviews, while non-spam texts that are not fake reviews could contain non-review statements. It is crucial to notice those non-review statements since they convey misperception to consumers. Non-review statements are hardly found, and those statements of large and long texts often need to be manually labelled, which is time-consuming. Because of the rareness in finding non-review statements, there is an imbalanced condition between non-spam as a major class and spam that consists of the non-review statement as a minor class. Augmenting fake reviews to add spam texts is ineffective because they have similar content to non-spam such as some opinion words of product features. Thus, the text generation of non-review statements is preferable for adding spam texts. Some text generation issues are the frequent neural network-based methods require much learning data, and the existing pre-trained models produce texts with different contexts to non-review statements. The augmented texts should have similar content and context represented by the structure of the non-review statement. Therefore, we propose a text generation model with content and structure-based preprocessing to produce non-review statements, which is expected to overcome imbalanced data and give better spam detection results in product reviews. Structure-based preprocessing identifies the feature structures of non-opinion words from part-of-speech tags. Those features represent the context of spam reviews in unlabeled texts. Then, content-based preprocessing appoints selected topic modeling results of non-review statements from fake reviews. Our experiments resulted an improvement on the metric value of ± 0.04, called as BLEU (Bi-Lingual Evaluation Understudy) score, for the correspondence evaluation between generated and trained texts. The metric value indicates that the generated texts are not quite identical to the trained texts of non-review statements. However, those additional texts combined with the original spam texts gave better spam detection results with an increasing value of more than 40% on average recall score.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here