Text Generation with Content and Structure-Based Preprocessing in Imbalanced Data of Product Review | Zendy

Ana Alimatus Zaqiyah | Zendy; Diana Purwitasari | Zendy; Chastine Fatichah | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Text Generation with Content and Structure-Based Preprocessing in Imbalanced Data of Product Review

Author(s) -

Ana Alimatus Zaqiyah,

Diana Purwitasari,

Chastine Fatichah

Publication year - 2021

Publication title -

international journal of intelligent engineering and systems

Language(s) - English

Resource type - Journals

eISSN - 2185-310X

pISSN - 1882-708X

DOI - 10.22266/ijies2021.0228.48

Subject(s) - computer science , notice , statement (logic) , preprocessor , context (archaeology) , product (mathematics) , class (philosophy) , information retrieval , sentiment analysis , artificial intelligence , problem statement , natural language processing , linguistics , paleontology , philosophy , geometry , mathematics , management science , political science , law , economics , biology

Spam detection frequently categorizes product reviews as spam and non-spam. The spam reviews may contain texts of fake reviews and non-review statements describing unrelated things about products. Most of the publicly available spam reviews are labelled as fake reviews, while non-spam texts that are not fake reviews could contain non-review statements. It is crucial to notice those non-review statements since they convey misperception to consumers. Non-review statements are hardly found, and those statements of large and long texts often need to be manually labelled, which is time-consuming. Because of the rareness in finding non-review statements, there is an imbalanced condition between non-spam as a major class and spam that consists of the non-review statement as a minor class. Augmenting fake reviews to add spam texts is ineffective because they have similar content to non-spam such as some opinion words of product features. Thus, the text generation of non-review statements is preferable for adding spam texts. Some text generation issues are the frequent neural network-based methods require much learning data, and the existing pre-trained models produce texts with different contexts to non-review statements. The augmented texts should have similar content and context represented by the structure of the non-review statement. Therefore, we propose a text generation model with content and structure-based preprocessing to produce non-review statements, which is expected to overcome imbalanced data and give better spam detection results in product reviews. Structure-based preprocessing identifies the feature structures of non-opinion words from part-of-speech tags. Those features represent the context of spam reviews in unlabeled texts. Then, content-based preprocessing appoints selected topic modeling results of non-review statements from fake reviews. Our experiments resulted an improvement on the metric value of ± 0.04, called as BLEU (Bi-Lingual Evaluation Understudy) score, for the correspondence evaluation between generated and trained texts. The metric value indicates that the generated texts are not quite identical to the trained texts of non-review statements. However, those additional texts combined with the original spam texts gave better spam detection results with an increasing value of more than 40% on average recall score.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore