Open Access
An Effective Text Classification Model Based on Ensemble Strategy
Author(s) -
Hong Zhu,
Wei Jin,
Yang Gao
Publication year - 2019
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1229/1/012058
Subject(s) - word2vec , computer science , artificial intelligence , feature (linguistics) , feature vector , support vector machine , representation (politics) , pattern recognition (psychology) , tf–idf , feature learning , convolutional neural network , bag of words model , classifier (uml) , machine learning , philosophy , linguistics , physics , embedding , quantum mechanics , politics , term (time) , political science , law
Automatic text classification is a classic topic for natural language processing. Text classification research mainly focuses on feature representation of text documents or designing an efficient machine learning model. Although various approaches have been proposed to address these problems, they are still far from being solved. In this paper, we proposed a novel method called LAC_DNN to achieve the text classification based on diverse feature representation approaches and classifiers. More specifically, LAC_DNN firstly introduces a novel feature representation approach called LATW to extract feature information of the documents, which integrates the feature information extracted by LSI model, TF-IDF weighted vector space model (TF-IDF_VSM), TF-IDF weighted word2vec (TF-IDF_word2vec) and average word2vec (Avg_word2vec), respectively. Secondly, it trains different classifiers including support vector machine, k nearest neighbor, logistic regression and convolutional neural networks based on the feature encoded by LATW. Finally, LAC_DNN integrates these classifiers into an ensemble predictor to leverage complimentary information of feature representation methods and classifiers, and predict the topic of text documents. LAC_DNN achieves superior performance with accuracy of 97.44% and 97.43% on the text datasets of Fudan and Netease news, respectively. Extensive experiments show that LAC_DNN is prominent and useful for text classification.