Premium
Improvement of automatic Chinese text classification by combining multiple features
Author(s) -
Luo Xi,
Ohyama Wataru,
Wakabayashi Tetsushi,
Kimura Fumitaka
Publication year - 2015
Publication title -
ieej transactions on electrical and electronic engineering
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.254
H-Index - 30
eISSN - 1931-4981
pISSN - 1931-4973
DOI - 10.1002/tee.22049
Subject(s) - mcnemar's test , word (group theory) , computer science , n gram , dimension (graph theory) , character (mathematics) , transformation (genetics) , artificial intelligence , feature (linguistics) , natural language processing , gram , pattern recognition (psychology) , word length , speech recognition , statistics , mathematics , language model , linguistics , biochemistry , chemistry , geometry , philosophy , genetics , biology , pure mathematics , bacteria , gene
In this paper, we present an effective way of combining character‐based ( N ‐gram) and word‐based approaches for Chinese text classification. Uni‐gram and bi‐gram features are considered as the baseline model, which are then combined with word features of length greater than or equal to 3. A weight coefficient that can be used to give higher weights to word features is also introduced. We further employ a serial approach based on feature transformation and dimension reduction techniques. The results of McNemar's test indicate that the performance is significantly improved by our proposed method. © 2014 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.