Premium
Ranking and selecting terms for text categorization via SVM discriminate boundary
Author(s) -
Kuo TienFang,
Yajima Yasutoshi
Publication year - 2010
Publication title -
international journal of intelligent systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.291
H-Index - 87
eISSN - 1098-111X
pISSN - 0884-8173
DOI - 10.1002/int.20392
Subject(s) - support vector machine , computer science , ranking (information retrieval) , artificial intelligence , categorization , term (time) , feature (linguistics) , pattern recognition (psychology) , linear discriminant analysis , search engine indexing , word (group theory) , document classification , text categorization , ranking svm , boundary (topology) , margin (machine learning) , decision boundary , machine learning , data mining , mathematics , mathematical analysis , linguistics , philosophy , physics , geometry , quantum mechanics
The problem of natural language document categorization consists of classifying documents into predetermined categories based on their contents. Each distinct term, or word, in the documents is a feature for representing a document. In general, the number of terms may be extremely large and the dozens of redundant terms may be included, which may reduce the classification performance. In this paper, a support vector machine (SVM)‐based feature ranking and selecting method for text categorization is proposed. The contribution of each term for classification is calculated based on the nonlinear discriminant boundary, which is generated by the SVM. The results of experiments on several real‐world data sets show that the proposed method is powerful enough to extract a smaller number of important terms and achieves a higher classification performance than existing feature selecting methods based on latent semantic indexing and χ 2 statistics values. © 2009 Wiley Periodicals, Inc.