An empirical evaluation of text classification and feature selection methods
Author(s) -
Muazzam Ahmed Siddiqui
Publication year - 2016
Publication title -
artificial intelligence research
Language(s) - English
Resource type - Journals
eISSN - 1927-6982
pISSN - 1927-6974
DOI - 10.5430/air.v5n2p70
Subject(s) - artificial intelligence , feature selection , weighting , support vector machine , computer science , pattern recognition (psychology) , machine learning , skew , text categorization , classifier (uml) , linear classifier , categorization , feature (linguistics) , benchmark (surveying) , philosophy , geography , telecommunications , medicine , linguistics , geodesy , radiology
An extensive empirical evaluation of classifiers and feature selection methods for text categorization is presented. More than 500 models were trained and tested using different combinations of corpora, term weighting schemes, number of features, feature selection methods and classifiers. The performance measures used were micro-averaged F measure and classifier training time. The experiments used five benchmark corpora, three term weighting schemes, three feature selection methods and four classifiers. Results indicated only slight performance improvement with all the features over only 20% features selected using Information Gain and Chi Square. More importantly, this performance improvement was not deemed statistically significant. Support Vector Machine with linear kernel reigned supreme for text categorization tasks producing highest F measures and low training times even in the presence of high class skew. We found statistically significant difference between the performance of Support Vector Machine and other classifiers on text categorization problems.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom