Evaluation of Feature Selection Approaches for Urdu Text Categorization | Zendy

Tehseen Zia | Zendy; Qaiser Abbas | Zendy; Muhammad Pervez Akhtar | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Evaluation of Feature Selection Approaches for Urdu Text Categorization

Author(s) -

Tehseen Zia,

Qaiser Abbas,

Muhammad Pervez Akhtar

Publication year - 2015

Publication title -

international journal of intelligent systems and applications

Language(s) - English

Resource type - Journals

eISSN - 2074-9058

pISSN - 2074-904X

DOI - 10.5815/ijisa.2015.06.03

Subject(s) - feature selection , c4.5 algorithm , computer science , support vector machine , artificial intelligence , feature (linguistics) , pattern recognition (psychology) , information gain ratio , decision tree , selection (genetic algorithm) , k nearest neighbors algorithm , naive bayes classifier , feature vector , machine learning , data mining , linguistics , philosophy

Efficient feature selection is an important phase of designing an effective text categorization system. Various feature selection methods have been proposed for selecting dissimilar feature sets. It is often essential to evaluate that which method is more effective for a given task and what size of feature set is an effective model selection choice. Aim of this paper is to answer these questions for designing Urdu text categorization system. Five widely used feature selection methods were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial and radial basis kernels and decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection and a naive collection. We have observed that three feature selection methods i.e. information gain, Chi statistics, and symmetrical uncertain, have performed uniformly in most of the cases if not all. Moreover, we have found that no single feature selection method is best for all classifiers. While gain ratio out-performed others for naive Bays and J48, information gain has shown top performance for KNN and SVM with polynomial and radial basis kernels. Overall, linear SVM with any of feature selection methods including information gain, Chi statistics or symmetric uncertain methods is turned-out to be first choice across other combinations of classifiers and feature selection methods on moderate size naive collection. On the other hand, naive Bays with any of feature selection method have shown its advantage for a small sized EMILLE corpus.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research