Premium
Automated arabic text classification with P ‐ S temmer, machine learning, and a tailored news article taxonomy
Author(s) -
Kanan Tarek,
Fox Edward A.
Publication year - 2016
Publication title -
journal of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.903
H-Index - 145
eISSN - 2330-1643
pISSN - 2330-1635
DOI - 10.1002/asi.23609
Subject(s) - computer science , taxonomy (biology) , arabic , artificial intelligence , software , rank (graph theory) , natural language processing , world wide web , information retrieval , linguistics , mathematics , philosophy , botany , biology , combinatorics , programming language
Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine‐learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund ( QNRF)‐ funded project to build digital library community and infrastructure in Q atar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other A rabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Q atar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic‐speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P ‐ S temmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10‐fold cross‐validation and the Wilcoxon signed‐rank test, we showed that our approach to stemming and classification is superior to state‐of‐the‐art techniques.