z-logo
open-access-imgOpen Access
Topic-based Classification through Unigram Unmasking
Author(s) -
Yaakov HaCohenKerner,
Avi Rosenfeld,
Asaf Sabag,
Maor Tzidkani
Publication year - 2018
Publication title -
procedia computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.334
H-Index - 76
ISSN - 1877-0509
DOI - 10.1016/j.procs.2018.07.210
Subject(s) - computer science , artificial intelligence , word (group theory) , task (project management) , natural language processing , search engine indexing , information retrieval , feature (linguistics) , information extraction , machine learning , support vector machine , philosophy , linguistics , management , economics
Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a concept we call unigram unmasking. Previous text classification approaches have typically focused on a “bag-of-words” vector. We posit that at times some of the most frequent unigrams, which have the greatest weight within these vectors, are not only unnecessary for classification, but can at times even hurt models’ accuracy. We present an approach where a percentage of common unigrams are intentionally removed, thus “unmasking” the added value from less popular unigrams. We present results from a topic-based classification task (hundreds of online free text-books belonging to five domains: Career and Study Advice, Economics and Finance, IT Programming, Natural Sciences, Statistics sand Mathematics) and show that unmasking was helpful across several machine learning models with some models even benefiting from removing nearly 50% of the most frequent unigrams from the bag-of-word vectors.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom