z-logo
open-access-imgOpen Access
Fast logistic regression for text categorization with variable-length n-grams
Author(s) -
Georgiana Ifrim,
Gökhan Bakır,
Gerhard Weikum
Publication year - 2008
Publication title -
citeseer x (the pennsylvania state university)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/1401890.1401936
Subject(s) - lexical analysis , computer science , artificial intelligence , bigram , logistic regression , feature vector , bag of words model , variable (mathematics) , pattern recognition (psychology) , machine learning , natural language processing , trigram , mathematics , mathematical analysis
A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom