ThetextcatPackage forn-Gram Based Text Categorization inR | Zendy

Kurt Hornik | Zendy; Patrick Mair | Zendy; J. Rauch | Zendy; Wilhelm Geiger | Zendy; Christian Buchta | Zendy; Ingo Feinerer | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

ThetextcatPackage forn-Gram Based Text Categorization inR

Author(s) -

Kurt Hornik,

Patrick Mair,

J. Rauch,

Wilhelm Geiger,

Christian Buchta,

Ingo Feinerer

Publication year - 2013

Publication title -

journal of statistical software

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 7.636

H-Index - 145

ISSN - 1548-7660

DOI - 10.18637/jss.v052.i06

Subject(s) - computer science , identification (biology) , categorization , variety (cybernetics) , artificial intelligence , selection (genetic algorithm) , natural language processing , algorithm , programming language , biology , botany

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research