Document Representations for Classification of Short Web-Page Descriptions
Author(s) -
Miloš Radovanović,
Mirjana Ivanović
Publication year - 2006
Publication title -
lecture notes in computer science
Language(s) - English
Resource type - Book series
SCImago Journal Rank - 0.249
H-Index - 400
eISSN - 1611-3349
pISSN - 0302-9743
DOI - 10.1007/11823728_52
Subject(s) - computer science , information retrieval , artificial intelligence , naive bayes classifier , document classification , support vector machine , classifier (uml) , normalization (sociology) , web page , pattern recognition (psychology) , natural language processing , machine learning , world wide web , sociology , anthropology
Motivated by applying Text Categorization to sorting Web search results, thispaper describes an extensive experimental study of the impact of bag-of-wordsdocument representations on the performance of five major classifiers --Naive Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts represent shortWeb-page descriptions from the dmoz Open Directory Web-page ontology.Different transformations of input data: stemming, normalization,logtf and idf, together with dimensionality reduction, arefound to have a statistically significant improving or degrading effect onclassification performance measured by classical metrics -- accuracy,precision, recall, F$_1$ and F$_2$. The emphasis of the study is not ondetermining the best document representation which corresponds to eachclassifier, but rather on describing the effects of every individualtransformation on classification, together with their mutual relationships.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom