z-logo
open-access-imgOpen Access
Training a Genre Classifier for Automatic Classification of Web Pages
Author(s) -
Vedrana Vidulin,
Mitja Luštrek,
Matjaž Gams
Publication year - 2007
Publication title -
cit. journal of computing and information technology/journal of computing and information technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.169
H-Index - 27
eISSN - 1846-3908
pISSN - 1330-1136
DOI - 10.2498/cit.1001137
Subject(s) - c4.5 algorithm , computer science , classifier (uml) , training set , artificial intelligence , set (abstract data type) , precision and recall , recall , web page , machine learning , information retrieval , natural language processing , pattern recognition (psychology) , support vector machine , world wide web , linguistics , philosophy , naive bayes classifier , programming language
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here