z-logo
open-access-imgOpen Access
Automatic subject heading assignment for online government publications using a semi‐supervised machine learning approach
Author(s) -
Hu Xiao,
Jackson Larry S.,
Deng Sai,
Zhang Jing
Publication year - 2005
Publication title -
proceedings of the american society for information science and technology
Language(s) - English
Resource type - Journals
eISSN - 1550-8390
pISSN - 0044-7870
DOI - 10.1002/meet.14504201139
Subject(s) - computer science , classifier (uml) , categorization , machine learning , artificial intelligence , subject (documents) , maximization , text categorization , training set , information retrieval , world wide web , economics , microeconomics
As the dramatic expansion of online publications continues, state libraries urgently need effective tools to organize and archive the huge number of government documents published online. Automatic text categorization techniques can be applied to classify documents approximately, given a sufficient number of labeled training examples. However, obtaining training labels is very expensive, requiring a lot of manual labor. We present a real world online government information preservation project (PEP 1 ) in the State of Illinois, and a semi-supervised machine learning approach, an Expectation-Maximization (EM) algorithm-based text classifier, which is applied to automatically assign subject headings to documents harvested in the PEP project. The EM classifier makes use of easily obtained unlabeled documents and thus reduces the demand for labeled training examples. This paper describes both the context and the procedure of such an application. Experiment results are reported and other alternative approaches are also discussed.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here