
A Probabilistic Method for Hierarchical Multisubject Classification of Documents based on Multilingual Subject Term Vocabularies
Author(s) -
Nikolaos Makris,
Stamatina K. Koutsileou,
Nikolaos Mitrou
Publication year - 2025
Publication title -
ieee open journal of the computer society
Language(s) - English
Resource type - Magazines
eISSN - 2644-1268
DOI - 10.1109/ojcs.2025.3592254
Subject(s) - computing and processing
Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess documents as normalised weighted distributions of well-defined subjects across hierarchical levels, based on a hierarchical subject term vocabulary. The proposed approach utilizes Bayesian formulas, in contrast to typical methods that depend on machine learning models, thereby obviating the necessity for resource-intensive training processes at various hierarchical levels. The method integrates refined pre-processing techniques, such as natural language processing (NLP) and filtering of non-distinctive terms, to enhance classification accuracy. It employs Bayesian inference along with real time and cached computations across all hierarchical levels, yielding an effective, time-efficient and interpretable classification method while ensuring scalability for large datasets. Experimental results demonstrate the potency of the algorithm to classify scientific textbooks across hierarchical subject tiers with significant precision and recall and retrieve semantically related scientific textbooks, thereby verifying its efficacy in tasks requiring hierarchical subject classification. This study presents a streamlined, interpretable alternative to model-dependent HMC approaches, rendering it particularly appropriate for real-world applications in educational and scientific fields. Furthermore, in the context of the present study, two public Web User Interfaces were published, the first is founded on Skosmos to illustrate the hierarchical structure of the subject term vocabulary, while the second one employs the HMC method to present in real-time the classification between subjects in English and Greek textual data.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom