z-logo
open-access-imgOpen Access
Hierarchical Multi-class and Multi-label Text Classification for Crime Report: A Traditional Machine Learning Approach
Author(s) -
Andre R. Vieira,
Glaucio De S. Santos,
Wilson S. Melo,
Luiz F. Rust
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3638984
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Large amounts of digital data are produced daily through society’s use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and provide information to business areas, supporting Business Intelligence solutions. Some databases store vast amounts of unstructured data, which must be systematized and classified to meet the data owner’s needs. In criminal incident report systems, each recorded incident must be classified as a specific crime, with hundreds or thousands of categories presented to the responsible officer. This work explores a clustering approach to group categories into a hierarchical tree of classes, enabling the use of Machine Learning (ML) models like XGBoost for automated classification of criminal incident reports narratives. As a case study, the Civil Police of of the State of Rio de Janeiro (SEPOL/RJ) has a database with over 6.5 million records, growing daily from Judicial Police Units (JPU) across the state. Each new report requires manual classification. A hierarchical tree of classes was developed to segment the problem, allowing various XGBoost models for automated classification. The proposed hierarchical model with 80 classes achieved an accuracy of 0.463, outperforming the baseline flat model which reached 0.419, along with a 25.48% reduction in training time. The weighted average F1-score obtained by the hierarchical model was 0.48188, while the baseline model reached 0.44061. The improvement was statistically validated through a Wilcoxon signed-rank test, which yielded a p-value of 0.000010.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom