z-logo
open-access-imgOpen Access
Extended Pre-Processing Pipeline For Text Classification: On the Role of Meta-Features, Sparsification and Selective Sampling
Author(s) -
Washington Cunha
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/sbbd_estendido.2021.18180
Subject(s) - computer science , pipeline (software) , pipeline transport , sampling (signal processing) , representation (politics) , word (group theory) , data mining , information retrieval , artificial intelligence , machine learning , engineering , linguistics , philosophy , filter (signal processing) , environmental engineering , politics , law , political science , computer vision , programming language
Pipelines for Text Classification are sequences of tasks needed to be performed to classify documents. The pre-processing phase of these pipelines involves different ways of manipulating documents for the learning phase. This Master Thesis introduces three new steps into the traditional pre-processing phase: 1) Meta-Features Generation; 2) Sparsification; and 3) Selective Sampling. Our experimental results, based on more than 5.600 measurements, show that our proposal can achieve significant gains in effectiveness when compared to the traditional TF-IDF representation (up to 52%) and word embeddings (up to 46%), at a much lower cost (9.7x faster). Our Master Thesis also includes a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline, as well as a comprehensive comparative experimental evaluation of many alternatives. This thesis falls under the topics of (i) Document Management and Classification, (ii) Information Retrieval Models and Techniques, (iii) and Text Database of the SBBD Call for Papers.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here