Textual, Non-Textual, and Hybrid Feature Engineering for SMS Spam Classification | Zendy

Aditi R. Verma | Zendy; Shriya Sadana | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Textual, Non-Textual, and Hybrid Feature Engineering for SMS Spam Classification

Author(s) -

Aditi R. Verma,

Shriya Sadana

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3620751

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Contemporary spam filters increasingly rely on resource-intensive, deep learning models. This study evaluates the performance, robustness, and deployability of lightweight learning models. It provides a head-to-head evaluation of probabilistic (Naïve Bayes) and margin-based (Support Vector Machine) classifiers on three feature spaces derived from the 5574-message UCI Short Message Service (SMS) spam collection. Our primary finding shows that a hybrid model, a fusion of bag-of-words (BoW) representation with 22 handcrafted metadata features, achieves the highest accuracy, with SVM peaking at 98.3%. To assess the resilience, the model was tested against adversarial attacks. The hybrid SVM model exhibited strong robustness when faced with altered data and maintained 72.41% accuracy against challenging semantic attacks. Furthermore, the hybrid SVM model demonstrated strong cross-dataset generalization, achieving 74.38% accuracy when trained on the original UCI data and tested on a modern, diverse dataset of SMS, Telegram, and email messages. Deployment analysis confirmed the efficiency of the framework, with processing of ~ 200 requests/s (fastest model) at less than 10ms latency and ~ 1.5% average CPU load on a standard CPU. The results establish three key principles for next-generation spam filters: (i) lexical information remains the dominant signal; (ii) lightweight metadata provides measurable incremental value when paired with text; and (iii) margin-based classifiers exploit multimodal fusion most effectively. Taken together, these findings validate that a lightweight hybrid feature-engineering approach provides a robust, generalizable, and resource-efficient solution for real-time spam mitigation, thereby presenting a compelling and practical alternative to computationally expensive deep learning architectures.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research