Textual, Non-Textual, and Hybrid Feature Engineering for SMS Spam Classification
Author(s) -
Aditi R. Verma,
Shriya Sadana
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3620751
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Contemporary spam filters increasingly rely on resource-intensive, deep learning models. This study evaluates the performance, robustness, and deployability of lightweight learning models. It provides a head-to-head evaluation of probabilistic (Naïve Bayes) and margin-based (Support Vector Machine) classifiers on three feature spaces derived from the 5574-message UCI Short Message Service (SMS) spam collection. Our primary finding shows that a hybrid model, a fusion of bag-of-words (BoW) representation with 22 handcrafted metadata features, achieves the highest accuracy, with SVM peaking at 98.3%. To assess the resilience, the model was tested against adversarial attacks. The hybrid SVM model exhibited strong robustness when faced with altered data and maintained 72.41% accuracy against challenging semantic attacks. Furthermore, the hybrid SVM model demonstrated strong cross-dataset generalization, achieving 74.38% accuracy when trained on the original UCI data and tested on a modern, diverse dataset of SMS, Telegram, and email messages. Deployment analysis confirmed the efficiency of the framework, with processing of ~ 200 requests/s (fastest model) at less than 10ms latency and ~ 1.5% average CPU load on a standard CPU. The results establish three key principles for next-generation spam filters: (i) lexical information remains the dominant signal; (ii) lightweight metadata provides measurable incremental value when paired with text; and (iii) margin-based classifiers exploit multimodal fusion most effectively. Taken together, these findings validate that a lightweight hybrid feature-engineering approach provides a robust, generalizable, and resource-efficient solution for real-time spam mitigation, thereby presenting a compelling and practical alternative to computationally expensive deep learning architectures.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom