
Privacy-Aware Detection for Large Language Models Using a Hybrid BiLSTM-HMM Approach
Author(s) -
Maryam Abbasalizadeh,
Sashank Narain
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3587988
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Large Language Models (LLMs) have transformed natural language processing, enabling applications such as conversational agents and machine translation. However, their deployment introduces significant privacy concerns, including the memorization and unintended disclosure of sensitive data. Existing privacy-preserving techniques—such as Differential Privacy and federated learning—struggle to balance data protection, model utility, and computational efficiency. To address these limitations, we propose a lightweight privacy-disclosure detection system that combines Bidirectional Long Short-Term Memory (BiLSTM) networks with Hidden Markov Models (HMM) using a novel modeling pipeline. Our approach employs the Predefined and Sensitive Labeling (PSL) technique, a generative labeling approach that extracts meaningful patterns from data. These patterns are then used to train a BiLSTM model capable of proactively identifying sensitive information in real-time user interactions with LLMs. As BiLSTMs do not provide the probability of private data, we design a HMM that estimates the probability of occurrence for this private data. Utilizing the Forward algorithm, our system quantifies privacy risks, enabling users to revise inputs prior to submission and thereby enhancing data privacy. Trained on synthetic data using PSL technique, the model achieves approximately 99.94% precision, recall, and F1-score, and successfully detects previously unseen sensitive information in synthetic datasets with ≈99.98% accuracy across 55,000 sentences. Additionally, the generated model trained on patterns derived from synthetic data, achieved ≈99.99% accuracy when evaluated on a real-world dataset across varying sentence structures, demonstrating strong generalizability in detecting sensitive information regardless of the data source. Importantly, the model provides real-time predictions with an average execution time of 35.46 milliseconds, satisfying the speed requirements for practical deployment. It also trains 45.5 times faster than a state-of-the-art framework, offering high computational efficiency without compromising accuracy.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom