In an article published in the journal Sensors, researchers introduced a novel hybrid model that combines an Artificial Neural Network (ANN) for new word weighting and a Hidden Markov Model (HMM) for efficient and accurate short-text filtering.
Background
In the digital era, short text messages generated through platforms like short message services (SMS), microblogs, instant messaging apps, and commercial websites have become ubiquitous means of communication. These short messages offer a cost-effective and convenient way to reach a mass audience. However, they have also become a fertile ground for spammers who exploit their low cost and wide reach to disseminate malicious or unwanted content. Short text messages pose unique challenges for effective classification due to their brevity, sparsity, informality, and rapid generation.
Traditional text classification methods like naive Bayes and support vector machines struggle with short texts as they lack sufficient semantic content and exhibit word sparsity. Moreover, the need for real-time, high-throughput spam filtering adds to the complexity. Short texts often feature informal language usage, characterized by abbreviations, misspellings, and creative adaptations of words, making it difficult to identify spam accurately.
To address these challenges, the present study presented a novel hybrid model. The innovation lies in the new word weighting approach, which calculates the weight of a new word based on its neighboring words and the probabilities of spam or not spam predicted by the ANN. This approach aims to enhance accuracy without compromising processing speed, offering a balanced solution for effective real-time filtering of short text messages.
The Hybrid Model for Short Text Filtering
The authors explored the intricate processes of short text pre-processing, feature extraction, and the innovative hybrid model designed for efficient short text filtering.
- Short Text Pre-Processing: The process begins with critical pre-processing steps. "Case folding" standardizes text by converting all capital letters to lowercase for consistent analysis. "Tokenization" divides the raw text into individual words for separate analysis. "Stemming" and "lemmatization" reduce words to their base forms by removing affixes and simplifying their representations. "Stop words removal" eliminates common words serving as placeholders, with adjustments made for specific message categories. For instance, symbols like '$' linked to financial content are retained, while emojis and special character strings are included due to their presence in spam and ham messages.
- Feature Extraction: To account for varying word importance in different categories, a feature extraction algorithm calculates word weights based on their probability of occurrence. Each word's weight signifies its likelihood difference between ham and spam messages, with negative weights for spam-indicative words and positive weights for ham-typical words.
- ANN for New Word Weighting: Short text vectors paired with ham or spam labels are input to the ANN. The ANN is trained using these pairs, and new words exceeding a predetermined threshold are weighted based on the disparity between ANN-generated probabilities, emphasizing the new word's significance.
- HMM for Short Text Filtering: At the core of the model is the HMM tailored for short text filtering. HMMs are trained using sequences of word weights and ham/spam labels. Short text representation for the HMM comprises sequences of word weights, each linked to ham or spam states. Transition and emission probability matrices are calculated during training based on sequences of word weights from ham and spam texts.
- The Proposed Hybrid Model: The hybrid model seamlessly integrates the ANN and HMM for new word weighting and efficient spam filtering with an asynchronous training process. The HMM initiates filtering, sending unidentified text strings to the ANN when occurrences surpass a threshold. This approach balances prompt HMM operation with periodic ANN retraining to enhance classification accuracy by accommodating a growing dataset of known words with updated weights.
Experiments and Results
The authors conducted experiments to evaluate a hybrid model for short text classification, implemented in Python 3.7 with pomegranate for HMM modeling and sci-kit learn for ANN modeling. Using a computer with an Intel Core i7-7820 CPU and 16 GB of memory, the first experiment focused on the UCI SMS Spam Collection dataset, containing 5574 SMS messages (747 spam, 4827 ham). It was divided into training and testing sets, with performance metrics including precision, recall, F1-measure, accuracy, and AUC.
Results from the HMM model showed 8127 extracted words from the training set, with fine-tuned transmission matrices and Gaussian distribution parameters for spam and ham states. The ANN underwent 57 training iterations with a final loss of 0.03807, successfully identifying new words and improving its vocabulary.
Comparisons with other models like Naïve Bayes, Support Vector Machines, Decision Trees, Linear Discriminant Analysis, Long Short-Term Memory Networks (LSTM), and Convolutional Neural Networks (CNN) demonstrated the hybrid model's superior performance in classifying ham messages.
The study extended to other datasets, with the hybrid model showing competitive accuracy, especially excelling in the Chinese SMS dataset, highlighting its multi-language filtering capabilities. Additionally, comparing training time and throughput favored the hybrid model over deep learning models, meeting production environment requirements.
The authors emphasized the hybrid model's success in capturing short-term dependencies in sequential data, making it effective for short-text classification. It addressed the challenge of informal language in short texts through the HMM and ANN combination, improving accuracy and throughput. However, it acknowledged that HMMs are better suited for short-term dependencies and may not perform optimally in longer texts or tasks involving longer-term dependencies.
Conclusions
This paper introduced a novel hybrid model, combining an ANN and an HMM, designed for spam short text filtering. The model addresses the challenge of handling new words in spam texts through a unique new word weighting approach, enhancing classification accuracy. Experimental tests on benchmark datasets, including the UCI SMS dataset, showcase its superior performance compared to other machine learning algorithms, even outperforming deep learning models like CNN and LSTM. This hybrid model strikes a remarkable balance between accuracy and speed, rendering it well-suited for practical applications in short text filtering. Future research will explore further hybrid methodologies involving HMM and deep learning models.
Journal reference:
- Xia, T., Chen, X., Wang, J., & Qiu, F. (2023). A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts. Sensors, 23(21), 8975. https://doi.org/10.3390/s23218975, https://www.mdpi.com/1424-8220/23/21/8975