In an article recently published in the journal Scientific Reports, researchers proposed a hybrid feature selection and ensemble-based machine learning (ML) approach for reliable and effective detection of botnets.
Background
A hacked computer network managed by a single attacker/bot master is referred to as a botnet. Botnets are created by infecting several computers through phishing attacks or malware infections. Once the infected computers become a part of the botnet, they can initiate attacks on other networks/computers. The botnet can be exploited by intruders to initiate distributed denial of service (DDoS) attempts, abuse online services, send phishing emails, and harm governments, businesses, and individuals by extracting private data.
Thus, botnet detection is crucial to ensure the integrity and security of computer networks and systems. However, botnet detection using existing/conventional detection systems is increasingly becoming challenging due to the continuous advancement and evolution of botnet strategies, which necessitated the development of a more proactive and dynamic approach.
Although ML-based approaches can analyze network traffic patterns to detect botnets, a single ML algorithm cannot effectively detect all botnet types. Additionally, using multiple classifiers in botnet detection models has several limitations, including higher false positive rates (FPR) and lower detection rates. Imbalanced datasets also increase the challenge of realizing botnet detection with high accuracy.
Moreover, several existing datasets employed in botnet detection contain mutually informed and correlated features, making feature selection difficult and necessitating the development of effective and novel feature selection approaches that can precisely identify and leverage the most useful features for improved botnet detection accuracy.
The proposed approach
In this study, researchers proposed a novel hybrid feature selection and ensemble-based ML approach for botnet detection to increase the efficiency of detecting evolving and new botnets with higher true positive rate (TPR).
Researchers used N-BaIoT, Bot-IoT, CTU-13, ISCX, CCC, and CICIDS datasets to evaluate the proposed ensemble ML models. The synthetic minority over-sampling technique (SMOTE) technique was applied to mitigate the dataset imbalance by generating synthetic data points.
Three feature selection techniques, including categorical analysis (CA), mutual Information (MI), and principal component analysis (PCA), were used to select the most relevant features for botnet detection and improve the ensemble learner detection capabilities.
Five ensemble ML techniques, including the extra-trees ensemble technique, bagging ensemble technique, random forest ensemble technique, random forest ensemble technique, and stacking ensemble technique, were evaluated and compared in this study.
A computational environment was established using an 11th Gen Intel(R) Core(TM) i7-11,700 processor with 16 GB of RAM for experiments. Researchers performed analyses using Python within the Jupyter Notebook interface, leveraging the robust Scikit-learn library to implement ML models.
Several assessment metrics, including accuracy, precision, recall, F1-score, Cohen’s kappa, area under the receiver operating characteristic (ROC) curve (AUC), and balanced accuracy (BACC), were employed to assess the effectiveness of the proposed botnet detection approach.
Significance of the study
The application of the SMOTE technique realized balanced datasets. The model with the extra trees ensemble approach outperformed all other models in the comparative analysis by achieving 99.99% accuracy rate, precision, recall, and F1-score, and 0.00% and 99% FPR and TPR, respectively, in botnet classification across varied datasets.
Specifically, the 0.00% FPR achieved by the model with the extra-trees ensemble technique demonstrated its high accuracy in differentiating botnets and regular instances. The model with extreme gradient boosting ensemble technique displayed the second-best performance in these metrics.
The model using the extra trees ensemble technique displayed the highest BACC score of 0.9999, which indicated its ability to accurately identify botnets and regular instances even when the data is imbalanced. Additionally, the low error rate of 0.0000 attained by the extra trees ensemble-based model displayed the accuracy of the model in making accurate predictions, indicating its reliability for botnet-detecting tasks.
The model also achieved a training accuracy of 1.0000 and a high testing accuracy of 0.9999, which indicated the ability of the extra trees approach to accurately match the training data and effectively generalize to unseen data, making the model an effective solution for botnet detection in practical scenarios.
Moreover, the model with the extra trees ensemble approach showed a Cohen’s Kappa of 0.9999, and a high AUC and observed accuracy score of 1.0000 and 0.9999, respectively, indicating an exceptional agreement between the predictions by the model and the actual classifications and its ability to detect botnets precisely.
To summarize, the study's findings demonstrated that the proposed hybrid feature selection and ensemble-based ML approach, specifically the model with the extra trees ensemble technique, can be used to identify botnets/botnet attacks reliably and effectively, making it a suitable option for cybersecurity applications.