In a recent submission to the ArXiV* server, researchers have introduced a novel approach to bolstering cybersecurity through early detection of network intrusions and cyber threats. Rather than relying solely on traditional methods that analyze network traffic on a per-packet or per-connection basis, their proposed methodology involves pre-processing the network traffic to extract new metrics based on graph theory.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
By considering the network as a whole entity rather than focusing solely on individual packets or connections, this approach aims to overcome certain limitations of classical techniques. Through experimentation on publicly available datasets, the researchers have demonstrated that their approach addresses these limitations and effectively enhances the capability to detect cyber threats. This innovative fusion of graph theory, machine learning (ML), and intrusion detection holds promise for advancing cybersecurity measures.
Related Work
Previous research has extensively explored network intrusion detection methods, mainly through ML techniques analyzing network traffic. However, traditional approaches focus on specific network protocol features such as internet protocol (IP) addresses, transmission control protocol (TCP)/ user datagram protocol (UDP) ports, and packet-related attributes. These methods often need to be revised, including loose correlation with malware presence, individual terminal analysis, susceptibility to evasion techniques, and difficulty analyzing encrypted payloads.
Graph-Based Feature Extraction
The process of extracting graph-based features from network traffic datasets is detailed here. These datasets consist of labeled connections between network terminals, with each connection denoted by source and destination IP addresses, a timestamp, and a label indicating whether it's benign or malicious. Additionally, each connection is associated with a set of classic features derived from packet-level analysis.
Researchers constructed a graph representation where each edge corresponds to a connection between terminals. Graph population involves iteratively updating the graph with new edges, and functions for both unweighted and weighted cases are defined. Maintaining granularity involves dividing the dataset into blocks, with each connection associated with metrics computed on a progressively populated graph.
After computing graph-based features for each connection, researchers generate a new dataset that excludes classic features. The parameters for graph generation, such as edge weighting and block size, allow for flexibility in dataset creation to capture different aspects of network behavior. This methodology enables learning network behavior over progressive steps rather than just a final snapshot.
ML-Based Detection Process
ML-based detection in this study focuses on a binary classification problem using graph-based features generated by the approach, employing a support vector machine algorithm-radial basis function in its non-linear version (SVM-RBF). By utilizing the RBF as the kernel trick, researchers transform the data into a space where they are linearly separable. To conduct experiments, researchers utilize the Canadian Institute for Cybersecurity Intrusion Detection Systems 2017 (CIC-IDS2017) dataset, which comprises network traffic collected over five consecutive days, characterized by 80 traffic-related features extracted using the CICFlowMeter software. The dataset labels each connection as representing normal behavior or belonging to a specific attack category.
The process starts with generating a graph-based dataset, as outlined in the previous section, and splitting it into training and testing sets. Researchers designate the training set to include the first two days' traffic. They designate the test set to comprise the last three days' traffic by observing and considering only normal behavior during the initial day.
Specifically, for the training set, researchers adopt an approach where all malicious connections from the second day, belonging to file transfer protocol (FTP)-patator and secure shell (SSH)-patator, are included, and an undersampling step is performed for the benign class to balance the classes.
After generating and splitting the graph-based dataset, researchers execute several subsequent steps to prepare it for analysis. These steps include feature scaling, feature selection using forward feature selection (FFS) with SVM-RBF estimator, hyperparameters tuning of γ and C using a 5-fold cross-validated grid search, assessing model robustness through 10-fold cross-validation, and finally, training the model on the entire training set and evaluating its performance on the test set. Researchers repeated this process for different parameter sets to comprehensively analyze the model's performance under various conditions.
Experimental Analysis
The methods outlined earlier were validated through numerical simulations, ensuring reproducibility by making the source code publicly available. The results of these experiments reveal several noteworthy observations: Feature selection consistently yields two significant features, predominantly the clustering coefficient (CC), achieving near-perfect separation of training data instances.
Hyperparameter tuning minimizes hyperplane complexity while not significantly improving the F1 score due to already well-separated training instances. Model robustness analysis demonstrates consistently high average F1 scores and low standard deviations, indicating stable performance across different parameter configurations.
Analyzing performance on the test dataset, researchers found that certain combinations of parameters yielded superior results. Notably, configurations utilizing unweighted graphs outperformed weighted counterparts, achieving comparable results for smaller values of σ. However, for σ = N, models operating on weighted graphs failed to distinguish normal from malicious behavior, indicating overfitting.
Consequently, researchers discarded the weighted case and selected configurations as optimal, boasting excellent detection performance for most attacks while maintaining minimal false favorable rates (FPRs). This comparison with state-of-the-art intrusion detection systems (IDS) showcased the superiority of the approach. Despite requiring fewer features and a smaller training set, this method achieved better results across all evaluation metrics. It highlights the model's effectiveness in achieving superior detection performance with fewer training samples, underscoring the reliability and efficiency of the approach.
Conclusion
To sum up, the method involved leveraging metrics derived from graph theory to detect cyber threats from network traffic. Using a graph-based approach to model interactions between network nodes, rather than analyzing the content of exchanged information, limitations of previous methods, such as challenges in analyzing encrypted packets and detecting elusive malicious behaviors, were addressed. The approach was validated through experiments with the CIC-IDS2017 dataset, demonstrating superior performance compared to previous standalone connection analysis methods. These promising results indicated the potential for further research, including conducting larger-scale experiments across diverse scenarios.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.