Graph-Based Machine Learning for Advanced Cyber Threat Detection

In a recent submission to the ArXiV* server, researchers have introduced a novel approach to bolstering cybersecurity through early detection of network intrusions and cyber threats. Rather than relying solely on traditional methods that analyze network traffic on a per-packet or per-connection basis, their proposed methodology involves pre-processing the network traffic to extract new metrics based on graph theory.

Study: Graph-Based Machine Learning for Advanced Cyber Threat Detection. Image credit: Pungu x/Shutterstock
Study: Graph-Based Machine Learning for Advanced Cyber Threat Detection. Image credit: Pungu x/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

By considering the network as a whole entity rather than focusing solely on individual packets or connections, this approach aims to overcome certain limitations of classical techniques. Through experimentation on publicly available datasets, the researchers have demonstrated that their approach addresses these limitations and effectively enhances the capability to detect cyber threats. This innovative fusion of graph theory, machine learning (ML), and intrusion detection holds promise for advancing cybersecurity measures.

Related Work

Previous research has extensively explored network intrusion detection methods, mainly through ML techniques analyzing network traffic. However, traditional approaches focus on specific network protocol features such as internet protocol (IP) addresses, transmission control protocol (TCP)/ user datagram protocol (UDP) ports, and packet-related attributes. These methods often need to be revised, including loose correlation with malware presence, individual terminal analysis, susceptibility to evasion techniques, and difficulty analyzing encrypted payloads.

Graph-Based Feature Extraction

The process of extracting graph-based features from network traffic datasets is detailed here. These datasets consist of labeled connections between network terminals, with each connection denoted by source and destination IP addresses, a timestamp, and a label indicating whether it's benign or malicious. Additionally, each connection is associated with a set of classic features derived from packet-level analysis.

Researchers constructed a graph representation where each edge corresponds to a connection between terminals. Graph population involves iteratively updating the graph with new edges, and functions for both unweighted and weighted cases are defined. Maintaining granularity involves dividing the dataset into blocks, with each connection associated with metrics computed on a progressively populated graph.

After computing graph-based features for each connection, researchers generate a new dataset that excludes classic features. The parameters for graph generation, such as edge weighting and block size, allow for flexibility in dataset creation to capture different aspects of network behavior. This methodology enables learning network behavior over progressive steps rather than just a final snapshot.

ML-Based Detection Process

ML-based detection in this study focuses on a binary classification problem using graph-based features generated by the approach, employing a support vector machine algorithm-radial basis function in its non-linear version (SVM-RBF). By utilizing the RBF as the kernel trick, researchers transform the data into a space where they are linearly separable. To conduct experiments, researchers utilize the Canadian Institute for Cybersecurity Intrusion Detection Systems 2017 (CIC-IDS2017) dataset, which comprises network traffic collected over five consecutive days, characterized by 80 traffic-related features extracted using the CICFlowMeter software. The dataset labels each connection as representing normal behavior or belonging to a specific attack category.

The process starts with generating a graph-based dataset, as outlined in the previous section, and splitting it into training and testing sets. Researchers designate the training set to include the first two days' traffic. They designate the test set to comprise the last three days' traffic by observing and considering only normal behavior during the initial day.

Specifically, for the training set, researchers adopt an approach where all malicious connections from the second day, belonging to file transfer protocol (FTP)-patator and secure shell (SSH)-patator, are included, and an undersampling step is performed for the benign class to balance the classes.

After generating and splitting the graph-based dataset, researchers execute several subsequent steps to prepare it for analysis. These steps include feature scaling, feature selection using forward feature selection (FFS) with SVM-RBF estimator, hyperparameters tuning of γ and C using a 5-fold cross-validated grid search, assessing model robustness through 10-fold cross-validation, and finally, training the model on the entire training set and evaluating its performance on the test set. Researchers repeated this process for different parameter sets to comprehensively analyze the model's performance under various conditions.

Experimental Analysis

The methods outlined earlier were validated through numerical simulations, ensuring reproducibility by making the source code publicly available. The results of these experiments reveal several noteworthy observations: Feature selection consistently yields two significant features, predominantly the clustering coefficient (CC), achieving near-perfect separation of training data instances.

Hyperparameter tuning minimizes hyperplane complexity while not significantly improving the F1 score due to already well-separated training instances. Model robustness analysis demonstrates consistently high average F1 scores and low standard deviations, indicating stable performance across different parameter configurations.

Analyzing performance on the test dataset, researchers found that certain combinations of parameters yielded superior results. Notably, configurations utilizing unweighted graphs outperformed weighted counterparts, achieving comparable results for smaller values of σ. However, for σ = N, models operating on weighted graphs failed to distinguish normal from malicious behavior, indicating overfitting.

Consequently, researchers discarded the weighted case and selected configurations as optimal, boasting excellent detection performance for most attacks while maintaining minimal false favorable rates (FPRs). This comparison with state-of-the-art intrusion detection systems (IDS) showcased the superiority of the approach. Despite requiring fewer features and a smaller training set, this method achieved better results across all evaluation metrics. It highlights the model's effectiveness in achieving superior detection performance with fewer training samples, underscoring the reliability and efficiency of the approach.

Conclusion

To sum up, the method involved leveraging metrics derived from graph theory to detect cyber threats from network traffic. Using a graph-based approach to model interactions between network nodes, rather than analyzing the content of exchanged information, limitations of previous methods, such as challenges in analyzing encrypted packets and detecting elusive malicious behaviors, were addressed. The approach was validated through experiments with the CIC-IDS2017 dataset, demonstrating superior performance compared to previous standalone connection analysis methods. These promising results indicated the potential for further research, including conducting larger-scale experiments across diverse scenarios.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, February 15). Graph-Based Machine Learning for Advanced Cyber Threat Detection. AZoAi. Retrieved on January 15, 2025 from https://www.azoai.com/news/20240215/Graph-Based-Machine-Learning-for-Advanced-Cyber-Threat-Detection.aspx.

  • MLA

    Chandrasekar, Silpaja. "Graph-Based Machine Learning for Advanced Cyber Threat Detection". AZoAi. 15 January 2025. <https://www.azoai.com/news/20240215/Graph-Based-Machine-Learning-for-Advanced-Cyber-Threat-Detection.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Graph-Based Machine Learning for Advanced Cyber Threat Detection". AZoAi. https://www.azoai.com/news/20240215/Graph-Based-Machine-Learning-for-Advanced-Cyber-Threat-Detection.aspx. (accessed January 15, 2025).

  • Harvard

    Chandrasekar, Silpaja. 2024. Graph-Based Machine Learning for Advanced Cyber Threat Detection. AZoAi, viewed 15 January 2025, https://www.azoai.com/news/20240215/Graph-Based-Machine-Learning-for-Advanced-Cyber-Threat-Detection.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Machine Learning Powering Breakthroughs in Climate Forecasting and Modeling