Cyber threat intelligence (CTI) is a branch of cyber security (CS) concerning the contextual information surrounding cyber-attacks. This involves understanding the future, present, and past tactics, techniques, and procedures (TTPs) of diverse threat actors. Organizations utilize CTI to assist their security teams in protecting their networks from cyber-attacks by including threat data feeds into the systems/networks. This article deliberates on the importance and applications of artificial intelligence (AI) in CTI.
CTI Basics
The complexity and frequency of cyber threats are constantly growing as cybercriminals successfully bypass organizations' security controls using customized TTPs and intrusion kill chains. The development and implementation of robust CTI is one of the feasible approaches to mitigate security breaches. For instance, nation-states already use CTI as an efficient solution for devising preventive CS measures in advance.
CTI is also a proactive security measure involving real-time gathering, collation, and analysis of information on potential attacks to prevent data breaches and the resultant consequences. The primary objective is to provide thorough information on security threats posing a significant risk to the organization’s infrastructure and guide the security teams on preventative actions simultaneously.
Importance of AI in CTI
AI methods, specifically machine learning (ML), can improve CS measures significantly when cyber-attacks are ever-increasing. For instance, AI/ML-powered CS applications perform anomaly detection on a network more effectively compared to conventional methods.
Several AI/ML applications are employed in CS solutions, including hacking incident forecasting, CS ratings, secure user authentication, botnet detection, credit scoring and next-best offers, fraud detection, network intrusion detection and prevention, and spam filter applications.
In the context of CTI, organizations can automate data processing and acquisition, integrate with their current security solutions, absorb unstructured data from diverse sources, and finally link information from various places by incorporating context on compromise and modi operandi of malicious actors using AI/ML methods.
This is especially important in the big data context as the massive processing scales necessitate comprehensive automation. The processing must comprise the fusion of data points from different sources, such as technical, dark web, deep web, and open web sources, to devise a more effective strategy.
This approach can assist in converting these massive amounts of data into actionable CTI. Additionally, by separating and assembling concepts, AI/ML techniques can be leveraged to structure the data into categories of entities depending on their relationships to each other, properties, names, and events.
The approach facilitates robust searches on categories, enabling data sorting automation, eliminating manual data sorting. AI/ML techniques effectively structure text in multiple languages through natural language processing (NLP). For instance, text from infinite unstructured documents across several languages can be analyzed and categorized based on language-independent groups and events by exploiting AI/ML techniques.
Moreover, ML methods can be developed for text categorization into code, data logs, or groups prose, and eliminate ambiguities between entities having the same name using contextual clues in the surrounding text. By implementing statistical methodology and ML, events, and entities can be sorted even further depending on significance, such as by assessing risk scores to malicious entities.
Risk scores are calculated by the ML when it is trained using an already examined dataset. Classifiers like risk scores deliver the context describing the score, and a judgment as various sources verify that a particular IP address is malicious.
Risk classification automation saves a significant amount of time by effectively sorting through false positives and determining the risks that must be prioritized. Events and entity properties can be predicted using ML by more accurately generating predictive analysis models compared to those developed by humans based on deep data pools that have been mined and categorized previously.
ML techniques can also serve as active sensors feeding data into a common threat intelligence network that the entire user base has employed. Thus, implementing AI/ML methods at various CTI levels is at very different stages. For instance, studies in operational intelligence type are currently in the research and experiment stage, necessitating substantial resources.
ML Applications in CTI
Applications of ML techniques to threat intelligence, especially attribution, are currently being tweaked, tested, and developed. Attribution is expected to remain a major problem due to its political and convoluted nature. However, ML can automate several parts of the analysis process to increase the scalability of attribution and threat intelligence efforts by decreasing the threat intelligence analyst workload.
Microsoft Defender Advanced Threat Protection: The Microsoft Defender Advanced Threat Protection Research Team has developed an NLP system that extracts TTPs from documents available publicly, identifies categories, and labels relationships between those identified categories. Specifically, the system is trained using documentation of known threats, receives unstructured text as input, and identifies attack techniques, threat actors, malware families, and relationships to create attacker timelines and graphs.
This ML model was employed to identify the common techniques between the Emotet malware family and identified threat actor groups, enabling organizations to implement defensive choke points to prevent/detect these attacker techniques to stop both commodity malware and high-profile targeted attacks. Thus, the platform is a good example of ML being applied to provide actionable threat intelligence for preventing cyber-attacks.
APTinder: APTinder is an ML model under development by FireEye with the objective to assist in automating the daunting manual intelligence analysis process and the threat actor grouping. FireEye possesses a large existing dataset required for the model development.
The primary objectives of the model are to build a single interpretable similarity metric between groups, assess past analytical decisions, and identify new potential matches. Every potentially unique feature or topic has its model, which allows fine-tuning each model and changing topic weights for the final grouping. Similar to the approach followed by the Microsoft Defender Advanced Threat Protection Research Team, the data required for FireEye’s project is gathered from a vast body of reports.
The inverse document frequency (IDF) technique scores the uniqueness of terms from vectorized reports. Additionally, cosine similarity is employed to measure similarities between different groups. The groups are represented by the vectors from the earlier step. Cosine similarity refers to the cosine value of the angle between two vectors, which primarily determines how parallel these vectors are. This process is repeated for every topic or category.
Although the categories are currently weighted with a straight average, an objective weighting system must be built based on existing data for the overall model. A robust attack attribution can only be achieved by automating and enabling the analysis of massive quantities of data in a scalable manner.
Recent Developments
A paper published in Information Systems Security proposed a model to generate actionable threat intelligence by implementing a supervised ML approach employing the Naïve Bayes classifier. The objective of the ML-based model was to extract the potential threat intelligence from structured data sources and predict the threat.
Although several algorithms, like convolutional neural networks and recurrent neural networks, are effective for text analysis and NLP, researchers here adopted the Naïve-Bayes classifier to extract high-level threat intelligence from the dataset through text classification, considering 30% data and 70% data for test and training datasets, respectively.
Additionally, every feature in the model is independent of another feature’s existence, indicating that every feature contributes to the prediction without having a correlation. The text vector acted as the data feed for training this model, which was followed by model testing to assess the model performance. Eventually, true events, such as malware or threats, were predicted for unknown data.
Researchers evaluated several performance metrics, such as f1-score, precision, and accuracy, of the model against the test and training dataset using the Naïve Bayes classifier. The model demonstrated 98.2% accuracy and 96.6% accuracy for the training and test datasets, respectively.
To summarize, AI/ML techniques are playing an effective role in automating and improving CTI. However, the challenges of using ML, especially bias and discrimination, explainability, and adversarial attacks, must be addressed to fully exploit the advantages of these techniques.
References and Further Reading
Montasari, R., Carroll, F., Macdonald, S., Jahankhani, H., Hosseinian-Far, A., Daneshkhah, A. (2021). Application of artificial intelligence and machine learning in producing actionable cyber threat intelligence. Digital Forensic Investigation of Internet of Things (IoT) Devices, 47-64. https://doi.org/10.1007/978-3-030-60425-7_3
Barker, C. (2020). Applications of Machine Learning to Threat Intelligence, Intrusion Detection and Malware. https://digitalcommons.liberty.edu/honors/985
Dutta, A., Kant, S. (2020). An overview of cyber threat intelligence platform and role of artificial intelligence and machine learning. Information Systems Security, 81-86. https://doi.org/10.1007/978-3-030-65610-2_5