Detecting Malicious Traffic with Machine Learning

Edgecast
9 min readAug 29, 2019

--

By Paul Rigor, Ph.D., Research Scientist, and Harkeerat Bedi, Ph.D., Research Scientist

The Future of Threat Management

Machine learning offers a unique opportunity to automate security research and stay ahead of application security threats. In this article, we’ll share our research into applying neural networks on production traffic data, as well as some ideas for future applications. By applying predictive analysis techniques to incoming requests to our platform, we’re developing new ways to detect, analyze, and mitigate malicious traffic.

Incorporating the output of the machine learning algorithms into our WAF product allows us to create a more robust threat detection system, enabling more accurate detection of malicious traffic toward our platform. Coupled with the flexibility of a Dual WAF platform, we can quickly incorporate these insights directly into a production configuration for a customer’s live traffic to help mitigate threats. Machine learning helps us develop agile threat detection and mitigation systems that are ready to grow to respond to emerging threats.

Threat Management — Helping Customers Mitigate Threats and Save Money

Recognizing the difference between legitimate and malicious traffic is a continuous balancing act. Actual users expect to receive their requested data in real time, while malicious traffic must be mitigated just as quickly. This situation becomes complicated as traffic grows. Our customers need systems that make the right decisions — quickly, consistently, and accurately.

The reputation database helps by monitoring incoming requests and filtering out clients with a history of bad actions. Reputation databases take a significant amount of data to build and time to refine. Because of that, they tend to lag behind evolving and emerging threats, especially for automated traffic, also referred to as bot traffic. Machine learning offers an opportunity to scale these capabilities without relying on costly manual analysis of massive amounts of data.

Machine Learning in a Network Environment

Machine learning applies algorithms to find patterns in existing data and then uses these derived patterns to classify new data. The steps in this process are:

  1. Clustering data points within similar behavioral groups

Unsupervised learning generates clusters of requests based on different user behavior. Every request has its signature such as inter-request time, the total number of unique endpoints, IP location, and even prior metadata such as reputational ranking. The key idea is that data points in each cluster exhibit similar behavior.

2. Training a classifier to recognize and assign new requests to these groups

Neural networks map the properties of each class and their respective class labels. These neural networks are trained against marked traffic data to identify requests within the desired parameters consistently. They are then tested with new, unmarked traffic data.

3. Verifying the results

The neural network predictions are validated against data provided by third parties, such as OWASP, Trustwave, and against Common Vulnerabilities and Exposures (CVE). Independent customer validation is also a viable basis for comparison.

Within a network environment, this can lead to better user traffic analysis and more accurate reputation classifications. Active machine learning systems can reincorporate these insights for even greater improvements.

Security automation relies on a constant evaluation of all incoming traffic. In this environment, machine learning algorithms are capable of analyzing large quantities of inputs and deriving actionable intelligence (reputable database of bad client/IP) based on real-time traffic data. This has a wide range of network applications.

Applying Machine Learning to Improve Security

Innovations need to be properly applied to maximize their benefits. Machine learning performs best in environments with:

  • High Data Density
  • Clear, Actionable Outcomes
  • Quantifiable Variation

Fortunately, there are many ideal environments for these optimizations on the Internet. The content delivery network (CDN), in particular, has access to large amounts of Internet traffic data that’s ideal for applying predictive analysis.

Our Experiments Using Machine Learning to Detect Malicious Traffic

Verizon Media’s research team uses a semi-supervised approach as the best way to effectively analyze large volumes of network traffic.

By training an algorithm against historical production traffic data, the mathematical model gets a head start in creating a data classification. Subsequent data can be classified with existing indexes, and new clusters can be derived that more efficiently group the data. As models for differentiating traffic emerge, they can be tested against new logs and compared against existing security rulesets for accuracy. Improvements derived from the model can be integrated into the ruleset, and the process can be repeated.

Our preliminary experiments in applying machine learning to production traffic logs have returned promising results. In our experiments, we took real, historical traffic data from an active timeframe, we then worked with top security professionals to identify bad requests, including those that might have made it past current filtering methods. We then compared it with our results from our Machine Learning algorithm and found that it outperforms manual traffic data analysis in differentiating good traffic from malicious traffic.

When provided with the raw traffic data (and no post-analysis coaching), our model had a 100% discovery rate for high confidence predictions. Rather than matching IPs against an existing blacklist, it was able to analyze requests based on their request attributes (e.g., user agents, query parameters, and cookie values), and return classifications in a production environment that perfectly correlated to our expert-derived conclusions. Even in low confidence scenarios, our model was able to return results that exceed current techniques of identifying a malicious client and building a reputational database.

How Machine Learning Sorts and Analyzes Traffic Data: Clustering, Training, Verifying, and Refining

Machine learning algorithms can analyze traffic data based on a variety of request behaviors. While most current security rulesets are based on signatures, the signatures of bot attacks are constantly changing. Instead, security researchers often have to continuously review their traffic data and manually generate custom security rules to mitigate this automated traffic.

Our research focuses on identifying malicious traffic before it becomes a problem. In the example outlined below, we’re dealing with production logs from our WAF, collected over three weeks. Using an internal analytics tool, we’re able to interact with the raw log data in real time. In the graphics below, we’ve simplified this to only two axes (Inter-request time and unique endpoints), but in actual experiments, we evaluated incoming data on more than 300 different properties.

Step 1 — Clustering

Unlabeled traffic data is grouped according to common metadata and behavior patterns. Clusters are formed when groups of requests exhibit similarities across different variables.

Figure 1. We generate various groups with different behavior such as Inter-request time and the total number of unique endpoints (or URLs). Data points in each group exhibit similar behavior.

Step 2 — Training a Classifier

The clustered data is then analyzed against labeled traffic data. This data consists of historical determinations of malicious requests (blacklisted IPs), as prepared by security experts. The machine learning system develops a classifier that can assign clusters of requests to relevant categories; in this case, a binary classifier to detect malicious requests. Through hyperparameter tuning, we were able to optimize cross-validation metrics such as accuracy.

Step 3 — Verification

We test the classifier on new WAF logs to detect known behavior and discover new client IP’s exhibiting malicious behavior. The classifier also focuses on identifying false positives and making sure legitimate traffic isn’t blocked. Results can also be confirmed by comparing against current customer traffic data and external reputation databases, in this experiment, Cisco’s Talos Intelligence Group.

Figure 2. When we test a model, we apply the classifier to a new set of data. We want to answer the following questions: Did our model discover new clients that exhibit known bad behavior? Does our model detect requests by bots or valid users (humans)?

Step 4 — Revise and Update

This machine learning cycle enables the system to adapt and refine itself over time. This leads to progressive improvements in effective blacklisting, as well as responsive mitigation for emergent threats.

How Machine Learning Can Improve Security Response

A principal benefit of machine learning is its ability to derive immediate inferences from large blocks of data. Legitimate web traffic can potentially consist of requests from millions of concurrent users. Each of these requests may be inspected individually, but when aggregated, they appear random and chaotic.

Malicious bots or automated traffic take advantage of this chaos by altering their request signature to mimic real user request signatures. For human observers, finding these bots would be like picking out a single oddly behaving pixel on a screen full of static. Machine learning algorithms, on the other hand, are fully capable of quickly classifying and categorizing millions of requests, and easily help identify non-conforming behaviors.

For network applications, this means understanding the subtle differences between legitimate and malicious traffic and developing an action plan to deal with this traffic appropriately. The algorithms must quickly recognize the type and frequency of incoming attacks and perform mitigating action to these requests without causing false positives to actual user requests.

Traditional network security takes a certain amount of uncertainty as a given and has to deal with a certain percentage of false negatives rather than risking a false positive that could cause service interruptions. However, machine learning is perfectly suited to operate in this grey zone and can make quick decisions without manual input. A powerful, multitenant engine like walfz can incorporate these insights to develop more accurate security rules.

Machine Learning to Enhance Security Research

Where machine learning attempts to respond to incoming data autonomously, classical programming relies on adjustments and new code written by human operators. Each approach offers unique advantages within different environments.

Machine learning will never replace an experienced security team. Instead, it provides companies with highly advanced, complementary tools to accommodate traffic growth while retaining organizational flexibility.

Future Steps — WAF Integration and Automation

While our current research has focused on applying machine learning algorithms to our historical traffic and WAF data to build a better reputation database, we’re exploring new ways to use machine learning within our WAF product in a live environment.

By introducing a machine learning layer within our WAF, we see the potential to simultaneously update and refine WAF rulesets directly in response to incoming traffic. Not only will this allow customers the opportunity to adjust and monitor their client requests, but the adaptive traffic insights allow a real-time response to changes in malicious traffic data.

With a testing framework operating on production data, our Dual WAF capability provides an optimal platform for comparing these machine learning derived rulesets against a current production WAF ruleset. Through waflz, customer can A/B test rulesets with actual traffic, and quickly refine their security configurations. Dual WAF with machine learning puts networks ahead of derivative forms of malicious traffic, filtering out harmful requests before they become a problem.

Conclusion

Machine learning empowers threat researchers with powerful tools to automate filtering, identify bad actors, and create stronger WAF rulesets. Our experiments with machine learning have demonstrated that they can be effectively trained to identify malicious traffic, with high success rates autonomously. This capability can be integrated within WAF in the form of a more robust reputation database. By automatically detecting attacks based on behavior patterns, we can mitigate emergent threats while minimizing disruptive false positives.

Effective network security requires an integrated approach to threat management. Malicious traffic increases the cost of IT operations, exploiting vulnerabilities, and causing data breach and service interruption. At Verizon Media, we’re taking a comprehensive view of Internet security to help our customers stay ahead of these threats. Our research into machine learning provides an important foundation for the future of our security products.

We’ll be discussing our security solution at IBC2019, Amsterdam, from September 13–17. To learn more or schedule a meeting, click here.

--

--

Edgecast

Formerly Verizon Media Platform, Edgecast enables companies to deliver high performance, secure digital experiences at scale worldwide. https://edgecast.com/