IPvest Clustering the IP traffic of network entities hidden behind a single IP address using machine learning

Abstract

IP Networks serve a variety of connected network entities (NEs) such as personal computers, servers, mobile devices, virtual machines, hosted containers, etc. The growth in the number of NEs and technical considerations has led to a reality where a single IP address is used by multiple NEs. A typical example is a home router using Network Address Translation (NAT). In organizations and cloud environments, a single IP can be used by multiple virtual machines or containers running on a single device. Discovering the number of NEs served by an IP address and clustering their traffic correctly is of value in many use cases for security, lawful interception, asset management, and other purposes. In this paper, we introduce IPvest, a system that incorporates unsupervised and supervised learning algorithms based on various features for counting and clustering network traffic of NEs masqueraded by a single IP. The features are based on the characteristics of operating systems (OSs), NAT behavior, and users’ habits. Our model is evaluated on real-world datasets including Windows, Linux-based, Android, and iOSbased devices, containers, virtual machines, and load-balancers. We show that IPvest can count the number of NEs and cluster their traffic with high precision, even for containers running on a single device and servers behind a load-balancer.

Existing System

? Each group might have a different IP size distribution. However, entities within the same group are expected to share a similar distribution. ? Since the majority of fraudulent clicks are already filtered out by existing detection systems, we use the aggregate distribution of legitimate IP sizes within each group as an estimation of the true (unknown) IP size distribution for that group. ? Next, we use a set of statistical methods to accurately characterize the deviation between the observed and expected distribution. ? As, different attacks result in different deviations in the IP size distribution.

Disadvantages

? The task of device fingerprinting based on various parameters was studied in various works. This topic is in the neighborhood of the DeNAT and the device counting challenges discussed in this study. If fingerprinting can be applied, it might provide the answer to the number of devices in the network. Generally, the challenge of device fingerprinting is different from clustering traffic, but it can be used to identify devices after clustering is implemented. ? In this paper, we introduce the challenge of identifying services sharing the same container.? As we did not find any studies dealing with containers and traffic clustering, to the best of our knowledge, this is the first study addressing this issue.

Proposed System

A different line of research has proposed a data analysis approach to discriminate legitimate from fraudulent clicks. focuses on the problem of finding colluding publishers. The proposed system analyzes the IP addresses generating the click traffic for each publisher and identifies groups of publishers that receive their clicks from roughly the same IPs. addresses the scenario of a single publisher generating fraudulent traffic from several IPs. A single publisher that receives a few clicks can evade the proposed system, but at the expense of throttling its own attacks. A set of publishers with a few clicks each can potentially collude to generate an aggregate large number of fraudulent clicks.

Advantages

? There are several advantages for the usage of IPvest as a whole system rather than using only this rule. ? First, in many cases, the particular network configuration is not known in advance, and only after activating IPvest, we obtain the knowledge of the network’s configuration. ? Second, in other scenarios, the rule is more complex (like in the previous experiment) and could not be applied without adding more features to the clustering. ? Third, often network configuration changes (e.g. by adding another device of a different type), and after such a change a simple rule may fail without providing indication.

Download DOC Download PPT