Security in P2P Networks:
A study of the gnutella protocol and it’s weaknesses

Imran Qureshi

Dept. of Computer Science

Montclair State University

CMPT 495-01 - Data Security

http://csam.montclair.edu/~qureshii/p2p.html

Abstract

P2P networks have become one of the hottest topics in the world of computers today. The concept of sharing music, movies, documents or any other kind of files over the internet has attracted a lot of people and made P2P very famous. Tens of thousands of people are over these networks at any given time, sharing millions of files. As history tells us, hackers will look at this as a very lucrative situation and almost obviously will try to find flaws or holes in the system, in order to attack the file sharing peers. It is one thing to find flaws and another to easily be able to see them. And it is amazing how insecure these systems are.

For these sole purposes, our attempt is to go in-depth in to one of these P2P networks, namely “Gnutella” and find its weakness. Specifically, we will be looking at the history of its creation, gnutella topology (how information is transferred through out the network) and difference between centralized and decentralized networks, the protocol on which it runs, security holes or weaknesses of the system and possible solution to the problem.

For: Prof. Stefan Robila

Date: December 13, 2004

Index of Topics

1  Introduction

1.1  History of Gnutella

1.2  Motivation

1.3  Contribution

In-Depth Gnutella

2.1  Gnutella Topology – Decentralized

2.2  Protocol Specifications

2.2  a Descriptor Headers

2.2 b Descriptors

2.3 Gnutella routing requirements

3 Communication in Gnutella

3.1 Finding Servents

3.2 Connecting to Servents

3.3 Searching and Downloading resources

3.4 Fire Walled servents

4 Security risks of Gnutella

4.1 Pong attack

4.2 Viruses or Trojans through the Push descriptor

4.3 Denial of Service attack (DOS)

4.4 IP Harvesting

4.4 a IP Host-Cache Server

4.4 b Crawlers

4.5 Man in the Middle attack

4.6 Browsing files

5  Solution

5.1  Unique Network Identifiers (UNI)

5.2  Validation

5.3  Reducing Traffic

6  Conclusion

7  References

1) Introduction

So what is P2P? P2P stands for “peer-to-peer” and refers to a group of users or peers sharing files with each other. The only requirement in order to directly connect to another user is that, both should be on the same network. There are many P2P clients out there providing these networking services, namely Napster, Kazaa, Gnutella etc…, and it is basically a user’s preference to choose any one of them. For the purposes of our discussion, we will look at Gnutella in depth. But first, how did Gnutella come about?

1.1) History

Gnutella was developed as an open source program by a company called “Null soft” (a subsidiary of AOL) during the early 2000. Two of its primary programmers or creators were Justin Frankel and Tom Peppers. On March 14, 2000 Gnutella was released by Null soft and uploaded on there website as a free download for users. When AOL got knowledge of this release, they immediately forced Null soft to take down Gnutella since it promoted piracy. Consequently, Gnutella was taken down after being online for only one day, but for this short period of time a lot of people had already downloaded the software. Hence, these people took advantage of the software’s open source capabilities and started reverse-engineering the protocol. This is why now we have a lot of different applications providing P2P services, using the Gnutella protocol as their basis. Figure 1.1 clearly shows some of the most prominent ones used on different OS platforms.

Figure 1.1

1.2) Motivation

The main motivation for this research came from analyzing the work done by other research papers. Going through them and reading them, it was amazing to see how most people are concerned about making the Gnutella network much more efficient in terms of the communication speed. While a few of them deal with the security aspects of the protocol since it concerns the privacy of thousands of people who use them. Taking into account the popularity of these systems, recently not only do regular users go on these networks, but businesses have also shown a great amount of interest and use them to share files or communicate with other businesses.

1.3) Contribution

Our contribution in this paper deals strictly with the security aspects of the Gnutella protocol. When this protocol was initially created, it didn’t take into account the fact that “peers might misbehave”. Due to this, this protocol is extremely insecure and it is amazing to see how private information is easily available for malicious users, making it a breeding ground for hackers.

2 In-depth Gnutella

2.1) Gnutella Topology

There two main topologies allowing people to share files over a P2P network. Decentralized and Centralized.

“Centralized” file sharing model, used by Napster, contains a central server. All the queries and file download or search requests have to go through this server. This central server contains a directory of all the files shared over the network and the users currently online. This setup makes the network run slowly and also creates a single point of failure since if the server goes down, the entire network also fails.

Napster’s Centralized Server

Source: Vlajic, N. “Peer-to-peer networks

Figure 2.1a

“Decentralized” file sharing model, as used by Gnutella, has no central server. The communication between two nodes or peers takes place directly. Each node or peer gives permission to download resources or asks other peers to access there resources. Each node on a Gnutella network is called a “servent”. Meaning that, every single peer is both a SERVer and a cliENT.

This type of a setup eliminates the “single point of failure” vulnerability of the network. Since users communicate directly, the communication is also very fast.

Gnutella’s De-centralized Topology

Vlajic, N. “Peer-to-peer networks

Figure 2.1b

2.2) Gnutella Protocol

The gnutella protocol defines the way communication is established between peers on the network. The main concept in this topic is the use of “Descriptors”. Currently, there are five descriptors used in the communication; ping, pong, query, queryhit and push. Each descriptor is preceded by a descriptor header.

So let’s look at how these descriptors are structured in the network:

When a peer sends a message over the network, that message looks as follows:

Descriptor Header / Descriptor Payload

0 22 23 variable; 0 …. max

The descriptor header is basically used to identify what type of descriptor (ping, pong …) will be used in this communication. It is also used to limit the number of peers that this message is broadcasted to. (we will look at this further on). Some other things to note are that:

-  All the structures are in little-endian byte format (least significant value is stored first)

-  All IP addresses are in IPv4 format:

0xD0 / 0x11 / 0x32 / 0x04

Byte 1 byte 2 byte 3 byte 4

Now let’s look at the first part of the message:

2.2 a) Descriptor Header

- Byte Structure

Descriptor ID / Payload
Descriptor / TTL / Hops / Payload Length

0 15 16 17 18 19 22

Descriptor ID: Unique identifier for the descriptor on the network (16-byte string)

Payload Descriptor: Depending upon the descriptor being sent, this value could be:

0x00 for a ping

0x01 for a pong

0x40 for a push

0x80 for a query

0x81 for a queryhit

TTL (Time-to-live or Horizon): This is the best technique available to control the amount of communication or traffic on the network and prevent flooding and poor performance. Each peer that receives a Descriptor Header will look at its TTL value to determine if the information should be forwarded to the next peer or not. If TTL is not 0, the user will decrement the value of TTL by one and forward it on to the next user. Otherwise, if the value is 0, it is an indication that the message should no longer be forwarded and hence it is rejected by the user.

Hops: The total number of times that the descriptor has been forwarded. Hops are similar to concept of hops in routers, where this value refers to the number of routers a packet passes through before reaching its destination. In the same way, in gnutella it refers to the number of people who have already seen the descriptor.

The general formula is: TTL (initial) = TTL (current) + Hops (current)

Payload Length: Length of the next descriptor. Used to find the beginning of the next descriptor.

2.2 b) The Descriptors

Immediately after the descriptor header, the descriptors follow. Each descriptor has a different function and a different byte structure. We will look at each one of them individually.

a)  Ping

A Ping descriptor is primarily used by a servent to find or probe the network for other servents on the network. Ping’s have a length of 0 and have no associated payload. Hence there is no byte structure that exists. The descriptor header identifies the ping by having a value of 0x00 in the payload descriptor field and a value of 0x00000000 in the payload length field. A servent who wishes to reply to a ping request, responds with a “pong” descriptor.

Pong

Port / IP Address / No. of files shared / No. of Kb shared

0 1 2 5 6 9 10 13

Pong is basically sent as a reply to a ping descriptor. It identifies the replying servents IP address and the port that they are accepting traffic on.

Port: the port at which this responding can accept incoming connections

IP Address: IP Address of the responding host (big-endian format)

Number of files shared: Total number of files the responding is sharing on the network (usually found in the “shared folder”

Number of Kb’s shared: Total number of Kb’s the responding host (with the given IP and Port) is sharing.

b)  Query

Minimum Speed / Search Criteria

0 1 2 ….

Used to search the network for files that meet certain criteria’s. A servent that meets the criteria reply’s with a “queryhit” descriptor.

Minimum Speed: The minimum speed (in kb/s) of the servents who should respond to this query request. A query with the minimum speed requirements of m (kb/s), should only responded to with a queryhit by a servent who has a speed greater than m.

Search Criteria: A search string terminated by a null (0x00). The maximum length is bounded by the payload_length field of the descriptor header.

eg: “nameofthesong.mp3

c)  Queryhit

No. of Hits / Port / IP Address / Speed / Result Set / Servent ID

Reply to a query request. This reply is sent only if it meets the criteria specified.

No. of Hits: Total number of hits or matches for the query in the result set

Port: the port at which this responding can accept incoming connections

IP Address: IP Address of the responding host (big-endian format)

Speed: Speed of the responding host

Result Set: Set of No. of hits responses for the corresponding query. In other words, how many files in the shared folder of the responding host met the search criteria. Each of the set of the No. of hits elements has the following structure:

File Index / File Size / File Name

0 3 4 7 8 …

File Index: Location and the ID of the file matching the query. (assigned by the responding host)

File size: Size in bytes of the file.

File Name: name of the file (double null terminated 0x0000)

Servent Identifier: Unique 16-byte string identifier of the responding servent on the network.

d)  Push

Servent Identifier / File Index / IP Address / Port

0 15 16 19 20 23 24 25

A user cannot download a file from a fire walled servent. Hence a push descriptor is used to allow the communication or transfer to take place.

Servent Identifier: targeted or firewalled servents unique 16-byte string identifer on the network, being requested to push the file with a index of File Index

File Index: index of the file to be pushed on the targeted servents shared folder.

IP Address: IP Address of the servent (big-endian format) to whom will be pushed

Port: the port on the targeted host, through which the file should be pushed.

2.3) Gnutella Routing Requirements

Other than the basic descriptors and there byte structures, the gnutella protocol also sets standards for how the routing of these descriptors should be done. These standards or rules are as follows:

1)  All the descriptors will be sent only along the same path that they were received. For example; a pong should be sent along the same path that the ping descriptor was issued. The same holds for all the other descriptors transferred throughout the network. If a servent receives a pong request but does not recognize any ping request received with the same ID, it should drop it.

2)  All descriptors received by a servent should be forwarded to all of its neighbors.

3)  The TTL value should be decremented by each servent

4)  To save network bandwidth and limit the traffic, a servent should not forward a descriptor with the same payload descriptor and descriptor id as the one it received before.

3) Communication in Gnutella

“Communication in gnutella” talks about how peers on the network find each other and make use of the descriptors to transfer files. It also talks about how a new user to the system, lets everyone know of its existence and connect directly to the peers. We will discuss the various topics as follows: finding servents, connecting to servents, downloading files and fire walled servents.

3.1) Finding servents

Initially, when a servent enters the network it has no idea of the topology and does not know the IP address and ports of it neighboring servents. This way, it is not possible for this client to start participating in the communication process with in the network. For this reason alone, most of the client vendors have setup Host-Caches Servers which serve the new clients with the IP addresses of the currently connected servents. Another technique widely used outside of the protocol is the use of local-caches of IP addresses from the previous connections. Using this, there is no need to connect to the host-cache servers since you already have the IP of the peers you connected to in your past connections. (figure 3.1 shows this concept)