8
Web Cache
Yee Vang
Computer Science - CIS
University of Wisconsin-Platteville
Abstract
As internet content grows and the number of users increases, internet traffic also increases. Increase in internet traffic causes network congestion for the internet and an increase in workload for the origin server. As a result of the internet’s network being congested and overloading of the origin servers, user access latency increases; that’s where web cache comes into play. Web caches are intended to reduce network bandwidth; it achieves this means by reducing network traffics between the clients and the origin server which in turn reduces access latency. This paper will explore what web caching is, the pros and cons of web caching, the types of web caching, web caching architecture, cache relevancy and cache placement and replacement policies.
Introduction
In today’s internet, web browsing is easily one of the top generators of internet traffic. With many users browsing many websites and many users requesting some of the same information from the origin server; internet traffic increases which causes network congestion and an increase in user access latency. One way to improve the performance of the delivery of websites from the origin server to the users and to reduce user access latency is through the use of web caches. So what is web cache? A cache literally means a place of storage; therefore web cache is literally a place to store websites or web objects. We will explore the basic idea of caching, and then delve into the advantages and disadvantages of web caching. Then we will examine the different types of web caches, web cache architecture, and end with cache relevancy and cache placement and replacement policies.
Caching and Web Cache
Caching is a technique that can reduce latency and network congestion through the use of a cache. To illustrate the basic idea of caching let us start with an example. In this example we will examine a simple movie rental store, this store only has one copy of every movie, and all the movies are kept in the movie storage room that is in the next building. Lastly this store is only manned by one store worker. So when a customer comes in and ask for a movie the worker has to go to the storage room, find the movie there, return to the front desk, and then give the movie to the customer. Later on in the day the customer returns the movie, the worker takes the movie and returns it to the storage room. A few hours later another customer comes in and asks for the same movie, which prompts the worker to go back into the storage room, find the movie, and return to the front desk with the movie for the customer. As you can see in this example each time the movie is requested the worker has to go to the storage room to get the movie for the customer, which is inefficient.
Now let us examine the same movie rental store, but this time we will utilize a cache system. Basically we will have a movie rack behind the front desk that can hold up to ten movies at a time. This movie rack will represent the cache, and the act of using the movie rack will represent caching. We will start with an empty movie rack; let us say that a customer comes in to ask for a movie, the worker will check the movie rack to see if the movie is there. Once it is determined that the movie is not there, the worker goes to the storage room, retrieves the movie, and gives it to the customer. Later on in the day the customer returns the movie, this time the worker puts the movie onto the movie rack. An hour later another customer comes in asking for the same movie, the worker then checks the movie rack, and sure enough the movie is present on the rack, so the worker just hands the movie to the customer. In this second example the worker only had to go back to the storage room once for the same movie.
The purpose of web caching is to reduce the traffic between the users and the origin server. Web caching achieves this by storing recently downloaded webpages, web documents, or web objects locally (either on your computer or on a web server that your ISP uses). Each subsequent request for the same webpages, web documents, or web objects is pulled locally or from the ISP’s server, which eliminates the need to send the request to the origin server each time there is a request that has previously been cached. The previous example illustrate web caching, except for the customers are the users, the movie is the web page, the worker is your ISP, and the storage room is the web server that contains the web page the user is requesting, and the movie rack represents the cache [5].
Cache Hit, Cache Miss and Cache Hit Rate
In the example given above when the worker checks the movie rack to see if the movie requested was there, the first time that the movie rack was checked the movie was not there, this is called a cache miss. The second time that the movie rack is checked (after the movie was returned and placed on the rack) the movie was present, this is called a cache hit. Cache hit rate is the percentage that a previously cached object will score a cache hit [3].
Pros – Advantages of Caching
Web caches alleviate the workload off of the origin server by storing requested web objects in a web cache whether the cache is at a browser, proxy or server level. An institution or organization utilizing a proxy cache between itself and the internet can actually reduce internet bandwidth. This in turn can decrease user access latency when requesting web objects that are not cached. As an added effect of using a proxy server, the institution/organization can control the kind of information their users can access, this is especially useful for companies that does not want their employees to waste bandwidth browsing non-work related websites. Web caching also reduces the workload on the origin servers, because request that gets a cache hit are sent to the users directly from the cache. Only request that scores a cache miss and outdated web objects are directly requested from the origin server. Lastly as an unintended advantage, users can still view requests that score a cache hit even if the origin server has gone down.
Cons - Disadvantages of Caching
The foremost issue with caching is that not all web documents/objects or websites are cacheable. Non-cacheable objects are usually dynamic in nature, i.e. websites that generate dynamic data. Also sites that require an active connection are not usually cacheable because of privacy issues. Lastly websites that utilizes Hypertext Transfer Protocol Secure (HTTPS) is not cacheable [3]. The number two problem with web caching is that the users could be scoring cache hit on stale web objects/websites. Stale objects are web objects that are not up to date; in essence a user could be viewing yesterday’s cnn.com instead of today’s cnn.com. In proxy caching there is a limit to how many users the proxy can serve before latency increases to an undesirable amount. Therefore in proxy caching the proxy server should be almost as efficient as if the user is directly connecting to the origin server. Lastly one thing to note is that some origin server might disable caching of their web documents/sites because caching reduces hits on their servers [1].
Types of Web Cache
There are many different approaches to caching, in this next section we will examine browser caching, proxy caching, and reverse proxy caching. Each approach has its own advantages and disadvantages, but what they all have in common are cache placement and replacement algorithms.
Browser Cache
Browser cache stores the cache at the client level; meaning that the cache is actually stored on the user’s computer. This form of caching uses the temporary internet file folder to store web objects and web sites for later use (Note: Temporary internet file folder is for Windows system). Browser caching will store a copy of a requested item inside the local computer’s hard drive for later use; because the cache is stored locally when the user makes the same request, the request is fulfilled almost instantly, this dramatically increases the user experience.
An advantage of browser caching is that the cache is stored locally on the user’s computer, and the same user will most likely score cache hit; because of the users web usage pattern. On the other hand since the cache is stored in the local machine it is only available to the user(s) that uses that machine. The main disadvantage of browser caching is that it only serves one machine, so if another machine sends the same request for the first time their request will have to go to the origin server to get the web object/website.
Proxy Cache
Proxy cache is similar to browser cache in that it also stores requested websites/web objects. The difference is that proxy cache stores the cache on a proxy server. A proxy server is a server that sits between the origin server and the client. In most cases the proxy server will serve more than just one client. Proxy servers often act as gateways to the internet for the clients they serve. In a proxy caching system, when a client makes a request, the request is directed to the proxy server, the proxy server checks to see if the same request has been previously made. If a cache hit occurs then the client is given the web object/website that is cached on the proxy server. If a cache miss occurs then the proxy server makes the request (for the user) to the origin server. After the origin server sends the requested object to the proxy server, it (the proxy server) stores a copy of the requested object, and then sends a copy of the requested object to the client [1].
The main advantage of proxy caching is that it serves more than one client; so if another user were to make the same request it will already be cached on the proxy server. Proxy servers are popularly used by ISPs and big companies because it more often than not helps reduce their internet bandwidth usage. A disadvantage of proxy caching is if a proxy server is trying to serve too many clients, the server can overload and cause latency to increase. Another disadvantage with proxy caching is when many clients are sending in requests simultaneously, this can also overload the proxy server. Lastly proxy cache has an unintended effect of allowing the proxy cache server to control what its client can access and cannot access. This can be consider a good thing if the proxy server is operated by a large company that only allows its users to browser work related web content; which can eliminate some web traffic.
Reverse Proxy Cache
Notice how in proxy caching the proxy server serves the clients. Reverse proxy caching is the exact opposite of this, it serves the origin server. Reverse proxy caching uses a proxy server that is located directly in front of the origin server(s). Therefore in reverse proxy caching, when a user makes a request, the request is intercepted by the reverse proxy server instead of being received by the origin server. If the request can be satisfied with the cache of the reverse proxy server then the cached object is returned to the client. On the other hand if the request cannot be satisfied by the reverse proxy server then, the proxy server goes ahead and sends the request to the origin server. When the request is returned the reverse proxy server keeps a copy of the requested object/content and sends another copy to the client [1][9].
The main advantages of an origin server using reverse proxy caching is that it will take off some loads from the origin server. This advantage is accomplished by various means. Requests can be requested from the origin server once, cached on the reverse proxy, and serve many clients without contacting the origin again. Also caching of static files such as logos, java scripts and CSS files eliminates the need to contact the origin server every time. Caching of static files also allows the origin server to better process dynamic web objects/contents [1] [9].
Summary of types of cache
Each type of cache has its own ways of reducing user access latency and internet network congestion. In one way or another, each type of cache also alleviates some stress from the origin server. The alleviation varies with the type of cache system. Reverse proxy and proxy cache alleviates a greater amount of stress from the origin server, but the proxy server can be overloaded while serving many clients. Browser cache alleviates the least amount of stress from the origin server, but on cache hit it has the fastest feedback because the caches are on the machine.
Web Caching Architecture
There are two main web caching architectures out there. These architectures are hierarchical and distributed. Each of them has their own advantages and disadvantages. Each architecture utilizes the network topology shown in figure 1 differently. Figure 1 illustrates how the internet can be modeled as a hierarchy of ISPs [3]. In figure 1 there is one national ISP at the top followed by many regional and even more institutional networks as you go down the hierarchy.
Hierarchical Caching Architecture
In hierarchical caching there is more than one level of cache between the users and the origin servers. Hierarchical caching often uses more than one types of cache, also in the hierarchical caching architecture there are parent and child cache. For example in a simple three level hierarchical caching system, using figure 1 as a visual, we can say that the first level of cache is at the institutional level, with the second level of cache at the network level and lastly the third level of cache at the national level[3]. In this example the cache at institutional level is the child cache of the regional level cache. Likewise the cache at the regional level is a child cache to the cache at the national level. This in turn makes the cache at the national level, the parent cache to the regional cache, and makes the regional cache the parent cache to the institutional cache. In this example the level one cache could very well be a proxy cache while the level three cache could be serving as a reverse proxy cache [1].