/ Technical papers | Web caches

Web caches

What is a web cache?

In their simplest form, web caches store temporary copies of web objects. They are designed primarily to improve the accessibility and availability of this type of data to end users.

Caching is not an alternative to increased connectivity, but instead optimises the usage of available bandwidth. After the initial access/download, schools can access a single locally stored copy of the content rather than repeatedly requesting the same content from the origin server. Content delivery works on the principle of delivering content to the local network before it is required, rather than the ‘on-demand’ approach of normal caching.

This technical paper focuses on the practical issues surrounding caches; it will look at hardware and software solutions and advanced features that provide shared services to users over a network.

How will a cache benefit my institution?

Caching minimises the number of times an identical web object is transferred from its host server by retaining copies of requested objects in a database or repository. Requests for previously cached objects result in the cached copy of the object being returned to the user from the local repository rather than from the host server. This results in little or no extra network traffic over the external link and increases the speed of delivery.

Caches are limited by the amount of disk space – when a cache is full, older objects are removed and replaced with newer content. Some systems may implement 'persistency' measures, however, to preserve certain types of content at the discretion of the administrator.

Example:
A school has a 10Mbps local area network (LAN) and a 128Kbps ISDN connection to the Internet, where the local network is 80 times faster than the Internet connection. Consider a class situation where a suite of computers is trying to download a large graphic, perhaps 256KB in size. This would take each computer in the suite 16 seconds to download across the 128Kbps connection (128Kbps = 16KBps).
If a cache is implemented on the local network, the cache computer will download a single copy of the graphic at a maximum speed of 128Kbps, and then pass this on to each computer over the high-speed LAN connection at 10Mbps. Across a 10Mbps connection (10Mbps = 640KBps), the transfer would take approximately half a second. In practice, transfer rates will be lower than these figures which allow for network overhead.

How does a web cache work?

The flowchart below illustrates what happens when a user requests a web page.

The thicker lines represent the normally higher-speed local connections between the client and the cache, while the thinner lines represent the slower connection speeds over the Internet.

© Becta 2004 Valid at September 2004page 1 of 8

Review at December 2004

Becta | Technical paper | Web caches

Where are web caches used?

Caches may be installed in different locations on networks for a variety of reasons:

  • Local caches are the most common type; they sit on the edge of the LAN just before the Internet connection. All outbound web requests are directed through them in an effort to fulfil web requests locally before passing traffic over the Internet connection.
  • ISP caches are used on the networks of most Internet Service Providers (ISPs). They provide customers with improved performance and conserve bandwidth on their own external connections to the Internet.
  • Reverse caches are used to reduce the workload of content provider’s web servers. They position the cache between the web server and its internet connection, so that when a remote user requests a web page, the request must first pass through the cache before reaching the web server. If the cache has a stored copy of the requested item, it delivers it direct rather than passing the request through to the web server.

This document concentrates on local caches, although most of the information applies to all caches.

The diagram below shows the different positions that caches can occupy on networks. As a request for information passes from the LAN to the content provider it passes through several caches, each trying to fulfil the request from their own repositories. Sometimes a request will never reach the content provider’s host web server, instead being fulfilled by a cache somewhere en route.

What are the advantages and disadvantages of caching?

Advantages:

  • Fast performance on cached content – if content is already in the cache it is returned more quickly, even for multiple users wanting to access the same content.
  • Improved user perception and productivity – quicker delivery of content means less waiting time and increased user satisfaction with the performance of the system.
  • Less bandwidth used – if content is cached locally on the LAN, web requests do not consume Internet connection bandwidth.
  • User monitoring and logging – if a cache manages all web requests (behaving in some ways like a proxy), a centralised log can be kept of all user access. Care must be taken that any information held is in accordance with appropriate privacy regulations and the institution's policy.
  • Caching benefits both the single end user and the content providers – ISPs and other users of the same infrastructure all benefit greatly from the reduction in bandwidth usage.

Disadvantages:

  • Slower performance – if an object is not cached an extra layer is added to the process, which adds time.
  • Subscription sites may become confused – some subscription services use IP addresses for authentication. The advent of dynamic client bypass technology, which passes the user’s original IP address to the host server, coupled with an increase in the use of other methods of authentication by content providers mean this is becoming less common, however.
  • Additional hardware or expertise may be required – any new system will potentially require extra hardware and software resources, with ongoing support needed after installation.
  • Dynamically generated content cannot be cached – the results of CGI scripts or certain types of database content are increasingly common on the World Wide Web, but cannot be cached.

What is the difference between transparent and non-transparent caches?

Caches can differ according to their so-called transparency, which will affect the degree of configuration required for a network during device installation.

Transparent caches – do not require any settings to be changed on individual client machines. Instead, the network router or switch is configured to forward all requests automatically through to the cache. This has the advantage of allowing a cache to be easily introduced and removed without reconfiguring the client computers. However, it can generate confusing error messages if a page is not found and make finding the location of any problems difficult.

Non-transparent caches – require the settings on each client computer to be changed to point at the appropriate cache. In this case, error messages will normally show clearly if a problem is with the cache itself. However should a change of cache server be required, perhaps for maintenance reasons, the clients may have to be reconfigured with the new cache’s information.

How well will a cache work in a classroom situation?

Caches can enhance the ways in which the Internet is used in the classroom. Teachers can pre-load a cache with particular web sites in advance of a lesson, either by simply visiting the required sites with a computer that uses the cache or by having the content pre-positioned into the cache by a management system.

For example: if a school buys content from a commercial provider they might opt to pre-load or copy it to their local cache in advance of using it in lessons. When teachers and pupils wanted to use this content they would then be accessing it from the LAN rather than from the Internet.

This would give fast, high-quality access without delays and without large numbers of students having to share an Internet connection. The worst case scenario in a cached environment is that the first user to request a page will experience a slightly longer delay than normal.

Some solutions, however, do not have the capability to cache multimedia content such as real time streaming media files. It is possible for more advanced cache solutions to reduce bandwidth by only requesting one stream from the host web server and then splitting that stream to many computers on the LAN. Multimedia content stored in static files – where the whole file must be downloaded before it can be played – will in most cases be cached as normal.

How do I install a cache?

Installing a web cache to a LAN is relatively straightforward. An additional computer system or dedicated appliance is connected to the LAN, and the clients or router are configured, if required, to access this system. The cache itself is installed through a software program executed either on its own dedicated hardware or as one of many programs running on a shared server.

Microsoft's Internet Security and Acceleration (ISA) Server is based on Windows 2000 and can run on its own or on a Windows 2000 server with other software. Similar arrangements are possible with Linux-based systems running software such as Squid. The main alternative would be a discrete hardware-based solution optimised for this specific role. Examples of such a solution include Volera or a Cisco unit.

The appropriateness of each solution depends on a number of factors, including the number of simultaneous users, available bandwidth and available resources. Functionality can also vary – developments in this area are concerned with moving away from just caching static HTML pages towards accelerating the whole web experience.

An institution should be able to function adequately with a single server of reasonable specification (for Linux, a Pentium II with sufficient memory; Versions of Squid are available for NT, and the server specification should be increased according to the suppliers instructions). Specialist solutions including multiple servers need only be considered for LEA-wide services and larger.

The National JANET Web Cache Service has an article on sizing servers using the Squid proxy on Linux. This service runs approximately 40 servers for the HE & FE community; and requests regularly exceed one million a day.

Example:
In a single small or medium-sized institution, a basic web cache system could be easily implemented on a Pentium II processor with 64MB-256MB of memory and two GB-20GB of hard disk storage using Linux and the Squid caching software. The minimum requirements for the Microsoft Internet Security and Acceleration Server are a Pentium III processor with 256MB of RAM.

For larger or more complex installations, it is sensible to consult a network system specialist. It is possible to connect caches together to improve the efficiency of the service and provide multiple layers. To do this, the onward ISP should be consulted.

What are the costs and cost savings of a cache?

It is difficult to put a precise cost on the benefits of a cache service, as its success will depend on the nature of the users. For example, in a sixth-form environment, where many students are looking at different pages, the benefits are less obvious than in a school environment, where groups or classes access the same material simultaneously. In the latter example, it could be said that cache solutions multiply bandwidth and, it follows, provide a kind of cost saving as they are providing users with a service equivalent to a higher bandwidth connection.

The costs of implementing a solution are equally variable and will include purchase of hardware and software, installation and maintenance. The cheapest cache solution in terms of capital cost is likely to be a Linux-based solution, which uses free software and can run on reasonably low-specification hardware. The total cost of ownership should be considered with any system – although most server software is reasonably reliable and will run for long periods without any attention, costs of maintenance and support may be a factor.

Some vendors provide systems offering caching facilities on a rental basis where, instead of purchasing hardware, an annual fee is payable to cover installation and ongoing support for the service.

Implementing a cache with good management and reporting facilities can identify usage patterns, cache effectiveness and bandwidth consumption. If the administrator uses these reports effectively they will show how much of the Internet connection’s bandwidth is being used and whether the current connection is meeting demand for that site.

What other functions do caches have?

Caches have progressed from being merely software applications that control a store of information to being managed appliances designed specifically for content delivery. Some of the advanced functions are described below:

  • Stream splitting is when a stream of data from a host server is divided at the cache for transmission to multiple LAN computers. If five users request a one Mb stream of video from, say, a BBC web site without a stream splitting-enabled cache this would take up five Mb of the Internet connection bandwidth. With stream splitting this would only take up one Mb of the Internet connection bandwidth.
  • Content filtering functions can be integrated with cache software so that a cache can block access to certain web sites depending on their content. Filtering normally adds to the cost of the cache, but does reduce bandwidth consumed, however, by not allowing access to inappropriate sites.
  • WCCP (Web Cache Communication Protocol) is a protocol that transparently routes all web traffic to the local cache before it leaves the LAN. It also provides extra features such as load balancing and multicasting, as well as certain security functions.
  • Scalability of caches and linking caches together can improve performance greatly. If one cache knows what another cache has in its repository it can redirect requests to that cache as and when required.
  • Overload bypass is a feature that allows the cache to pass traffic that it is too busy to deal with to web servers rather than have those requests for information held in a queue at the cache.
  • As caches increase in intelligence and complexity they offer increased report and management functionality. Logs and reports can be produced for each user, for each web site visited, the time of visit etc. These reports can be used to assist in the efficient running of a cached environment and optimise available bandwidth.
  • Pre-positioning is the downloading of content to the cache appliance before the user requests it. This becomes more crucial when accessing video and rich media clips. The burden is taken off the network during peak usage hours if this download can be scheduled to occur out of hours

How does a cache differ from a proxy?

A cache server is not the same as a proxy server. Cache servers have a proxy function with regard to requests for certain content from the World Wide Web. When a client passes all their requests for web objects via a cache, this cache is effectively acting as a proxy server. Caching is a common function of proxy servers.

Proxy servers perform a number of other functions, too, mainly centred on security and administrative control. Broadly speaking, a proxy server sits between a number of clients and the Internet. Any requests made to the Internet from a LAN computer are forwarded to the proxy server which will then make the requests itself.

The key differences between a proxy and caches are:

  • A proxy server will handle more requests than just those for web content.
  • A proxy server does not by default cache any data that passes through it.

There are certain security benefits based on the fact that proxy servers hide other computers on the network from the Internet making it is impossible for individual machines to be targeted for attack. The requirement for 'public' IP addresses is also removed, so that any number of computers can share one public address that is configured to the proxy rather than each computer needing a unique IP address. This has implications for video conferencing and other point-to-point applications which might require some additional resource or configuration.

What standards are there for caches?

CERN is a standard for application aware proxy services over HTTP-based client/server communications. A CERN server is slow and not suitable for heavy traffic.

ICP is the Internet Caching Protocol that exchanges data between caches about the existence of stored information.

WCCP is a Cisco router control protocol that transparently routes TCP port 80 packets to cache appliances and incorporates value-added features such as load balancing, security features and multicasting.

What about storing multiple copies of content in a cache?