Cache Memory (Key Words: Cache Hit, Cache Miss, Hit Rate , Latency)

Introduction to Computer Administration

Week-2

Advanced Concepts related to Computer Parts and Types

1. System Clock

2. MPIS / TFLOPS

3. Cache

4. DMA (Direct Memory Access)

5. Pipeline

6. SMP (symmetric multiprocessing)

7. Single-Point-Of-Failure

System Clock ( Clock Rate , MHz , GHz )

The clock rate is the fundamental rate in cycles per second (measured in hertz) for the frequency of the clock in any synchronous circuit. For example, a crystal oscillator frequency reference typically is synonymous with a fixed sinusoidal waveform, a clock rate is that frequency reference translated by electronic circuitry into a corresponding square wave pulse [typically] for digital electronics applications. In this context the use of the word, speed (physical movement), should not be confused with frequency or its corresponding clock rate. Thus, the term "clock speed" is a misnomer.

CPU manufacturers typically charge premium prices for CPUs that operate at higher clock rates. For a given CPU, the clock rates are determined at the end of the manufacturing process through actual testing of each CPU. CPUs that are tested as complying with a given set of standards may be labeled with a higher clock rate, e.g., 1.50 GHz, while those that fail the standards of the higher clock rate yet pass the standards of a lesser clock rate may be labeled with the lesser clock rate, e.g., 1.33 GHz, and sold at a relatively lower price

Limits to clock rate

The clock rate of a CPU is normally determined by the frequency of an oscillator crystal. The first commercial PC, the Altair 8800 (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz (2 million cycles/second). The original IBM PC (c. 1981) had a clock rate of 4.77 MHz (4,772,727 cycles/second). In 1995, Intel's Pentium chip ran at 100 MHz (100 million cycles/second), and in 2002, an Intel Pentium 4 model was introduced as the first CPU with a clock rate of 3 GHz (three billion cycles/second corresponding to ~3.3 10-10seconds per cycle).

With any particular CPU, replacing the crystal with another crystal that oscillates half the frequency ("underclocking") will generally make the CPU run at half the performance. It will also make the CPU produce roughly half as much waste heat.

The clock rate of a computer is only useful for providing comparisons between computer chips in the same processor family. An IBM PC with an Intel 80486 CPU running at 50 MHz will be about twice as fast as one with the same CPU, memory and display running at 25 MHz, while the same will not be true for MIPS R4000 running at the same clock rate as the two are different processors with different functionality. Furthermore, there are many other factors to consider when comparing the performance of entire computers, like the clock rate of the computer's front-side bus (FSB), the clock rate of the RAM, the width in bits of the CPU's bus and the amount of Level 1, Level 2 and Level 3 cache. Further, in many cases a computer's performance depends on factors outside of the CPU, such as the speed of access to storage devices such as hard drives.

Clock rates should not be used when comparing different computers or different processor families. Rather, some software benchmark should be used. Clock rates can be very misleading since the amount of work different computer chips can do in one cycle varies. For example, RISC CPUs tend to have simpler instructions than CISC CPUs (but higher clock rates), and superscalar processors can execute more than one instruction per cycle (on average), yet it is not uncommon for them to do "less" in a clock cycle. In addition, subscalar CPUs or use of parallelism can also affect the quality of the computer regardless of clock rate.

Clock rates:

1 Hertz = 1 Cycle Per Second

1 KHz = 1024 Cycles Per Second

1 MHz = 106 Cycle Per Second

1 GHz = 109 Cycle Per Second

1 THz = 1012 Cycle Per Second

MIPS

Processor / IPS / IPS/MHz / Year / Source
Intel 486DX / 0000540 54 MIPS at 66MHz / 00818 0.818 MIPS/MHz / 1992
DEC Alpha 21064 EV4 / 0005410 300 MIPS at 150MHz / 02705 2 MIPS/MHz / 1992 / [4]
Motorola 68060 / 0000880 88 MIPS at 66MHz / 01330 1.33 MIPS/MHz / 1994
Intel Pentium Pro / 0005410 541 MIPS at 200MHz / 02705 2.705 MIPS/MHz / 1996 / [5]
ARM 7500FE / 0000359 35.9 MIPS at 40MHz / 00897 0.897 MIPS/MHz / 1996
PowerPC G3 / 0005250 525 MIPS at 233MHz / 02253 2.253 MIPS/MHz / 1997
Zilog eZ80 / 0000800 80 MIPS at 50MHz / 01600 1.6 MIPS/MHz / 1999 / [6]
Intel Pentium III / 0013540 1,354 MIPS at 500MHz / 02708 2.708 MIPS/MHz / 1999
Freescale MPC8272 / 0007600 760 MIPS at 400MHz / 01900 1.9 MIPS/MHz / 2000 / [7] Integrated Communications Processors
AMD Athlon / 0035610 3,561 MIPS at 1.2GHz / 02967 2.967 MIPS/MHz / 2000
AMD Athlon XP 2400+ / 0059350 5,935 MIPS at 2.0GHz / 02967 2.967 MIPS/MHz / 2002
Pentium 4 Extreme Edition / 0097260 9,726 MIPS at 3.2GHz / 03039 3.039 MIPS/MHz / 2003
ARM Cortex A8 / 0020000 2,000 MIPS at 1.0GHz / 02000 2.0 MIPS/MHz / 2005 / [8]
AMD Athlon FX-57 / 0120000 12,000 MIPS at 2.8GHz / 04285 4.285 MIPS/MHz / 2005
AMD Athlon 64 3800+ X2 (Dual Core) / 0145640 14,564 MIPS at 2.0GHz / 07282 7.282 MIPS/MHz / 2005 / [9]
Xbox360 IBM "Xenon" Triple Core / 0096000 19,200 MIPS at 3.2GHz / 02001 2.0 MIPS/MHz / 2005
PS3 Cell BE (PPE only) / 021800 10,240 MIPS at 3.2GHz / 021800 3.2 MIPS/MHz / 2006
AMD Athlon FX-60 (Dual Core) / 0189380 18,938 MIPS at 2.6GHz / 07283 7.283 MIPS/MHz / 2006 / [9]
Intel Core 2 Extreme X6800 / 0270790 27,079 MIPS at 2.93GHz / 09242 9.242 MIPS/MHz / 2006 / [9]
Intel Core 2 Extreme QX6700 / 0491610 49,161 MIPS at 2.66GHz / 18481 18.481 MIPS/MHz / 2006 / [10]
P.A. Semi PA6T-1682M / 0088000 8,800 MIPS at 2.0GHz / 04400 4.4 MIPS/MHz / 2007 / [11]
Intel Core 2 Extreme QX9770 / 0594550 59,455 MIPS at 3.2GHz / 18580 18.580 MIPS/MHz / 2008 / [12]
Intel Core i7 Extreme 965EE / 0763830 76,383 MIPS at 3.2GHz / 23860 23.860 MIPS/MHz / 2008 / [13]
AMD Phenom II X4 940 Black Edition / 042820 42,820 MIPS at 3.0GHz / 14273 14.273 MIPS/MHz / 2009 / [14]

TFLOPS (1012 FLoating point Operations Per Second)

In computing, FLOPS (or flops or flop/s) is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to the older, simpler, instructions per second.

Computer Performance
Name / FLOPS
yottaFLOPS / 1024
zettaFLOPS / 1021
exaFLOPS / 1018
petaFLOPS / 1015
teraFLOPS / 1012
gigaFLOPS / 109
megaFLOPS / 106
kiloFLOPS / 103

Cache Memory

(Key words: Cache Hit, Cache Miss, Hit Rate , Latency , Cache Types)

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.

The diagram on the right (above) shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested by a CPU instruction, which ranges from 1 to 16 bytes. Each location in each memory also has an index, which is a unique number used to refer to that location. The index for a location in main memory is called an address. Each location in the cache has a tag that contains the index of the datum in main memory that has been cached. In a CPU's data cache these entries are called cache lines or cache blocks.

Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.

When the processor needs to read or write a location in main memory, it first checks whether that memory location is in the cache. This is accomplished by comparing the address of the memory location to all tags in the cache that might contain that address. If the processor finds that the memory location is in the cache, we say that a cache hit has occurred; otherwise, we speak of a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. The proportion of accesses that result in a cache hit is known as the hit rate, and is a measure of the effectiveness of the cache.

In the case of a cache miss, most caches allocate a new entry, which comprises the tag just missed and a copy of the data from memory. The reference can then be applied to the new entry just as in the case of a hit. Misses are comparatively slow because they require the data to be transferred from main memory. This transfer incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor.

DMA (Direct Memory Access)

Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips, where its processing element is equipped with a local memory (often called scratchpad memory) and DMA is used for transferring data between the local memory and the main memory. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time and allowing computation and data transfer concurrency.

Without DMA, using programmed input/output (PIO) mode for communication with peripheral devices, or load/store instructions in the case of multicore chips, the CPU is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU would initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical. Another and related application area is various forms of stream processing where it is essential to have data processing and transfer in parallel, in order to achieve sufficient throughput.

A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks

Pipeline

An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).

The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)

Generic pipeline

Generic 4-stage pipeline; the colored boxes represent instructions independent of each other

To the right is a generic pipeline with four stages: