ECE 463/521: Spring 2006

Project 1: Cache Design

Due: Friday, February 24, 2006, 11:00 PM

Project rules

1.All students are encouraged to work in teams of two, using pair programming. Pair programming means programming where two people sit at the same workstation, writing the code collaboratively. To find a partner, post a message on the message board at

2.ECE 521 students will have additional parts to do. If a 463 student pairs with a 521 student, they will have to meet the 521 requirements.

3.You may not work with the same partner on more than one project this semester.

4.You must register your partnership by posting on the Pair-Programming Partners message board, under the topic Project 1.”

5.Sharing of code between teams will be considered cheating, and will be penalized in accordance with the Academic Integrity policy.

6. It is acceptable for you to compare your results with other groups to help debug your program. It is not acceptable to collaborate on the final experiments.

7.You must do all your work in C, C++, or Java. C++ and Java are encouraged because they enable straightforward code-reuse and division of labor.

8.Homework will be submitted over the Wolfware Submit system and run in the Eos/Unity environment.

Project description

This project will study several parameters in cache design, such as line size, associativity, and subblocking. It is based on the Cachegrind simulator that runs inside of the Valgrind instrumentation environment. Cachegrind can be used directly to study line size and associativity, but it must be modified to support subblocking. Running Cachegrind involves simulating the execution of x86 code in an environment where factors such as memory references and cache hits can be counted in software. ECE 521 students will also simulate the translation-lookaside buffer (TLB) for the system. This will require additional modifications to Cachegrind.

Running Cachegrind

We recommend that you run Cachegrind before attempting to modify it. Cachegrind runs only on Linux systems. You have access to two Linux systems, the ECE cluster called Grendel, and the NCSU Virtual Computing Lab. Here are instructions on how to log into each of them.

You can connect to grendel.ece.ncsu.edu using an SSH client. You can download an easy-to-use SSH implementation called PuTTY from

To use VCL, you need to make a reservation at

You will need to enter your unity account ID and password to make a VCL reservation, and in order to use the Valgrind software, you will need to select ECE463/521 from the drop-down menu. When a VCL connection is available, press “connect”. You will be given an IP for a remote computer, and will use this to connect via an SSH client.

When using PuTTY to connect, launch putty.exe. In the popup window, enter “grendel.ece.ncsu.edu”, or the remote computer IP address (from the VCL reservation) under the title “Host Name (or IP Address)”. Then press “Open”.

Enter your unity ID and password when the shell screen asks you to. Once you have logged in, it is an easy matter to run Cachegrind. You invoke Valgrind, specifying Cachegrind as the tool, and passing the program to be run as a parameter. For example, this command runs Cachegrind on the command “ls -l”. In the same way, you could run it on any other program by substituting the name of the program and its parameters for “ls -l”.

valgrind --tool=cachegrind ls -l

Complete information on running Cachegrind can be found at

In the course of your experiments, you may run Cachegrind on whatever programs you choose, but be aware that it runs a long time (20 to 100 times as long as the uninstrumented program), and produces a lot of output. The format of the output is described in Section 5.2 of the Valgrind manual, We also provide sample output, as explained in Section [X.x], so that you can test your modified simulator to make sure it is functioning correctly.

Modifying Cachegrind

In order to run, modify and configure your own version of Valgrind, you need to install Valgrind. (Installation takes around 70MB of space. For this reason you may consider using your course workspace in /afs/eos.ncsu.edu/lockers/workspace/ece/ece521/001/loginid ):

1-Create a directory called Valgrind

mkdir Valgrind

cd Valgrind

…and to check which version of Valgrind is installed into your system:

valgrind –version

It should be version 2.2.0 if you are connected to grendel as default.

2-Download Valgrind-3.1.0.tar.bz2 to the Valgrind directory

3-Decompress the file using the following command:

tar xvjf valgrind-3.1.0.tar.bz2

This command will create a directory named “valgrind-3.1.0”.

cd valgrind-3.1.0

4-Use following commands to configure and install valgrind (note that [Valgrind directory path] is the absolute address of your Valgrind directory):

./configure --prefix=[Valgrind directory path]/valgrind310
make
make install

5-Then you will need to set your environment variable PATH to valgrind’s binary directory:

setenv PATH /[Valgrind directory path]/valgrind310/bin:${PATH}

6-To make sure you completed setup procedure successfully, check the version number again:

valgrind –version

If everything goes smoothly, it should output version 3.1.0.

Experiment 1: What cache sizes, line sizes, and associativities give the lowest AAT?

Run Cachegrind on a variety of programs. A good test is to run it on gcc, while compiling some program. However, be aware that gcc is just a front-end for processes that represent the various phases of the compiler, which do most of the work. It is more interesting to run it on the cc1 phase. To do this, you need to save temporary files, then run cc1 by executing Valgrind on the cc1 command that appears in the verbose output from gcc. For example, here’s what you do on Grendel to run Cachegrind on compiling hello.c:

gcc -v hello.c –save-temps

valgrind --tool=cachegrind /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/cc1 –fpreprocessed hello.i –quiet –dumpbase hello.c –version –o hello.s

Test varying configurations of cache size, line size, and associativity for all three caches in an effort to come up with the best average access time (see below). To keep the simulations realistic, note that the line size in the L2 cache should be a multiple of the line size in each of the other two caches. That is, the L2 cache may have the same line size as the I1 and D1 caches, but if it does not, an integral number of lines in the I1 and/or D1 caches should fit into a single line in the L2 cache (e.g., a line in the L2 cache could be twice the size of a line in the I1 and D1 caches).

The metric we will use to compare configurations is average access time (AAT). The AAT is dependent on the access time of the caches, and on the miss rates and the time to process each miss. As the cache gets larger and more associative, the access time tends to rise. This offsets, to some degree, the lower miss ratios achieved by larger and more associative caches. Use CACTI, described below, to calculate the access time, and use this number to calculate the AAT as described in the section called “AAT Calculation.”

Your goal, again, is to come up with the configuration offering the best AAT. But be sensitive to cost: If the best configuration requires an extremely large cache (several MB), give some well-performing alternatives that require reasonable amounts of hardware.

Experiment 2: Subblocking

In this experiment, you will modify Cachegrind to offer an alternative using write-validate and subblocking instead of fetch-on-write (which is provided by the standard Cachegrind).

To keep things simple, assume that a subblock is 4 bytes, the same as the width of a write to memory. Then each write fills a single subblock. Instead of having a single presence bit for each line in the cache, there is now a presence bit for each subblock. On a read-miss to a subblock, a fetch will be initiated to the next lower level of the hierarchy, and the counter of misses for this cache will be incremented by 1. The presence bit will then be turned on, reflecting arrival of the subblock in the cache. Note that a cache line is allocated to this block only if the tag is not present in the corresponding cache set. Otherwise, it must be the case that the line has already been allocated because of a miss to another subblock within the block, so no cache-line replacement is necessary at this time.

Because we are simulating write-validate, a write-miss does not cause a fetch from the next lower level of memory. If the line is not already allocated (i.e., if no tag in the set matches the tag field of the referenced address), a new line is allocated, which does cause a replacement and a write-back, assuming that all lines in the set are occupied. When computing AAT, these write-misses should be counted as taking just as long as a cache hit.

Simplifying assumption: We will assume that all writes are aligned on a subblock boundary. This means that no write access will write adjacent portions of two subblocks. To get the address of the subblock that is written, you may simply right-shift the write address by two bits.

For read-misses, we have a decision to make: Does a read-miss cause only a single subblock to be fetched, or does it cause (pre)fetching of the entire cache block? To implement prefetching accurately is rather complicated, and we will not attempt it. Rather, depending on the value of a command-line parameter, we will assume that only a single subblock is fetched for each miss (pf=no), or that a miss causes fetching of the entire block (pf=yes). The time to fetch the entire block is longer than the time to fetch just a subblock (see the “AAT calculation” section).

For your experiment, you should compare subblocking and write-validate against no subblocking and fetch-on-write (the default). Try both varieties of subblocking (pf=no and pf=yes). What is the best cache configuration when subblocking is in use? Does subblocking help, and is the best cache configuration different from what it was in Experiment 1?

Experiment 3: TLB simulation (ECE 521 only)

ECE 521 students will also simulate the TLB of the system. The page size of the system should be a parameter that for a particular run, will be set either to 4K or 8K bytes. You may assume that physical memory consists of 256 MB (how large does this make the page-frame number?).

TLB description.

o ENTRIES: number of TLB entries

o T_ASSOC: the associativity of the TLB (T_ASSOC = 1  direct-mapped TLB)

o LRU replacement policy

Each TLB entry consists of a valid bit, a page number, and a page-frame number. A real architecture would also maintain a write-protect bit for each page, but that is not necessary in our simulation.

l. Page-table description. Since this is a paged-memory system, it will have a page table. Each page-table entry will consist of a valid bit and a page-frame number. (Again, real page-table entries have additional fields, but they are not required for this simulation.)

Address translation.

Assume that the addresses generated by the program are virtual addresses. They must be translated to physical addresses before the cache is accessed. Translation proceeds as follows.

  • The address is separated into a page number and a displacement.
  • The page number is looked up in the TLB. If it is present, it is translated to a page-frame number, which is concatenated with the displacement to form the physical address. Then the physical address is looked up in the cache.
  • If the page number is not present in the TLB, an entry must be made for it. In a real architecture, this would require looking it up in the page table. However, to simplify matters, we will simply assign the page-frame number by applying the following function to the page number:
  • Discard the 8 most significant bits of the address, as well as the bits that constitute the displacement.
  • Then take the ones-complement of the remaining bits. This becomes the page-frame number.
  • An entry is then made for this page in the TLB. The entry consists of a valid bit (set to “true”), the page number, and the page-frame number. This entry is placed in the proper place in the TLB, and may replace a pre-existing TLB entry.
  • The cost of the TLB miss is 2 memory reads (2 MEM_DELAY).
Data analysis

Use the data collected from your experiments to analyze the relationship of miss rate, AAT (average access time), cache size, block size, number of TLB entries and other parameters of the cache (ideally, with tables and graphics). Find two data-cache systems with the best AAT for each program you test.

Overall analysis

From an architectural perspective, analyze and compare the characters of those three models with the results from your experiments.

CACTI tool

The CACTI tool is used to calculate the access time of a cache based on its configurations (size, associativity, etc.

Installing CACTI

Log into your Unity home directory in Unix (Solaris, etc.). CACTI does not run on Linux.

Using CACTI

We’ve compiled cacti. Just type its name to run it. The executable is in /afs/eos/courses/ece/ece521/common/www/homework/projects/1/cacti

(or

Using our function (simplest) …

The simplest way to use CACTI is to call the function that we have provided. Simply copy the function cacti_get_AT from the file callcacti.c in the same directory, and paste it into your code. It has three parameters, cache size in bytes, block size in bytes, and associativity. (For a TLB, use 8 (bytes) as the block size.) In calling this function, use an associativity of 1 to denote a direct-mapped cache and an associativity of -1 to denote a fully associative cache.

For example, the function might be called like this:

cacti_get_AT(16384, 128, 4)

You are not required to use the cacti_get_AT function. You may call CACTI directly, as follows.

Calling CACTI directly (more flexibility)

In case you need to use your own code to call cacti, here’s what you need to know:

CACTI Syntax

cacti_ <csize> <bsize> <assoc> <tech>

csize - size of cache in bytes (e.g., 16384)

bsize - block size of cache in bytes (e.g., 32)

assoc - associativity of cache (e.g., 2, 4, DM , or FA)

For direct mapped caches use 1 or DM

For set-associative caches use a number n for an n-way associative cache

For fully associative caches, use FA

tech - technology size in microns (we use 0.18um)

Example:

eos% cacti 16384 32 2 0.18um

will return the following timing and power analysis for a 16K 2-way set associative cache with a 32-byte block size at a 0.18 m feature (technology) size:

………………

………………

Technology Size: 0.18um

Vdd: 1.7V

Access Time (ns): 1.20222 ------this is what we need.

Cycle Time (wave pipelined) (ns): 0.400739

Power (nJ): 1.49542

Wire scale from mux drivers to data output: 0.10

……………..jj

………………

CACTI doesn’t allow the number of sets to be too small. To simulate a fully associative cache, you have to use the “FA” option.

Example:

eos% cacti 16384 32 FA 0.18um

will then return

...

Access Time (ns): 2.89345

...

Limits of Parameters

No matter whether you use our function or your own code to call CACTI, there are some limits on the parameters:

Minimum / Maximum
Cache Size (bytes) / 26 / 230
Block Size* (bytes) / 23 / 217
Associativity / 1 / 25 or 225 (using option FA)
Number of Sets / 23 / 227

* Block size must be the power of 2.

AAT calculation

The access times of the caches and TLB can be determined using the CACTI tool. Below, TL1, TTLB and TV refer to the L1 cache, TLB cache and victim cache access times (in ns), respectively.

For L1 cache and L2 cache

AAT = Hit timeL1+ Miss rateL1 (Hit timeL2 +Miss rateL2Miss penaltyL2)

To fetch the first 8 bytes of a block from the L2 cache, it takes one L2 access time (as given by cacti). The remaining bytes are transferred at a rate of 8 bytes/cycle. (If the access is only to four bytes, that also takes as many cycles as CACTI gives for the L2 access time.)

It takes 150 cycles to fetch the first 8 bytes of a block from main memory, and the remaining bytes are transferred at a rate of 8 bytes/cycle. (If the access is only to four bytes, that also takes 150 cycles.) Assume that the bytes are loaded into the L1 and L2 caches at the same time.

For L1 and L2 caches + TLB

Assume that TLB lookup is done in parallel with cache access, so there is no impact on access time, except in case of a TLB miss. Take the TLB miss rate and multiply it by the time to service a TLB miss (which will be the time to fetch 8 bytes from main memory). Then proceed as before with the AAT calculation.

Program Interface Requirements

To assure that your code can be run and graded easily, …

  1. Your modified Cachegrind must be able to run in the Linux environment on the Eos/Unity system.
  2. The program must be executable from the command line.
  3. Every simulator must have a good interface for cache parameter setting.

What to hand in

  1. Makefile
  2. Source code
  3. Project report (Microsoft Word .doc file recommended)

* No executable file (including cacti) is needed; just submit yoursource code and makefile. Make sure you use zip to compress all your source code and report into a single zip file named project1.zip. Inform us if your file is larger than 1M.

Grading

0% You do not hand in (submit electronically) anything by the due date.

+10% Your simulator can read trace files from the command line and has the proper interface for parameter settings.

+50% Your simulator produces the correct output.

+40% You have done a good analysis of the experiment.

–1–