Spring 2009 CDA5155 Homework 2

Spring 2009 CDA5155 Homework 2

Date assigned: Feb 12th, 2009

Due date: On-campus Student: Feb 24th, 2009 (11:59 pm)

EDGE students: Mar 3rd, 2009 (11:59 pm)

Primary TA: Gang Liu ()

All homework must be submitted through E-Learning. No late submission will be accepted. You are required to submit homework in Microsoft word or Adobe pdf format and please make sure that your submitted file can be opened on CISE windows machines.

Your submitted homework answer MUST be typewritten using word processing software. Scanned handwritten submissions will NOT be accepted. Necessary calculation procedure or explanation MUST be provided in your answer.

All homework must be done individually.

Total Points: 100 pts

NAME: ______

UFID: ______

Exercise 1 (20pts)

Identify all possible hazards in the following instructions and show clearly which instructions are involved in what type of hazard (data-WAW, data-WAR, data-RAW and control) and why? Consider a simple 5-stage MIPS pipeline as shown in Figure A.18 for this exercise.

Inst 1: LD R1 700(R0)

Inst 2: LD R2 800(R1)

Inst 3: LD R3 900(R2)

Inst 4: ADDI R3,R2, #3

Inst 5: ADDI R4,R1, #3

Inst 6: MUL R2, R2, R3

Inst 7: ADD R5,R3,R2

Inst 8: ADD R6,R6,R5

Inst 9: ADDI R1,R6,#-1

Inst 10: BGTZ R1, #-32

Inst 11: SD R6 1000(R0)

Inst 12: BREAK

Exercise 2 (20pts)

Following is a simple pipeline; with a bypass from the output of the MEM stage (i.e., the

MEM/WB latch) to the EX stage. (Note: There is no bypass from the EX stage, i.e., the

EX/MEM latch, back to the EX stage.) Assume that the register file is written in the first

half and read in the second half of the clock cycle. Hence, a register value can be written

and read in the same cycle.

(a) What’s the purpose of bypassing?

(b) For the following instruction sequence, show the pipeline stages in the table. Part of

the stages for the first instruction is shown.

ADD R1, R2, R3 // add1

ADD R4, R1, R5 // add2

LD R5, 0(R4) // lw1

LD R6, 0(R5) // lw2

SUB R7, R5, R6 // sub1

SW R7, 0(R4) // sw1

Cycle / IF / ID / EXE / MEM / WB
1 / add1
2 / add2 / add1
3
4
5
6
7
8
9
10
11
12
13
14
15

(c)Suggest technique(s) to eliminate the bubbles in part (b). Which bubbles would your

proposed technique(s) eliminate?

Exercise 3 (30pts)

The execution cycles for various instructions are given in the following table. All units

except DIV.D are pipelined. Loads and stores require the Integer ALU for address

generations and the load/store unit for accessing memory.

Instruction / Cycle / Number of unit
Integer ALU / 1 / 2
Load/Store / 3 / 2
ADD.D/SUB.D / 5 / 2
DIV.D / 9 / 1

Simulate the execution of two loop iterations in the provide table. The execution is

carried out on a two-issue dynamic-scheduling pipeline microarchitecture. A single CDB

broadcasts/writebacks results after executions. Note that stores do not access memory

until commit.

Iter / Instruction / Issue / Execute / Memory / Write-CDB
1 / L.D F0, 0(R1)
1 / DIV.D F4, F2, F0
1 / S.D F4, 0(R1)
1 / LW R5, 0(R4)
1 / ADD R5, R5, R2
1 / ST R5, 0(R4)
1 / SUBI R4, R4, #44
1 / ADDI R1, R1, #8
1 / BNE R1, R2, LOOP
2 / L.D F0, 0(R1)
2 / DIV.D F4, F2, F0
2 / S.D F4, 0(R1)
2 / LW R5, 0(R4)
2 / ADD R5, R5, R2
2 / ST R5, 0(R4)
2 / SUBI R4, R4, #44
2 / ADDI R1, R1, #8
2 / BNE R1, R2, LOOP

Exercise 4 (30pts)

(a) (10pts) Assume that a (3, 2) correlated predictor is designed, which uses the recent three branch outcome to determine one of possible predictors. Within each predictor, there are a number entries indexed by the address of the predicted branch. Each entry is a 2-bit counter to predict the outcome of the branch. Assume the total prediction table size is 32Kbits. Calculate the number of index bits from the predicted branch address that is needed to index each predictor table.

(b) (20pts) In a sophisticated tournament branch prediction scheme, a tournament predictor, a global correlated predictor, and a two-level local predictor are designed.

As shown in the figure, the tournament predictor consists of a number of t-bit predictors indexed by i1 bits from the branch address to select the prediction outcome from the global or the local predictor. The global correlated predictor has a number of m-bit predictor indexed by the recent branch outcome history. Finally, the local predictor has 2-level prediction tables. The first-level table is indexed by i2 bits from the branch address. Each entry in the 1st-level table consist h2 nearby branch outcomes of the respective branch. The 2nd-level table consists of a number of n-bit predictors indexed by the selected content of the 1st-level table.

(b.1) Derive the formula to calculate the total bits required to build the tournament predictor using all the given parameters.

(b.2) Describe qualitatively the major space saving in the global correlated predictor comparing with the original correlated branch predictor.