Spring 2009 CDA5155 Homework 2
Date assigned: Feb 12th, 2009
Due date: On-campus Student: Feb 24th, 2009 (11:59 pm)
EDGE students: Mar 3rd, 2009 (11:59 pm)
Primary TA: Gang Liu ()
All homework must be submitted through E-Learning. No late submission will be accepted. You are required to submit homework in Microsoft word or Adobe pdf format and please make sure that your submitted file can be opened on CISE windows machines.
Your submitted homework answer MUST be typewritten using word processing software. Scanned handwritten submissions will NOT be accepted. Necessary calculation procedure or explanation MUST be provided in your answer.
All homework must be done individually.
Total Points: 100 pts
NAME: ______
UFID: ______
Exercise 1 (20pts)
Identify all possible hazards in the following instructions and show clearly which instructions are involved in what type of hazard (data-WAW, data-WAR, data-RAW and control) and why? Consider a simple 5-stage MIPS pipeline as shown in Figure A.18 for this exercise.
Inst 1: LD R1 700(R0)
Inst 2: LD R2 800(R1)
Inst 3: LD R3 900(R2)
Inst 4: ADDI R3,R2, #3
Inst 5: ADDI R4,R1, #3
Inst 6: MUL R2, R2, R3
Inst 7: ADD R5,R3,R2
Inst 8: ADD R6,R6,R5
Inst 9: ADDI R1,R6,#-1
Inst 10: BGTZ R1, #-32
Inst 11: SD R6 1000(R0)
Inst 12: BREAK
Exercise 2 (20pts)
Following is a simple pipeline; with a bypass from the output of the MEM stage (i.e., the
MEM/WB latch) to the EX stage. (Note: There is no bypass from the EX stage, i.e., the
EX/MEM latch, back to the EX stage.) Assume that the register file is written in the first
half and read in the second half of the clock cycle. Hence, a register value can be written
and read in the same cycle.
(a) What’s the purpose of bypassing?
(b) For the following instruction sequence, show the pipeline stages in the table. Part of
the stages for the first instruction is shown.
ADD R1, R2, R3 // add1
ADD R4, R1, R5 // add2
LD R5, 0(R4) // lw1
LD R6, 0(R5) // lw2
SUB R7, R5, R6 // sub1
SW R7, 0(R4) // sw1
Cycle / IF / ID / EXE / MEM / WB1 / add1
2 / add2 / add1
3
4
5
6
7
8
9
10
11
12
13
14
15
(c)Suggest technique(s) to eliminate the bubbles in part (b). Which bubbles would your
proposed technique(s) eliminate?
Exercise 3 (30pts)
The execution cycles for various instructions are given in the following table. All units
except DIV.D are pipelined. Loads and stores require the Integer ALU for address
generations and the load/store unit for accessing memory.
Instruction / Cycle / Number of unitInteger ALU / 1 / 2
Load/Store / 3 / 2
ADD.D/SUB.D / 5 / 2
DIV.D / 9 / 1
Simulate the execution of two loop iterations in the provide table. The execution is
carried out on a two-issue dynamic-scheduling pipeline microarchitecture. A single CDB
broadcasts/writebacks results after executions. Note that stores do not access memory
until commit.
Iter / Instruction / Issue / Execute / Memory / Write-CDB1 / L.D F0, 0(R1)
1 / DIV.D F4, F2, F0
1 / S.D F4, 0(R1)
1 / LW R5, 0(R4)
1 / ADD R5, R5, R2
1 / ST R5, 0(R4)
1 / SUBI R4, R4, #44
1 / ADDI R1, R1, #8
1 / BNE R1, R2, LOOP
2 / L.D F0, 0(R1)
2 / DIV.D F4, F2, F0
2 / S.D F4, 0(R1)
2 / LW R5, 0(R4)
2 / ADD R5, R5, R2
2 / ST R5, 0(R4)
2 / SUBI R4, R4, #44
2 / ADDI R1, R1, #8
2 / BNE R1, R2, LOOP
Exercise 4 (30pts)
(a) (10pts) Assume that a (3, 2) correlated predictor is designed, which uses the recent three branch outcome to determine one of possible predictors. Within each predictor, there are a number entries indexed by the address of the predicted branch. Each entry is a 2-bit counter to predict the outcome of the branch. Assume the total prediction table size is 32Kbits. Calculate the number of index bits from the predicted branch address that is needed to index each predictor table.
(b) (20pts) In a sophisticated tournament branch prediction scheme, a tournament predictor, a global correlated predictor, and a two-level local predictor are designed.
As shown in the figure, the tournament predictor consists of a number of t-bit predictors indexed by i1 bits from the branch address to select the prediction outcome from the global or the local predictor. The global correlated predictor has a number of m-bit predictor indexed by the recent branch outcome history. Finally, the local predictor has 2-level prediction tables. The first-level table is indexed by i2 bits from the branch address. Each entry in the 1st-level table consist h2 nearby branch outcomes of the respective branch. The 2nd-level table consists of a number of n-bit predictors indexed by the selected content of the 1st-level table.
(b.1) Derive the formula to calculate the total bits required to build the tournament predictor using all the given parameters.
(b.2) Describe qualitatively the major space saving in the global correlated predictor comparing with the original correlated branch predictor.