Consider the Following Program Segment Computing the Dot Product of Two Floating-Point

Consider the Following Program Segment Computing the Dot Product of Two Floating-Point

Comp326 Practice Exercises2

Quiz #3

Consider the following program segment computing the dot product of two floating-point vectors whose addresses are stored in registers R1 and R2.

loop: LD F0, 0(R1) ; load A[i]

LD F4, 0(R2) ; load B[i]

MULTD F0, F0, F4 ; compute A[i] * B[i]

ADDD F2, F0, F2 ; accumulate sum in F2

SUBI R1, R1, #8 ; address of A[i-1]

SUBI R2, R2, #8 ; address of B[i-1]

BNEZ R1, loop ; loop if necessary

SD 0(R3), F2 ; store the dot product

Assume that register F2 is initialized to 0 and register R3 contains the address of the memory location at which the dot product is to be stored. Also assume that the initial value of register R1 is 400.

The following table shows the clock cycles for the instructions in the above program.

(a)[3 marks] Suppose this program is run on a processor using Tomasulo’s dynamic scheduling with the following characteristics: the processor has one FP multiply unit, one FP add unit, and one integer unit; the FP units take four Execute cycles and the integer unit takes one Execute cycle; and the branch decision and branch target address are computed in the EX stage of a branch instruction.

Assuming that branches are handled by flushing the pipe, construct an instruction execution profile of the above program on the given processor. From your instruction profile, determine the number of clock cycles required to execute the given program. Clearly state any assumption(s) that you may be making.

Instruction / Issues
at clock-cycle / Executes
at clock-cycle / Writes result
At clock-cycle
Loop: LD F0,0(R1)
LD F4,0(R2)
MULTD F0,F0,F4
ADDD F2,F0,F2
SUBI R1,R1,#8
SUBI R2,R2,#8
BNEZ R1, loop
SD 0(R3),F2

(b)[2 marks] Suppose a hardware branch target buffer (BTB) with the following characteristics is added to the processor in (a).

Pipeline Stage / Function
Issue / Check if the instruction address is in BTB.
If present, fetch branch prediction and branch target instruction.
Execute / Compute branch decision and branch target address.
If the prediction is correct, issue the predicted target instruction.
Write Result / If the prediction is incorrect or if the instruction is not in BTB, update BTB and fetch and issue the correct target instruction.

Determine the branch penalties for the following situations for this processor.

Instruction in BTB / Prediction / Actual branch / Penalty cycles
Yes / Taken / Taken
Yes / Taken / Not taken
No / Taken

(c) [3 marks] How much faster does the given program run on the processor (with the BTB) in (b) compared to the processor (without the BTB) in (a)?

(d) [2 marks] Suppose the branch target buffer (BTB) is modified so that it stores the branch prediction and target instruction (instead of the address of the target instruction) for every branch instruction. Briefly discuss the major advantage of this BTB compared to the one described in (b).