Student Name:______

CISC662 - COMPUTER SYSTEMS: ARCHITECTURE - Fall 2008

HOMEWORK 2

DUE DATE: October 20, 11:59PM (hard deadline)

Part A - Illustrate the impact of forwarding and code order on the performance of simple loops with pipeline diagrams.

Use the following code segment:

LOOP: LWR1, 0(R2)

ADDIR1, R1, #1

SWR1, 0(R2)

ADDIR2, R2, #4

SUBR4, R3, R2

BNZR4, LOOP

Assume that the initial value of R3 is R2 + 396. Throughout this exercise use the RISC integer pipeline and assume all memory accesses are cache hits.

Question 1: Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardware but assuming a register read and a register write in the same clock cycle “forwards” through the register file. Use a pipeline-timing chart as in the book. Assume that bnz computes the next PC in MEM stage. Also assume that the branch is handled by flashing the pipeline. If all memory references hit in the cache, how many cycles does this loop take to execute.

Instruction / Clock Cycles
1 2 3 ……

How many times is the loop executed? How many clock cycles does the loop executes in a total?

Question 2: Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Use a pipeline-timing chart as in the book. Assume that the branch is handled by predicting it as not taken. Also assume that bnz computes the next PC in MEM stage. If all memory references hit in the cache, how many cycles does this loop take to execute?

Instruction / Clock Cycles
1 2 3 ……

How many times is the loop executed? How many clock cycles does the loop executes in a total?

Question 3: Assuming the RISC pipeline with a single-cycle delayed branch and normal forwarding and bypassing hardware, schedule the instructions in the loop including the branch delay slot. Also assume that bnz computes the next PC in ID stage. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline-timing diagram and compute the number of cycles needed to execute the entire loop.

Write here your reordered code:

Pipeline-timing diagram:

Instruction / Clock Cycles
1 2 3 ……

Write here the number of cycles needed to execute the entire loop:

Evaluation Criteria: For each question, questions 1-3, you can get up to 10 points so assigned:

  • You get 10 points if the answer is correct, well written, and, if required, optimized.
  • You get 8 points if the answer has some minor issues or is correct but very poorly presented; overall the answer shows your understanding of the problem.
  • You get 4 points if you show some effort to address the problem but the answer cannot be considered correct; your solution shows some major issues with your understanding of the problem and you are encourage to review the theory.
  • You get 0 points if the answer is incorrect and there is no evidence of your effort in understanding the problem, or you copied the solution, or an answer is missing.

Part 2 - This exercise provide you an opportunity to compare the performance of several different MIPS implementations.

These three basic pipeline structures are considered:

  1. Classic five-stage MIPS with Fetch, Decode, Execution, Memory, Write Black (IF/ID/EX/MEM/WB) pipeline stages.
  2. MIPS implementing Scoreboarding with Issue, Read Operands, Execute, Write Results (IS/RO/EX/WB) pipeline stages.
  3. MIPS implementing Tomasulo’s Algorithm with Issue, Execute, Write Results (IS/EX/WR) pipeline stages.

Each of the sub-exercises explores a variation on one of these general pipelines.

For each of the sub-exercises, consider the following code:

FOO:LD F2, 0(R1) ; load X(i)

MULTD F4, F2, F0; multiply a*X(i)

LD F6, 0(R2); load Y(i)

ADDD F6, F4, F6; add a*X9i) + Y(i)

SD 0(R2), F6; store Y(i)

ADDI R1, R1, #8; increment X index

ADDI R2, R2, #8; increment Y index

SGTI R3, R1, DONE; test if done

BEQZ R3, FOO; loop if not done

….

The code implements the vector operation Y=a * X + Y for a vector of length 100.

Question 1 – For this problem use the standard single-issue MIPS pipeline (classic five-stages described in point 1). Table 1 shows the latency of operations.

Instruction producing result / Instruction using result / Latency in clock cycles
FP ALU op / Another FP ALU op / 3
FP ALU op / Store double / 2
Load double / FP ALU op / 1
Load double / Store double / 0

Table 1: Latencies of FP operations

Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) on the first iteration of the loop. How many clock cycles does each loop iteration take?

To answer this question, assume that all of the functional units are fully pipelined and fully bypassed (forwarding). Contention for MEM or WB pipeline stages is resolved by allowing the instruction that issued earliest to complete. Use the table provided below.

Instructions / First execution cycle / Stall cycles / Comments

Write here your conclusion on the number of cycles that each loop takes and the number of stalls per cycle:

Question 2 - Unroll the code to make four copies of the body and schedule it for the standard single-issue MIPS pipeline (classic five-stages described in point 1) and a fully pipelined FPU with the FP latencies of Table 1. When unwinding, optimize the code: significant reordering of the code may be necessary to maximize performance. Please note that you will get full score ONLY if the code is correct AND with the best optimization. How many clock cycles does each loop iteration take?

Write here your code:

Write here your conclusion on the number of cycles that are needed per iteration. Do you still have stalls per iteration?

Question 3 – Using the MIPS code above, show the state of the scoreboard tables (instruction status table, functional unit status table, and register result status table) when the SGTI instruction reaches write result. Assume that issue and read operands each takes one cycle. Assume that there is one integer functional unit that takes only a single execution cycle (the latency to use is 0 cycles, including loads and stores). Assume the basic structure of a MIPS processor with scoreboarding shown in Figure A.50 with the latency of Figure 2.2 (both in the book). The branch should not be included in the scoreboarding.

Assume these other conditions that will simplify your solution (MANDATORY conditions):

  1. An instruction waiting for a result from another functional unit can pass through read operands at the same time the result is being written.
  2. An instruction in WR completing will allow a currently active instruction that is waiting on the same functional unit to issue in the same cycle as the first instruction completes WR.
  3. Because there is only one integer unit, many integer operations will stall at issue due to the integer unit being busy until WR stage.

Table 2 summarizes the values to use for the number of cycles FP operations spend in the execution stages of the scoreboarding pipeline. Integer operations only require one cycle in EX.

FP Multiply / FP Add
3 / 3

Table 2: Number of cycles FP Multiplies and Additions spend in execution for a scoreboarding MIPS

Instruction status table:

Instruction / Unit / Cycles in / Comments
IS / RO / EX / WR

Functional unit status table:

Functional
Unit / Function Unit Status
Busy / Op / Fi / Fj / Fk / Qj / Qk / Rj / Rk

Register status table:

FP Register Status
Function Unit

Note: please extend/change the tables with e.g., more or less columns and rows as needed.

Write here your comments, if any:

Question 4 – Use the MIPS code above and a fully pipelined. Assume Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle (a latency of 0 cycles to use) for all integer operations. Show the state of the instruction status, reservation stations, and register-status tables when the SGTI writes its results on the CDB. Do not include the branch. Assume the basic structure of a MIPS processor using Tomasulo’s algorithm shown in Figure 2.9 in the book.

Other assumptions that will simplify your solution:

  1. There are enough FP reservation stations to allow an FP operation of any type to be issued each cycle in the absence of stalls. Note that providing lots of reservation stations and fully pipelining a unit are essentially equivalent.
  2. There are at least three integer reservation stations do that the machine can issue an integer operation each cycle in the absence of stalls.
  3. The EX stage of loads and stores does both the effective address computation and the memory access.

As for the previous exercise, you need to know the number of cycles an instruction spends in execution. The values for the number of cycles in execution have to be consistent with the latencies described in Figure 2.2 in the book. For the Tomasulo’s algorithm, you have to examine the time between the first EX stage of the producing and consuming instructions, as it is at this point that the results are read. It turn out that the amount of time spent in execution is one more than the latencies given in Figure 2.2 in the book. Table 3 summarizes the number you have to use fore the number of cycles FP operations spend in the execution stages of the Tomasulo pipeline. Integer operations only require one cycle in EX as specified in the exercise.

FP Multiply / FP Add
4 / 4

Table 3: Number of cycles FP Multiplies and Additions spend in execution for a Tomasulo MIPS

Instruction status table:

Instruction / Unit &
Res. St. / Cycles in / Comments
IS / EX / WR

Reservation stations table:

Function
Unit / Reservation Station Status
Busy / Op / Vj / Vk / Qj / Qk

Register-status table:

Field / Register Status

Note: please extend/change the tables with e.g., more or less columns and rows as needed.

Write here your comments, if any:

Question 5 – Consider the same assumptions listed in Question 3 but assume 6 cycles for FP Mult and 4 cycles for FP Add.

Show the state of the scoreboard tables (instruction status table, functional unit status table, and register result status table) when the branch issues for the second time. Assume the branch was correctly predicted as taken and took one cycle. How many cycles does each loop iteration take? You may ignore any register port/bus conflicts. Also note that you have only one integer functional unit also used for store and load. Assume the basic structure of a MIPS processor with scoreboarding shown in Figure A.50 in the book.

Instruction status table:

Instruction / Unit / Cycles in / Comments
IS / RO / EX / WR

Functional unit status table:

Functional
Unit / Function Unit Status
Busy / Op / Fi / Fj / Fk / Qj / Qk / Rj / Rk

Register status table:

FP Register Status
Function Unit

Note: please extend/change the tables with e.g., more or less columns and rows as needed.

How many cycles does each loop iteration take?

Question 6 – Consider the same assumptions listed in point Question 4 but assume 7 cycles for FP Mult and 5 cycles for FP Add.

Show the state of the instruction status, reservation stations, and register-status tables the branch is executed for the second time. Assume the branch was correctly predicted as taken. How many cycles does each loop iteration take? Assume the basic structure of a MIPS processor using Tomasulo’s algorithm shown in Figure 2.9 in the book.

Instruction status table:

Instruction / Unit &
Res. St. / Cycles in / Comments
IS / EX / WR

Reservation stations table:

Function
Unit / Reservation Station Status
Busy / Op / Vj / Vk / Qj / Qk

Register-status table:

Field / Register Status

Note: please extend/change the tables with e.g., more or less columns and rows as needed.

Write here your comments, if any:

Evaluation Criteria: For each question, questions 1-6, you can get up to 10 points so assigned:

  • You get 10 points if the answer is correct, well written, and, if required, optimized.
  • You get 8 points if the answer has some minor issues or is correct but very poorly presented; overall the answer shows your understanding of the problem.
  • You get 4 points if you show some effort to address the problem but the answer cannot be considered correct; your solution shows some major issues with your understanding of the problem and you are encourage to review the theory.
  • You get 0 points if the answer is incorrect and there is no evidence of your effort in understanding the problem, or you copied the solution, or an answer is missing.

Part 3 – This exercise provide you with an opportunity to compare the performance of the superscalar architecture with unrolled code to the scalar architectures evaluated in Part 2.

Consider the dual-issue version of the Tomasulo pipeline without speculation. In other words, consider a superscalar architecture that can issue any two independent operations in a clock cycle (including two integer operations). Assume one fully pipelined copy of each FP functional unit (i.e., FP adder and FP multiplier) and two integer functional units with latency equal to 0. Loads and stores are executed by the integer unit that allows them to have a latency of 0.

Unwind the code already given in H3 and also reported below in Figure 1 to make four copies of the body and schedule it assuming the FP execution cycles is given in Table 1:

FP Multiply / FP Add
4 / 3

Table 1: Number of cycles FP Multiplies and Additions spend in execution

IMPORTANT: When unwinding, you should optimize and reorder the code as discussed in Section 2.2 of the book so that you enhance the processor’s ability to exploit ILP and eliminate all the stalls.

Question 1 - How many clock cycles will each iteration on the code take? What is the speedup per iteration versus the Tomasulo pipeline seen in Part 2? Report the previous result before to compare it. Ignore the stall cycle due to the branch.

FOO:LD F2, 0(R1) ; load X(i)

MULTD F4, F2, F0; multiply a*X(i)

LD F6, 0(R2); load Y(i)

ADDD F6, F4, F6; add a*X9i) + Y(i)

SD 0(R2), F6; store Y(i)

ADDI R1, R1, #8; increment X index

ADDI R2, R2, #8; increment Y index

SGTI R3, R1, DONE; test if done

BEQZ R3, FOO; loop if not done

….

Figure 1: This code implements the vector operation Y=a * X + Y for a vector of length 100.

Notes:

  • The challenge in this exercise is to reorder the unwind loop so that you eliminate all the possible stalls. Only in this way, you get the highest speedup.
  • By assuming full bypassing, an instruction can be allowed to be issued during the cycle before its operands become available.
  • Assume an in-order issue of instructions.
  • Ignore the stall cycle of the branch.
  • When reordering the unwind loop, separate instructions to honor their latencies.
  • Reorder carefully to preserve the behavior of the program.

Instruction / Issues / Executes

Write here how many clock cycles each iteration on the code takes:

Write here what the speedup per iteration versus the Tomasulo pipeline seen in Part 2 is:

Evaluation Criteria: For this question you can get up to 10 points so assigned:

  • You get 10 points if the answer is correct, well written, and, if required, optimized.
  • You get 8 points if the answer has some minor issues or is correct but very poorly presented; overall the answer shows your understanding of the problem.
  • You get 4 points if you show some effort to address the problem but the answer cannot be considered correct; your solution shows some major issues with your understanding of the problem and you are encourage to review the theory.
  • You get 0 points if the answer is incorrect and there is no evidence of your effort in understanding the problem, or you copied the solution, or an answer is missing.

1