Pipelining
We’ve already covered the basics of pipelining in Lecture 1.
We saw that cars could be built on an assembly line, and that instructions could be executed in much the same way.
[H&P §A.1] In the ideal situation, this could give a speedup equal to the number of pipeline stages:
However, this assumes “perfectly balanced” stages—each stage requires exactly the same amount of time.
This is rarely the case, and anyway, pipelining does involve some extra overhead.
Three aspects of RISC architectures make them easy to pipeline:
•All operations on data apply to data in registers.
•Only load and store operations move data between memory and registers.
•All instructions are the same size, and there are few instruction formats.
An unpipelined RISC
For our examples, we’ll work with a simplified RISC instruction set. In an unpipelined implementation, instructions take at most 5 clock cycles. One cycle is devoted to each of—
•Instruction fetch (IF).
Fetch the current instruction (the one pointed to by PC).
IR Mem[PC]
Update the PC by adding
NPC PC +
•Instruction decode/register fetch (ID).
Decode the instruction.
Read the source registers from the register file.
A Regs[IR6..10]; B = Regs[IR11..15]
Sign-extend the offset (displacement) field of the instruction.
Imm sign-extend(IR16..31)
Check for a possible branch (by reading values from the source registers).
Cond (A rel B)
Compute the branch target address by adding the to the
ALU_Output NPC + Imm
If the branch is taken, store the branch-target address into the PC.
If (cond) PC ALU_Output, else PC NPC
What feature of the ISA makes it possible to read the registers in this stage?
•Execute/compute effective address (EX).
The ALU operates on the operands, performing one of three types of functions, depending on the opcode
Memory reference: ALU adds and to form the effective address.
ALU_Output
Register-register instruction: ALU performs operation on the values read from the register file.
ALU_Output A op B
Register-immediate instruction: ALU performs operation on the and the
ALU_Output A op Imm
In a load-store architecture, execution can be done at the same time as effective-address computation because
•Memory access (MEM).
Load_Mem_Data Mem[ALU_Output] /* Load */
Mem[ALU_Output] B/* Store */
•Write-back (WB). If the instruction is register-register or , the result is written into the register file at the address specified by the destination operand.
Reg-Reg ALU Operation: Regs[IR16..20] ALU_Output
Reg-Immediate ALU Operation: Regs[IR11..15] ALU_Output
Load instruction: Regs[IR11..15] Load_Mem_Data
In this implementation, some instructions require 2 cycles, some require 4, and some require 5.
•2 cycles:
•4 cycles:
•5 cycles:
Assuming the instruction frequencies from the integer benchmarks mentioned in the last lecture, what’s the CPI of this architecture?
Pipelining our RISC
It’s easy to pipeline this architecture—just make each clock cycle into a pipe stage.
Clock # / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9Instruction i / IF / ID / EX / MEM / WB
Instr. i+1 / IF / ID / EX / MEM / WB
Instr. i+2 / IF / ID / EX / MEM / WB
Instr. i+3 / IF / ID / EX / MEM / WB
Instr. i+4 / IF / ID / EX / MEM / WB
Here is a diagram of our instruction pipeline.
In this pipeline, the major functional units are used in different cycles, so overlapping the execution of instructions introduces few conflicts.
•Separating the instruction and data caches eliminates a conflict that would arise in the IF and MEM stages.
Of course, we have to access these caches faster than we would in an unpipelined processor.
•The register file is used in two stages:
Thus, we need to perform reads and writes each clock cycle.
To handle reads and writes to the same register, we write in the first half of the clock cycle and read in the second half.
•Something is incomplete about our diagram of the IF stage. What?
We’ve omitted one thing from the diagram above: We need a place to save values between pipeline stages. Otherwise, the different instructions in the pipeline would interfere with each other.
So we insert latches, or pipeline registers, between stages. Of course, we’d need latches even in an unpipelined multicycle implementation.
What is our pipeline speedup, then …?
Of course, we have to allow for latch-delay time.
We also need to allow for clock skew—the maximum delay between when the clock arrives at any two registers.
Let’s define To’head= Tlatch+ Tskew.
Speedup=
=
=n(ideal case where To'head=0)
Example: Consider the unpipelined processor in the previous example. Assume—
•Clock cycle is 1 ns.
•Branch instructions, 20% of the total, take 2 cycles.
•Store instructions, 10% of the total, take 4 cycles.
•All other instructions take 5 cycles.
•Clock skew and latch delay add 0.2 ns. to the cycle time.
What is the speedup from pipelining?
How can pipelining help?
How can pipelining improve performance?
•If we keep CT constant, by improving CPI …
•If we keep CPI constant, by improving CT …
•Usually we improve both CT and CPI.
Pipeline hazards
A hazard reduces the performance of the pipeline. Hazards arise because of the program’s characteristics.
There are three kinds of hazards.
•Structural hazards—Not enough hardware resources exist for all combinations of instructions.
•Data hazards—Dependences between instructions prevent their overlapped execution.
•Control hazards—Branches change the PC, which results in stalls while branch targets are fetched.
Structural hazards
Consider a pipeline with a unified instruction-data cache.
Clock # / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10Load instr. / IF / ID / EX / MEM / WB
Instr. i+1 / IF / ID / EX / MEM / WB
Instr. i+2 / IF / ID / EX / MEM / WB
Instr. i+3 / stall / IF / ID / EX / MEM / WB
Instr. i+4 / IF / ID / EX / MEM / WB
Instr. i+5 / IF / ID / EX / MEM
Instr. i+6 / IF / ID / EX
Instruction i+3 has to stall, because the load instruction “steals” an instruction-fetch cycle.
In this pipeline, what kind of instructions (what “opcodes”) cause structural hazards?
© 2002 Edward F. GehringerECE 463/521 Lecture Notes, Fall 20021
Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)