The Moron

CS 152 Final Project

Professor Kubiatowicz

Superscalar, Branch Prediction

John Gibson (cs152-jgibson)

John Truong (cs152-jstruong)

Albert Wang (cs152-albrtaco)

Timothy Wong (cs152-timwong)

17 of 18

Final Project

I.  Abstract

The goal of this project is to construct a working superscalar processor, with branch prediction. The memory module from lab 6 needed to be reworked to properly functioning. This phase we decided to emphasize robustness and functionality (ie, a working processor) rather than speed, so the memory was scaled back to a direct mapped, write through cache.

Although the superscalar architecture itself was straightforward, the primary complication lay in increasing the number of ports in the register file and cache to support 2 pipelines. We introduced “striping” in the cache to handle this situation with relatively few stalls.

II.  Division of Labor

Datapath Enhancement – This part involved updating the datapath to include 2 pipelines, adding additional forwarding logic, and updating the memory / register modules to support dual input/output, when necessary.

Initial revision: Albert, John G.

Cache, Striping, Dual Issue – This part involved writing a direct mapped, write-through cache with bursting, striping instructions within the cache, and adapting the cache for reading 2 instructions at once.

Initial revision: John G., Tim

Testing: John G.

Branch Predictor – This part involved writing and

Initial revision: John T.

Testing: John T., Tim

Distributor – This part consisted of a VHDL component that distributes the instructions between the two pipelines based on dependencies and other constraints.

Initial revision: Tim

Testing: Albert

Forwarding/Hazards – This involved updating the forwarding and hazard units to support the two pipelines.

Initial revision: Tim

Testing: Tim, Albert

Integration – Integration primarily involved updating the toplevel modules to support the new modules introduced by superscalar.

Integration: Everybody

Overall Testing – Testing was done on each element that we implemented followed by thorough testing of the datapath after integration of each component, as well as ensuring that it worked on the board correctly.

Testing: Everybody

III.  Detailed Strategy

Sections:

0: Superscalar

1: Stall Arbiter

2: Dual Issue

3: Memory Subsystem

4: Instruction Distribution

5: Forwarding

6: Hazards

7: Branch Prediction

Section 0: Superscalar

superscalar.sch

Because our 5-stage pipelined processor was already working reasonably well, extending it to a superscalar architecture was relatively straightforward. The two pipelines are referred to as the EVEN and ODD pipeline. Alternatively, the control signals distinguish between the two pipelines as Pipeline 1 (EVEN) and 2 (ODD) (this is slightly confusing, however we were able to distinguish the names between ourselves, and decided that going back to change all the names would be tedious and could cause annoying bugs if we were not careful).

Each pipeline maintains their own copies of the instructions, PCs, and control signals they process. The goal of this is to isolate each pipeline as much as possible in order to simplify the debugging process and minimize complexity.

With this project, we had the opportunity to use many of the lessons we learned from Lab 6’s non-functioning cache. Most notably, we kept the “Keep It Simple, Stupid” motto in mind throughout the design process. Because we wanted to reduce the complexity of our design, we decided to limit the functionality of the pipelines. For instance, all branch and jump instructions must be processed in the EVEN pipeline, whereas all memory instructions must be process in the ODD pipeline. We also kept an invariant that the earlier instruction must always be in the EVEN pipeline. The rationale behind this decision was that we wanted to keep the pipelines “synched” so that forwarding, hazards, and prediction mechanisms would be easier to design and test. Although this invariant inevitably increases our CPI, our goal was to have a working processor first and then include additional “features.” Keeping this in mind, we tried to design our processor so that it would be easy to integrate optimizations later.

Restricting the pipeline reduced the number of corner cases we had to worry about. The “branch pipeline” was intentionally set as the EVEN pipeline (the earlier one), so that branch delay slots could be handled more cleanly. Since branch and jump instructions will always be sent to the EVEN pipeline with their delay slots in the ODD pipeline, our distributor doesn’t have to keep states and remember that a delay slot instructions has to be fetched.

Restricting the pipelines also reduced the complexity of forwarding between the pipelines because data does not need to be forwarded to the memory stage of the EVEN pipeline and nor does data have to forwarded to the decode stage of the ODD pipeline.

Section 1: Stall Arbiter

stallarbiter.v

When multiple components request a stall or a bubble the stall arbiter decides which stall has precedence. Until the final project stalls had been handled in an ad hoc manner with several simple logic gates and latch signals. While it was easy to use the ad hoc system for lab 5 (only the hazard unit could stall, so no arbitration was necessary), we began to see stalling issues in lab 6 when we created two additional components (the data and instruction caches) that needed to stall the processor. However, with a few more gates we were able to retain our old stalling system. Unfortunately this system became inadequate during the development of lab 7 when we created three new stalling signals that needed to be handled. The first is bubble, which is asserted when the instruction in the decode stage of the ODD pipeline is dependent upon the instruction in the decode stage of the EVEN pipeline. The second and third signals are jumpflush and branchflush, which are asserted when a jump is detected or bad guess is made by the branch predictor. This proved to be far too many signals to handle with simple logic gates so a new module was created to give preference to the various signals.

1.  Data Cache Stall – Freezes the entire pipeline

2.  Hazard Stall – Freezes the fetch and decode stages, inserts bubbles into the execute stage.

3.  Instruction Cache Stall – Freezes the fetch and decode stages, inserts bubbles into the execute stage.

4.  Bubble – Inserts a bubble into the execute stage of the ODD pipeline. It also fills the decode stage of the EVEN pipeline with the decode instruction from the ODD pipeline. Finally it fills the decode instruction of the ODD pipeline with the instruction from the even fetched instruction unless the even fetched instruction is a jump or a branch.

5.  Jump Flush / Branch Flush (Only one of these should ever be asserted at once, so they have equal priority) – The flush signals reset the instructions entering the decode stage.

The stall arbiter selects the stall signal with the highest priority and propagates it to the rest of the processor. The actual reset and write enable signals to instruction and PC registers are determined by a set of OR and NOR gates which have the appropriate stall signals being fed into them. This probably should have been done in another behavioral verilog module, but because the signals could be decided with a single level of gates we decided that it was easiest to just place the gates on the pipeline.

Section 2: Instruction Issue

A superscalar processor requires two new instructions every cycle, therefore we had to modify our 32 bit-wide write through cache to create a wider cache that could provide 64 bits of data to the pipeline. Fortunately the instruction cache never has to accept stores from the pipeline, this made developing the new instruction cache substantially easier.

We widened the instruction cache to fetch two instructions in one cycle. The original design was a BRAM with a single 64-bit port. We used Coregen to create a BRAM with 64 bit entries, and we halved the number of entries to keep the size of the instruction cache constant. This had the side effect of making our loads from RAM 4 cycles faster because now a cache line could be filled in just four writes to cache instead of eight. The downside to this design was that we could only load even/odd pairs of words, not odd/even pairs. Unfortunately, the distributor is designed to fetch words in odd/even pairs as well as even/odd pairs (because of a jump/branch or a stall because of a memory instruction in the EVEN pipeline). Without modification, we would have incurred a 1 cycle penalty every time we wanted an odd/even pair. We thought of several solutions to this problem.

First, Prof. Kubi suggested that we build a stream buffer that would always be a few cycles ahead and then we could select both odd/even and even/odd pairs as long as the buffer was full. To keep the buffer full, we would have had to keep the fetch stage a few cycles ahead of the rest of the processor. This would have made branches and jumps very costly, however our branch prediction unit would have offset this penalty. Unfortunately, managing a stream buffer sounded complicated, because we wanted to keep our design as simple as possible, we decided not to use this approach.

An alternative would have been to use a dual-ported BRAM with port widths of 32 which would allow us to select any two words we wanted simultaneously. However given Prof. Kubi’s disdain for dual-ported BRAMs and the fact that using dual porting here would rob us of the opportunity to use dual-porting to enhance the loading speed of DRAM requests, we decided not to go this route either.

The final option was to stripe the instructions across two separate BRAMs with one containing odd words and the other containing even words. This way we could always select both an odd and an even word in a single cycle regardless of the word’s position in cache. This modification was straightforward and we were able to make the change without touching the cache controller.

Unfortunately, we encountered another significant problem: loading words across cache lines. If the processor requested word 7 of line x and word 0 of line x + 1 then we would be unable to fulfill the request in a single cycle for a number of reasons. The first was that we still could only lookup a tag for a single line in one cycle. To fix this problem we either would have to dual-port the tag file or keep two copies of it so that we could query different entries simultaneously. A more serious issue would arise in the event of a double-cache miss. Handling a double-cache miss would have altered our cache controller greatly and given our earlier trouble with the cache controller we felt that it was safer not to attempt this solution. Instead we decided to make a separate controller that would handle a load across cache lines over multiple cycles. This controller would act as a supercontroller for the instruction cache; it would detect a load across cache lines and then freeze the rest of the processor while it hijacked the cache and simulated two separate load requests. This way we were able to quickly resolve the load across cache lines issue without having to substantially alter our cache controller. We did pay a performance penalty however, a load across cache lines would always take at least two cycles. We decided that this was an acceptable tradeoff because functionality is more important than performance. To reduce this penalty we could increase the length of our cache lines (thus making requests across cache lines less frequent). This improvement would require changes to the DRAM controller.

Section 3: Memory Subsystem

memorysubsystem.sch

The memory subsystem is just a schematic that wires together several major components. It contains the Memory Mapped I/O, the Instruction Memory Wrapper, the Instruction Cache, the Data Cache, the Cache Arbiter, and the DRAM Interface as well as a few shift registers and logic gates. The MM I/O unit sits above the Data Cache and is in charge of intercepting requests to the Memory-Mapped I/O space before they can reach the Data Cache. The Instruction Memory Wrapper performs a similar task. It prevents the Instruction Cache from attempting to load instructions while the Boot Rom code is executing. The Cache Arbiter is connected to both the Instruction Cache and the Data Cache and routes their requests to the DRAM Interface. Data transfers to and from DRAM are done through the shift registers. We decided to implement the Memory Subsystem as a schematic because it makes it easier to visualize the connections.

Note that we re-worked Lab 6’s cache architecture to correctly function. In doing so, we simplified it into a direct-mapped, write-through policy.

Section 4: Instruction Distribution

superdistributor.v

With dual issue and the restrictions we put on our pipeline concerning memory and branching instructions, the processor needs a way to distribute instructions to avoid possible structural hazards. The purpose of the distributor module is to distribute instructions in such a way to avoid these structural hazards. It does a simple decode of the opcode and functcode of the instructions coming from instruction cache. If the earlier instruction coming from cache is a memory instruction it sends it down the ODD pipeline, sends a NOP down the EVEN pipeline and requests that the cache load the 2 instructions following the memory instruction. For a branch or jump detected as the later instruction, the earlier instruction is sent down the EVEN pipeline, a NOP down the ODD pipeline, and the distributor requests that the branch or jump be fetched again with its delay slot instruction (Figure 4.1). The effect of refetching the branch or jump instruction with its delay slot instruction helps particularly when dealing with branch prediction. Normally, when a branch is predicted after the instruction is fetched, the new predicted PC is immediately sent into the instruction cache. If a branch happens to be in the ODD pipeline there would be an issue of having to fetch the delay slot instruction before predicting a branch.

We also extended the distributor’s responsibility by allowing it to detect jumps (j and jal) in the early instruction and send the new jump PC to the PC unit (our branch predictor). In this way we saved a cycle whenever there was a j or jal instruction because we did not have to wait until ID stage to decode a jump.