The Lords of the RISC Proudly Present

The Lords of the RISC proudly present,

Lab 7: The Tomasulian

Starring,

Michael Carns (cs152-mcarns)

Mark Feng (cs152-wushuf)

Randy Huang (cs152-rkhuang)

Caroline Leung (cs152-cleung)

Mike Lowey (cs152-mlowey)

Abstract

Our processor implements the tomasulo scheduling algorithm. It supports in-order issue, out-of-order execution, and in-order commit. Additionally, we implement a precise exception point in the reorder buffer. The system consists of 8 major components: Instruction Fetcher, Instruction Buffer, Dispatcher, Reorder Buffer, Branch Ex Unit, ALU Ex Unit, Address Calc Unit, and Memory Unit.

Features implemented:

1. Out-of-order Execution (Tomasulo) ------22 points

2. Reorder Buffer ------12 points

3. Static Branch Prediction & recovery from mis-predicted branches

Division of Labor

Micahael Carns / Overall design of the Architecture, Reorder Buffer, modification to cache replacement policy (random to LRU), monitor and verification tools
Randy Huang / Memory Load/Store Unit
Caroline Leung / Branch Execution Unit, Integer Execution Unit
Mike Lowey / Instruction Fetcher, Instruction Buffer
Mark Feng / Dispatch Unit, Commit Unit

Detailed Strategy

Description of Major Modules & Operations

Instruction Fetcher

The Instruction Fetcher was created to accomplish a few specific tasks. Overall the Instruction Fetcher sits between the cache and the Instruction Buffer and acts as an interface that gets instructions from the memory and passes them on to the Instruction Buffer. The data that is passed to the buffer is 65 bits wide and contains the PC, instruction, and a bit to indicate whether a branch is being taken. The IFetch unit must keep track of and increment the PC in order to fetch the correct instruction. This IFetcher must also keep track of when a branch is being taken and insert the new PC at that time. The IFetch unit must also communicate with the cache and the IBuffer using the WAIT and FULL signals to make sure that there is always valid data is being transferred at the correct times. When good data has been received from the cache (WAIT is low) and the IBuffer is ready to receive data (FREE is high) the IFetcher will put the combined 65 bits on the data line and raise its STORE output signal.

Instruction Buffer (Load FSM, POP FSM)

The Instruction Buffer (IBuffer) is a buffer that holds instructions to be fed in to the dispatcher. It sits between the IFetcher and the Dispatcher and serves as a reservoir of instructions to keep new instructions flowing during cache misses. The IBuffer consists of a series of four, 65-bit registers set up in series to allow them to act as a 65-bit shift register. This allows instruction that have been loaded in to the registers to be popped out. The IBuffer also has four, 65-bit muxes that allow new data to be loaded in to a specific register depending on how full the buffer is.

The last part of the IBuffer is the controller. The controller keeps track of how full the buffer is and handles the details of loading and storing. Inputs and outputs are coordinated with the Dispatcher and the IFetcher using the FREE and INSTR_READY signals.

Dispatch Unit

The dispatch unit is designed entirely in VHDL. It is essentially a dissembler that decodes the current instruction and issues signals to the appropriate execution stations and the reorder buffer. The output signals of the dispatcher are broadcasted to all the execution units, but only certain execution units will start running. An execution unit will run only if it receives a dispatch signal for that unit. For example, when an arithmetic instruction is dispatched, the dispatcher decodes the instruction and asserts the INTDISPATCH signal. The integer unit then latches in input signals and executes the appropriate instruction.

The dispatch unit is also used to flush instructions once we realize a branch is taken. (A branch is never taken for our implementation to simulate branch prediction.) When the dispatcher receives a DOBRANCH signal, it goes to the BRANCH-RESOULTION state. In this state, the dispatch output all 0s, except the DISPATCHED signal to the instruction fetcher, which is set high. This tells the instruction fetcher to fetch more instructions, and the rest of the datapath to ignore these faulty instructions. Once the BRANCHTARGET signal is asserted, meaning that we are dispatching the correct target instruction of the branch, then the dispatched unit returns to the NORMAL state and continue to dispatch instructions like normal.

Reorder Buffer

The reorder buffer keeps track of all instructions in the system. Each instruction is assigned a unique 4-bit Reorder ID (or RobID) which is used to track the instructions throughout the system. Each instruction takes up one of the 12 reorder buffer slots.

The reorder buffer tracks the following for each instruction:

Inputs

Description

/ # bits
Robid / Robid of the instruction / 4
Valid / High if the instruction is valid / 1
Clear / High if the instruction is permitted to run / 1
Done / Instruction is completed and can be committed / 1
IsBranch / High if the instruction is a branch. / 1
IsMem / High if the instruction is a load or store / 1
Regwrite / High if the instruction writes it’s result to Rd / 1
PC / PC of instruction / 32
Rd / Destination register of instruction (not necessarily the MIPS RD field) / 4
Result / The result (if any) of the instruction / 32

Once per cycle the reorder buffer can clear memops or branches to run. Stores are not permitted to run until they reach the front of the ROB, and loads are not permitted to run until all future branches have resolved (in order to avoid false cache misses). Branches are not allowed to run until all future branches have resolved. Note: resolved is not the same as committed. An instruction is resolved when it’s done field is high. It is committed when it leaves the reorder buffer. Instructions can resolve out-of-order, but they commit in order.

If a branch is taken, it invalidates all instructions after the branch delay slot (since we effectively predict not taken). Once per cycle an invalid rob-id is placed on the invalidate bus to throw out bad instructions that are taking up space in execution units. Once a branch is invalidated, it can be overwritten in the reorder buffer to save space.

The ROB also acts as a rename buffer. When instructions are dispatched, the dispatcher passes Rs and Rt to the reorder buffer and it outputs any new results for those registers if they exist but have not been committed.

Individual Execution Unit:

Integer Execution Unit

There are 8 reservation stations for the integer execution unit. We decided to have this many slots because integer execution is the most frequent type of instruction that our processor executes in general. Since the cost of hardware is almost free in our implementation, having more reservation stations can enable more out-of-order executions and thus increase the performance of our processor.

Inputs: clk, dispatch signal from dispatcher, robid, op, Srs, Srt, Vrs, Vrt, Trs, Trt, immediate, invalidate robid, alu-forwarded robid/result, branch-forwarded robid/result, memory-forwarded rob/result

Outputs: current robid, result after execution, free signal to indicate still has slots available to load

Within each reservation station, there are

Inputs

Description

/ # bits
Robid / Robid of the reservation station lines / 4
Opcode / Customized microcode act as control logic at various places / 18
Valid / If the particular reservation line is holding valid things / 1
Ready / Ready to be executed / 1
Immediate / Immediate from dispatcher / 32
Ready / If the current reservation is ready to be executed / 1
Set_rs/Set_rt / If value waiting for is coming from common data bus / 2
Trs / Trt / Which forwarding bus line (from ALU/branch/memory) to pick / 2
Srs / Srt / Source robid of the two operand registers / 4
Vrs / Vrt / Actual value of the two operands / 32
Flush / Determine if the current reservation station needs to be flushed / 1

Basically the integer execution unit consists of 3 main parts: reservation stations, execution and the controlling/routing logic. Once it gets a dispatch signal from the dispatcher and a robid from the reorder buffer, the controller thus loads the instruction into the lowest available reservation station and set the corresponding valid bit to 1. Depending upon whether the source robids are zeros (which means values for both operands are ready already), the execution unit has to wait until the correct robid is broadcasted on the common data bus line. Until then, it will grab the corresponding data with the matching robids. Due to randomness of out-of-order execution and the quick execution time of ALU instruction, multiple reservation station lines can be ready at the same time; in that case, we pick the lower line to execute first, with the help of the appropriate routing logic. The order of executing instruction doesn’t matter, as long as the reorder buffer finalizes the in-order commit.

In order to get maximum speed, we made the dispatcher to send out horizontal microcode as opcode, which functions as control signals at various places. It successfully eliminates the extra local decode logic, since the opcode can be used as control signals directly.

Memory Load/Store Unit & Address Calculation Unit

The basic structure of the Memory Load/Store Execution Unit is consisted of 8 reservation stations, combinational logic to determine which instruction to run (out of order if necessary), and a controller that shifts the queue or removes bubble out of the queue of instructions in the reservation stations.
The reservation stations contain the ROBID of the instruction dispatched, memory address, data value (only for sw), ROBIDs for the address value and the data value, and other bits for validation, clear to go, and opcode bits. Each reservation station has the capability of loading new instructions, loading addresses, or data values accordingly.
The controller on the Memory Load/Store Execution Unit handles the way the reservation station buffer is loaded with instructions, shifted, and removes bubbles. When an instruction is done, the controller will shift all instruction above the finished instruction and “crush” it. During an execution of an instruction, any bubbles that exist above the instruction, the controller will crush the bubbles starting with the lowest most bubble. Furthermore, the controller loads instructions starting with the lowest available reservation station.
The combinational logic in the unit determines which instruction is ready to run by checking three conditions. First the logic will check if the address and data values are ready. Two, it will check if there is no dependencies (i.e. lw after a sw instruction both with the same address). And finally three, it will check if there is a lower instruction in the buffer that is ready to run before it runs certain instruction. The execution unit will always try to run the lowest (oldest) instruction in the buffer.
The Address Calculation Unit is in similar structure and behavior as the Integer Execution Unit. See 1. Integer Execution Unit.

Branch Execution Unit

The inputs are basically similar to the integer execution unit with the extra inputs of PC & clear robid from reorder buffer. For outputs, it has a new PC, PC+8 as $ra value for jal instruction, PC+4 as the branch-delay-slot PC, current robid+1 as the branch-delay-slot robid, a free signal, and a branch taken signal.

Also similar to the integer execution unit, it can split into 3 parts: reservation station, controlling/routing logic, and execution. We decided to have fewer reservation stations (4) for branch, since there cannot be as many branch instructions running at the same time than ALU instructions, and they cannot run until the reorder tells it to with the clear robid matched. The execution part is unique part. It takes care of both branch instructions (ie. bne, beq, bltz and bgez) and jump instructions (j, jal & jr), since both include a (possible) change to the current PC. For branch instructions, we reuse the same logic from previous lab to determine whether to branch or not, and thus calculate the branch PC by adding immediate to the PC+4. For jump instructions, which are like branches that are always taken, we get the new PC from the appropriate sources (Vrs for jr, and immediate for j/jal). Using one bit from the opcode and another bit from discrete logic to determine whether to branch or not, we can thus choose the proper new PC and send it to the Instruction Fetcher, with our branch-taken signal asserted as well. The branch-delay-slot PC / Robid are for the reorder buffer to resolve hazards while flushing. When reorder buffer sends out an invalidate robid (nonzero), each reservation station does a comparison of the invalidate robid with the current robid of that line. If they match, then the valid bit of that particular line will be diasserted, indicating that the line is not valid anymore (empty slot), so when new branch dispatches, it can be loaded over.

Commit Unit

The commit unit is really just a 108 bits register. We use the commit unit to shorten our critical path. Without the commit unit, a signal could potentially travel from one execution unit, to the reorder buffer, and propagate to another execution unit and thus creating a single-cycle processor. The commit unit is used solely to increase performance.

Results & Analysis

Optimizations Made to improve performance
Customized comparator using logic gates instead of stand VHDL block

--Since the specification limits that a standard vhdl comparator must have a delay of 10ns, so we create our own comparator using discrete logic gates, successfully reducing the delay from 10ns to 4ns (at most 4 levels of gate logic)

LRU replacement policy from random to LRU

--In lab 6, we designed our cache to use a random replacement policy; however, we change it into LRU policy for performance reason, successfully reducing the number of cycles needed.

Reduce the number of states in DRAM arbiter

--Simply the logic of DRAM arbiter to have more efficient data/instruction cache fetch. Speed up the memory transaction pace.

Critical Path Analysis

Our critical path consists of a long path from the mem/ex unit through the cache to the dram on the address lines. We’re not sure exactly how long this is, but it’s a corner case. When the DRAM path doesn’t trigger, most of the machine can run on a 40ns clock.

Performance Comparison between 5 stage Pipeline and Tomasulo

Our processor can successfully unroll loops in hardware and execute multiple iterations at the same time. By utilizing the dead cycles during a cache stall, it can achieve a much higher CPI. If we successfully eliminate the long path in the cache, we would also have a clock cycle that was comparable lab 6.

After playing with our clock period, we can run the merge sort in 17448 cycles with a cycle time of 62ns for a total time of ~1082000ns.

Testing Methodology

We tested the processor using the old lab 5 and lab 6 mystery programs, as well as the partial sum program. We did not have time to construct our own specific tests.

We still used the verification tools that we built for lab 5. They include a monitor which verifies the instructions as they are run, and a website which assembles and simulates the program to generate verification data.

The website is currently password protected but will be unlocked once the project is submitted. It can be reached at for a while.

Conclusion

Finally done with the FINAL project!

We boldly decided to go for the Tomasulo out-of-order execution, because all of us think that this is the coolest thing one can learn in CS152. With 2-weeks of careful design, we figured out about 85% of the details of each unit’s structure. Having the groundwork laid out, we went from paper to computer. An important lesson we got is that having a good design can save much debugging and implementation troubles.

Despite the fact that our processor cannot run at a speed that we expected it to be, we are still very proud of our design. We used VHDL only for controller. Everything else is in hardware, such as reservation stations, cache lines, routing logic, and all the execution units. (ie. no VHDL-simulated cache line or reservation station) This gives us a hard time while simulating, due to the extensive amount of wires and modules. Also, with the layers of discrete logic gates, our cycle time increases as well.

With a huge project like this, we learned how to manage and communicate as a working team. As usual, first we split up the work of implementation. With frequent communication, we ensure that the individual structure for similar components is consistent (especially the reservation stations for different functional units & the reorder buffer). Having a consistent basic structure not only makes our debugging easier, it also deepens our knowledge of the design and makes the logic cleaner, since we all have a common set of signals to manipulate on now.