Phoebus Chen

Mark Chew

Manda Sutijono

Koichi Tsunoda

Posu Yan

Lab 7: Final Project

Abstract:

Our goal for the final project was to design a processor using the Tomasulo algorithm to perform out-of-order execution. We also decided to implement branch prediction, reorder buffer, and a multiplier/divider.

In order to accomplish this, we essentially scratched our previous pipelined processor, and started over. This proved to be quite complicated, not only because we could not simply upgrade our older processor design, but also because the Tomasulo algorithm itself was more complex than we had anticipated.

Division of Labor:

Phoebus Chen – Branch Prediction

Mark Chew – Reservation Stations, Multiplier/Divider

Manda Sutijono – Operation Queue, Utility Units

Koichi Tsunoda – Reorder Buffer, Asynchronous DRAM Controller

Posu Yan – Issuer, Arbiter, PC FSM

Detailed Strategy:

Reservation Station

A reservation station has, among other things, space to store the input values to a function, where those input values are coming from (if the input values are not yet available at issue time), the specific operation that should be performed on the input values, and the number of the ROB entry. Initially we were going to centralize most of the intelligence in the reservation station controller entity, but we found that it is much easier to distribute the intelligence. Reservation stations are able to listen to the CDB on their own, latching values that they need, say whether they are ready to be launched, say whether they are occupied by an instruction, and decide whether to reset themselves on a mispredicted branch.

Integer Unit and Shift Unit:

Once the needed values are in a reservation station, launcher will launch an instruction the next time the functional unit (ALU or shifter) is free. The shifter is simple because the output of the shifter is always what the output register latches, to drive onto the CDB the next cycle. The integer unit, however, also needs to drive 0x00000000 or 0x00000001 on a set or set unsigned instruction. So the input to the output register must be muxed between set, set unsigned, and the ALU output. For both the integer unit and the shift unit, values wait in the output registers until they get a chance to broadcast, at which point they latch in the next value to be broadcast.

Testing Methodology for Reservation Stations, Reservation Station Controllers, and Integer Unit

To test these, we basically wrote a command file that issued a bunch of instructions to the reservation stations. First of all, this allowed us to see if the issue controller was distributing the instructions among reservation stations in a reasonable way. Some of the instructions we launched still needed to wait for their values to come on the CDB, and accordingly, we put some wrong values and some right values on the CDB to see if the reservation stations would grab the values it was waiting for and signal that they were ready to launch. Finally, we added a pseudo-random grant signal using an xor gate feeding a flip-flop, so we could test that the integer unit behaves correctly when we are not granted the CDB.

Multiply/Divider:

The answers to multiplies and divides don’t get directly broadcast on the CDB because hi and lo aren’t registers in the pure sense. Instead, we send the answers directly to the ROB. It’s not until the ROB encounters mfhi or mflo instructions that the answers to a multiply or divide get broadcast.

MFHI/MFLO Unit

Each “mf” slot in a station looks into the mult cache to see if the multiply it depends on is ready. Once the multiply results are ready, the mfhi/mflo unit takes the data from the mult cache and broadcasts it. Reorder buffer renaming resolves dependencies between “mf” instructions and mult instructions.

Branch Unit

The branch unit is meant to correct the branch from the branch prediction unit when the latter mispredicts an instruction. It is also the duty of the branch unit to update the branch predictor so future predictions are more accurate. To perform its duties, the branch unit takes in the address not chosen by the branch predictor (either PC+4 or PC+4+Immed), the prediction value (taken/not taken), the PC of the branch, and the other signals usually given to a reservation station. The branch unit also listens to the CDB to grab its operand values to use for comparisons.

When the branch unit launches an instruction, it compares the two operands and based on whether the instruction was a beq, bne, bgez, bgtz, returns whether a branch should have been taken. It also compares the actual branch result with the predicted branch result, and tells the ROB, other functional units, and the instruction OP queue when it mispredicts. While the branch unit is comparing the operands, it also calculates the “ROB holding the current instruction + 2”. This value is used for determining the range of instructions to squash (all instructions issued after and including ROB + 2 must be squashed). Regardless of the outcome of the prediction, the branch unit always updates the branch prediction unit’s table entries, using the PC of the branch instruction as the index to the table.

The branch unit has 8 reservation stations (it should never have to use so many, and can never need more because we only have 16 slots in the reorder buffer and half of those will be used to hold delay slot instructions).

JR Unit

We need a special unit to handle JR instructions. Since we don’t do jump prediction, we always stall on a JR. Once JR is issued to the JR Unit, it simply does a couple case analyses. First, it checks if the register value is ready. If it is, then it will tell the PC FSM to do a jump to the address in the register. If the value is not ready, then it will store the ROB of the register, and will listen to the CDB for the matching destination ROB. Once it matches, it also tells the PC FSM to do a jump.

However, if the PC FSM is not ready to write the jump register PC when the JR unit is ready (such as when a stall is happening with an instruction cache miss), we need to store the value into the registers and keep on asserting the jump signal until it is notified that the jump PC has been written.

Result

With this unit, it allows the JR instruction to assert signals to the PC FSM to do the correct next PC selection. Though this stall is pretty bad for Tomasulo (since we can’t continue to fetch instructions from the memory), it simplifies it and works so we decided to design it this way.

Arbiter

The arbiter works by queuing all CDB requests and grants them access in order based on their priority. Each request line is handled by its own FSM.

When the FSM gets a request, the FSM transitions to one of the eight queue states, each state corresponding to a “queue level”. If there were no pending requests, the FSM enters the “go” level, which immediately grants the requestor access to the bus. Otherwise, the FSM enters one of seven other queue states corresponding to its position in the FIFO queue. The queue position of a request is based on its priority and time of arrival, and guarantees fairness to all CDB requests.

To schedule the bus access, the FSM takes the number of queued requests that are in front of the current request as an input. This is accomplished by having an adder that counts the number of request lines that are currently queued, plus the number of requests that is simultaneously coming in. For example, the highest priority request only needs to add the number of requests that are already queued before it came in, but the lowest priority line needs to add the number of lines queued plus the number of simultaneous requests (since all other requests are of higher priority).

SW cache

The SW cache keeps track of outstanding SWs and their addresses. This is a lookup table that is used by LW instructions for looking up and determining address dependencies. This is to prevent lw after sw hazards where they both access the same address (memory disambiguation).

The SW cache has 16 entries. Each entry has a valid bit, the ROB number, and the address. An entry is entered into the cache after the SW address is calculated in the LW/SW unit, and each entry is taken out when a SW is committed.

Operation Queue

The operation queue holds the instructions to be sent to the reorder buffer. It also stores the PC, PC+8, the address if a misprediction took place, and a signal that tells whether a branch was taken or not. This storage is accomplished by using 32-bit registers and a flip-flop, for each of the 16 “slots” of the queue.

An insert input is used to signify that an instruction and its accompanying information must be stored in the queue. In order to know which register to store the instruction in, there is a “head” counter that keeps track of the next available slot. The counter output is decoded and if the insert signal is high, then it asserts the enable signal for the corresponding slot.

Similarly, there is a send input that will actually output an instruction. To monitor the slot that holds the “oldest” instruction to output, a “tail” counter is used.

Another input signal is squash, which will reset the counters and essentially “erase” the contents of the op queue.

An output the queue also provides is the most recent instruction. This will be helpful in the datapath to signify if there is a jump instruction.

The full output signal will be looked at when asserting the insert signal. If the full signal is high, then the insert signal should stay low, so that instructions in the op queue will not get overwritten. The empty output signal will get sent to the issuing unit. These two outputs are found by looking at the two counters. If both the head and tail are equal, then the queue must be either full or empty. If the last operation on the queue was an insert, then that means the queue is full. However, if the last operation was a send or squash, then the op queue is empty.

To test the functionality of the op queue, a command file was made. First we tried to fill the entire queue with “instructions” and the corresponding PCs, and make sure that the full and empty signals came out correctly. Then, the send signal was asserted, which would check the tail counter and that it was keeping track of the oldest instruction that was stored in the queue.

When the op queue was placed in the datapath, we made sure to check that if the op queue was full, then no new instruction would be stored.

Reorder Buffer

We decided to implement the Reorder Buffer (ROB) because it makes sense to have it with the Branch Predictor. When the Branch Predictor miss predicts, then we have to be able to either roll back, or not commit the bad instructions. Therefore, with the use of the ROB, we can get rid of the instructions that came after the miss predicted branch by simply “squishing” it.

It seems like the ROB had a pretty straightforward part with the Tomasulo algorithm. However, as we started to think about more of the complicated matters, the ROB got more and more complicated in an exponential behavior.

For example, we decided that the Branch Predictor should not use the Common Data Bus (CDB) when telling the ROB the result of the branch calculation (if it was a miss predict or not). Furthermore, the multiply unit is the only one with a sixty-four bit wide data, so we figured that we may as well make a dedicated bus for that to the ROB. Those dedicated connections were also nice because it allowed many of the instructions to be stored into the ROB simultaneously. Obviously those dedicated interface with some units made the ROB much more tedious.

The renaming was also taken care within the ROB. The ROB would tell the outside world the ROB numbers of the sources if they were in the ROB. Each of the registers would have an associated ROB number if they were inside the ROB at the time of the check. The renaming aspect brought about many issues as well unfortunately. For example, it is possible that at the ROB number given to a reservation station is the value being broadcasted to the CDB at the time of the issue. In that case, the value on the CDB should be latched into the “V” registers instead of the ROB number latching into the “Q” registers. Those subtle things made the ROB more and more complicated.

What seemed like a trivial thing to do (at least the book/lecture made it sound trivial) in terms of squishing the miss predicated instructions, actually ended up being really difficult. First, when the instructions needed to be squished, we had to figure out from where to where we must squish. Since the ROB used a circular queue, the difficulty increased as well. The renaming device also had to be squished—if registers have an ROB number that is going to be squished, it needs to reset those values.

Committing is rather difficult too. There are several different behaviors in committing. SW, for example, stores the value into the memory. Multiply on the other hand stores to the HI and the LO registers. Branch instructions do not store any data anywhere as well as the jump instructions. Rest of the instructions involving registers simply writes the data to the register file when committing. Sadly, those are the easy parts. The more difficult part shows up when figuring out when they are allowed to be committed. The logic that decides if the slot can commit when its turn comes became the most confusing part of the ROB. There are already five different types of instructions, so it is just like doing a case statement and making sure we didn’t miss anything.

In the end, the ROB slots had these components: Destination (register num or an address), Value, Instruction Type, two registers dedicated to multiply/divide units, Valid bit, Ready to Commit bit, and many logic that controlled the latching into these components. ROB itself had sixteen ROB slots, with the renaming device for RS, RT, and the Mult registers, counters for the head and tail, and many decoders. Along with the ROB, ROB issue controller and the ROB commit controller helped assert signals when issuing and committing respectively.

Result

We resulted with a pretty nice ROB that allows several units to store the value into the ROB in one cycle. This should theoretically speed up the executions since those units do not have to fight for the CDB to be free to give the ROB its result.

Testing became a rather difficult task as well. We had to simulate out of order execution as well as multiple units finishing their execution and committing to the ROB without the other parts of the processor yet (because testing it with issuer, memory, etc. would’ve been too hard). Of course testing brought about more cases that were ignored by mistake.

Once we had a primitive Tomasulo ready to test the Integer units, we checked if the ROB behaved the right way by asserting the correct signals when committing. After several test cases, we decided that the ROB works excluding the possible minor cases that show up once out of 2^32 instructions.

What we also noticed is that we ended up making the ROB the central aspect of Tomasulo. Every time other units had to be changed, it usually involved changing the ROB as well because they were all closely knit with the ROB. This increased the functionality of the ROB, with the cost of the increase in its complexity. However, there were benefits from making the ROB complex, because it helped make other parts of the processor simpler (sometimes significantly).


Memory Controller

The memory controller we used in Lab6 was a finite state machine. However, for the final project, we made it semi-asynchronous—the clock edge starts the process of requesting, but thereafter, everything happens asynchronously. That allowed us to do the fastest possible memory controller, which starts at a clock edge. Basically, the new memory controller waits the bare minimum time for the setup/hold time of the address, and the time required for the edges of RAS, CAS, and other signals. As soon as the data is ready, the processor is notified so that it can give the units waiting for the data right away without having to wait extra. However, this brought about a couple complications relating to the WAIT signal and the RAS precharge time.