ECE 552 Introduction to Computer Architecture
WISC-99S
Computer Architecture Design Project
Final Report
Team Members
KOI, Chao (koic@cae)
CHAN, Tung-Fai (tchan@cae)
April, 1999
Computer Architecture Design Project
TABLE OF CONTENTS
Introduction1
Overview 1
Components Descriptions 2
Control Units Descriptions 3
Costs of the Design 6
Discussions 7
Comments 8
Schematic Printouts
VHDL Codes
Test Program Simulations
Commented Simulation Documents
Computer Architecture Design Project
INTRODUCTION
W
ISC-99S is a 16-bit RISC oriented computer with load/store architecture. Design and implementation of this architecture are based on Mentor Graphics tools. The architecture has eight general-purpose registers and six major types of instruction: computation register, computation immediate, load/store, branch, subroutine jump, and reserved-for-future. The details of specifications are provided in the “Project Description” hand out from Professor Saluja.
Introduction/OverviewPg. 1
Computer Architecture Design Project
OVERVIEW
W
e have designed 2 versions of architecture, multi-cycle and pipeline. The reason of building the multi-cycle version is because it provides us a better understanding of the system architecture and a chance to improve the performance without the help of pipeline. After we successfully simulated the multi-cycle version and optimized it fundamentally, we moved on to the pipeline version. The optimizations we have made on both versions will be discussed in the Discussion section.
Our multi-cycle design is essentially based on the state diagram on Pg. 3. State 0 and State 1 are common for all types of instructions and stand for Instruction Fetch and Instruction Decode respectively. State 2 are common for load and store instruction. The result states are corresponding to a specific instruction indicated. This design is able to handle NOP (No Operation) instruction and Reserve-for-future exception, which are stated in the Project Description. The former instruction will simply go back from State 1 to State 0. It indicated that the PC (Program Counter) is incremented and does nothing else. The latter exception is treated as invalid OPCODE (Operation Code) by storing the address of current instruction in memory location FFFF and halt further execution. The Control Unit is implemented in VHDL and then convert into a symbol.
The pipeline is basically divided a typical instruction into five stages, IF (Instruction Fetch), ID (Instruction Decode), EX (Execution), MEM(Memory Access), and WB (Write Back). Each stage will make use of one clock cycle. The control signal of Control Unit is provided on Pg. 4. The Control Unit, also implemented in VHDL, but instead of using finite state machine, is implemented as a single cycle machine because the signals will be propagate to later stage though registers at the end of each stage. The advantage of using pipeline is to let each component, Control Unit, ALU, Memory, etc to be fully used in parallel while still pretends that each instruction has its own datapath. However, this design has higher cost as each component can only be used once for every instruction. For example, there must be at least 2 ALU or Adder (in our design, 3 are used) in order to accomplish the task, one for IF stage for PC increment while the other one in EX stage for typical execution. Furthermore, extra hardware and design have to be employed to tackle problems such as Data Hazard and Branch Hazard. More details regarding optimization and handling hazards are provided in Discussion section.
Introduction/OverviewPg. 1
Computer Architecture Design Project
COMPONENTS DESCRITPIONS
M
ulti-cycle and Pipeline are using the same set of components with minor modification. Of course, Pipeline has a larger amount of hardware. Here are some significant components used in the both versions:
ALU Control Unit / Control the operations take place in the ALU. It has 2 control inputs sources, from Control Unit and current instruction. Control Unit can control the ALU operation explicitly, ADD, SUB, and AND. However, it also can depends on FUNC field in the instruction.1Adder / It is a simplified version of ALU. It does not include any other operation such as AND and XOR. This greatly reduces the cost and increase efficiency. Also, no control input is required and only ADD can be performed.
2ALU / Perform ADD, SUB, AND, XOR, and SHIFT operation according to control inputs. It has 2 16 bits data and a 3-bit control inputs with 16-bit result output.
1Branch Detection Unit / Detect whether the input value is ZERO or NEG and output 2 1-bit value. ie, for a negative number, NEG = 1; for a Zero number (0x0000), ZERO =1; otherwise 0.
3EndianfromMem/
EndiantoMem / To interchange the big/little endian format of bit orientation of the value. As in our design, all bus are declared as (15:0) while memory uses (0:15)
Instrreg / Extract each field, rs, rt, rd, func, immd, opcode, from a 16bit instruction from memory.
LoadImmUnit / Performs load immediate operation. Has 1 input from Control Unit to determine what position should be loaded onto (High/Low 8 bits).
MemoryBlock / Contain the main memory of the datapath. It contains the EndianfromMem/EndiantoMem components to invert the bit orientation format. It is a abstracted form of memory.
RegFile / Contain 8 x 16bit registers. It can perform 2 registers read simultaneously while write synchronous. It has enable signal to control Read/Write of selected register.
SignExtendUnit / It performs sign extended operation for immediate value from instruction. It has an input signal from Control Unit to determine which immediate value (6 or 8 bits) are processed.
1ForwardUnit / Used to detect data hazard on the pipeline design and forward appropriate value to current stage. It is implemented in VHDL.
1HazardUnit / Used to detect a data hazard situation that no forwarding can be done and still the pipe by inserting NOP instruction into current stage (IF). Implemented in VHDL.
1NOP Instruction / Detect whether current instruction would have data memory access situation and then insert NOP instruction to stall the pipe. It is because there is only one memory and it could only either fetch instruction or data access. Implemented in VHDL.
1PCSourceUnit / Used to enable PC and determine its source. It handles the branch instruction and insertion of NOP instruction in IF stage. Implemented in VHDL.
Other than these components, there are several other common hardware used. They are 2-to-1, 4-to-1 16bit MUX; different kinds of registers used in pipeline design for save up all values at the end of each stage. We are not trying to explain these common component nor printout is included to save up some pages.
[1]
Components DescriptionsPg. 1
Computer Architecture Design Project
CONTROL UNITS DESCRIPTIONS
C
ontrol Units are essential built by using VHDL for both Multi-cycle and Pipeline datapath. The Control Unit of Multi-cycle design simulates a finite state machine. The state diagram is provided. By and large, all instructions, except load/store (Memory Access), are completed within 3 states. State 0 and State 1 are corresponding to Instruction Fetch and Instruction Decode respectively and they are common for all instruction. The state diagram has a loop for all result states to go back to State 0 for fetching new instruction. Here is a description for each state:
State Number / Tasks0 / Instruction is fetched into Instruction Register
Increment PC
1 / Calculate the Branch Address by adding PC + Immediate
2 / Select which type of immediate is being used (6 or 8 bits) and sign extended
Calculate the Memory Access address
3 / Select the memory address input to be the result from ALU instead of from PC
Turn the Memory into Read mode
Select which register should be written to and its data source
4 / Select the memory address input to be the result from ALU instead of from PC
Turn the Memory into Write mode
5 / Select the ALU source A and B are from RegFile
Set the ALU Control to depend on the FUNC field of that instruction
Select which register should be written to and its data source
6 / Select the ALU source A is from RegFile while source B from Immediate
Select which type of immediate is being used and sign extended
Set the ALU Control to perform explicit ADD function
7 / Select the ALU source A is from RegFile while source B from Immediate
Select which type of immediate is being used and sign extended
Set the ALU Control to perform explicit AND function
8 / Select which part of immediate is being loaded into (upper 8 bits)
Select which register should be written and its data source
9 / Select which part of immediate is being loaded into (lower 8 bits)
Select which register should be written and its data source
10 / Select ALU source A is from RegFile and source B from 0 (constant)
Select ALU Control to be explicit ADD
Select which signal from ALU is to be used (NEG)
Select which source PC should read from
11 / Select ALU source A is from RegFile and source B from 0 (constant)
Select ALU Control to be explicit ADD
Select which signal from ALU is to be used (ZERO)
Select which source PC should read from
12 / Select PC to read from RegFile
Enable PC to read
Select ALU source A is from PC and source B from 0 (constant)
Select ALU Control to be explicit ADD
Select which register should be written and its data source
13 / Select PC to read from RegFile
Enable PC to read
As the Register File and all kinds of register such as PC, ALUOUT, are synchronous, all data will be written in during the positive trigger of next clock cycle.
State Diagram for Multi-cycle Control Unit:
State 5State 6State 7State 8
ALUA = 1ALUA = 1ALUA = 1
ALUB = 0ALUB = 10ALUB = 10MemtoReg = 10
ALUOp = 11ALUOp= 00ALUOp = 10RegDst = 01
MemtoReg = 0ImmdSrc = 0ImmdSrc = 0RegWrite
RegDst = 0MemtoReg = 00MemtoReg = 00 LdImmdPos = 0
RegWrite = 1RegDst = 01RegDst = 01
RegWrite = 1RegWrite = 1
State 0State 1State 2State 3
PCWrite = 1
MemRW = 1
ALUA = 0ALUA = 1ALUA = 1IorD = 0
ALUB = 01ALUB = 10ALUB = 10MemRW = 1
IorD = 00 ALUOp = 00ALUOp = 00MemtoReg = 1
PCSource = 00 ImmdSrc = 1ImmdSrc = 0RegDst = 10
IRWrite = 1RegWrite = 1
ALUOp = 00
State 9State 10State 11State 4
LdImmdPos = 1ALUA = 1ALUA = 1
MemtoReg = 10ALUB = 11ALUB = 11
RegDst = 01ALUOp = 00ALUOp = 00 PCSource = 10
RegWrite = 1PCWriteCond = 1PCWriteCond = 1PCWrite = 1
BranchType = 0BranchType = 1
PCSource = 01PCSource = 01
Invalid1Invalid2State 13State 12
ALUA = 0 MemData = 1 PCSource = 10ALUA = 0
ALUB = 01IorD = 10 PCWrite = 1ALUB = 11
ALUOp = 01MemRW = 0MemtoReg = 0
RegDst = 10
PCSource = 10
PCWrite = 1
RegWrite = 1
End
<All signal is 0,
except MemRW = 1>
Control Units DescriptionsPg. 1
Computer Architecture Design Project
Here is a description of each Control Signal (both Multi-cycle and Pipeline design)
Control Signal / DescriptionsALUA / Control the input as ALU Source A
0 – PC; 1 – Register File
ALUB / Control the input as ALU Source B
00 – Register File
01 – Constant 0x0001
10 – Immediate value
11 – Constant 0x0000
ALUOp / Control the operation of ALU
00 - Explicit ADD
01 - Explicit SUB
10 - Explicit AND
11 – according to FUNC in the instruction
PCWrite / Enable/Disable the PC
MemRW / Read/Write mode of memory
1 – Read
0 – Write
MemtoReg / Control the input data to Register File for written in
00 – value from ALU
01 – value from Memory
02 – value from Load Immediate unit
RegDst / Determine which register should be written
RegWrite / Enable/Disable write to Register File
ImmdSrc / Determine which immediate should be signed extended
0 – 6 bit Immediate
1 – 8 bit Immediate
LdImmdPos / Determine which position (high/low) should be low to the register
0 – upper 8 bit
1 – lower 8 bit
IorD / Control the source address to memory
00 – PC
01 – ALU
10 – Constant 0xFFFF
PCSource / Control the source of PC
00 – PC + 1
01 – Branch Addres
10 – Register File
11 – Constant 0x0000
PCWriteCond / Determine whether there is a possibility that the PC would be written, the result is depends on the comparison of values.
BranchType / Determine which signal from ALU should be read
0 – Neg (for BLT instruction)
1 – Zero (for BEQ instruction)
MemData / Determine where input data to memory from
0 – Register Value
1 - ALU
The Control Unit for Pipeline Datapath, same as previous version, which are implemented in VHDL and are essentially using the same set of Control Signal with the same descriptions. However, the Control Unit is not designed as finite state machine. On the other hand, it is implemented as single clock cycle machine which all signals are generated at the time when OPCODE is received from new instruction. Then, these generated signals would be propogated through the datapath through the control registers, such as ID_EX_Control, EX_MEM_Control, MEM_WB_Control. However, due to limitation of pages, we would exclude the printout of these registers.
There is no state diagram for Control Unit for Pipeline version. The output signals for each OPCODE are simply combination of all signals of all states that OPCODE required (except State 0 and State 1). For example, for Computational Register Type Instruction, the output signals are simply those in State 5. Take another example of load instruction, a combination of signals in State 3 and State 4 are included.
Discussions/CommentsPg. 1
Computer Architecture Design Project
COSTS OF DESIGN
Below is the cost breakdown for Multi-cycle design:
Block / And / Or / Other / Buffer / Register / RAM / MUX / Decoder / CostMemBlock / 1 / 1 / 0.48
InstrReg / 16 / 96
RegFile / 8*and2 / 128 / 32*mux81 / Dec38 / 1557
SignExtendUnit / 16*mux21 / 16
ALU / 39*and2
18*and3
14*and4 / 9*or2
5*or3
10*or4 / 96*xor2
7*inv
1*nor16 / 18*mux21
17*mux41 / 796
LoadImmdUnit / 16*mux21 / 32
ALUControl / 1*and2 / 3*mux21 / 7
Misc. / 1*and2 / 1*nor2 / 64 / 49*mux21
83*mux41 / 1148
Total / 3652.48
Below is the cost breakdown for Pipeline design:
Block / And / Or / Other / Buffer / Register / RAM / MUX / Decoder / CostMemBlock / 1 / 1 / 0.48
InstrReg / 16 / 96
RegFile / 8*and2 / 128 / 32*mux81 / Dec38 / 1557
SignExtendUnit / 16*mux21 / 16
ALU / 39*and2
16*and3
14*and4 / 8*or2
5*or3
10*or4 / 96*xor2
7*inv / 18*mux21
17*mux41 / 612.28
LoadImmdUnit / 16*mux21 / 32
ALUControl / 1*and2 / 3*mux21 / 7
2*Adder / 72*and2
30*and3
30*and4 / 10*or2
10*or3
10*or4 / 160*xor2 / 1025.4
MEM_WB_Control / 5 / 30
IF_ID_Reg / 32 / 192
MEM_WB_Reg / 57 / 342
ID_EX_Control / 15 / 90
ID_EX_Reg / 84 / 504
EX_MEM_Control / 9 / 54
DX_MEM_Reg / 57 / 342
Misc. / 1*and2 / 1*nor2 / 6*mux21
10*mux41 / 92
Total / 4997.16
Costs of DesignsPg. 1
Computer Architecture Design Project
DISCUSSIONS
O
ptimizations have been made on both versions to improve the performance and reliability. In this section, we would discussion optimizations in both Multi-cycle and Pipeline datapath and hazard handling in latter. Also, we would talk about the pros and cons of each design.
Multi-cycle:
- Optimization
- In Multi-cycle design, as all instructions have to be executed serially, the best way to improve its performance is to reduce number of clock cycle each instruction takes and reduce the period of each clock cycle. In the design, the clock cycle period is determined to be the longest path of all states. We figure out that the bottleneck is on the ALU execution which we can’t really improve. Therefore, we focus on reducing the number of clock cycles of each instruction type.
- As mentioned in the text, the first two states, State 0 and 1, are common for all instructions and it is no way to eliminate them. We target on the states after State 1 and successfully make most instructions to be completed with 3 states (including State 0 and 1). The 2 exceptions are load/store instructions which takes 4 states. The technique of reducing 1 additional for computational register/computational immediate/load instruction is based on the synchronization of register file. We made use of the fact that the data will only be written during the next positive trigger, we supply the value to register file directly from ALU and Memory Block instead of storing in ALUOut and Memory Data Register which cost extra clock cycle.
- Though this optimization, we can reduce 1 clock cycle (equal to 60ns for our design) for each computational register/computational immediate instructions, which are the most common. Take the example of the test program, there are altogether 55 instructions (more than 90%) belong to these 2 categories and it saves 55 x 60 = 3300ns in total.
Pipeline:
- Optimization and hazard handling:
- In fact, using Pipeline design itself is already a great improvement on efficiency relative to Multi-cycle design as most instructions can be executed “almost” in parallel. Yet, we have made several modifications to further improve its reliability and efficiency.
- First-of-all, to handle data hazards, as described in the text, we implemented a forwarding unit which forwards data from different stages. However, our design is different from what is described in the text. In the text, the design is to put the MUX in the beginning of EX stage which select appropriate value from different stages registers (EX/MEM, MEM/WB) Our design, instead, put the MUX at the end of ID stage. There are several reasons of doing so:
1)As mentioned, the clock cycle is determined on the worst path of all stage and in our design, EX is the bottleneck. Thus, the clock cycle has to be based on the longest time taken in EX stage. If we put the MUX in EX stage, it will definitely increase the clock cycle period. A 4-to-1 MUX has 4ns delay and if we increase clock cycle by 8ns, each instruction has to take 8x5 = 40ns more to complete. It’s a definitely a huge trade-off.
2)Put it in ID stage can simulate the “Write before Read” of register file. In our design, as register file are synchronous and data won’t be written till the next clock trigger. Thus, data can’t be read till next clock and it can’t accomplish that task of “Write before Read” within the same clock. However, by putting the MUX at the end of ID stage, we can forward value directly from wire carrying the value going to be written in the next clock trigger. In other word, instead of obtaining value after register file has read in the value, we obtain the value at the same time as the register file. If we put the forwarding MUX in EX stage, there is no way to accomplish this task but to put a new MUX in ID stage. This increases the cost and the delay of EX stage as mentioned above.
3)This related to branch hazard. We tried to determine whether the branch will be taken or not in ID stage because in this way, it will only stall 1 clock cycle. However, if we following the text that compare the values from register file directly, data hazard occurs. Therefore, instead of comparing the values from register file, we have to compare forwarded values from MUX. If we put the forwarding MUX in EX stages, that means we can’t determine the branch till EX stage and one more clock cycle has to be stall. In our pipeline version, each clock cycle is 64ns and it’s rather costly to determine in EX stage.
- Because of these reasons, putting the MUX in ID stage can both increase efficiency and increase reliability of the datapath.
Comparison:
- The advantage of using pipeline is, of course, its performance in long run. The Multi-cycle datapath takes #ns with 60ns for each clock cycle to complete the test program while pipeline only takes #ns to complete with 64ns clock cycle. There is almost 50% improvement.
- The disadvantage is costly. Additional hardware is spent on forwarding, hazard detection units, and MUX. From the tables above, Pipeline is over 1000units more expensive than Multi-cycle. Secondly, due to additional hardware, the clock cycle are increased. Also, in pipeline design, each instruction, no matter its type, have to go through all 5 stages so, in short run, say 3 R-type instructions, multi-cycle though runs serially, only takes 3 x 3 x 60 = 540ns to finish while it takes 7 x 64 = 448ns to finish.
- Overall, Pipeline design is worth than Multi-cycle according to performance and cost.
Performance of Multi-cycle = 1 / # =