ECE4480/5480
spring 2009
open textbook, lecture notes, homework solutions only
No other materials are allowed
Name: ______Solution______
1. Assume that a design team is considering enhancing a machine by adding MMX (multimedia extension instruction) hardware to a processor. When a computation is run in MMX mode on the MMX hardware, it is 10 times faster than the normal mode of execution. Call the percentage of time that could be spent using the MMX mode the percentage of media enhancement.
(a) (3%) What percentage of media enhancement is needed to achieve an overall speedup of 2?
We will use Amdahl’s Law for this question.
Execution time with Media Enhancement =
(Execution time improved by Media enhancement)/(Amount of Improvement) +
Execution time unaffected
Let x be the percent of media enhancement needed for achieving an overall speedup of 2.
Then, (100)/2 = (x)/10 + (100-x)
Solving for x, we have x = 55.55
(b) (3%) What percentage of the run-time is spent in MMX mode if a speedup of 2 is achieved? (Hint: You will need to calculate the new overall time.)
The new overall time is 100/2 = 50.
MMX time = 55.55/10 = 5.55 The rest 1 – 55.55 = 44.45
5.55/50 = 11.1%
(c) (3%) What percentage of the media enhancement is needed to achieve one-half the maximum speedup attainable from using the MMX mode?
The maximum speedup using MMX mode occurs when the whole program can run in media enhancement mode.
The maximum speedup in this case is 10. One-half of this is 5.
(100)/5 = (x)/10 + (100-x)
Solving for x, we get x = 88.89
2. (10%) In MIPS assembly, write an assembly language version of the following C code segment:
for (i = 0; i < 98; i ++) {
C[i] = A[i + 1] - A[i] * B[i + 2]
}
Arrays A, B and C start at memory location A000hex, B000hex and C000hex respectively. Try to reduce
the total number of instructions and the number of expensive instructions such as multiplies.
The MIPS assembly sequence is as follows:
li $s0, 0xA000 # Load Address of A
li $s1, 0xB000 # Load Address of B
li $s2, 0xC000 # Load Address of C
li $t0, 0 # Starting index of i
li $t5, 98 # Loop bound
loop:
lw $t1, 0($s1) # Load A[i]
lw $t2, 8($s2) # Load B[i+2]
mul $t3, $t1, $t2 # A[i] * B[i+2]
lw $t1, 4($s1) # Load A[i+1]
sub $t2, $t1, $t3 # A[i+1] - A[i]*B[i+2]
sw $t2, 0($s3) # C[i] = A[i+1] + A[i]*B[i+2]
addi $s1, 4 # Go to A[i+1]
addi $s2, 4 # Go to B[i+1]
addi $s3, 4 # Go to C[i+1]
addi $t0, 1 # Increment index variable
bne $t0, $t5, loop # Compare with Loop Bound
halt:
nop
3.(a) (2%) n the snippet of MIPS assembler code below, how many times is instruction memory accessed? How many times is data memory accessed? (Count only accesses to memory, not registers.)
(1) lw $v1, 0($a0)
(2) addi $v0, $v0, 1
(3) sw $v1, 0($a1)
(4) addi $a0, $a0, 1
Four instruction memory accesses; two data memory accesses
3.(b) (3%) If run the above code on MIPS 5-stage pipeline processor studied in class.
Are there any data hazards? If there are, identify all data hazards and how to solve these data hazards.
Data hazard between (1) (3). Because when sw does the decode, $v1 has not been written by lw already in the pipeline.
This hazard can be solved by forwarding the $v1 to memory data input.
4. (a)(2%) Assuming single precision IEEE 754 format, what decimal number is represent by this word:
1 01111101 00100000000000000000000
The decimal number = (-1)* (2^(125-127))*(1.001)2
= (-1)*(0.25)*(1.125)
= - 0.28125
4. (b)(2%) In the 32-bit IEEE format, what is the encoding for negative zero?
1 00000000 00000000000000000000000
4.(c) (2%) What is the smallest positive (not including +0) representable number in 32-bit IEEE single precision floating point? Show the bit encoding and the value in base 10 (fraction or decimal)
The smallest positive representable number 32-bit IEEE 754 single precision floating point value is
0 00000000 00000000000000000000001
Its decimal value is = (+1) * (2(0-127) )*(1. 00000000000000000000001)
= 5.87747175411 ´ 10-39
5. (10%) This problem covers 4-bit binary unsigned restoring division. Fill in the table for the Quotient, Divisor and Dividend for each step. You need to provide the DESCRIPTION of the step being performed (shift left, shift right, sub). The value of Divisor is 4 (0100, with additional 0000 bits shown for right shift), Dividend is 6 (initially loaded into the Remainder).
6. (30%) For the MIPS datapath shown below, several lines are marked with “X”. For each one:
• Describe in words the negative consequence of cutting this line relative to the working, unmodified processor.
• Provide an example of code that will fail
• Provide an example of code that will still work
(1) Cannot write to register file. This means that R-type and any instruction with write back to register file will fail.
An example of code snippet that would fail is:
add $s1, $s2, $s3
An example of a code snippet that will not fail is:
sw $s1, 0($s2)
(2) Forwarding of the first operand fails.
An example of code snippet that would fail is:
add $s1, $t0, $t1
add $s1, $s1, $s1
An example of code snippet that will not fail is:
add $s1, $t0, $t1
add $s1, $t2, $s1 # Here the second operand is forwarded correctly
(3) Jumping to a branch target does not work.
Example of code that fails:
addi $s1, $zero, 2
addi $s2, $zero, 2
beq $s1, $s2, exit
Code that will still work:
addi $s1, $zero, 10
addi $s2, $zero, 20
beq $s1, $s2, exit
7. (20%) Consider a simple singlecycle implementation of MIPS ISA. The operation times for the major functional components for this
machine are as follows:
Component Latency
ALU 10 ns
Adder 8 ns
ALU Control Unit 2 ns
Shifter 3 ns
Control Unit/ROM 4 ns
Sign/zero extender 3 ns
2-1 MUX 2 ns
Memory (read/write) (instruction or data) 15 ns
PC Register (read action) 1 ns
PC Register (write action) 1 ns
Register file (read action) 7 ns
Register file (write action) 5 ns
Logic (1 or more levels of gates) 1 ns
Below is a copy of the MIPS single-cycle datapath design. In this implementation the clock cycle is determined by the longest possible path in the machine. The critical paths for the different instruction types that need to be considered are: R-format, Load-word, and store-word. All instructions have the same instruction fetch and decode steps. The basic register transfer of the instructions are:
Fetch/Decode: Instruction <- IMEM[PC];
R-type: R[rd] <- R[rs] op R[rt]; PC <- PC + 4;
load: R[rt] <- DMEM[ R[rs] + signext(offset)]; PC <- PC +4;
store: DMEM[ R[rs] + signext(offset)] <- R[Rt]; PC <- PC +4;
7. continued
In the table below, indicate the components that determine the critical path for the respective instruction, in the order that the critical path occurs. If a component is used, but not part of the critical path of the instruction (ie happens in parallel with another component), it should not be in the table. The register file is used for reading and for writing; it will appear twice for some instructions. All instruction begin by reading the PC register with a latency of 2ns.
R-format: PC Instr_Mem Reg_Read MUX ALU Mux Reg_Write
Load: PC Instr_Mem Reg_Read MUX ALU Mem_Read Mux Reg_Write
Store; PC Inst_Mem Reg_Read MUX ALU Mem_Write
Place the latencies of the components that you have decided for the critical path of each instruction in the table below. Compute the sum of each of the component latencies for each instruction.
R-format: 2ns 15ns 7ns 2ns 10ns 2ns 5ns total:43ns
Load: 2ns 15ns 7ns 2ns 10ns 15ns 2ns 5ns total: 58ms
Store: 2ns 15ns 7ns 2ns 10ns 15ns total: 51ns
Use the total latency column to derive the following critical path information:
• Given the data path latencies above, which instruction determines the overall machine critical path (latency)?
The load word instruction
• What will be the resultant clock cycle time of the machine based on the critical path instruction?
58 ns
• What frequency will the machine run?
1/58 ns = 17 Mhz
8. (10%)Consider the following assembly language code:
I0: add $ r4, $ r1, $ r0 // ADD R4 = R1 + R0;
I1: sub $ r9, $ r3, $ r4 // SUB R9 = R3 - R4;
I2: add $ r4, $ r5, $ r6 // ADD R4 = R5 + R6;
I3: lw $ r2, 100($ r3) // LDW R2 = MEM[R3 + 100];
I4: lw $ r2, 0($ r2) // LDW R2 = MEM[R2 + 0];
I5: sw $ r2, 100($ r4) // STW MEM[R4 + 100] = R2;
I6: and $ r2, $ r2, $ r2 //AND R2 = R2 & R1;
I7: beq $ r9, $ r1, target // BEQ R9 == R1, Target;
I8: and $ r9, $ r9, $ r1 //AND R9 = R9 & R1;
Consider a pipeline with forwarding, hazard detection, and 1 delay slot for branches. Thepipeline is the typical 5-stage IF, ID, EX, MEM, WB MIPS design. For the above code, complete the pipeline diagram below (instructions on the left, cycles on top) for the code. Insert the characters IF, ID, EX, MEM, WB for each instruction in the boxes. Assume that there two levels of bypassing, that the second half of the decode stage performs a read of source registers, and that the first half of the write-back stage writes to the register file.
Label all data stalls (Draw an X in the box). Label all data forwards that the forwarding unit detects (arrow between the stages handing off the data and the stages receiving the data). What is the final execution time of the code?
T0 / T1 / T2 / T3 / T4 / T5 / T6 / T7 / T8 / T9 / T10 / T11 / T12 / T13 / T14 / T15 / T16I0 / IF / ID / EX / M / WB
I1 / IF / ID / EX / M / WB
I2 / IF / ID / EX / M / WB
I3 / IF / ID / EX / M / WB
I4 / IF / ID / X / EX / M / WB
I5 / IF / X / ID / EX / M / WB
I6 / IF / ID / EX / M / WB /
I7 / IF / ID / EX / M / WB
I8 / X / IF / ID / EX / M / WB
one delay slot means the new PC is calculated at ID stage
total clock cycles = 15 cycles