Midterm Exam: EE7720 (Advanced Computer Architecture), Fall 2006; 10/19/06

Midterm Exam: EE7720 (Advanced Computer Architecture), Fall 2006; 10/19/06

Sample Final Exam: CDA5155 (Computer Architecture Principles), Fall 2016

Name: ______; UF ID: ______

On my honor, I have neither given nor received unauthorized aid on this examination.

Signature: ______

Answer all the questions in a precise, clear, short form inside the space provided. You MUST show the solution steps that lead to the final answer.

  1. (15pt) Multiple choices (You may select multiple answers for each question):
  1. Which of the following statements about pipelining is incorrect?
  2. Pipelining improves the execution of a single instruction. (V)
  3. Pipelining is an implementation technique that is invisible to the programmer.
  4. Stalling the pipeline is the only method when structural hazards happen. (V)
  5. In the integer MIPS pipeline, memory reference may cause data hazards.
  6. Bypassing and forwarding solves data hazards. (V)
  1. Memory Consistency:
  2. Sequential consistency must be enforced by hardware.
  3. Sequential consistency requires uniprocessor maintaining the memory reference order (V)
  4. Sequential consistency requires any global memory reference order among multiple processors that must be observed by all processors (V)
  5. Processor consistency allows processor read bypassing write (V)
  6. Synchronization instruction can be used to enforce certain order with respect to regular memory reads and writes (V)
  1. Select the true statements
  1. Nvidia GPU has superior performance for general-purpose applications than a multi-core CPU because GPU has more hardware cores to execute more threads in parallel with higher GFLOPS
  2. Although CUDA provides two different levels of parallelism, threads and thread blocks, in practice it makes no difference to combine the two levels into a single level of parallel threads on Nvidia GPU
  3. The shared memory model of computation in CUDA avoids programmer involvement in data communication and for fast memory accesses
  4. None of above (V)
  1. (10pt) Short answer about exploiting data-level parallelism:

Compare similarities and differences among vector processor, SIMD extension, and GPU approaches.

  1. (15pt) A shared snooping bus is often bandwidth limited and can only support a small number of processors.Assume an aggressive multiple-issue, out-of-order processor can achieve 1.1 average IPC. The average memory access per instruction is 1.2. Assume the last-level cache miss rate (global miss rate) is 1.5%. All the misses must issue the requests through the snooping bus. The bus speed (cycle time) is four times slower than the processor speed and can accept one processor request every 2 bus cycles. Furthermore, about 30% of the replaced blocks are dirty and need writeback to memory. Calculate how many processors that the snooping bus can support under the constraint that the average bus utilization cannot exceed 85%. Note that you can assume the data bus that transfers the data back to the requester is not a bottleneck.

Answer:

Miss/cycle = 1.1*1.2*.015 = .0198

30% dirty replacement = .0198 * (1+.3) = .02574

4 time slow bus and 2 cycle per request = .02574 * 8 = .20592

For 85%, can support 4 processors = .20593*4 = .83368

  1. TEST&RESET R2, 0(R1) instruction is included in the ISA, which fetches the content from memory location 0(R1) into R2 and reset the content to zero as an atomic instruction.
  2. (10pt) Rewrite the lock routine using the TEST&RESET instruction

Answer:

Lock:LD R2, 0(R1)

BEZR2, lock

TEST&RESET R2, 0(R1)

BEZR2, lock

  1. (5pt) Rewrite the lock routine using LL and SC pair

Answer:

Lock:LL R2, 0(R1)

BEZR2, lock

DADDUI R2, R0, #0

SCR2, 0(R1)

BEZR2, lock

  1. (5pt) What is the advantage of using LL and SC pair?

Answer:

SC will not cause global traffic if fails

  1. Alpha 21164 has two levels of on-chip cache. The first-level is split with 8KB instruction and 8KB data caches, and both aredirect-mapped. The second-level is unified, 3-way set-associativewith 96KB in size. Both levels have 32-byte line sizes and arephysically addressed.
  1. (10pt) Assume that the physical address is 40 bits long. For thefirst-level data cache, how many bits in each of the tag, indexand offset field? Also, calculate the size (in terms of bits)of the total tags in the first-level data cache.

Answer:offset: 5; index: 8; tag: 27 bits

  1. Cache coherence question based on Fig 5.7 and 5.22.

1