Midterm Exam: EE7720 (Advanced Computer Architecture), Fall 2006; 10/19/06

Sample Final Exam: CDA5155 (Computer Architecture Principles), Fall 2016

Name: ______; UF ID: ______

On my honor, I have neither given nor received unauthorized aid on this examination.

Signature: ______

Answer all the questions in a precise, clear, short form inside the space provided. You MUST show the solution steps that lead to the final answer.

Memory Consistency:
Sequential consistency must be enforced by hardware.
Sequential consistency requires uniprocessor maintaining the memory reference order (V)
Sequential consistency requires any global memory reference order among multiple processors that must be observed by all processors (V)
Processor consistency allows processor read bypassing write (V)
Synchronization instruction can be used to enforce certain order with respect to regular memory reads and writes (V)

Nvidia GPU has superior performance for general-purpose applications than a multi-core CPU because GPU has more hardware cores to execute more threads in parallel with higher GFLOPS
Although CUDA provides two different levels of parallelism, threads and thread blocks, in practice it makes no difference to combine the two levels into a single level of parallel threads on Nvidia GPU
The shared memory model of computation in CUDA avoids programmer involvement in data communication and for fast memory accesses
None of above (V)

Compare similarities and differences among vector processor, SIMD extension, and GPU approaches.

(15pt) A shared snooping bus is often bandwidth limited and can only support a small number of processors.Assume an aggressive multiple-issue, out-of-order processor can achieve 1.1 average IPC. The average memory access per instruction is 1.2. Assume the last-level cache miss rate (global miss rate) is 1.5%. All the misses must issue the requests through the snooping bus. The bus speed (cycle time) is four times slower than the processor speed and can accept one processor request every 2 bus cycles. Furthermore, about 30% of the replaced blocks are dirty and need writeback to memory. Calculate how many processors that the snooping bus can support under the constraint that the average bus utilization cannot exceed 85%. Note that you can assume the data bus that transfers the data back to the requester is not a bottleneck.

Answer:

Miss/cycle = 1.1*1.2*.015 = .0198

30% dirty replacement = .0198 * (1+.3) = .02574

4 time slow bus and 2 cycle per request = .02574 * 8 = .20592

For 85%, can support 4 processors = .20593*4 = .83368

TEST&RESET R2, 0(R1) instruction is included in the ISA, which fetches the content from memory location 0(R1) into R2 and reset the content to zero as an atomic instruction.
(10pt) Rewrite the lock routine using the TEST&RESET instruction

Answer:

Lock:LD R2, 0(R1)

BEZR2, lock

TEST&RESET R2, 0(R1)

BEZR2, lock

Answer:

Lock:LL R2, 0(R1)

BEZR2, lock

DADDUI R2, R0, #0

SCR2, 0(R1)

BEZR2, lock

Answer:

SC will not cause global traffic if fails

Alpha 21164 has two levels of on-chip cache. The first-level is split with 8KB instruction and 8KB data caches, and both aredirect-mapped. The second-level is unified, 3-way set-associativewith 96KB in size. Both levels have 32-byte line sizes and arephysically addressed.

(10pt) Assume that the physical address is 40 bits long. For thefirst-level data cache, how many bits in each of the tag, indexand offset field? Also, calculate the size (in terms of bits)of the total tags in the first-level data cache.

Answer:offset: 5; index: 8; tag: 27 bits