Why Parallel Architecture s3

Why Are CMPs Needed?

[OHL §1.1] What factors have combined to create the need for CMPs? Name one. (You may fill out the form multiple times.)

Advantages of CMPs.

Disadvantages of CMPs.

Levels of parallelism.

· Instruction.

· Basic block.

· Loop iteration.

· Task.

· Process.

Comparing CMPs and Superscalars

[OHL §1.3] Let’s compare the characteristics of a 6-way superscalar with a CMP containing four two-way superscalars.

6-way superscalar 4×2-way superscalar

# of CPUs 1 4

Degree superscalar 6 4 × 2

# of architectural registers 32int/32fp 4 × 32int/32fp

# of physical registers 160int/160fp 4 × 40int/40fp

# of integer functional units 3 4 × 1

# of floating pt. functional units 3 4 × 1

BTB size 2048 entries 4 × 512 entries

Return stack size 32 entries 4 × 8 entries

Instruction issue queue size 128 entries 4 × 8 entries

I cache 32 KB, 2-way SA 4 × 8 KB, 2-way SA

D cache 32 KB, 2-way SA 4 × 8 KB, 2-way SA

L1 hit time 2 cycles 1 cycle

L1 cache interleaving 8 banks N/A

Unified L2 cache 256 KB, 2-way SA 256 KB, 2-way SA

L2 hit time/L1 penalty 4 cycles 5 cycles

Memory latency/L2 penalty 50 cycles 50 cycles

(They used small L2 caches because benchmarks were using small data sets.)

Now, let’s look at the results.

· IPC –

· % BP –

· % MPCI –

6-way superscalar 4×2-way superscalar

IPC % BP % MPCI IPC % BP % MPCI

compress 1.2 86.4 5.0 0.9 85.9 4.4

eqntott 1.8 80.0 2.2 1.3 79.8 1.5 Integer

m88ksim 2.3 92.6 0.1 1.4 91.7 2.6

MPsim 1.2 81.6 7.4 0.8 78.7 9.7

applu 1.7 79.7 5.6 0.9 79.2 3.7

apsi 1.2 95.6 5.9 0.6 95.1 7.2 Floating-

swim 2.2 99.8 4.8 0.9 99.7 2.4 point

tomcatv 1.3 99.7 8.5 0.8 99.6 9.9

pmake 1.4 82.7 2.3 1.0 86.2 4.8 Multiprog.

Average 1.5 88.7 4.6 0.9 88.4 5.2

In the pmake benchmark, the parallel architecture could run multiple compilations in parallel.

The aggressive 6-way superscalar makes extensive use of speculation. What is that?

Look up nonblocking caches (p. 13). Why do they complicate the analysis?

But there are some disadvantages to speculation.

· Extra cache misses—why?

· Nonblocking caches—misses may occur to lines with outstanding misses.

For these benchmarks, why do we study performance of a single processor, rather than the CMP?

Take a look at the charts on p. 15 of the text. What do they show about reasons for inefficiency?

Improving throughput

[OHL §2.1] What kind of applications require a lot of bandwidth, but can tolerate fairly high latency?

Figure 2.1, p. 23, shows throughput vs. power for aggressive superscalars and in-order processors.

· Which is the most desirable point on the graph?

What does the figure show?

Why do more complex processors use more power?

· Deeper, more-complex pipelines require more transistors and they must be switched at a higher frequency.

· It takes power to speculate, and some of the instructions will be discarded.

Then, why isn’t an unpipelined processor most efficient?

What’s the difference between multithreading between processors and multithreading within processors?

Why is multithreading needed within modern processors?

There are three styles of multithreading within processors.

· Coarse-grained.

· Fine-grained.

· Simultaneous.

Find an example of each.

Why doesn’t coarse-grain multithreading work well?

Lecture 25 Architecture of Parallel Computers XXX