PC Processor Microarchitecture

Introduction

PC Processor Microarchitecture

TABLE OF CONTENTS

• Introduction

• Building a Framework for Comparison

• What Does a Computer Really Do?

• The Memory Subsystem

• Exploiting ILP Through Pipelining

• Exploiting ILP Via Superscalar Processing

• Exploiting Data-Level Parallelism Via SIMD

• Where Should Designers Focus The Effort?

• A Closer Look At Branch Prediction

• Speculative, Out-of-Order Execution Gets a New Name

• Analyzing Some Real Microprocessors: P4

• Pentium 4's Cache Organization

• Pentium 4's Trace Cache

• The Execution Engine Runs Out Of Order

• AMD Athlon Microarchitecture

• AMD Athlon Scheduler, Data Access

• Centaur C3 Microarchitecture

• Overall Conclusions

• List of References

Introduction

Isn't it interesting that new high-tech products seem so complicated, yet only a few years later we talk about how much simpler the old stuff was? This is certainly true for microprocessors. As soon as we finally figure out all the new features and feel comfortable giving advice to our family and friends, we're confronted with details about a brand-new processor that promises to obsolete our expertise on the "old" generation. Gone are the simple and familiar diagrams of the past, replaced by arcane drawings and cryptic buzzwords. For a PC technology enthusiast, this is like discovering a new world to be explored and conquered. While many areas will seem strange and unusual, much of the landscape resembles places we've traveled before. This article is meant to serve as a faithful companion for this journey, providing a guidebook of the many wondrous new discoveries we're sure to encounter.

An Objective Tutorial and Analysis of PC Microarchitecture
The goal of this article is to give the reader some tools for understanding the internals of modern PC microprocessors. In the article "PC Motherboard Technology", we developed some tools for analyzing a modern PC motherboard. This article takes us one step deeper, zooming into the complex world inside the PC processor itself. The internal design of a processor is called the "microarchitecture". Each CPU vendor uses slightly different techniques for getting the most out of their design, while meeting their unique performance, power, and cost goals. The marketing departments from these companies will often highlight microarchitectural features when promoting their newest CPUs, but it's often difficult for us PC technology enthusiasts to figure out what it really means.

What is needed is an objective comparison of the design features for all the CPU vendors, and that's the goal of this article. We'll walk through the features of the latest x86 32-bit desktop CPUs from Intel, AMD, and VIA (Centaur). Since the Transmeta "Crusoe" processor is mostly targeted at the mobile market, we'll analyze their microarchitecture in another article. It will also be the task for another article to thoroughly explore Apple's PowerPC G4 microprocessor, and many of the analytical tools learned here will apply to all high-end processors.

Building a Framework for Comparison

Before we can dive right into the block diagram of a modern CPU, we need to develop some analytical tools for understanding how these features affect the operation of the PC system. We also need to develop a common framework for comparison. As you'll soon see, that is no easy task. There are some radical differences in architecture between these vendors, and it's difficult to make direct comparisons. As it turns out, the best way to understand and compare these new CPUs is to go back to basic computer architectural concepts and show how each vendor has solved the common problems faced in modern computer design. In our last section, we'll gaze into the future of PC microarchitecture and make a few predictions.

Let's Not Lose Sight of What Really Matters
There is one issue that should be stressed right up front. We should never lose sight of the real objective in computer design. All that really matters is how well the CPU helps the PC run your software. A PC is a computer system, and subtle differences in CPU microarchitecture may not be noticeable when you're running your favorite computer program. We learned this in our article on motherboard technology, since a well-balanced PC needs to remove all the bottlenecks (and meet the cost goals of the user). The CPU designers are turning to more and more elaborate techniques to squeeze extra performance out of these machines, so it's still really interesting to peek in on the raging battle for even a few percent better system performance.

For a PC technology enthusiast, it's just downright fascinating how these CPU architects mix clever engineering tricks with brute-force design techniques to take advantage of the enormous number of transistors available on the latest chips.

What Does a Computer Really Do?

It's easy to get buried too deeply in the complexities of these modern machines, but to really understand the design choices, let's think again about the fundamental operation of a computer. A computer is nothing more than a machine that reads a coded instruction, decodes the instruction, and executes it. If the instruction needs to load or store some data, the computer figures out the location for the data and moves it. That's it; that's all a computer does. We can break this operation into a series of stages:

The 5 Computer Operation Stages
Stage 1 / Instruction Access (IA)

Stage 2

/ Instruction Decode (ID)
Stage 3 / Execution (EX)
Stage 4 / Data Access (DA)
Stage 5 / Store (write back) Results (WB)

Some computer architects may re-arrange, combine, or break up the stages, but every computer microarchitecture does these five things. We can use this framework to build on as we work our way up to even the most complicated CPUs.

For those of you who eat this stuff for breakfast and are anxious to jump ahead, remember that we haven't yet talked about pipelines. These stages could all be completely processed for a single instruction before starting the next one. If you think about that idea for a moment, you'll realize that almost all the complexity comes when we start improving on that limitation. Don't worry; the discussion will quickly ramp up in complexity, and some readers might appreciate a quick refresher. Let's see what happens in each of these stages:

Instruction Access
A coded instruction is read from the memory subsystem at an address that is determined by a program counter (PC). In our analysis, we'll treat memory as something that hangs off to the side of our CPU "execution core", as we show in the figure below. Some architects like to view memory and the system bus as an integral part of the microarchitecture, and we'll show how the memory subsystem interacts with the rest of the machine.

Instruction Decode
The coded instruction is converted into control information for the logic circuits of the machine. Each "operation code (Opcode)" represents a different instruction and causes the machine to behave in different ways. Embedded in the Opcode (or stored in later bytes of the instruction) can be address information or "immediate" data to be processed. The address information can represent a new address that might need to be loaded into the PC (a branch address) or the address can represent a memory location for data (loads and stores). If the instruction needs data from a register, it is usually brought in during this stage.

Execute
This is the stage where the machine does whatever operation was directed by the instruction. This could be a math operation (multiply, add, etc.) or it might be a data movement operation. If the instruction deals with data in memory, the processor must calculate an "Effective Address (EA)". This is the actual location of the data in the memory subsystem (ignoring virtual memory issues for now), based on calculating address offsets or resolving indirect memory references (A simple example of indirection would be registers that house an address, rather than data).

Data Access
In this stage, instructions that need data from memory will present the Effective Address to the memory subsystem and receive back the data. If the instruction was a store, then the data will be saved in memory. Our simple model for comparison gets a bit frayed in this stage, and we'll explain in a moment what we mean.

Write Back
Once the processor has executed the instruction, perhaps having been forced to wait for a data load to complete, any new data is written back to the destination register (if the instruction type requires it).

Was There a Question From the Back of the Room?
Some of the x86 experts in the audience are going to point out the numerous special cases for the way a processor must deal with an instruction set designed in the 1970s. Our five-stage model isn't so simple when it must deal with all the addressing modes of an x86. A big issue is the fact that the x86 is what is called a "register-memory" architecture where even ALU (Arithmetic Logic Unit) instructions can access memory. This is contrasted with RISC (Reduced Instruction Set Computing) architectures that only allow Load and Store instructions to move data (register-register or more commonly called Load/Store architectures).

The reason we can focus on the Load/Store architecture to describe what happens in each stage of a computer is that modern x86 processors translate their native CISC (Complex Instruction Set Computing) instructions into RISC instructions (with some exceptions). By translating the instructions, most of the special cases are turned into extra RISC instructions and can be more efficiently processed. RISC instructions are much easier for the hardware to optimize and run at higher clock rates. This internal translation to RISC is one of the ways that x86 processors were able to deal with the threat that higher-performance RISC chips would take over the desktop in the early 1990s. We'll talk about instruction translation more when we dig into the details of some specific processors, at which point we'll also show several ways in which our model is dramatically modified.

To the questioner in the back of the room, there will be several things we're going to have to gloss over (and simplify) in order to keep this article from getting as long as a computer textbook. If you really want to dig into details, check out the list of references at the end of this article.

The Memory Subsystem

The memory subsystem plays a big part in the microarchitecture of a CPU. Notice that both the Instruction Access stage and the Data Access stage of our simple processor must get to memory. This memory can be split into separate sections for instructions and data, allowing each stage to have a dedicated (hence faster) port to memory.

This is called a "Harvard Architecture", a term from work at Harvard University in the 1940s that has been extended to also refer to architectures with separate instruction and data caches--even though main memory (and sometimes L2 cache) is "unified". For some background on cache design, you can refer to the memory hierarchy discussion in the article, "PC Motherboard Technology". That article also covers the system bus interface, an important part of the PC CPU design that is tailored to support the internal microarchitecture.

Virtual Memory: Making Life Easier for the Programmer and Tougher for the Hardware Designer
To make life simpler for the programmer, most addresses are "virtual addresses" that allow the software designer to pretend to have a large, linear block of memory. These virtual addresses are translated into "physical addresses" that refer to the actual addresses of the memory in the computer. In almost all x86 chips, the caches contain memory data that is addressed with physical addresses. Before the cache is accessed, any virtual addresses are translated in a "Translation Look-aside Buffer (TLB)". A TLB is like a cache of recently-used virtual address blocks (pages), responding back with the physical address page that corresponds to the virtual address presented by the CPU core. If the virtual address isn't in one of the pages stored by the TLB (a TLB miss), then the TLB must be updated from a bigger table stored in main memory--a huge performance hit (especially if the page isn't in main memory and must be loaded from disk). Some CPUs have multiple levels of TLBs, similar to the notion of cache memory hierarchy. The size and structure of the TLBs and caches will be important during our CPU comparisons later, but we'll focus mainly on the CPU core for our analysis.

Exploiting ILP Through Pipelining

Instead of waiting until an instruction has completed all five stages of our model machine, we could start a new instruction as soon as the first instruction has cleared stage 1. Notice that we can now have five instructions progressing through our "pipeline" at the same time. Essentially, we're processing five instructions in parallel, referred to as "Instruction-Level Parallelism (ILP)". If it took five clock cycles to completely execute an instruction before we pipelined the machine, we're now able to execute a new instruction every single clock. We made our computer five times faster, just with this "simple" change.

Let's Just Think About This a Minute
We'll use a bunch of computer engineering terms in a moment, since we've got to keep that person in the back of the room happy. Before doing that, take a step back and think about what we did to the machine. (Even experienced engineers forget to do that sometimes.) Suddenly, memory fetches have to occur five times faster then before. This implies that system and cache must now run five times as fast, even though each instruction still takes five cycles to completely execute.