Chapter 5: A Closer Look at Instruction Set Architectures

Here we look at a few design decisions for instruction sets of the computer.

Some basic issues for consideration today.

Basic instruction set design

Fixed–length vs. variable–length instructions.

Word addressing in a byte–addressable architecture.

Big–Endian vs. Little–Endian Designs

Stack machines, accumulator machines, and multi–register machines

Modern Design Principles

NOTE:We shall not discuss expanding opcodes, covered in Section 5.2.5 of the
textbook. The discussion is overly complex.

Design Considerations: Instruction Size

The MARIE has a fixed instruction size: 16 bits, divided as follows:

Bits / 15 – 12 / 11 – 0
Content / Opcode / Operand address (if used)

Instructions of a fixed size are easier to decode. Moreover, they facilitate prefetching of instructions, in which one instruction is fetched while the previous one is executing.

Instructions with variable size make more efficient use of memory.

When memory was costly, this was an important design consideration.

For example, the VAX–11/780 had a number of addition instructions:
Add register to register
Add memory to register
Add memory to memory

Suppose an 8–bit opcode, 16 registers (so that 4 bits identify each register),
and 32–bit addressing.

Instruction Lengths:Register to register8 + 4 + 4 =16 bits (two bytes)
Memory to register8 + 32 + 4 =44 bits (six bytes)
Memory to memory8 + 32 + 32 =72 bits (nine bytes)

With fixed length instructions, each instruction would have to be at least 9 bytes long.

RISC vs. CISC Considerations

The RISC (Reduced Instruction Set Computer) movement advocated a simpler design with fewer options for the instructions. Simpler instructions could execute faster.

Machines that were not RISC were called CISC (Complex Instruction Set Computers).
The VAX series was definitely CISC, a fact that lead to its demise.

One of the main motivations for the RISC movement is the fact that computer memory is no longer a very expensive commodity. In fact, it is a “commodity”; that is, a commercial item that is readily and cheaply available.

If we have fewer and simpler instructions, we can speed up the computer’s speed significantly. True, each instruction might do less, but they are all faster.

The Load–Store Design
One of the slowest operations is the access of memory, either to read values from it or write values to it. A load–store design restricts memory access to two instructions
1)Load a register from memory
2)Store a register into memory

A moment’s reflection will show that this works only if we have more than one register, possibly 8, 16, or 32 registers. More on this when discussing the options for operands.

NOTE: It is easier to write a good compiler for an architecture with a lot of registers.

Word Addressing vs. Byte Addressing

Each architecture devotes a number of bits to addressing memory.

The MARIE has a 12–bit memory address.

Question: What is the size of an addressable unit?

In byte–addressable machines, each byte is individually addressable.

In word–addressable machines, bytes are not individually addressable.

Advantages of Word–Addressable Designs

The CPU can address a larger memory. In the MARIE it is 212 words (not 212 bytes)

Word addressing is more natural for a number of problems, such as numerical simulations that use a large amount of real–number arithmetic.
The CDC–6600 used 60–bit addressable words to store real numbers.

Advantages of Byte–Addressable Designs

Direct access to the bytes for easy manipulation. This is good for any string–oriented processing, such as editing and message passing.

Question: How are multi–byte entities (words, longwords, etc.) handled?

Word Addressing in a Byte Addressable Machine

Each 8–bit byte has a distinct address.

A 16–bit word at address Z contains bytes at addresses Z and Z + 1.

A 32–bit word at address Z contains bytes at addresses
Z, Z + 1, Z + 2, and Z + 3.

Note that computer architecture refers to addresses, rather than variables.

In a high–level programming language we use the term “variable” to indicate
the contents of a specific memory address.

Consider the statementY = X

Go to the memory address associated with variable X

Get the contents

Copy the contents into the address associated with variable Y.

Question: What is the Address of the Last Longword (Word)?

We shall ask the question in general for an N–bit address space and specifically for a 16–bit address space. For byte addresses, this is easy to answer.

Byte addressesN–bit byte addresses run from 0 to 2N – 1.

The byte addresses run from 0 to 65535.

Word addressesIf a word is at address Z, its last byte is at address (Z + 1)

The byte of the last word has address Z + 1 = 2N – 1,
so the last word has address Z = 2N – 2.

For N–bit byte addresses run from 0 to 2N – 2.

The byte addresses run from 0 to 65534.

LongwordA word at address Z, has last byte is at address (Z + 3)

The last byte of the last longword has address Z + 3 = 2N – 1,
so the last longword has address Z = 2N – 4.

For N–bit byte addresses run from 0 to 2N – 4.

The byte addresses run from 0 to 65532.

Big–Endian vs. Little–Endian Addressing

AddressBig-EndianLittle-Endian

Z0104

Z + 10203

Z + 20302

Z + 30401

Note that, within the byte, the hexadecimal digits are never reversed.

Words and Longwords in Byte
Addressable Memory: Example #1

You are given the following memory map for a byte addressable memory.

All values are given in hexadecimal.

Address / FC / FD / FE / FF / 100 / 101 / 102 / 103 / 104
Contents / 99 / 88 / 66 / 66 / 10 / 20 / 30 / 40 / 55

Each addressable unit contains one byte, expressed as two hexadecimal digits.

A 32–bit longword occupies four bytes.

Question:What is the 32–bit longword associated with address 0x100?

Answer:It occupies addresses 0x100, 0x101, 0x102, and 0x103.

Big–Endian:The value, in hexadecimal, is 0x1020 3040.

Little–Endian:The value, in hexadecimal, is 0x4030 2010.

Note:In little–endian, we read the bytes backwards.
We do not read the hexadecimal digits backwards.

Words and Longwords in Byte
Addressable Memory: Example #1 (Continued)

You are given the following memory map for a byte addressable memory.

Address / FC / FD / FE / FF / 100 / 101 / 102 / 103 / 104
Contents / 99 / 88 / 66 / 66 / 10 / 20 / 30 / 40 / 55

Question:What is the 32–bit longword associated with address 0x100?

Answer:It occupies addresses 0x100, 0x101, 0x102, and 0x103.

Big–Endian
The value, in hexadecimal, is 0x1020 3040.

In decimal: 1167 + 0166 + 2165 + 0164 + 3163 + 0162 + 4161 + 0160

Little–Endian:The value, in hexadecimal, is 0x4030 2010.

In decimal: 4167 + 0166 + 3165 + 0164 + 2163 + 0162 + 1161 + 0160

Words and Longwords in Byte
Addressable Memory: Example #2

You are given the following memory map for a byte addressable memory.

Address / FC / FD / FE / FF / 100 / 101 / 102 / 103 / 104
Contents / 99 / 88 / 66 / 66 / 10 / 20 / 30 / 40 / 55

Question:What is the 16–bit word associated with address 0x100?

Answer:It occupies addresses 0x100 and 0x101.

Big–Endian
The value, in hexadecimal, is 0x1020

In decimal: 1163 + 0162 + 2161 + 0160

Little–Endian:The value, in hexadecimal, is 0x2010.

In decimal: 2163 + 0162 + 1161 + 0160

One Classification of Architectures

How do we handle the operands? Consider a simple addition, specifically

C = A + B

Stack architecture
In this all operands are found on a stack. These have good code
density (make good use of memory), but have problems with access.

Typical instructions would include:
PushX// Push the value at address X onto the top of stack
PopY// Pop the top of stack into address Y
Add// Pop the top two values, add them, & push the result

Program implementation of C = A + B

Push A
Push B
Add
Pop C

Single Accumulator Architectures

The MARIE is an example of this. There is a single register used to accumulate results of arithmetic operations.

The MARIE realization of the instruction C = A + B

LoadA// Load the AC from address A
AddB// Add the value at address B
StoreC// Store the result into address C

In each of these instructions, the accumulator is implicitly one of the operands and need not be specified in the instruction. This saves space.

Extension of this for multiplication
If we multiply two 16–bit signed integers, we can get a 32–bit result.
Consider squaring 24576 (214 + 213) or 0110 0000 0000 0000 is 16–bit binary.

The result is 603, 979, 776 or 0010 0100 0000 0000 0000 0000 0000 0000.

We need two 16–bit registers to hold the result. The PDP–9 had the MQ and AC.

The results of this multiplication would be
MQ0010 0100 0000 0000// High 16 bits
AC0000 0000 0000 0000// Low 16 bits

General Purpose Register Architectures

These have a number of general purpose registers, normally identified by number.

The number of registers is often a power of 2: 8, 16, or 32 being common.

(The Intel architecture with its four general purpose registers is different. These are called EAX, EBX, ECX, and EDX – a lot of history here)

The names of the registers often follow an assembly language notation designed to differentiate register names from variable names. An architecture with eight general purpose registers might name them: %R0, %R1, …., %R7.

The prefix “%” here indicates to the assembler that we are referring to a register, not to a variable that has a name such as R0. The latter name would be poor coding practice.

Designers might choose to have register %R0 identically set to 0. Having this constant register considerably simplifies a number of circuits in the CPU control unit.

We shall return to this %R0  0 when discussing addressing modes.

General Purpose Registers with Load–Store

A Load–Store architecture is one with a number of general purpose registers in which the only memory references are:
1)Loading a register from memory
2)Storing a register to memory

The realization of our programming statement C = A + B might be something like

Load %R1,A// Load memory location A contents into register 1
Load %R2, B// Load register 2 from memory location B
Add %R3, %R1, %R2// Add contents of registers %R1 and %R2
// Place results into register %R3
Store %R3,C// Store register 3 into memory location C

General Purpose Registers: Register–Memory

In a load–store architecture, the operands for any arithmetic operation must all be in CPU registers. The Register–Memory design relaxes this requirement to requiring only one of the operands to be in a register.

We might have two types of addition instructions

Add register to register
Add memory to register

The realization of the above simple program statement, C = A + B, might be

Load %R4,A// Get M[A] into register 4
Add %R4,B// Add M[B] to register 4
Store %R4, C// Place results into memory location C

General Purpose Registers: Memory–Memory

In this, there are no restrictions on the location of operands.

Our instruction C = A + B might be encoded simply as

Add C, A, B

The VAX series supported this mode.

The VAX had at least three different addition instructions for each data length
Add register to register
Add memory to register
Add memory to memory

There were these three for each of the following data types:
8–bit bytes, 16–bit integers, and 32–bit long integers
32–bit floating point numbers and 64–bit floating point numbers.

Here we see at least 15 different instructions that perform addition. This is complex.

Modern Design Principles: Basic Assumptions

Some assumptions that drive current design practice include:

1.The fact that most programs are written in high–level compiled
languages. Provision of a large general purpose register set
greatly facilitates compiler design.

2.The fact that current CPU clock cycle times (0.25 – 0.50 nanoseconds)
are much faster than memory access times.

3.The fact that a simpler instruction set implies a smaller control unit,
thus freeing chip area for more registers and on–chip cache.

4.The fact that execution is more efficient when a two level cache
system is implemented on–chip. We have a split L1 cache (with
an I–Cache for instructions and a D–Cache for data) and a L2 cache.

5.The fact that memory is so cheap that it is a commodity item.

Modern Design Principles

1.Allow only fixed–length operands. This may waste memory, but modern
designs have plenty of it, and it is cheap.

2.Minimize the number of instruction formats and make them simpler, so that
the instructions are more easily and quickly decoded.

3.Provide plenty of registers and the largest possible on–chip cache memory.

4.Minimize the number of instructions that reference memory. Preferred
practice is called “Load/Store” in which the only operations to reference
primary memory are:register loads from memory
register stores into memory.

5.Use pipelined and superscalar approaches that attempt to keep each unit
in the CPU busy all the time. At the very least provide for fetching one
instruction while the previous instruction is being executed.

6.Push the complexity onto the compiler. This is called moving the DSI
(Dynamic–Static interface). Let Compilation (static phase) handle any
any issue that it can, so that Execution (dynamic phase) is simplified.

The Fetch–Execute Cycle (Again)

This cycle is the logical basis of all stored program computers.

Instructions are stored in memory as machine language.

Instructions are fetched from memory and then executed.

The common fetch cycle can be expressed in the following control sequence.

MAR  PC.// The PC contains the address of the instruction.

READ.// Put the address into the MAR and read memory.

IR  MBR.// Place the instruction into the MBR.

This cycle is described in many different ways, most of which serve to highlight additional steps required to execute the instruction. Examples of additional steps are: Decode the Instruction, Fetch the Arguments, Store the Result, etc.

A stored program computer is often called a “von Neumann Machine” after one of the originators of the EDVAC.

This Fetch–Execute cycle is often called the “von Neumann bottleneck”, as the necessity for fetching every instruction from memory slows the computer.

Avoiding the Bottleneck

In the simple stored program machine, the following loop is executed.

Fetch the next instruction

Loop Until Stop

Execute the instruction

Fetch the next instruction

End Loop.

The first attempt to break out of this endless cycle was “instruction prefetch”;
fetch the next instruction at the same time the current one is executing.

As we can easily see, this concept can be extended.

Instruction–Level Parallelism: Instruction Prefetch

Break up the fetch–execute cycle and do the two in parallel.

This dates to the IBM Stretch (1959)

The prefetch buffer is implemented in the CPU with on–chip registers.

The prefetch buffer is implemented as a single register or a queue.
The CDC–6600 buffer had a queue of length 8 (I think).

Think of the prefetch buffer as containing the IR (Instruction Register)

When the execution of one instruction completes, the next one is already
in the buffer and does not need to be fetched.

Any program branch (loop structure, conditional branch, etc.) will invalidate the contents of the prefetch buffer, which must be reloaded.

Instruction–Level Parallelism: Pipelining

Better considered as an “assembly line”

Note that the throughput is distinct from the time required for the execution of a single instruction. Here the throughput is five times the single instruction rate.

What About Two Pipelines?

Code emitted by a compiler tailored for this architecture has the possibility to run twice as fast as code emitted by a generic compiler.

Some pairs of instructions are not candidates for dual pipelining.

C = A + B
D = A  C// Need the new value of C here

This is called a RAW (Read After Write) dependency, in that the value for
C must be written to a register before it can be read for the next operation.
Stopping the pipeline for a needed value is called stalling.

Superscalar Architectures

Having 2, 4, or 8 completely independent pipelines on a CPU is very
resource–intensive and not directly in response to careful analysis.

Often, the execution units are the slowest units by a large margin. It
is usually a better use of resources to replicate the execution units.