A Perceptron Based Branch Prediction Study

236381 - VLSI Project

A Perceptron based Branch Prediction study

Under the supervision of Michael Behar

Performed by

314518853 / Apfelbaum Marcel / smarchel@t2
304087547 / Zlotnik Alexander / sip@t2

Table of contents 2

1. Background 3

1.1. Motivation 3

1.2. Introducing the perceptron 3

1.3. The main perceptron advantage – long histories 4

2. Design and algorithms 5

2.1. High level design 5

2.2. Implementation 6

2.2.1. Platform 6

2.2.2. Structures 6

2.3. Main algorithms 7

2.3.1. Lookup function 7

2.3.2. Training function 7

3. Results 8

3.1. Test Settings 8

3.2. Test results 9

4. Conclusions 18

4.1. Based on our results 18

4.2. Future research proposals 18

5. Bibliography 19

1. Background

1.1. Motivation

Branch prediction is one of the main objectives of processor designers for the last 20 years. Branch commands are 20% of assembler commands run by the processor. Correct prediction of a branch command saves time and effort of the processor and high branch prediction rates lead to faster programs execution and better Power Management. Both of these issues are highly "fashionable".

1.2. Introducing the perceptron

A perceptron is a learning device that takes a set of input values and combines them with a set of weights (which are learned through training) to produce an output value. In our predictor, each weight represents the degree of correlation between the behavior of a past branch and the behavior of the branch being predicted. Positive weights represent positive correlation, and negative weights represent negative correlation. To make a prediction, each weight contributes in proportion to its magnitude in the following manner. If its corresponding branch was taken, we add the weight; otherwise we subtract the weight. If the resulting sum is positive, we predict taken; otherwise we predict not taken. The perceptrons are trained by an algorithm that increments a weight when the branch outcome agrees with the weight’s correlation and decrements the weight otherwise.

1.3. The main perceptron advantage – long histories

Well-accepted structures for dynamic branch predictors are based on two-bit counters. In this project, we explore a specific type of artificial neurons and propose a scheme that uses perceptrons instead of two-bit counters. Because the size of perceptrons scales linearly with the size of their inputs, which in our case is the branch history, a perceptron predictor can exploit long history lengths. By contrast, traditional two-level adaptive schemes use pattern history tables (PHTs), which are indexed by the branch history and which therefore grow exponentially with the history length. Thus the PHT structure limits the length of the history register to the logarithm of the number of counters. As a result, for the same hardware budget, a perceptron predictor can consider much longer histories than PHT-based schemes. For example, for a 4 KB hardware budget, a PHT based predictor can use a history length of 14, whereas a version of the perceptron predictor can use a history length of 34. These longer history lengths lead to higher accuracy.

2. Design and algorithms

2.1. High level design

For every branch there is an entry in a table. The entry holds a vector of weights and is multiplied (Cartesian product) with the global history register.

Every time we meet a conditional branch command, we compute the product and make a prediction. Basing on the prediction we fetch instructions and move the branch command to the next pipe stage. After the branch condition calculation we check whether we made a correct prediction. If no, we flush the read commands form the pipe, reset the PC counter and continue form correct location. In any case, we train the perceptron. I.e. we update the predictor according to the history register at the time of prediction.

2.2. Implementation

A more detailed insight to implementation is given in “External documentation” section on project’s site.

2.2.1. Platform

A known processor simulator – SimpleScalar(http://simplescalar.com).

A neural networks branch predicting mechanism was implemented over the existing platform.

2.2.2. Structures

The following structures have been modified:

enum bpred_class {

…

BPredPerceptron, /* perceptron predictor */

BPred_NUM

};

struct bpred_dir_t{

enum bpred_class class;

union{

…

struct{

int num_prcpt;

int weight_bits;

int prcpt_history;

int *weights_table;

unsigned long long glbl_history;

unsigned long long spc_glbl_history

}neural;

}config;

};

The following structure has been added:

typedef struct{

char dummy; /* is holding the prediction instead of pdir1 */

int prediction; /* prediction: 1 for taken, 0 for not taken */

int output; /* perceptron output */

unsigned long long int history;

/* value of the history register yielding this prediction */

int *weights_table_entry;

int *masks_table_entry;

unsigned long long *counter_entry;

unsigned long long *counter_limit;

}prcpt_update_t;

2.2.3. This structure helps to emulate perceptron behavior, and it is circulated in SimpleScalar as perceptron’s decision (char *).

It is initialized on lookup and used on training.

2.3. Main algorithms

weights_table – represents the table of perceptrons, every entry is a perceptron represented by his weights.

(glbl_)history – history register, every bit that is on represents a taken branch.

spc_glbl_history – history as seen during OOOE lookup

2.3.1. Lookup function

First, the wanted perceptron is found in weights_table:

index = NEURAL_HASH(pred_dir, baddr);

Then the perceptron’s weights are use to compute the “output”

for (mask=1,i=0; i<pred_dir->config.neural.prcpt_history;

i++,mask<=1,w++)

if (pred_dir->config.neural.spc_glbl_history)

output += *w;

else

output += -*w;

where the first entry is the bias and does nod depend on history.

Then the perceptron prediction is made following the concept:

pred_state->prediction = output >= 0;

2.3.2. Training function

The perceptron is trained when the branch is executed and the outcome is calculated.

First, the real global history register is updated:

pred->dirpred.bimod->config.neural.glbl_history <= 1;

pred->dirpred.bimod->config.neural.glbl_history |= taken;

Then, if was a miss prediction, the speculative history is restored:

if (u->prediction != take)

pred->dirpred.bimod->config.neural.spc_glbl_history =

pred->dirpred.bimod->config.neural.glbl_history;

If |output| > THETA(the threshold), no update is necessary

else for each weight and corresponding bit in the history register:

if (!!(history & mask ) == taken){

(*w)++;

else (*w)--;

Of course, MAX_WEIGHT, MIN_WEIGHT is checked prior to weight update.

3. Results

3.1. Test Settings

3.1.1. System configurations

Here is the configuration of the simulated system that ran the benchmarks:

# instruction fetch queue size (in insts)
-fetch:ifqsize 8
# extra branch mis-prediction latency
-fetch:mplat 3
# speed of front-end of machine relative to execution core
-fetch:speed 1
# instruction decode B/W (insts/cycle)
-decode:width 4
# instruction issue B/W (insts/cycle)
-issue:width 4
# run pipeline with in-order issue
-issue:inorder false
# issue instructions down wrong execution paths
-issue:wrongpath true
# instruction commit B/W (insts/cycle)
-commit:width 4
# l1 data cache config, i.e., {<config>|none}
-cache:dl1 dl1:128:32:4:l / # l1 data cache hit latency (in cycles)
-cache:dl1lat 1
# l2 data cache config, i.e., {<config>|none}
-cache:dl2 ul2:1024:64:8:l
# l2 data cache hit latency (in cycles)
-cache:dl2lat 10
# memory access latency (<first_chunk> <inter_chunk>)
-mem:lat 100 2
# memory access bus width (in bytes)
-mem:width 8
# total number of integer ALU's available
-res:ialu 4
# total number of integer multiplier/dividers available
-res:imult 1
# total number of memory system ports available (to CPU)
-res:memport 2
# total number of floating point ALU's available
-res:fpalu 4

3.1.2. Benchmarks

We compared performance of various parameters over the following benchmarks from the ss_spec2k:

1. VPR

2. Perl

3. Parser

During the runs we allowed the programs to skip 300 million commands of each benchmark and test the prediction mechanism over the following 100 million commands.

3.1.3. Tested Indexes

While running the benchmarks we calculated the following results:

Prediction rate, IPC, Executed commands/Committed commands

3.2. Test results

In most of the cases and parameters, perceptron based prediction beats GShare (traditional two-level adaptive scheme) even with smaller buffer size.

We ran GShare predictor with 8 to 17 global memory register length, and perceptron based predictors with 10 to 30 memory register length and 64 to 2048 entries in the perceptrons table.

The results clearly show that by most of the parameters, perceptron based prediction beats GShare for same memory size or better.

3.2.1. Prediction rates

Perceptron based predictor outruns Gshare outruns in almost every case.

This can be seen when comparing Table 1.A with Table 1.B .

Each shows prediction success rate in percents, the required memory size for predictor settings and the history length achieved in this memory size.

Gshare prediction
Benchmark: VPR
GHR[1] length / Prediction rate / Memory size
8 / 93.19% / 512
9 / 94.71% / 1024
10 / 96.27% / 2048
11 / 96.53% / 4096
12 / 97.24% / 8192
13 / 97.48% / 16384
14 / 97.78% / 32768
15 / 97.86% / 65536
16 / 97.92% / 131072
17 / 97.92% / 262144
Prediction rate of Gshare with specified history lengths.
Table 1.A / Perceptron prediction
Benchmark: VPR
GHR1 length/ #Perceptrons / Prediction rate / Memory size
10/64 / 98.29% / 3840
10/256 / 98.23% / 15360
15/64 / 98.61% / 5760
15/256 / 98.64% / 23040
20/64 / 98.84% / 7680
20/256 / 98.88% / 30720
25/64 / 98.93% / 9600
25/256 / 98.94% / 38400
30/64 / 99.12% / 11520
30/256 / 99.18% / 46080
Prediction rate of a perceptron based predictor. First column specifies the length of the history used to make the prediction and the number of perceptrons (entries) in the prediction array.
Table 1.B

The data clearly shows that perceptron based predictor performs better than Gshare. For the checked benchmark, VPR, even the simplest perceptron (GHR length 10, 64 perceptrons) makes better prediction (98.29%) than the most sophisticated Gshare (GHR length 17).

This comes as a surprise, since common logic tells us that perceptron gives good results because it can handle longer history lengths than Gshare for same memory sizes. We will see that this is a unique case and in other examples we will need longer history lengths to be handled by the perceptron based predictor in order to outrun Gshare.

A complete comparison of Gshare – Perceptron prediction rates on VPR benchmark can be seen in Graph 1

The statements above are also clearly seen in results of the Parser benchmark.

Gshare prediction
Benchmark: Parser
GHR length / Prediction rate / Memory size
8 / 81.66% / 512
9 / 83.04% / 1024
10 / 84.42% / 2048
11 / 85.32% / 4096
12 / 86.32% / 8192
13 / 86.81% / 16384
14 / 87.38% / 32768
15 / 87.66% / 65536
16 / 87.87% / 131072
17 / 88.02% / 262144
Prediction rate of Gshare with specified history lengths.
Table 2.A / Perceptron prediction
Benchmark: Parser
GHR length/ #Perceptrons / Prediction rate / Memory size
10/64 / 86.94% / 3840
10/256 / 87.30% / 15360
15/64 / 87.68% / 5760
15/256 / 88.00% / 23040
20/64 / 87.93% / 7680
20/256 / 88.23% / 30720
25/64 / 88.14% / 9600
25/256 / 88.42% / 38400
30/64 / 88.24% / 11520
30/256 / 88.52% / 46080
Prediction rate of a perceptron based predictor. First column specifies the length of the history used to make the prediction and the number of perceptrons (entries) in the prediction array.
Table 2.B

Perceptron based predictor performs better than Gshare even for smaller memory size.

Additionally we can see that after a certain GHR length both Gshare and perceptron predictor come to a maximum. This is clearly seen in Graph 2, which is based on the data that Tables 2.A and 2.B are based upon.

Graph 2

Both graphs indicate that the longer history lengths considered, the better prediction is made by the predictor. Claims that too long history lengths interfere or make noises in the results were not proved by our test cases.

Gshare prediction
Benchmark: Perl
GHR length / Prediction rate / Memory size
8 / 87.14% / 512
9 / 88.40% / 1024
10 / 90.76% / 2048
11 / 91.26% / 4096
12 / 91.31% / 8192
13 / 91.54% / 16384
14 / 91.55% / 32768
15 / 91.56% / 65536
16 / 91.56% / 131072
17 / 91.56% / 262144
Prediction rate of Gshare.
Table 3.A / Perceptron prediction
Benchmark: Perl
GHR length/ #Perceptrons / Prediction rate / Memory size
10/64 / 91.51% / 3840
10/256 / 91.55% / 15360
15/64 / 91.50% / 5760
15/256 / 91.56% / 23040
20/64 / 91.53% / 7680
20/256 / 91.57% / 30720
25/64 / 91.53% / 9600
25/256 / 91.57% / 38400
30/64 / 91.54% / 11520
30/256 / 91.57% / 46080
Prediction rate of a perceptron based predictor.
Table 3.B

Graph 3

Graph 3 (PERL benchmark predictions rate comparison) shows that Gshare and perceptron based predictor agree it is hard to get a better prediction rate than 91.56% in PERL benchmark. In fact, for every tested history length, the perceptron gives similar prediction. This shows that there are cases where simplest perceptron (3840 bits) is equal in its prediction power to much bigger ones.

3.2.2. IPC[2]

In two benchmarks (VPR & Parser) the IPC behaves similar to prediction rate. Perceptron based predictor outruns Gshare either from the beginning (VPR, Tables 4.A and 4.B, Graph 4) or along increasing the considered history length (Parser, Tables 5.A and 5.B, Graph 5).

Gshare IPC
Benchmark: VPR
GHR length / IPC / Memory size
8 / 1.8437 / 512
9 / 1.8709 / 1024
10 / 1.9032 / 2048
11 / 1.9048 / 4096
12 / 1.9197 / 8192
13 / 1.925 / 16384
14 / 1.9293 / 32768
15 / 1.9307 / 65536
16 / 1.9323 / 131072
17 / 1.9323 / 262144
IPC of Gshare with specified history lengths.
Table 4.A / Perceptron IPC
Benchmark: VPR
GHR length/ #Perceptrons / IPC / Memory size
10/64 / 1.9697 / 3840
10/256 / 1.9631 / 15360
15/64 / 1.984 / 5760
15/256 / 1.9824 / 23040
20/64 / 1.9942 / 7680
20/256 / 1.994 / 30720
25/64 / 1.9999 / 9600
25/256 / 1.9958 / 38400
30/64 / 2.0137 / 11520
30/256 / 2.0122 / 46080
IPC of a perceptron based predictor.
Table 4.B