Advanced Compiler Design and Implementation

Mooly Sagiv

Course notes

Case Studies of Compilers and Future Trends

(Chapter 21)

By Ronny Morad & Shachar Rubinstein

24/6/2001

Case Studies

This chapter deals with real-life examples of compilers. For each compiler this scribe will discuss three subjects:

· A brief history of the compiler.

· The structure of the compiler with emphasis on the back end.

· Optimization performed on two programs. It must be noted that this test can’t be used to measure and compare the performance of the compilers.

The compilers examined are the following:

· SUN compilers for SPARC 8, 9

· IBM XL compilers for Power and PowerPC architectures. The Power and PowerPC are classes of architectures. The processors are sold in different configurations.

· Digital compiler for Alpha. The Alpha processor was bought by Intel.

· Intel reference compiler for 386

Historically, compilers were built for specific processors. Today, it is not so obvious. Companies use other developers’ compilers. For example, Intel uses IBM’s compiler for the Pentium processor.

The compilers will compile two programs: a C program and a Fortran 77 program.

The C program

int length, width, radius;

enum figure {RECTANGLE, CIRCLE} ;

main()

{ int area = 0, volume = 0, height;

enum figure kind = RECTANGLE;

for (height=0; height < 10; height++)

{

if (kind == RECTANGLE) {

area += length * width;

volume += length * width * height;

}

else if (kind == CIRCLE){

area += 3.14 * radius * radius ;

volume += 3.14 * radius * height;

}

process(area, volume);

}

Possible optimizations:

The value of ‘kind’ is constant and equals to ‘RECTANGLE’. Therefore the second branch, the ‘else’ part is dead code and the first ‘if’ is also redundant.
‘length * width’ is loop invariant and can be calculated before the loop.
Because ‘length * width’ is loop invariant, area can be calculated simply by using a single multiplication. Specifically, 10 * length * width.
The calculation of ‘volume’ in the loop can be done using addition instead of multiplication.
The call to ‘process()’ is a tail-call. The fact can be used to prevent to need to create a stack frame.
Compilers will probably use loop unrolling to increase pipeline utilization.

Note: Without the call to ‘process()’, all the code is dead, because ‘area’ and ‘volume’ aren’t used.

The Fortran 77 program:

integer a(500, 500), k, l;

do 20 k=1,500

do 20 l=1,500

a(k, l)= k+l

20 continue

call s1(a, 500)

end

subroutine s1(a,n)

integer a(500, 500), n

do 100 i = 1,n

do 100 j = i + 1, n

do 100 k = 1, n

l = a(k, i)

m = a(k, j)

a(k, j) = l + m

100 continue

end

Possible optimizations:

a(k,j) is calculated twice. This can be prevented by using common-subexpression elimination.
The call to ‘s1’ is a tail-call. Because the compiler has the source of ‘s1’, it can be inlined at the main procedure. This can be used to further optimize the resulting code. Most compilers will leave the original copy of ‘s1’ intact.
If the procedure is not inlined, interprocedural constant propagation can be used to find out that ‘n’ is a constant equals 500.
The access to ‘a’ is calculated using multiplication. This can be averted using addition. The compiler “knows” how the array will be realized in memory. For example, in Fortran, arrays are ordered by columns. So it can add the correct number of bytes every time to the address, instead of recalculating.
After 4, the counters aren’t needed and the conditions in the loop can be replaced by testing the address. That’s done using linear test replacement.
Again, loop unrolling will be used according to the architecture.

Sun SPARC

The SPARC architecture

SPARC has two major versions of the architecture, Version 8 and Version 9.

The SPARC 8 has the following features:

· 32 bit RISC superscalar system with pipeline.

· Integer and floating point units.

· The integer unit has a set of 32-bit general registers and executes load, store, arithmetic, logical, shift, branch, call and system-control instructions. It also computers addresses (register + register or register + displacement).

· The floating-point unit has 32 32-bit floating-point data register and implements the ANSI/IEEE floating-point standard.

· There are 8 general purpose integer registers (from the integer unit). The first has a constant value of zero (r0=0).

· Three address instructions at the following form: Instruction Src1,Src2,Result.

· Several 24 register windows (spilling by OS). This is used to save on procedure calls. When there aren’t enough registers, the processor sends an interrupt and the OS handles saving the registers to memory and refilling them with the necessary values.

SPARC 9 is a 64-bit version, fully upward-compatible with Version 8.

The assembly language guide is on pages 748-749 of the course book, tables A.1, A.2, A.3.

The SPARC compilers

General

Sun SPARC compilers originated from the Berkeley 4.2 BSD UNIX software distribution and have been developed at Sun since 1982. The original back end was for the Motorola 68010 and was migrated successively to later members of the M68000 family and then to SPARC. Work on global optimization began in 1984 and on interprocedural optimization and parallelization in 1989. The optimizer is organized as a mixed model. Today Sun provides front-ends, and thus compilers, for C, C++, Fortran 77 and Pascal.

The structure

The four compilers: C, C++, Fortran 77 and Pascal, share the same back-end. The front-end is Sun IR, an intermediate representation discussed later. The back end consists of two parts:

· Yabe – “Yet Another Back-End”. Creates a relocatable code without optimization.

· An optimizer.

The optimizer is divided to the following:

· The automatic inliner. This part works only on optimization level 04 (discussed later). It replaces some calls to routines within the same compilation unit with inline copies of the routines’ body. Next, tail-recursion elimination is preformed and other tail calls are marked for the code generator to optimize.

· The aliaser. The aliaser used information that is provided by the language specific front-end to determine which sets of variables may, at some point in the procedure, map to the same memory location. The aliaser aggressiveness is determined on the optimization level. Aliasing information is attached to each triple that requires it, for use by the global optimizer.

· IRopt, the global optimizer

· The code generator.

The Sun IR

The Sun IR represents a program as a linked list of triples representing executable operations and several tables representing declarative information. For example:

ENTRY “s1_” {IS_EXT_ENTRY, ENTRY_IS_GLOBAL}

GOTO LAB_32

LAB_32: LTEMP.1 = (.n { ACCESS V41} );

i = 1

CBRANCH (i <= LTEMP.1, 1: LAB_36, 0: LAB_35);

LAB_36: LTEMP.2 = (.n { ACCESS V41} );

j=i+1

CBRANCH (j <= LTEMP.2, 1: LAB_41, 0: LAB_40);

LAB_41: LTEMP.3 = (.n { ACCESS V41} );

k=1

CBRANCH (k <= LTEMP.3, 1: LAB_46, 0: LAB_45);

LAB_46: l = (.a[k, i] ACCESS V20} );

m = (.a[k, j] ACCESS V20});

*(a[k,j] = l+m {ACCESS V20, INT});

LAB_34: k = k+1;

CBRANCH(k>LTEMP.3, 1: LAB_45, 0: LAB_46);

LAB_45: j = j+1;

CBRANCH(j>LTEMP.2, 1: LAB_40, 0: LAB_41);

LAB_40: i = i+1;

CBRANCH(i>LTEMP.1, 1: LAB_35, 0: LAB_36);

LAB_35:

The CBRANCH is a general branch, not attached to the architecture. It provides two branches, the first when the expression is correct and the second when not.

This IR is somewhere between LIR and MIR. It isn’t LIR because there are no registers. It isn’t MIR because there is access to memory using the compiler memory organization, the use of LTEMP.

Optimization levels

There are four optimization levels:

01 Limited optimizations. This level invokes only certain optimization components of the code generator.

02 This and higher levels invoke both the global optimizer and the optimizer components of the code generator. At this level, expressions that involve global or equivalent variables, aliased local variables’ or volatile variables are not candidates for optimization. Automatic inlining, software pipelining, loop unrolling, and the early phase of instruction scheduling are not done.

03 This level optimizes expressions that involve global variables but make worst-case assumptions about potential aliases caused by pointers and omits early instruction scheduling and automatic inlining. This level gives the best results.

04 This level aggressively tracks what pointers may point to’ making worst-case assumptions only where necessary. It depends on the language-specific front ends to identify potentially aliased variables, pointer variables, and a worst-case set of potential aliases. It also does automatic inlining and early instruction scheduling. This level turned out to be very problematic because of bugs in the front-ends.

The global optimizer

The optimizer input is Sun IR and the output is Sun IR.

The global optimizer performs the subsequent on that input:

· Control-flow analysis is done by identifying dominators and back edges, except that the parallelizer does structural analysis for its own purposes.

· The parallelizer searches for commands the processor can execute in parallel. Practically, it doesn’t improve execution time (The alpha processor is where it has an effect, if any). Most of the time it is just for not interrupting the processor’s parallelism.

· The global optimizer processes each procedure separately, using basic blocks. It first computes additional control-flow information. In particular, loops are identified at this point, including both explicit loops (for example, ‘do’ loops in Fortran 77) and implicit ones constructed from ‘if’s and ‘goto’s.

· Then a series of data-flow analysis and transformations is applied to the procedure. All data-flow analysis is done iteratively. Each transformation phase first computes (or recomputes) data-flow information if needed. The transformations are preformed in this order:

Scalar replacement of aggregates and expansion of Fortran arithmetic on complex numbers to sequences of real-arithmetic operations.
Dependence-based analysis and transformations (levels 03 and 04 only, as described below).
Linearization of array addresses.
Algebraic simplification and reassociation of address expressions.
Loop invariant code motion.
Strength reduction and induction variable removal.
Global common-subexpression elimination.
Global copy and constant propagation.
Dead-code elimination

· The dependence based analysis and transformation phase is designed to support parallelization and data-cache optimization and may be done (under control of a separate option) when the optimization level selected is 03 or 04. The steps comprising it (in order) are as follows:

Constant propagation
Dead-code elimination
Structural control flow analysis
Loop discovery (including determining the index variables, lower and upper bounds and increment).
Segregation of loops that have calls and early exits in their bodies.
Dependence analysis using GCD and Banerjee-Wolfe tests, producing direction vectors and loop-carried scalar du- and ud-chains.
Loop distribution.
Loop interchange.
Loop fusion.
Scalar replacement of array elements.
Recognition of reductions.
Data-cache tiling.
Profitability analysis for parallel code generation.

The code generator

After global optimization has been completed, the code generator first translates the Sun IR code input to it to a representation called ‘asm+’ that consists of assembly-language instructions and structures that represent control-flow and data dependence information. An example is available on page 712. The code generator then performs a series of phases, in the following order:

Instruction selection.
Inline of assembly language templates whose computational impact is understood (02 and above).
Local optimizations, including dead-code elimination, straightening, branch chaining, moving ‘sethis’ out of loops, replacement of branching code sequences by branchless machine idioms, and communing of condition-code setting (02 and above).
Macro expansion, phase 1 (expanding of switches and a few other constructs).
Data-flow analysis of live variables (02 and above).
Software pipelining and loop unrolling (03 and above).
Early instruction scheduling (04 only).
Register allocation by graph coloring (02 and above).
Stack frame layout
Macro expansion, phase 2 (Expanding of memory-to-memory moves, max, min, comparison of value, entry, exit, etc.). Entry expansion includes accommodating leaf routines and generation of position-independent code.
Delay-slot filing.
Late instruction scheduling
Inlining of assembly language templates whose computational impact is not understood (02 and above).
Macro expansion, phase 3 (to simplify code emission).
Emission of relocatable object code.

The Sun compiling system provides for both static and dynamic linking. The selection is done by a link-time option.

Compilation results

The assembly code for the C program appears in the book on page 714. The code was compiled using 04 optimization.

The assembly code for the Fortran 77 program appears in the book on pages 715-716. The code was compiled using 04 optimization.

The numbers in parentheses are according to the numbering of possible optimizations for each program.

Optimizations performed on the C program

· (1) The unreachable code in ‘else’ was removed, except for π, which is still loaded from .L_const_seg_900000101 and stored at %fp-8.

· (2) The loop invariant ‘length * width’ has been removed from the loop (‘smul %o0,%o1,%o0’).

· (4) Strength reduction of “height”. Instead of multiplying by ‘height’, addition of previous value is used.

· (6) Loop unrolling by factor of four. (‘cmp %lo,3’)

· Local variables in registers.

· All computations in registers.

· (5) Identifying tail call and optimizing it by eliminating the stack frame.

Missed optimizations on the C program

· Removal of p computation.

· (3) Compute area in one instruction.

· Completely unroll the loop. Only the first 8 iterations were unrolled.

Optimizations performed on the Fortran 77 program

· (2) Procedure integration of s1. The compiler can make use of the fact that n=500 to unroll the loop, which it did.

· Common subexpression elimination of ‘a[k,j]’

· Loop unrolling, from label .L900000112 to .L900000113.

· Local variables in registers

· Software pipelining. Note, for example, the load just above the starting label of the loop.

An example for software pipelining:

When running the following commands, assuming all depend on each other:

Load

Add

Store

The add can’t be started until load is finished and ‘store’ can’t be started until ‘add’ is finished. The compiler can improve this code by writing the following: