CODE OPTIMIZATION

I. General methodology

Start with C code.

Compile with –g option withno optimization. (If using 'C67x floating point operations, need to select –mv6710 option in the compiler so that it will use the floating-point hardware of the DSP). If performance is satisfactory, do not continue.

Compile with –g and –o0 or –o1 optimization option. Check performance

Compile with –g and –o3 and –pm optimization option. USE –pm when have no ASM code (there are ways to use –pm we are not going into that level of detail.)

(-g: enables src-level symbolic debugging, -pm: combine all C source files before compiling, program level optimization, -gs: interlist C statements into assembly listing)

Modify C code.

Use intrinsics, pragmas, word-wide optimization, loop unrolling (to remove branches), and compiler feedback.

Use optimization levels.

Linear ASM

Identify functions or sections that need to be further optimized using profiling.

Write these sections in linear asm.

Use word-wide optimization.

ASM

Write time-critical sections in asm.

Use word-wide optimization, parallel instructions, removal of NOPs.

Hand-optimize through dependency graph and scheduling.

II. C-code modifications

  1. Intrinsics: special functions that map directly to inline 'C67x instructions. Compiler produces assembly language statements directly into the programs.

Intrinsics are specified with a leading underscore ( _ ) and are treated like a function.

Ex: int c

short a,b

c = _mpy(a,b)

This is equivalent to c=a*b BUT is done at the assembly language level.

in CCS: Help > Contents > TMS320C6000 Programmer's Guide > Optimizing C/C++ Code > Refining C/C++ Code > Using Intrinsics

B. Pragma directives provide information to the compiler or tell it to do something.

#pragma MUST_ITERATE (min, max, factor)

Write the pragma in the C code before the loop. Helps in loop optimization.

C. Word-wide optimization: If you have short 16-bit data, access two values at once and then work with 32-bit data. Also, called packed data processing.

1. Use casts and intrinsics

int dot_product(short *x, short *y, short z) //x,y point to short variables

{

int *w_x = (int *)x; //cast to point to a word

int *w_y = (int *)y;

int sum1 = 0, sum2 = 0, i;

for (i = 0; i < z/2; i++)

{

sum1 += _mpy(w_x[i], w_y[i]); // mpy 16LSB

sum2 += _mpyh(w_x[i], w_y[i]);// mpy 16MSB

}

return (sum1 + sum2);

}

Equivalent to summing the even indexed numbers in the array (i=0,2,4) and summing the odd indexed numbers in the array and then summing the sums.

2. Use _nassert intrinsic to specify that 16-bit short arrays are aligned on a 32-bit (word) boundary.

int dot_product(short *x, short *y, short z)

{

int sum = 0, i;

_nassert (((int)(x) & 0x3) == 0);

_nassert (((int)(y) & 0x3) == 0);

#pragma MUST_ITERATE(20, , 4); //must be even number of times in order to split in half for optimization

for (i = 0; i < z; i++) sum += x[i] * y[i];

return sum;

}

_nassert tells the computer that the pointer x and y are on a word boundary (2 LSBs are 0). The compiler then knows that it can optimize the following loop with mpy and mpyh.

_nassert is useful in that you do not need to rewrite code – simply add line

D. Unrolling the Loop – a technique where you expand small loops so that more iterations of the loop are seen in your code. The purpose is so that more instructions can operate in parallel.

Compiler will do this automatically if it knows:

1)The trip count

2)The trip count is a multiple of 2

Can do this through MUST_ITERATE:

#pragma MUST_ITERATE (10, ,2);

E. Compiler Feedback – when you compile, you can get information from the compiler which helps to know what to optimize and how to optimize.

III. Linear ASM Optimization – Word-wide optimization

A. Example: Load word and operate on 16-bit components/parts, dot product

.global _dotp

_dotp .cproc x, y

.reg sum, sum0, sum1, cntr

.reg a, b, p, p1

MVK 50,cntr ; cntr = 100/2

ZERO sum0 ; multiply result = 0

ZERO sum1 ; multiply result = 0

LOOP: .trip 50

LDW *x++,a; load a & a+1 from memory

LDW *y++,b; load b & b+1 from memory

MPY a,b,p; a * b

MPYH a,b,p1 ; a+1 * b+1

ADD p,sum0,sum0 ; sum0 += (a * b)

ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)

[cntr] SUB cntr,1,cntr ; decrement loop counter

[cntr] B LOOP ; branch to loop

ADD sum0,sum1,sum ; compute final result

.return sum

.endproc

B. Example: Load double word and work on 32-bit words for floating point dot product

.global _dotp

_dotp .cproc x, y

.reg sum, sum0, sum1, a, b

.reg a0:a1, b0:b1, p, p1;note format for double word

MVK 50,cntr ; cntr = 100/2

ZERO sum0 ; multiply result = 0

ZERO sum1 ; multiply result = 0

LOOP .trip 50

LDDW *x++,a0:a1 ; load a & a+1 from memory

LDDW *y++,b0:b1 ; load b & b+1 from memory

MPYSP a0,b0,p; a * b

MPYSP a1,b1,p1 ; a+1 * b+1

ADDSP p,sum0,sum0 ; sum0 += (a * b)

ADDSP p1,sum1,sum1 ; sum1 += (a+1 * b+1)

[cntr] SUB cntr,1,cntr ; decrement loop counter

[cntr] B LOOP ; branch to loop

ADDSP sum,sum1,sum0 ; compute final result

.return sum

.endproc

IV. Assembly language code

A. Word-wide optimization, parallel instructions, and reduction of NOPs

Consider the fixed-point dot product example with no optimization

MVK .S1 100, A1 ; set up loop counter

ZERO .L1 A7 ; zero out accumulator

LOOP

LDH .D1 *A4++,A2 ; load ai from memory

LDH .D1 *A3++,A5 ; load bi from memory

NOP 4 ; delay slots for LDH

MPY .M1 A2,A5,A6 ; ai * bi

NOP ; delay slot for MPY

ADD .L1 A6,A7,A7 ; sum += (ai * bi)

SUB .S1 A1,1,A1 ; decrement loop counter

[A1] B .S2 LOOP ; branch to loop

NOP 5 ; delay slots for branch

; Branch occurs here

This takes 16 cycles for each iteration. Since there are 100 numbers in each array, this results in 1600 cycles + 2 cycles for the MVK and ZERO.

Optimize with parallel instructions and reducing NOPs:

If use both .D units for loading, can put the loads in parallel.

Also the SUB and B do not depend on the other instructions. By moving them, we can remove the NOP 5 line and reduce the number of cycles. The B instruction should occur no earlier than 5 steps before the ADD.

MVK .S1 100, A1 ; set up loop counter

|| ZERO .L1 A7 ; zero out accumulator

LOOP

LDH .D1 *A4++,A2 ; load ai from memory

|| LDH .D2 *B4++,B2 ; load bi from memory

SUB .S1 A1,1,A1 ; decrement loop counter

[A1] B .S2 LOOP ; branch to loop

NOP 2 ; delay slots for LDH

MPY .M1X A2,B2,A6 ; ai * bi X denotes clearly that cross-path used

NOP ; delay slots for MPY

ADD .L1 A6,A7,A7 ; sum += (ai * bi)

; Branch occurs here

This loop requires 8 cycles per iteration. For 100 numbers in each array, this function requires 8*100 + 1 (for initialization) cycles.

We've gone from 1602 to 801 cycles for the function.

Let's do word-wide optimization now:

The loop in linear ASM is:

LDW *x++,a ; load a & a + 1 from memory

LDW *y++,b ; load b & b + 1 from memory

MPY a,b,p; a * b

MPYH a,b,p1 ; a+1 * b+1

ADD p,sum0,sum0 ; sum0 += (a * b)

ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)

[cntr] SUB cntr,1,cntr ; decrement loop counter

[cntr] B LOOP ; branch to loop

So the optimized code is now:

MVK .S1 50,A1 ; set up loop counter

|| ZERO .L1 A7 ; zero out sum0 accumulator

|| ZERO .L2 B7 ; zero out sum1 accumulator

LOOP

LDW .D1 *A4++,A2 ; load a & a+1 from memory

|| LDW .D2 *B4++,B2 ; load b & b+1 from memory

SUB .S1 A1,1,A1 ; decrement loop counter

[A1] B .S1 LOOP ; branch to loop

NOP 2

MPY .M1X A2,B2,A6 ; a * b note same registers can be used

|| MPYH .M2X A2,B2,B6 ; a+1 * b+1

NOP

ADD .L1 A6,A7,A7 ; sum0+= (a * b)

|| ADD .L2 B6,B7,B7 ; sum1+= (a+1 * b+1)

; Branch occurs here

This loop takes 8 cycles per iteration, but now we only have 50 iterations. The total cycles are then 8*50 + 2 (for initialization).

We've gone from 3202 > 1601 > 402 cycles.

B. Software Pipelining - technique used to schedule instructions from a loop so that multiple iterations execute in parallel. The parallel resources on the C6x make it possible to initiate a new loop iteration before previous iterations finish. The goal of software pipelining is to start a new loop iteration as soon as possible. (this works for linear assembly or assembly)

1. Development of the idea (ignoring NOPs for simplicity)

Consider: loop:ldh

||ldh

mpy

add

Cycle / .D1 / .D2 / .M1 / .M2 / .L1 / .L2 / .S1 / .S2
1 / ldh / ldh
2 / mpy
3 / add
4 / ldh / ldh
5 / mpy
6 / add
7 / ldh / ldh
8 / mpy
9 / add

We can see that many of the resources are not used. With pipelining can have the following:

Cycle / .D1 / .D2 / .M1 / .M2 / .L1 / .L2 / .S1 / .S2
1 / ldh / ldh
2 / ldh / ldh / mpy
3 / ldh / ldh / mpy / add
4 / ldh / ldh / mpy / add
5 / ldh / ldh / mpy / add
6 / mpy / add
7 / add

There are three stages to a pipelined code: prolog, loop kernel (gray area), and epilog. Prolog are the instructions needed to build up a loop kernel or cycle. Epilogue are the instructions needed to complete all loop iterations. When a single-cycle loop kernel is set up, the entire loop is executed in one cycle via one parallel instruction using the maximum number of functional units.

2. Method - three steps are needed to produce a hand-coded software pipelined code from an assembly loop code:

1)drawing a dependency graph

2)setting up a scheduling table

3)deriving the pipelined code from the table

3. Dependency Graph

Let’s consider the linear assembly program:

loop.trip10;exactly 10 iterations through loop

LDH*x++,a;pointer to x array > a

LDH*y++,b;put an element of y into b

MPYa,b,prod;prod=x*y

ADDprod,sum,sum;sum of products > sum

SUBcount,1,count;decrement counter

[count]Bloop

First, you draw a node and path for each instruction and write the number of cycles to complete the instruction. Each circle corresponds to an instruction. Inside the circle is the variable written to by the instruction, dst.

Second, you assign functional units to each instruction and separate the A and B data paths so that the maximum number of units are used and the workload between both paths are as equal as possible.

4. Scheduling Table

Identify the longest path. This effects how long the table will be.

For our example, the longest path is 8. This means that 7 prolog columns are needed before entering the loop kernel.

The longest path starts with LDH, so we start with that in cycle 1. This is repeated in every cycle through loop kernel column (8). The MPY must come 5 cycles after the loads (1 + 4). We should have 8 of these for the 8 loads. The ADD must come 2 cycles after the MPY (1 + 1). There should be 8 of these as well.

The branch is scheduled by reverse counting 5 cycles from the loop kernel. The SUB must come 1 cycle before the branch.

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15
.D1 / LD / LD / LD / LD / LD / LD / LD / LD
.D2 / LD / LD / LD / LD / LD / LD / LD / LD
.L1 / AD / AD / AD / AD / AD / AD / AD / AD
.L2 / SU / SU / SU / SU / SU / SU / SU
.S1
.S2 / B / B / B / B / B / B
.M1 / MP / MP / MP / MP / MP / MP / MP / MP
.M2

5. Pipelined Code

Write code from the table. Two ways:

a. The number of times you go through the loop kernel is equal to the number of dot products which you want to do – the number of prolog cycles.

Ex. If we have 10 numbers to multiply together and add, then the number of loop kernel iterations is 10-7 = 3, i.e. the loop count must be set to 3.

Assembly code (notice no NOPs are needed):

Cycle 1:

LDH.D1*A4++,A2

||LDH.D2*B4++,B2

Cycle 2:

LDH.D1*A4++,A2

||LDH.D2*B4++,B2

|| [B1]SUB.L2B1,1,B1;need [B1] for loop kernel

Cycle 3-5:

LDH.D1*A4++,A2

||LDH.D2*B4++,B2

||[B1]SUB.L2B1,1,B1

|| [B1]B.S2loop

Cycle 6-7:

LDH.D1*A4++,A2

||LDH.D2*B4++,B2

||[B1]SUB.L2B1,1,B1

|| [B1]B.S2loop

||MPY.M1A2,B2,A3

Cycle 8, 9, 10

LDH.D1*A4++,A2

||LDH.D2*B4++,B2

||[B1]SUB.L2B1,1,B1

||[B1]B.S2loop

||MPY.M1xA2,B2,A3

||ADD.L1A3,A7,A7

Cycle 11-15:

MPY.M1xA2,B2,A3

||ADD.L1A3,A7,A7

Cycle 16-17:

ADD.L1A3,A7,A7

Now it takes 17 cycles to complete the loop.

Could optimize further by doing word-wide optimization and software pipelining.

b. An alternative is to get rid of the epilog cycles and let the loop count equal the number of dot products you want to do.

The result is a program of smaller code size and unnecessary loads but with the same number of cycles.

C. Summary

For fixed-point dot-product:

No optimization: 16 cycles/loop * 10 numbers = 160 cycles

Reduce NOPS & parallel instruction: 8 cycles/loop * 10 numbers = 80 cycles

Word-wide optimization: 8 cycles/loop * 5 times through loop = 40 cycles

SW pipelining: 17 cycles