Superscalar Design space exploration and optimization framework for DSP Operations

Rehan Ahmed

Department of Electrical and Computer Engineering,

University of Madison Wi,

Email:

Abstract

A design methodology for superscalar architecture optimization for DSP operations is proposed. The optimization considers both performance and power consumption metrics. Two basic approaches are initially evaluated. The first approach follows a heuristic recursive algorithm based on the general working of superscalar architecture. The second approach uses simulated annealing to converge to an optimized configuration. A hybrid approach that uses initial configuration obtained from simulated annealing and further improvement using architecture heuristics has also been developed. Over 200% improvement over initial configuration has been achieved through the hybrid method. The results have been validated through various DSP benchmark applications in the MiBench Suite [1].

Index Terms— Superscalar, Search and optimization, Simplescalar, Wattch

1. Introduction

Power consumption has become a primary concern in modern processor design. With the advent of mobile processing platforms, it is becoming more and more critical to design efficient architectures; taking into account power consumption at initial design stages. This paper presents an approach for guided design space exploration of superscalar architecture. The methodology used is a combination of random search and optimization algorithm and a heuristic approach based on the functioning of superscalar architecture. Previous approaches in the area of superscalar optimization use simulated annealing [2], random search [3], genetic algorithms [4] and multidimensional optimization algorithms. However, in these approaches, only a limited set of architectural parameters has been considered. This results in an exploration of a small portion of the actual design space. Furthermore, both power consumption and performance have not been concurrently considered in the existing approaches. Other methods used for superscalar optimization improve upon the individual functional blocks such as cache [6, 7] resulting in an overall reduction in the efficiency and performance of a given architecture. The methodology followed in this paper performs evaluations and optimizations at global architectural level. This paper extends our previous work in superscalar architecture design and optimization [8]. In our prior work, a methodology based on heuristics was proposed; based on which, an automated design and optimization tool (SSOPT) was developed. The optimized architecture specifically targeted sample rate conversion operations for software radio applications. In this paper, architectural improvements have been conducted for generic DSP benchmark applications. Furthermore, the optimization approach has been extended by evaluating two new approaches. Simulated annealing has been used for conducting global architectural optimization and these optimization results have been compared with those of SSOPT [8]. Based on these results, a hybrid approach has been proposed which utilizes the strong points of both its constituent approaches. During all optimizations, performance, power consumption and complexity of architectures have been considered to validate the feasibility of optimized configurations. In the remaining paper, Section 2 explains the simulation framework developed and used throughout this paper. Section 3 includes the working of SSOPT; the optimization tool based on the functioning of superscalar architecture. Section 4 covers the operation of simulated annealing based optimization approach. This is followed by a brief explanation of benchmark applications and the optimization results of both approaches in section 5. The hybrid approach based on the simulation results of section 5 is given in section 6, followed by conclusion and references.

2. simulation framework

2.1. Simulation Tools

All optimization methodologies evaluated in this paper consider power consumption, performance and the complexity of superscalar architectures for validating processor configurations. For getting power and performance measures, Simplescalar architectural modeling tool [9] and its power extension Wattch [10] have been used. All power estimates have been given at an operating frequency of 1GHz and 100nm CMOS technology is assumed. Conditional clock gating (cc3) has been assumed for all power estimates. This assumes full power consumption for active units and 10% power consumption for inactive architectural units. Although, complexity and area estimates have not been considered in the optimization’s objective function, they have been estimated using the cost model given in [11]. An automated tool has been developed which implements the cost model given in [11]. Complexity estimates of optimized architectural configurations have been compared with commercial architectures.

2.2. Benchmark Applications

This paper targets optimizations for a range of complex DSP operations. For this reason, a subset of MiBench [1] benchmark suite for communication operations has been used. The applications for which optimization have been conducted are: (i) FFT, (ii) IFFT, (iii) CRC32, (iv) ADPCM Encode, (v)ADPCM Decode. FFT and IFFT perform Fast Fourier Transform and its inverse respectively. CRC performs Cyclic Redundancy Check on a given data. ADPCM encode and decode functions implement Adaptive Differential Pulse Code Modulation.

2.3. Gain Evaluation

As mentioned in section 2.1, the optimization approaches proposed in this paper consider power consumption, performance and complexity estimates for validating configuration changes. All methodologies effect architectural changes iteratively and compare the changed configuration to the former configuration. For such a comparison, a parameter ‘Efficiency-Gain’ has been evaluated which is a combination of performance and efficiency measures. The basic expression for efficiency gain is given in equation 1. The parameter takes a weighted sum of IPC (Instruction per Cycle) gain and APPI (Average Power per Instruction) gain. The relative precedence given to APPI and IPC can be set before the start of a given optimization by changing the value of ‘scale’ [8].

(1)

The optimizations covered in this paper target architectures with a high level of efficiency. For this reason, APPI has been given much higher precedence than IPC. The value of scale has been set 50 to give it higher weightage.

3. Architectural optimization through SSOPT

An automated superscalar optimization tool [8] has been previously developed for architectural optimizations. The tool is based on the general working of superscalar architecture and considers both power consumption and performance for evaluating architectural configurations.

SSOPT conducts optimization by progressing through various stages. Each stage corresponds to a change in the configuration of a certain architectural parameter or parameters. For instance, the first stage covers the optimization of Branch Table. In each stage, a given set of simulation results are analyzed. If the results show improvement, the configuration change for that particular stage is repeated. In case of deterioration, the configuration change is discarded and SSOPT moves to the next stage. The details of each stage are given in Table 1. To increase the effectiveness of optimization methodology, the configuration changes in each stage are bidirectional.

Table 1: SSOPT Stage Parameters

Configuration Change / Monitored Result / dir =0 / dir=1
1 / Branch Table / Branch_Misses / Incr / Dec
2 / BTB / IPC/APPI / Incr / Dec
3 / Return Address Stack / IPC/APPI / Incr / Dec
4 / IFQ, Exec Win, I ALU / IFQ_full, Eff_gain, IPB / Incr / Dec
5 / I ALU / Eff_gain / Incr / Dec
6 / I Mul/Div / Eff_gain / Incr / Dec
7 / FP ALU / Eff_gain / Incr / Dec
8 / FP Mul/Div / Eff_gain / Incr / Dec
9 / RUU / Eff_gain / Dec / Inc
10 / LSQ / Eff_gain / Dec / Inc
11 / I-Compress / Eff_gain / En / En
12 / I-Cache / Eff_gain / Dec / Inc
13 / D-Cache / Eff_gain / Dec / Inc
14 / Instruction TLB / Eff_gain / Dec / Inc
15 / Data TLB / Eff_gain / Dec / Inc
16 / Bus Width / Eff_gain / Inc / Dec
17 / Memory To System Ports / Eff_gain / Inc / Dec
18 / Exit Stage / Eff_gain / Nil / Nil

Another important feature of SSOPT is its recursive nature. After completion of all 18 stages (Optimization sub-cycle), SSOPT compares the simulation results of final architectural configuration with that of the initial configuration and calculates the efficiency gain parameter ‘Eff_Gain’. If the Eff_Gain turns out to be positive, this implies that further improvement may be possible. In that case, SSOPT considers the final configuration of the completed optimization sub-cycle initial configuration, and goes through all of the 18 stages again. This gives the algorithm its recursive nature and reduces the dependence of initial configuration on the entire optimization process. However, this optimization approach has a tendency of converging to local optimal points; due to its heuristic nature. This effect is reduced to some extent by selecting a scalable initial configuration. To ensure scalability, the size of caches, RUU and LSQ are kept high in the initial configurations.

4. architecture optimization using simulated annealing

In this approach, a canonical search and optimization algorithm has been developed that is based on simulated annealing [12, 2]. The flowchart followed by simulated annealing algorithm is given in figure 1. For a given configuration change, the algorithm randomly selects an architectural parameter and its value from a defined set (Table 2). The configuration change is effected and Eff_Gain value is evaluated according to equation 1. If the configuration change yields a positive gain, the configuration change is finalized and the process is repeated. In case of negative gain, the configuration is accepted based on the probability given by equation 2.

Figure 1: Implementation Flowchart of Simulated Annealing Optimization

(2)

In equation 2, ‘sim_num’ is the number of simulations and p is the probability of acceptance of a configuration. As an additional clause, the probability of acceptance of a configuration having negative gain can never exceed the value 0.5. The probability function is plotted against simulation index for various values of gain in figure 2. As illustrated in the figure, the probability of selecting configurations yielding negative gain, at the beginning of optimization process is higher than its probability at the concluding stages. This feature ensures that a large portion of design space is explored and the algorithm, more or less converges to an optimized configuration at the end of the optimization process. It is important to note that, since the value of scale in equation 1 has been set to ‘50’, a gain of -10 corresponds to a 20% increase in APPI value while the effect of IPC is negligible. During the entire optimization process, a total of 250 simulations are conducted and the configuration yielding maximum gain is selected. These simulations were run on Pentium-4 2.0 GHz machines with 512 MB RAM and Ubuntu (Linux) operating system. Time required to run 250 simulations varied between 6 to 12 hours depending upon the benchmark. The set of parameters and their values that simulated annealing considers while effecting configuration changes are given in Table 2. As is evident from the table, a very wide range of configuration parameters and their values have been selected. This is in contrast to the approaches presented in [2][3][4][5], contributing to a relatively higher level of optimization.

Figure 2: Probability function


The optimization process has been automated through the development of a software tool. The tool basically comprises of a script file and an ANSI C file. The primary function of script file is issuing commands to simplescalar and transferring control to C file and the software module which gives area estimates. All other operations of the simulated annealing algorithm are managed by the C file. This includes: (a) analyzing simulation results; (b) calculating probability of acceptance incase of negative gain; (c) selecting a given configuration change randomly from the set defined in Table 2; (d) finalizing configuration, either as a result of positive gain or probabilistically incase of negative gain; and finally (e) identification of optimization termination point.

4. Results and Comparisons

4.1. Optimization Results

The results of optimizations are given in figure 3 through figure 7. Results indicate the evolution of ‘gain’ for both SSOPT and simulated annealing algorithms against simulation index. In equation 1, since the value of scale has been set to 50, APPI gain value has a larger contribution to the actual gain value. For the results given in this section, gain has been evaluated using equation 3.

(3)

It is evident that in most cases, simulated annealing performs much better than the methodology followed by SSOPT. The primary reason for this is that, due to the heuristic nature of its algorithm, SSOPT tends to converge to local optimal points. The final configuration is dependant on the initial configuration; and hence only a small portion of the actual design space is explored. Simulated annealing however does not have this limitation, and the possibility of probabilistically selecting configurations which give negative gain ensures that a large portion of design space is explored. However, this random nature of simulated annealing algorithm makes further optimization possible. This is main reason why SSOPT outperforms simulated annealing in FFT results.

Figure 3: FFT Result

Figure 4: IFFT Results

Figure 5: CRC Results

Figure 6: PCM Decode Results

Figure 7: PCM Encode Results

5. Hybrid Methodology

Based on the simulation results of last section, a hybrid approach has been devloped which ensures better optimization than both approaches presented in sections 2 and 3. The objective is to select the architectural configuration which gives maximum gain in simulated annealing results. The selected configuration is then used as an initial configuration for SSOPT algorithm. The flowchart for hybrid approach is given in figure 8. This approach utilizes the strong points of both optimization algorithms. The first phase, comprising of simulated annealing algorithm helps converge to a more or less global optimal point. The later stage comprising of SSOPT, fine tunes the configuration given by simulated annealing algorithm and gives further performance improvement. Results of this approach are given in figure 9 through figure 13 and illustrate the evolution of gain versus simulation index. The value of gain is calculated according to equation 3.