Analyzing the Xilinx Virtex-II Pro PowerPC with the Dhrystone Benchmark Application

David Petrick

NASA – GoddardSpaceFlightCenter

Code 564

Introduction

The Xilinx Virtex-II Pro FPGA is available with one or two PowerPC405 processor cores embedded within the FPGA fabric. In order to determine if flight-based systems can benefit from this technology, NASA is performing ongoing testing of this device, particularly on how the embedded PowerPCs function in a radiation environment[1]. Custom applications that test the static and dynamic operation of the PowerPC have been implemented for recent radiation experiments. In addition to customized applications, the dynamic testing phase needs to include a well-known processor benchmark application to help aid the assessment of the embedded PowerPC performance.

The Dhrystone benchmark application is used to test processor performance and how well a compiler optimizes the code. The performance is measured in Dhrystone MIPS, or DMIPS. Similar work was done by Xilinx (Application Note 507), which reports that the PowerPC produces 600+ DMIPS at 400 MHz. However, these results were obtained by using the WindRiver Diab DCC 5.2 compiler, not the compiler supplied with Xilinx’s Embedded Development Kit (EDK). This report lists the DMIPS results using the GNU-GCC 3.4 compiler and shows how varying the bus frequency, processor frequency, and compiler settings affect the DMIPS performance. It also describes the performance variations between commercial devices, speed grades, and the left and right processors.

FPGA Design

The Dhrystone application design was built using Xilinx EDK 7.1.02, which invokes the Xilinx ISE 7.1.04 hardware compilation toolset. The FPGA architecture is straightforward, and can be implemented on any Virtex-II Pro part. The processor local bus has only two peripherals: (1) a 64 KB BlockRAM that holds the Dhrystone instructions and data storage, (2) a 115,200-baud UART controller. A digital clock manager (DCM) uses a 100-MHz reference clock to derive the processor clock and the FPGA logic clock (processor bus clock). In total, only six pins are needed for this design: (1) clock, (2) reset, (3-6) UART signals.

The majority of testing was done using the XRTC board configuration, which is populated with a XQR2VP40[2] device with a –5 speed grade. In addition, the application was also run on other Virtex-II Pro devices of different logic densities and speed grade in order to understand how the performance varies between device types.

Software Application

The original Dhrystone application prompts the user for the number of iterations to run the program using a scanf function. This function was removed so that the program fits within the 64 KB BlockRAM and replaced with a hard-coded value. XAPP507 discusses the minimum iterations needed to achieve 1% error. Worst-case testing varied the number of iterations between 1-350 million showed that the difference in the resulting DMIPS was less that 0.2. Therefore, the iteration value was hard-coded to one million in order to gather all results in reasonable time.

The GNU-GCC compiler supplied with the Xilinx EDK was used to build all software code. Under the compiler options menu, the user can select from a range of optimization settings and enabling global pointers (GP). These settings were tested using the Dhrystone code when operating the PowerPC at 100-MHz. The results are shown in Figure 1.

The Level 3 optimization setting, which enables inlining, resulted in 112.5 DMIPS. Enabling global pointers at Level 3 resulted in 113 DMIPS. For all tests, the Level 3 (GP) setting was used to compile the Dhrystone code. In addition, both instruction and data caches are enabled for all tests. Running the tests without the caches enabled results in an enormous performance hit (up to 15x reduction).

Figure 1 – Compiler Optimization Results

Results

As discussed in Xilinx App Note 640, the PowerPC can only be operated at integer multiples of the local bus clock. Tests were set up to violate this spec, for example a bus clock of 100-MHz and a CPU clock of 150-MHz, in which the FPGA/PowerPC design did not functionally operate. In addition, the Virtex-II Pro data sheet DS083-1 specifies that the maximum PowerPC performance for a speed grade of –5, -6, and –7 is 300-MHz, 350-MHz, and 400-MHz, respectively. However, valid results were obtained when testing the PowerPC within a –5 device at 400-MHz. A working design was not obtained when clocking the PowerPC above 400-MHz, specifically 425, 450, or 500-MHz. This is most likely a result of the limitations on the DCM within the Virtex-II Pro.

Three different bus frequencies were used for this test: 25, 50, and 100-MHz. For each bus speed, data was recorded on all possible values of the CPU clock up to 400-MHz. The core FPGA current and Dhrystones/sec was recorded for each test. The DMIPS value was then calculated by dividing Dhrystones/sec by 1757. All data is listed in Appendix 2.

Figure 2 plots the DMIPS over the range of CPU frequencies for each of the three bus frequencies. This plot shows that the processor performance is linearly proportional to the CPU frequency over the entire range at 1.13 DMIPS/MHz. Furthermore, since all three plots are identical, the frequency of the processor bus has no affect on the performance of the PowerPC processor. At 400-MHz, the PowerPC is capable of 453 DMIPS using the GNU-GCC compiler, as opposed to the 628 DMIPS using the WindRiver Diab DCC compiler (XAPP507 results in Appendix 1). Therefore, using the WindRiver DCC compiler provides a 38% performance increase over the GCC compiler.

Figure 2 – DMIPS Variation in CPU and Bus Clocks

Figure 3 plots the core FPGA power as a function of the CPU frequency for each of the three bus speeds. All three plots are linear and only differ by a few milliwatts for a given processor frequency. Adding processor peripherals and other logic would increase the FPGA power consumption, but the power drawn by the PowerPC would remain constant. Using the data gathered from these tests, the average power consumption of the PowerPC is 0.88-mW/MHz. This result agrees with the Virtex-II Pro data sheet, which specifies that the low power consumption of the PowerPC core is 0.9-mW/MHz.

Figure 3 – Power Consumption Variation in CPU and Bus Clocks

A subset of these tests were run on XC2VP7 and XC2VP70 devices (bus clock of 100-MHz), both with a –6 speed grade (a –7 speed grade device was not readily available). There were no differences in DMIPS greater than 1 Dhrystone/sec on either FPGA compared to the data collected on the XQR2VP40 –5 device. The same subset of tests were also run using the left processor instead of the right processor on a dual processor device. The reason for also testing the left processor is because Xilinx Application Note 755 makes reference to slight differences from the right. Again, there were no differences in DMIPS greater than 1 Dhrystone/sec on the left processor compared to the data collected on the right processor.

Appendix 1: Xilinx Application Note 507 Results

Appendix 2: Dhrystone Test Results Using EDK GNU-GCC Compiler

Clock Freq (MHz) / Dhyrstone Results[3]
Bus / CPU / Current (mA) / PPC mW/MHz[4] / Power (mW) / Dhrystone/Sec / DMIPs/MHz / DMIPs
100 / 100 / 173 / 0.96 / 259.5 / 198807 / 1.13 / 113
100 / 200 / 237 / 0.96 / 355.5 / 397614 / 1.13 / 226
100 / 300 / 294 / 0.86 / 441 / 596420 / 1.13 / 339
100 / 400 / 350 / 0.84 / 525 / 795227 / 1.13 / 453
50 / 50 / 134 / 0.84 / 201 / 99403 / 1.13 / 57
50 / 100 / 162 / 0.84 / 243 / 198807 / 1.13 / 113
50 / 150 / 193 / 0.93 / 289.5 / 298210 / 1.13 / 170
50 / 200 / 221 / 0.84 / 331.5 / 397613 / 1.13 / 226
50 / 250 / 252 / 0.93 / 378 / 497016 / 1.13 / 283
50 / 300 / 280 / 0.84 / 420 / 596419 / 1.13 / 339
50 / 350 / 311 / 0.93 / 466.5 / 695822 / 1.13 / 396
50 / 400 / 339 / 0.84 / 508.5 / 795224 / 1.13 / 453
25 / 25 / 108 / 0.84 / 162 / 49702 / 1.13 / 28
25 / 50 / 122 / 0.84 / 183 / 99404 / 1.13 / 57
25 / 75 / 138 / 0.96 / 207 / 149105 / 1.13 / 85
25 / 100 / 154 / 0.96 / 231 / 198807 / 1.13 / 113
25 / 125 / 167 / 0.78 / 250.5 / 248508 / 1.13 / 141
25 / 150 / 182 / 0.90 / 273 / 298210 / 1.13 / 170
25 / 175 / 196 / 0.84 / 294 / 347911 / 1.13 / 198
25 / 200 / 211 / 0.90 / 316.5 / 397612 / 1.13 / 226
25 / 225 / 226 / 0.90 / 339 / 447314 / 1.13 / 255
25 / 250 / 241 / 0.90 / 361.5 / 497015 / 1.13 / 283
25 / 275 / 255 / 0.84 / 382.5 / 546716 / 1.13 / 311
25 / 300 / 270 / 0.90 / 405 / 596417 / 1.13 / 339
25 / 325 / 285 / 0.90 / 427.5 / 646117 / 1.13 / 368
25 / 350 / 300 / 0.90 / 450 / 695818 / 1.13 / 396
25 / 375 / 314 / 0.84 / 471 / 745519 / 1.13 / 424
25 / 400 / 329 / 0.90 / 493.5 / 795220 / 1.13 / 453

1

NASA – GSFC10/31/2018

[1] This work is being done in conjunction with the Xilinx Radiation Test Consortium (XRTC)

[2] The device had the follow markings: XQR2VP40 FF1152CGB0517 D15519A -5R

[3] Measurements recorded at room temperature

[4] FPGA core voltage VCCINT = 1.5 volts