Smith-Waterman Reconfigurable Computing System

and Performance Analysis Model

Guangming Tan1 Youying Zhou2 Shengzhong Feng 1

Ninghui Sun1

NationalResearchCenter for Intelligent Computing Systems,Institute of

Computing Technology, ChineseAcademy of Sciences1

ZheJiangUniversity, China2

Abstract: Reconfigurable computing(RC) has become a subject of a great deal of research, owing to its greatly accelerator in a wide variety of applications. However, its application is limited by developing tools and programming environment, also designer can not analysis the bottleneck of entire reconfigurable system because of the lack of the methods and tools of performance evaluation. Smith-Waterman RC system provides a high-performance reconfigurable computing environment for sequence alignment in biology. Its hardware consists of a FPGA module that uses the DIMM bus as co-processor, which has high-bandwith and low-latency. The module is integrated with 64-bits Linux operate system and application programming interface is implemented. Based on the theoretical performance analytical model, the reconfigurable computing system is simulated and the bottleneck can be given.

Keywords: Reconfigurable Computing (RC), FPGA, dynamic programming, performance analysis

1

1Introduction

Reconfigurable computing (RC) based on FPGA has been implemented in many computing intensive application, such as cryptography, signal processing, image processing and bioinformatics[9,10,11,12].The fact that no good developing tool is available makes the process of developing a reconfigurable system very difficult, hardware and software engineers must cooperate in defining a clear interface between hardware and software. In the application of RC system, there is no standard mechanism to partition hardware and software. Because the cores of RC system are not as general as CPU and all systems are confined to some application domain, there is no general benchmark and performance evaluation method.

Bioinformatics has been a very important research field and its requirement on computing ability has surpassed available high performance computing. RC has demonstrated the potential for achieving high performance for many applications, and it has being widely used in bioinformatics in recent years with the developing of RC. Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith-Waterman. However, because of the computationally demanding nature of this method, heuristic programs have been developed[13][14]. Increased speed has been obtained at the cost of reduced sensitivity. For high-speed implementation of the Smith-Waterman algorithm, some have exploited the single instruction, multiple data (SIMD) computers. Several special-purpose hardware solutions have been also developed for Smith-Waterman algorithm, such as Splash[19], SAMBA[20], Paracel[15], Celera[17], JBits[16]. These machines are able to process more than 2000 million matrix cells per second, and can be expanded to reach much higher speeds. We implemented a RC platform base on FPGA about Smith-Waterman [4]algorithm, which is the kernel of many algorithms in bioinformatics, and provided a policy for hardware/software codesign, besides, We build a theoretically analytical performance model and evaluation method[1,2,3].

In the following sections, firstly, we simply introduce Smith-Waterman algorithm. Secondly,

We discuss the core design of FPGA and system integration, which include the partition between hardware and software. In the end, we build a performance analysis model and show the results of simulation.

2Smith-Waterman Algorithm

Sequence alignment is a very common problem in biology, which searches the similarity among sequences. Mostly search tools are based on Smith-Waterman dynamic programming(DP) algorithm. The two most popular tools are BLAST[13] and FASTA[14].Smith-Waterman algorithm [7]consists of two main progresses:

  1. Given scoring scheme, it fills scoring matrix, which is DP matrix, by the similarity between two sequences.
  2. It backtracks the best search path in dynamic programming matrix, then get the best alignment.

For example, given two sequence S and T, n=|S| and m=|T| respectively represent their lengths. V(i,j) is the best alignment scores between two subsequence S: S[1]…S[i] and T: T[1]…T[j].Edit distance defines the scoring.scores substitution,scores deletion, scores insertion. Then, DP formulation:

1≤i≤n,1≤j≤m (1)

The matrix V(i,j) is DP formulation, the algorithm’s time complexity is O(m*n).

3Core Design

The major components of the Smith-Waterman module’s core are Control Unit, Systolic Array and Memory Unit, as show in Figure 1.The whole Smith-Waterman modules attaches to DIMM memory bus and be mounted in DIMM memory slot with high bandwidth and low latency communication to host CPU[6].

Control Unit connects to the DIMM bus. It is responsible for transferring data from/to the DIMM bus to/from the memory on Smith- Waterman module or among memory on module. Control Unit reads data from DIMM bus and transfers them to some memory on module, then it generates address signal, by which systolic array can read data to compute. The temporary data also must be wrote back to some memory on the module, the last results are wrote to DIMM bus. Control Unit also performs reconfiguration by the configuration bitstream.

Systolic Array[7,8] perform actual computing. It accesses the input/output FIFO data line, which is controlled by Control Unit. Many processor units consist of Systolic Array and pipeline the data.

Memory Units are input/output SDRAM FIFO buffers. Respectively, there are two FIFO buffers to the two side of Systolic Array. These memory banks save the input/output data of Systolic Array and have parallel I/O ports. For fast reconfiguration, configuration bitstream can be held in the on-board RAM.

Figure 1.Architecture Module

4System Integration

We integrated Smith-Waterman module with 64-bits Linux operating system on AMD Opteron 64-bits processor. The module, which connects to DIMM bus, is considered as a memory device in system. Transferring data between host CPU and module is the same as transferring data between host CPU and main memory.

System integration comprise the partition of software and hardware, we partition software by fixed-function and application domain specific function. To define a clear hardware/software interface, we lay hardware as well.

Software layers comprise kernel driver, library/

API, and application program. Kernel driver has fixed-function, which is to set up the PC’s memory controllerand provide low-level access to module’s firmware at runtime. Application software, which has some generally abstract characteristics, can access module through using library/API rather than calling the module kernel driver directly. Library/API implements the wrapper function in kernel deriver and generally standard functions for application software. Application software layer runs on host CPU and module, which is considered as a coprocessor.

Hardware module’s firmware implements kernel driver and library, application software can be mapped to Control Unit and Systolic Array in Smith-Waterman reconfigurable module.

Kernel Driver and Firmware

The main function of kernel driver is to probe device and to provide low-level methods so that operate system and application software can access device. Kernel driver can read/write the registers in Control Unit.

When a PC is booted, the BIOS detects all memory modules in the system and read the data in PROM, through which operating system initializesmemory controller and maps memory to CPU’s address space. However, Linux abandons the BIOS’s data, it is the device driver that detects and initializes devices after loading OS. Firstly, The module’s driver probes all DIMM slots, re-program the memory controller when it find a memory device. Then, the module is assigned a certain physical memory area, besides, the physical memory area is mapped to Linux address space. Kernel driver operate on device through memory-mapped I/O and hide the details of firmware.

Although Smith-Waterman module is a memory-mapped I/O device, due to the affection of Cache, the mode host CPU access general main memory is not the same with the mode host CPU access a memory-mapped I/O device. AMD 64 architecture has two level Cache—L1 data Cache,L1 instruction Cache and L2 data/instruction Cache. It defines six kinds of memory mode[15]: Uncacheable(UC), Cache Disable(CD),Write Combining(WC),Write Pro-

tect(WP),Write Through and Write Back(WB).

UC memory is not cacheable and is useful for memory-mapped I/O devices where strict ordering of reads and writes is important. WB provides the highest-possible performance, but it may cause unexpected results because of out-of-order reads and writes.

Smith-Waterman reconfigurable module has several FIFO buffers, which are mapped to CPU address space, strict ordering of reads and writes is necessary, so it is suitable to UC memory mode. In AMD 64-bits processors, Memory Type Range Registers(MTRRs) are used to control processor access to memory.

Linux kernel driver sets the value of registers and controls communication among sub-modules.Firmware’s functions are to reconfigure module and control memory access. Memory access controller regulates the FIFO buffer’s actions. FIFO buffer cannot connect to DIMM slot directly and DIMM bus interface cannot access FIFO buffer. The data transferring which is controlled by DMA can be implemented in Control Unit as firmware.

Reconfiguring firmware reprograms the port of systolic array so that module can be reconfigured or reset. Data bitstream come from host CPU or on-board RAM. Besides, firmware may retrieve status information such as memory controller status and DMA channel.

Library/API

Through classifying some algorithms, we can abstract their common kernel, by which it is feasible to hardware/software codesign. For example, Smith-Waterman DP is used as kernel of many sequence alignment tools in biology, the best popular one of which is BLAST.

Based on Smith-Waterman DP reconfigurable module, we implement a standard library for application program. Software designer needn’t care about how hardware is implemented. API wrapper the callers to kernel driver such as initializing hardware module and managing buffers in driver.

5Performance Analysis Model

A few efforts have been made to produce a unified development environment for RC systems, we do not hold out much hope that that meaningful broad-based analytical performance model and analysis tools. For Smith-Waterman

Figure 2.RC Model: Finite Automata

RC system, we build a simple mathematical model for performance analysis.

Both reconfigurable computing and massive parallel computing are high performance computing(HPC),Performance is commonly measured in term of speedup and efficiency. The basic definition for speedup in HPC is the ratio of the execution time(T(1) )of the serial program on a single processor to the parallel execution time(T(N)) of the parallel program on an N-processor parallel system:

S(N) = T(1)/T(N)(2)

Efficiency is defined as the ratio of speedup to the number of processor:

E(N) = S(N)/N (3)

In the RC environment, the focus will be on FPGA configuration, processor to FPGA communication, data distribution between FPGA and processor, computing time of hardware and software, memory access time and configuration setup costs.

Speedup in RC is defined as the ratio of the execution time(T(1)) to the execution time in RC environment:

S(RC) = T(1)/T(RC) (4)

Our RC analytical model is a finite automata. It comprises five states: startup, configuration, data transferring, computing and halt. as shown in Figure 2.

Startup: The program running on host CPU need to startup RC module and will run on both host CPU and RC module

Configuration: Reading configuration bit-stream from host CPU or local RAM to configure RC module. Reconfiguration may be full or partial.

Data transferring: Data transferring exist in both between host CPU and RC module and among the memory bank in RC module.

Computing: The software running on host CPU and program running on RC module can parallel. Multiple RC module also can parallel or pipeline.

Halt: The program in RC module is end.

Assume that tsetup , tconfig,tcomm and tcomp is the time of startup, configuration, data transferring and computing, respectively. Then the time of only in RC module(T1(RC)):

T1(RC) = tsetup + tconfig+ tcomm+ tcomp (5)

In application program, there are some parts which cannot be executed in RC. The time of programs which must run on host CPU is tserial., then the time of in RC system:

T(RC) = tserial + T1(RC) = tserial + tsetup + tconfig+ tcomm+ tcomp (6)

tsetup , tconfig,tcomm , tcomp and tserial are expected

1

Table 1. Execution Time and Speedup. Host CPU: 1.6GHz Opteron

sequence clock cycles/time(us) single CPU time(us) data transfer(us) speedup
1KB 8704 / 9 64001 3.15 5267
5KB12704 / 13 284274 5.64 15250
10KB 17696 / 18 541319 8.77 20221
30KB 37696 / 38 1567802 21.27 26451

1

values. For multiple RC module, we change the

value of tcomm and tcomp:

T(RC) = tserial + T1(RC) = tserial + tsetup +(n-r)tconfig+ (n-d)tcomm+ tcomp (7)

Where r is the number of RC module not requiring a new configuration and d is the number of RC module not requiring a new data set.

The RC model is an iterative model of configuration, data transferring and computing.

In order to exploit the parallel, data transferring is divided into two processes: data input and data output. There will exist some overlap between data transferring and hardware computing through pipelining data input, computing and data output. In most system, the bottlenecks often are bus bandwidth and speed. Computing and data transferring can wholly overlap in some limited case, then:

T(RC)limited = tserial + T1(RC) = tserial + tsetup + tconfig+ tcomm (8)

6 Simulation Results

Base on our analytical performance model, we implemented C simulator for the whole Smith-Waterman RC system. C simulator implemented the detailed design of FPGA and we can measure the execution time of hardware by counting the number of clock cycles. For simplicity, when the RC hardware is computing,the software in host CPU don’t perform computing and only transfer data, so tserial in RC system is the same with tserial in single CPU and we only need to simulate the execution time of RC hardware. In C simulator, the PE in systolic array is 3840,the frequency of FPGA is 100MHz, 200Mhz/64bits data lines of DIMM bus.The execution time of RC and single CPU for the alignment between one sequence which length is 3840 bytes and multiple sequence show as Table 1.

In Smith-Waterman RC system based on DIMM slot, the bottleneck is hardware computing, it is necessary to increase the frequency of FPGA. However, the cost of communication shouldn’t be overlooked, we may improve DIMM bus. We can interpret the reason for the use of DIMM memory bus instead of PCI bus from our analytical performance model. For example, PCI66MHz/64 bits bus, raw data transferring speed is 528MB/s, the time of data communication is 18.93us for 10KB sequence and 56.82us for 30KB sequence which is much more than the time of computing, 38us.The bottleneck is data communication for RC system based on PCI, so RC system based on DIMM memory bus is a better alternative.

7Future Work
We will integrate Smith-Waterman RC module with Linux cluster system, in which each node has 4P Opteron 64-bits processors. The combination of RC and massive parallel computing may be a new challenge.

Firstly, a synchronization policy should control the communication between multiple processors and RC hardware. A new library and API will be implemented. Secondly, we’ll develop application and programmingenvironment for cluster system with RC. However, for easily finding bottleneck, we will start with building a theoretical analytical model for cluster system with RC.

8Acknowledgments

The Work was supported by 863 HongKong University Grid Systems (2002AA104530), National Nature Science Foundation of China (60372040) and knowledge innovative project of Chinese Academy of Sciences ( KSCX2-SW-233). The author would like to thank Dr. Xianyan Jang, Dr. Peiheng Zhang, Hao Sun, Nan Zhang and Lin Xu for their help in implementing C simulator.

References

1D Andrews and D Niehaus, Programming Models for Hybrid CPU/FPGA Chips, IEEE Computer, January 2004,pp 118-120

2K Compton and S Hauck, Reconfigurable Computing: A Survey of Systems and Software. ACM Computing Surveys, Vol 34,

No.2 June 2002,pp. 171-120

3K Bondalapati and V K. Prasanna,Reconfig-

urable computing: Achitechtures,models and algorithms.Current Science,Vol.78.7,10

ARRIL 2000

4Jiang Tao,Ying Xu,Michael Q. Zhang Current Topics in Computational Molecular Biology,Tsinghua University Press,The MIT Press 2002

5K Bondalapati and V K.Prasanna,Reconfig-

urable Computing Systems. Proceedings of the IEEE, July 2002

6D.Tong,P.Lo,K.Lee and P. Leong,A System Level Implementation of Rijndael on a Memory-slot base FPGA Card.In proceedings of the 2002 International Conference on Field Programmable Technology(FPT),pp. 102-109,2002

7C.W. Yu, K.H. Kwong, K.H. Lee and P.H.W. Leong, A Smith-Waterman Systolic Cell, Proceedings of the Tenth International Workshop on Field Programmable Logic and Applications (FPL'03), Lisbon, pp. 375-384, 2003

8I.M.Bland and G.M.Megson,The Systolic Array Genetic Algorithm, An Example of Systolic Arrays as a Reconfigurable Design Methodology. IEEE Symposium on FPGAs for Custom Computing Machines.April 15-17,1998,pp 260-261

9A. Dandalis, V. K. Prasanna, and J. D. P. Rolim. An Adaptive Cryptographic Engine for Internet Protocol Security Architectures. IEEE Symposium on Field-Programming Custom Computing Machines, April 2000.

10A. Dandalis and V. K. Prasanna. Fast parallel implementation of DFT using configurable devices. International Workshop on Field Programmable Logic and Applications, September 1997

11Yongwha Chung and Viktor K. Prasanna, Parallizing Image Feature Extraction on Coarse-Grain Mahcines, IEEE Transactions on PAMI, Vol. 20, No. 12, pp. 1389-1394, 1998.

12

13

14

15

16S. A. Guccione and E Keller. Gene matching using JBits, 2002.

17Celera genomics, inc, 2003.

18AMD64 Architecture Programmer’s manual Volume2: System Programming

19D. T Hoang. A systolic array for the sequence alignment problem. Brown University, Providence, RI, Technical Report, pages CS-92-22, 1992

20D Lavenier. SAMBA: Systolic Accelerators for Molecular Biological Applications. 1996