A Research Project of

TMS320C80 – Architecture and Applications

Sam Tran

CENG 6332 – High Performance Computer Architecture

Instructor: Dr. Liwen Shih

May 11, 2004

Figure - 1 TMS320C80 architecture

TMS320C80 – Multimedia Video Processor

Digital cameras, DVD players, HDTV, hand phones, optical devices, medical diagnostic devices, space exploration devices, etc all need image or audio processing techniques. TMS320C80 appeared in order to increase performance of the techniques. Their parallelism, their superior speed and storage making them satisfy applications above. Furthermore, they could be used for purposes of other fields. This project focuses on structure and possible applications to exploit power of the chip.

TMS320C80 – A superior architecture

With a 50 MHz RISC master processor, four DSP parallel processors, 50 MB of storage, crossbar bus, and transfer controller (TC), C80 was considered as a high performance DSP processor. In addition, the on-chip video controller and emulation unit facilitated user in video applications as well as in debugging.

Block diagram

See Figure - 1.

High performance master processor

The master processor is 32-bit RISC CPU, can deliver 50 MIPS at 50 MHz. 4K instruction catch and 4K data catch could reduce the latency of accessing. In addition, the CPU can perform multiply, add, load, and store parallelism. Especially, its floating point unit has a set of vector instructions consisting of matrix multiply, DCT, and FFT allowing to speed up image processing, audio processing and 3-D graphic applications.

The floating point unit (IEEE-754) is very fast in arithmetic calculation. For examples, in one cycle, it can perform add, subtract, compare, or conversion single or double numbers. In multiply, it needs only 1 cycle for singles and 4 cycles for doubles. Divide and square root take a little bit longer, such as: 6 and 20 cycles for single and double respectively in divide, 9 and 26 cycles in square root. Besides, it has 3-stage pipeline, parallel in multiply, add, and 64-bit load.

Four DSP parallel processors

C80 contains 4 DSP parallel processors (PPs). They are 64-bit large instruction words so that it could support many operations per cycle. Besides, two address units also allow PP can handle 2 memory operations in a cycle. The ALU unit with 3 inputs can perform multiple operations in each pass. It also enables mix arithmetic and boolean operations at the same time. This means the ALU can perform masking add, subtract concurrently. With 8 data registers, each PP, at a time, could perform up to 7 reads and 4 writes. In addition, the crossbar system allows each PP can access multiple data, so they can act in SIMD model.

The data unit of PPs has a flexible data path and 44 user accessible registers that can be operand of the ALU. It also performs rounding DCT accuracy as well as conditional operations, conditional choice of register pair source, and conditional save of result.

Manage and optimize the memory traffic with Traffic Controller (TC)

TC is responsible for catch fills and writes, direct loads/stores from/to off-chip memory via DEA (direct external memory access) request, and block movement of data via packet transfers. It also generates refresh and SRT (shift register transfer) to maintain DRAMs and VRAMs.

The controller can transfer up to 400MB/s with linear xy addressing of independent source and destination. It could dynamically resize the bus by 64, 32, 16, or 8 and automatically byte alignment.

Crossbar architecture allows share memory and handle collisions

This system connects MP, PPs, and TC to the memory (25 x 2K SRAM). It allows all processors can parallel independent access to the memory as well as share all on-board RAM. If more than one processor try to access the same RAM in the same cycle, a hardware controller will round robin prioritization in the order to make sure only one processor access the RAM at a time.

Video controller facilitates video applications

It can capture and display video images in both horizon and vertical presentation format simultaneously. Besides, it can operate in asynchronous and synchronous mode, so the image capture can be received at different rate.

Support user in debugging

The C80 was built in an emulation unit with JTAG (IEEE -1149.1) specification to facilitate users in programming and testing.

Applications to exploit power of the chip

Although, the chip was designed to increase performance of video processing, however, in practical, people used it in many fields. This project will focus on the applications of the chip in three fields: image processing, video processing, and telecom. In each filed, the applications are grouped by Flynn’s classification.

Image processing with SIMD model

In all applications of this model, the image was divided to smaller parts and each PP will perform the same tasks in each part simultaneously (Figure - 2). After finish the processing, the image parts are sent back to the master processor to reconstruct the image. In some applications, the image can be divided by the basic colors. As we know, every color in natural was formed by the colors of red, green, and blue. In this case, the image was split to red, green, and blue image and then each PP will work on with one. The MP responds for split and reconstruct the image (Figure - 3).

Image processing with MISD model

In this case, each PP will perform different tasks on the image. For examples, one filters, one sharpens, one makes edge detection, and so on. The sharing data is the whole image.

Figure - 2 SIMD with segmented images

Figure - 3 SIMD with split by basic colors

Video processing with SIMD model

In some applications as increasing quality of the video, the frames can be segmented or separated by basic colors so that each PP just processes in the smaller one of the frame (as in image processing), or, at a time, each PP processes a different frame. Another application could use this model is finding motion information of the video. As we know, to reduce video data in transmitting, only the first fame, the differences (from the previous frame), and motion vector are transmitted. At the receiver, we have to build the frame based on the previous frame, the motion vector, and the differences. To do that, we have to search on the previous to find down the motion regions. The C80 is employed in which each PP will search on a small part of the previous to find out the motion regions. In Figure - 4, three frames are the same, except the mouth of the guy. Hence, the mouth movement is the motion vector and the differences are the mouth.

Video processing with MISD model

This kind of applications is exactly as in the model of image processing. Each PP is responsible for specific task and the sharing data is a frame of the video.

Figure-4 SIMD in finding motion information

Video processing with pipeline model

In this model, each PP performs a set of unique tasks. Frame by frame of the video will be processed through the PPs (Figure - 5). If all processor of C80 are employed, the number stage of the pipeline is five (four PPs and one MP).

Telecom application –SIMD in smart antenna

This application exploits DSP parallelism of C80. In practice, radio waves come to a device from many directions because of the reflection. If we know which direction is the best, the quality of the device will increase dramatically. A group of antennas and C80 were used to perform this task. Each antenna was designed to absorb radio waves from a particular direction. Each PP will consider on an antenna. At any time, the PPs send to MP all information about radio waves of its direction. MP will compare information and decide to select radio waves from which direction (of course, the best one). Figure–6 shows an antenna system with two sources of radio wave. W1 mostly absorbs S1, meanwhile, W2 mostly accepts S2. The PPs of C80 will determine S1 and S2 concurrently and send the information to MP. Based on it, at a time, the MP will select the best one among radio waves.

Figure-5 Video processing in pipeline model

Figure–6 SIMD in smart antenna

Conclusion

C80 with DSP parallelism and video controller led them meet the demand of video processing. Moreover, users exploited their power to apply in other fields.

References

•Bilas A., Fritts J., Singh P. J. (1997). Real-Time Parallel MPEG-2 Decoding in Software. Retrieve April 10, 2004 from

•Bosi B., Bois G., Savaria Y. (1999). Reconfigurable Pipelined 2D Convolvers for Fast Digital Signal Processing. Retrieved April 19, 2004 from

•Bayik T., Akhan B. M. (2004). Realisation of parallel 3D scan conversion on TMS320C80 and benchmarks. Retrieve April 10, 2004 from

•Furht B. (2004). Processor Architectures for multimedia. Retrieve April 10, 2004 from

•Gellow C. J., Sodini G. C. (2004). A Pixel-Parallel Image Processor Using Logic Pitch-Matched to Dynamic Memory. Retrieved April 10, 2004 from

•Guangming Lu, Hartej Singh, Ming-hau Lee1, Nader Bagherzadeh, Fadi Kurdahi, and Eliseu M.C. Filho (1999). The MorphoSys Parallel Reconfigurable System. Retrieve April 19, 2004 from

•Kwong S., Choi S. (Nov, 2000). TI presentation. Retrieve April 10, 2004 from

•Li K. (April 26, 2000). Smart Antenna Receive. Retrieve April 19, 2004 from

•LiLein L. A. (Sep 1996). Digital Signal Processors vs. Universal Processors. Retrieve April 10, 2004 from

•Palkow M. (Sep 1996). MVP Waveletcodec. Retrieve April 12, 2004 from

•Pakers SM (Jan 9, 2002). DSP (Demanding space-based processing!) The path behind and the road ahead. Retrieve April 10, 2004 from

•Peter Wißkirchen, Klaus Kansy, Günther Schmitgen (2004). Integrating Graphics Into Video Image-Based Camera Tracking and Filtering. Retrieve April 19, 2004 from

•Winter H. J. (Nov 27, 2000). Smart Antennas for third generation. Retrieve April 10, 2004 from