Network Address ProcessorAddress Processor and UltraCAMClassifier Co-Processors from Silicon Access Networks: A Family of Search Coprocessors for Terabit Routers with OC192 Blades
Mike O’Connor & Syed Mahmud
Principal Architect
Silicon Access Networks
Introduction
Advanced IP networks are being designed to provide improvements in Quality of Service to allow operators to offer differentiated service levels and to support the addition of voice and video data types on connectionless datagram networks. Routing at 10 Gbps line speeds requires numerous lookups into large tables to support the demands of such networks. Typically, these lookups include both longest-prefix-match lookups into a forwarding table and large multi-dimensional lookups into a flow classification table for access control, billing, or QoS purposes. The flow classification lookups may include both source and destination IP addresses and TCP ports among other things. Advanced products such as server load balancers require lookups using session and application data such as SAP session, URLs, and cookies.
As shown in Figure 1, the Network Address ProcessorAddress Processor and UltraCAMClassifier can be configured as Co-Processors that target different search applications. The NAP Address Processor is used for longest-prefix and exact match searches, and its architecture is based upon a tree search algorithm using dense DRAM technology. The UltraCAMClassifier is used for large multi-dimensional lookups and it is based upon dynamic ternary CAM technology. The use of fast and dense embedded DRAM enables a high-performance solution at significantly lower cost and lower power consumption relative to other SRAM based solutions.
Figure 1. System Configuration with NAPAddress Processor(s) and UltraCAMClassifier(s) on ZBT SRAM bus
In addition to lookups, the NAP Address Processor and UltraCAMClassifier also provide collection of statistics in associated memory and on-the-fly modification and an automated table maintenance capability. Each product is optimized for its task in terms of memory density, power, and functionality. At the same time, both chips are designed to reside simultaneously on a standard 133-MHz ZBT SRAM bus with up to 128 data pins, allowing the routing system designer maximum flexibility. Multiple devices may be placed upon the bus in a mix-and-match fashion. The two chips share a common application-programming interface.
The NAPAddress Processor and UltraCAMClassifier are designed to allow a large number of concurrent pipelined lookup requests- one for each key- to access and update associated data. Write transactions on the ZBT bus specify a command and also provide data to the chips such as the key to look up and the operation to perform on the associated data. Such operations are flexible as to the fields affected and include several modifications including increment by a constant and add. Several write transactions may be required for all necessary data to be transferred for a given command. Once the all the data has been transferred, the request is scheduled to begin execution through the pipeline.
A read request is used to access some or all the results of a given command from the result buffer. These read requests appear as standard SSRAM read transactions.
Address Processor
Extensive use of purpose-built embedded DRAM arrays enables a single NAP Address Processor chip to store up to 256K 48-bit longest-prefix-match ranges, with no loss in lookup performance of 66 million lookups per second. The NAP Address Processor supports key sizes of 48 bits (up to 256K entries), 96 bits (up to 128K entries), and 144 bits (up to 80K entries). Arbitrary Ipv4 keys are supported since the Address Processor can have up to 33 nested levels of prefixesThe NAP can service up to 66 million lookup requests per second. The Address Processor can update table entries at an average of 1M updates/sec. Each lookup result indexes one of 256K associated 96-bit user data words. Each of these 96-bit words, in turn, contains a 13-bit field that refers to one of 8K additional 256-bit user data words.
The NAP Address Processor supports on-the-fly read-modify-write operation on the 96-bit associated words and half (128-bits) of the 256-bit user data words. Each "per route entry" can perform a read, add and write on two fields of a 96-bit data word. Similarly, the next-hop information indexed by the per-route information can do a read, add and write on 2 fields in a 128-bit data word. Doing these 4 sets of read-modify-write operations and storing the results would require approximately 15 instructions per lookup for another processor in the system. Since the Address Processor supports 66 Million lookups per second, thisThis capability enables the user to offload over one billion statistics maintenance operations per second from other processors in the system for each included NAPAddress Processor. Line cards built with such chips will require lower off-chip processing power and fewer or zero external SRAM memory components for associated memory tables.
Each Address Processor performs up to 66 million searches per second along with statistics updates associated with each search. This allows two lookups and statistics updates per 40-Byte sized packet at 10 Gbps for WAN router applications. Alternatively, three lookups and statistics updates are supported for a 64-Byte packet environment.
Storage configuration for the Address Processor is shown as follows. The NAPAddress Processor includes a multi-level search tree in L0-L3 as well as associated statistics data in L4 and L5.
L5. As shown in Table 1, the routing table is stored in a 25 Mb block of embedded DRAM, organized as 8,192 rows of 3200 bits. Three levels of indexing tables, very similar to a B-Tree, are used to select a row of the L3 memory in a pipelined manner. These three memories are a total of ~1.2 Mb and are implemented using embedded SRAM. The correct entry is selected based on the 3200 bits in the selected L3 row, based on a proprietary patent pending algorithm.
NAPAddress Processor Memory Organizations
Level / Width / Height / Size/Memory TypeL0 / 4712 bits / 1 / ~4.6 Kb / SRAM
L1 / 2310 bits / 32 / ~72.2 Kb / SRAM
L2 / 2310 bits / 512 / ~1.1 Mb / SRAM
L3 / 3200 bits / 8K / 25 Mb / DRAM
L4 Data / 100 bits / 256K / 25 Mb / DRAM
L5 Data / 256 bits / 8K / 2 Mb / DRAM
Total / ~1.2 Mb
52 Mb / SRAM
DRAM
Table 1. Network Address ProcessorAddress Processor storage levels and memory organizations
Classifier
The UltraCAMClassifier product is based upon embedded Ternary CAM memory arrays as well as DRAM arrays. The UltraCAMClassifier stores up to 192K 48-bit key entries, with no loss in lookup performance of 66 Million lookups per second. The Classifier supports variable key sizes of 48-bit (up to 192K entries), 96-bit (up to 96K entries), 144-bit (up to 64K entries), 192-bit (up to 48K entries), 288-bit (up to 32K entries) and 576-bit (up to 16K entries). The product can support 288-bit operation at 66 Million lookup requests per second. Each lookup result indexes an associated variable width (multiples of 32-bit) user data word. The number of variable width associated data words is equal to the number of keys (e.g. 192K for 48-bit keys).
The NAP Classifier supports on-the-fly read-modify-write operation on the associated data. Each data entry can perform a read, add and write on two fields of the associated data word. Performing these 2 sets of read-modify-write operations and storing the results would require approximately 8 instructions per lookup for another processor in the system. Since the Classifier supports 66 Million lookups per second, thisThis capability enables the user to offload over half a billion statistics maintenance operations per second from other processors in the system for each included NAPClassifier.
Each Classifier performs up to 66 million searches per second along with statistics updates associated with each successful search. This allows up to two lookups and statistics updates per 40-Byte sized packet at 10 Gbps for WAN router applications.
Storage configuration for the Classifier is shown as follows. The UltraCAMClassifier includes a single-level TCAM array as well as associated data in L4. Total user-accessible memory in the UltraCAMClassifier is 15 Mb (9 Mb of TCAM and 6 Mb DRAM)
UltraCAMClassifier Storage Levels
Level / # Entries / Content / L4 dataTCAM / 192K / 48-bit Key / 192K x 32 bits
TCAM alternate / 96K / 96-bit Key / 96K x 64 bits
TCAM alternate / 64K / 144-bit Key / 64K x 96 bits
TCAM alternate / 48K / 192-bit Key / 48K x 128 bits
TCAM alternate / 32K / 288-bit Key / 32K x 192 bits
TCAM
alternate / 16K / 576-bit Data / 16K x 384 bits
UltraCAMClassifier Memory Organizations
Level / Width / Height / Size/Memory TypeTCAM / 48 bits / 192K(max) / 9 Mb / TCAM
L4 Data / 32 bits / 192K(max) / 6 Mb / DRAM
Total / 9 Mb
6 Mb / TCAM
DRAM
Table 2. UltraCAMClassifier storage levels and memory organizations
Conclusions
Deep-packet processing at OC192 wire speeds requires highly memory-intensive processing. Traditional solutions using SRAM require a large number of chips, causing high power dissipation and high chip counts. Silicon Access Networks’ Address Processor and Classifier Co-Processor chips make extensive use of fast, embedded Smart Memory to offer a unique cost-effective, yet high-performance solution. OC192 line cards built with such chips will require lower off-chip processing power and fewer or zero external SRAM memory components for associated memory tables.
The conference presentation will focus on the lookup requirements for packet classification, implementation details of the NAPAddress Processor and UltraCAMClassifier chips, and how a typical high-end router system might use them.
Page 1