Your Network Is to Be Designed Assuming the Following Specifications and Technology

Spring 2011 EE382C Handout 4

EE382C Project

Description

For the project, you are to design a data center network. Your goal is to design the highest performance network that fits the cost/area criteria described below. Your network will be evaluated across a wide range of synthetic traces to determine its performance. We expect that you will investigate several different aspects of the data center network, including topology, routing, flow control, as well as the router and network interface microarchitecture.

Your network is to be designed assuming the following specifications and technology characteristics:

Your network will provide communication services between up to 100,000 “endpoints”. Each endpoint is an 8x or 16x PCIe 3.0 interface. As it is costly to simulate a network of this size, you should develop a plan to simulate a small representative “slice” of your network – modeling traffic between this slice and the rest of the network.
The network will provide user-level remote-procedure call (RPC) (average 8 words), asynchronous message send (AMS) (4KBytes average), and remote DMA (RDMA) (1MBytes average) to user processes at each endpoint. Protection will be provided to isolate “jobs” from one another (i.e., one “job” cannot RDMA into another job’s remote memory).
Assume that each endpoint can generate up to 5GB/s of bandwidth.
We will assume a 2011 technology (28nm) with chip size limited to 20mm x 20mm
A nominal clock cycle is assumed to be 30 FO4 (FO4 ~ 12ps). You may adjust your pipelines to use faster or slower pipeline stages. The flip-flop insertion delay is 6FO4. On-chip signals propagate at a velocity of 2mm/ns. Long wires must be pipelined accordingly.
Assume that each chip costs $0.50 per square mm. (Real cost is not linear, but use this approximation.)
Each chip can support 10 10Gb/s input/outputs (two signal pairs, one in each direction) for each 1mm of chip perimeter. Each of these input outputs dissipates 2pJ/bit and can drive up to a total of 25dB of attenuation. Attenuation of minimum width (6 mil 1oz) PC traces is 50dB/m. Attenuation of a 24AWG cable is 5dB/m.
Moving data on-chip dissipates 100fJ/b-mm. Reading or writing on-chip memory consumes 0.5fJ/bit.
The two main component of area that we consider within the routers are the buffer/storage area and the crossbar switches. The buffer area will be assumed to be 0.5µm2/bit and the crossbar area can be approximated as (inputs x outputs x width of the crossbar x 100 x (0.1µm)2 ).
We will neglect the area impact of the control overheads within the router (e.g. allocators, control state overhead, etc). However, if you implement a complex routing algorithm or a flow control, please include any impact that they might have on the router delay and/or the router area.
Off chip communication can be on PC boards, electrical cables, or active optical cables. An active optical cable costs $100 and carries 4 10Gb/s signals in each direction up to 100m with an energy of 25pJ/bit. Electrical cables cost $0.10 per pair and can be up to 5m in length. PC boards cost $0.10 per square cm and may be up to 0.5m on a side. Connectors between PC boards and between boards and cables cost $0.10 per pair.
To compute cable lengths, assume each endpoint is a 1U module (1.75 inches high) in a 19-inch rack. Each rack contains 32 end points. Racks have a 19 x 19 inch footprint and are spaced in rows with three-foot wide aisles between adjacent rows.
Your network should be reliable in the presence of single link or router failures. That is, a single link or router failure may cause packets that were in flight at the time of the failure to be dropped, it should not disconnect the network or significantly impact its performance.

Evaluation –

We will provide you with a set of synthetic traces. You are to measure the time required to complete the traces and try to minimize cost. You will be graded on cost-performance and power-performance. Cost here is a combination of parts cost for your network ($). Power is the total power of your network with all links operating at full bandwidth.

Infrastructure –

To evaluate your design, we provide you with the booksim simulator, the simulator that was used to generate the graphs from the textbook. You will need to modify the simulator to implement your design. The simulator can be obtained from the class website.

Checkpoints

Checkpoint 1 (due May 10, 2011):

For checkpoint 1 you must turn in a project proposal that lists the members of your group and a design document on the topology, routing, flow control, and/or the router microarchitecture design of the network. Your invention/selection should be compared to other alternatives and justified

Checkpoint 2 (due May 19, 2011):

By checkpoint 2, you should be midway into your evaluation with the simulator at least partially working and some preliminary results. You don’t need to turn anything in for this checkpoint, but you must come to Prof. Dally’s office hours or make an appointment to review your progress.

Final Report (due May 31, 2011):

You must submit a written report on your project. This report should be no longer than 10 pages (plus appendices if needed) and should include a description of your problem area, your solution, hypothesis, methodology, findings, and conclusions.

Presentation:

Each group must present their project and findings in class. These presentations will take place on May 24 and 26, so you must have your work largely complete, except for writing the report, by your presentation date.

Suggestions for Data Center Networks:

Here are some suggestions to get you started on the project. Feel free to use any of these ideas or create your own.

Topology:

Most data center networks today use a folded Clos (sometimes called a Fat-tree) network. A dragonfly network, discussed in class, can clearly be used in place of the Clos to improve cost-performance. Can you come up with a topology that improves upon the dragonfly? If you choose a dragonfly-like network, what is the optimal topology within a group to make best advantage of the relative costs of electrical and optical links?

Routing:

Your routing algorithm will depend on your topology. If you choose a dragonfly (or a variant) you will need to use some form of global adaptive routing and may need to employ one or more techniques for indirect adaptive routing (e.g., progressive adaptive routing). Can you come up with a better scheme for distributing congestion information or making routing decisions that balance load globally?

Flow Control:

Data center traffic consists of a mix of short messages (RPC), medium-sized messages (AMS) and long messages (RDMA). What is the best way to divide these messages into packets and maintain ordering of the packets within a message at the application level?

In the case of Map-Reduce using asynchronous messages, during the reduction phase, several messages may converge on one node creating a hot spot. How will you handle such hot spots?

What is the expensive resource in your network? How can you design the flow control mechanisms (and the rest of the network) to optimize the use of this expensive resource?

Router Architecture:

What is the architecture of your routers? How many ports do they have? How are they organized internally? Where are the internal channels and switches? Where are the buffers? Are they deep enough to avoid credit stalls? Where is the routing decision and the various allocation decisions made? What is the pipeline for your router? How many cycles of latency does it take to get from input to output?

Network Interface Architecture:

What is the architecture of your network interface? What is its internal state? What functions does it perform?