The CRAY-X1 Supercomputer

CS-350-2

Spring 2004

Kevin Boucher

Brian Femiano

Sara Prochnow

Allen Peppler


Table of Contents

Cray History …………………………………………………………………………….3

Introduction of the X1…………………………………………………………………..3

Operating System………………………………………………………………………..4

Cray X1 IRIX 6.5 Implementation Features…………………………………………..4

Raid System on Cray X1………………………………………………………………...4

The C-brick………………………………………………………………………………5

The S-brick……………………………………………………………………………….5

Nodes………………………………………………………………………………...……5

Multi-Chip Modules (MCM)……………………………………………………………6

Cooling……………………………………………………………………………6

Single-Streaming Processors (SSP)……………………………………………………..6

Instruction and Data Caches……………………………………………………..7

ECache……………………………………………………………………………………7

Memory…………………………………………………………………………………...8

Global Addressability…………………………………………………….……….8

Word Size……………………………………………………………………….....8

Modes……………………………………………………………………………...8

Cabinets…………………………………………………………………………………..8

Programming……………………………………………………………………………..9

Applications………………………………………………………………………………9

Summary………………………………………………………………………………...10

APPENDICES

Questions and Answers for Review…………………………………….. APPENDIX A

Bibliography…………………………………………………...……...... APPENDIX B

Work Summary Sheet...... APPENDIX C
Cray History

Seymour Cray is responsible for a great deal of the history revolving around supercomputers. Some people will go as far as to call him the father of supercomputers. Cray’s mission in life was to create the world’s fastest computer. In 1957, he founded the Control Data Corporation with William Norris and in 1958 they developed the first fully transistorized supercomputer, the CDC 1604. After the CDC 1604, Cray designed the CDC 6600, which consisted of 60-bit words and parallel processing. This computer demonstrated RISC (Reduced Instruction Set Computing) design and was at least forty times faster than the 1604.

In 1972, Cray and Norris had a dispute about a new computer which Norris had put on hold. This disagreement caused Cray to leave Control Data Corporation and found a new company called Cray Research. In 1976, Cray designed the CRAY-1 which consisted of 133 megaflops and followed this achievement in 1985 with the CRAY-2TM which consisted of 1.9 gigaflops (Bellis). This second system had the world's largest central memory with the possibility of 2048 megabytes (Long). These two computers were the fastest supercomputers for their time.

In 1989, Cray had a dispute with management at Cray Research after they put the CRAY-3 on hold. This disagreement led to the founding of Cray Computer Corporation where Cray could develop his new project, the Cray-3. The Cray-3 consisted of 4-5 megaflops, which became the fastest supercomputer when it was introduced. This supercomputer was based on 1 GHz gallium arsenide (GaAs) processors where as the other processors were conventional silicon that top out at 400-500 MHz. The Cray-4 followed the Cray-3 and was similarly based on gallium arsenide. This computer is twice as fast per-node than the Cray-3 and smaller than the human brain.

In 1995, the Cray Computer Corporation went bankrupt due to uncontrollable circumstances in the economy. The Cray-3 and Cray-4 had minimal sales. Due to this bankruptcy Cray founded another new company called the SRC Computer Labs to begin building a new computer. Unfortunately, Seymour Cray was killed in an automobile accident a year later, leaving his future plans deserted.

Besides Cray’s development of some of the fastest supercomputers, he also invented and contributed to several technologies used by the supercomputer industry. These include the CRAY-1 vector register technology, immersion cooling technology, gallium arsenide semiconductor technology, and RISC architecture.

Introduction of the X1

The death of Seymour Cray was not the end of Cray supercomputers, although it seemed to be. In 1996, Cray Research merged with Silicon Graphics Inc. (SGI) who cancelled future Cray research (Dow). In 2000, Silicon Graphics Inc. sold Cray Research assets to Tera Computer Co. who preformed major reconstruction, minor upgrades to Cray’s existing computers, maintained service business to collect revenue, and changed its name to Cray Inc. (Dow, 2003).

In November of 2002, Cray Inc introduced the X1, a new supercomputer with leading-edge technology. The X1 was created to focus on capability rather than capacity (Dow, 2003). Brooks states, “The X1 can sort the 6 million volumes of the New York City Public Library in under a minute, an improvement of nearly four minutes over the supercomputing standards of the mid-1990s, officials say. It holds the equivalent computing power of 25,000 personal computers,” (2003). The Cray X1 is the latest Cray supercomputer. There is a lot to learn about Cray’s new features and improvements in the X1.

Operating System

The Cray X1 system uses the UNICOS/mp operating system to control its overall resource and disk management, which is based on the IRIX 6.5 kernel found in various Unix platforms. The kernel itself, on the Cray X1, has been upgraded for improved scalability and resource scheduling. The kernel implemented within the Cray X1 is based on POSIX 1003.1-1990 and POSIX 1003.2-1992 standards (Cray Docs 1).

Cray X1 IRIX 6.5 Implementation Features

The Cray X1 runs a single-system operating system image with hardware error reporting and checkpoint/restart software management tools. Depending on user permission settings, various processes can be controlled with the software management tool cpr. The Cray X1 also supports the PBS Pro batch system for batch job management, but does not come standard with the hardware. System monitoring and reporting data can be accessed using the sar and timex utilities provided by the UNICOS/mp operating system. CPU time, processor jobs, memory, I/O, network communications and storage can be monitored using these utilities (Cray Docs 1).

UNICOS/mp operates using the XFS journaling file system and the XLV volume manager for managing logical volumes. Entire file systems, directories, and/or individual files can be backed up and restored using the xfsdump and xfsrestore utilities. Additionally, currently mounted filing systems can be backed up with xfsdump (Cray Docs 1).

Both the Network Filing System (NFS) client and NFS server are available under the UNICOS/mp system implementation on the Cray X1. Support for remote procedure calls is required for session layer scheduling within NFS and Domain Name System support is also available. Transmission Control Protocol/Internet Protocol (TCP/IP) is also supported, including the socket interface for network communications, ftp, telnet, and rsh (Cray Docs 1).

Raid System on Cray X1
Cray has traditionally supported two different models of RAID subsystems within its computers. The early production systems housed the RS100 series and the later models use the RS200 series. Both models encompass a pair of redundant RAID controllers and back end storage. Components from the two series cannot be mixed. The RAID storage system in the X1 is based on third party RAID hardware that Cray selects and configures (Cray Docs 2).

The overall system is housed in a PC-20 peripheral cabinet and divided into two components, the C-brick and the S-brick. The C-brick contains the RAID controller and the S-brick contains the physical hard-drives used in storage and the means to interface with the C-brick controllers (Cray Docs 2).

The C-brick has an Ethernet port attached to monitor disk performance of it’s connected S-bricks.

The C-brick
As mentioned earlier, the C-brick houses the RAID controllers. There are two main models among the Cray X1 system series, the CB100 and the CB200. Some early production Cray X1 systems used the CB100. At a minimum, the CB100 houses two RC100 RAID controllers with 2 Fibre Channel front-end connections and 4 back end connections. Redundant power supplies with cache batteries are used incase of power failure. The CB200, which is the C-brick of an RS200-based RAID subsystem, has two 2-Gbps Fibre Channel front-end connections and four back end connections. Each RAID controller can access these loops. It has the same power and cooling features as the CB100 model, but additional Ethernet and serial connections for administrative control. The architecture of the CB200 has been improved allowing a considerably more powerful performance from the RS200 RAID controllers (Cray Docs 2).

The S-brick
The number of S-bricks attached to a given C-brick depends on configuration, but there are limitations to what can be done. Mixing of RS100 and RS200 components is not allowed. Moreover, the maximum number of S-bricks a single RAID subsystem can handle is eight, and they must be added in pairs. Because the S-bricks have only two Fibre Channel connections, the S-bricks are attached to C-bricks as a pair of redundant loops. Each individual S-brick contains a series number to signify compatibility for either the RS100 or RS200 models, and a spindle size indictor of either 0 for 36 Gigabyte spindles, 1 for 73 Gigabyte spindles, or 2 for 146 Gigabyte spindles (Cray Docs 2).

For the RS100 series, the S-bricks contain housing units for 10 dual-ported Fibre Channel drives, two 1 Gigabyte/second back-end connections, and power and cooling units. Spindle sizes are limited to 73 Gigabytes of storage space. For the RS200 series, the S-bricks contain housing units for 14 dual ported Fibre Channel drives, similar connection bandwidth and cooling, but spindle sizes can vary from 36, 73, to 146 Gigabytes of storage (Cray Docs 2).

Nodes

The basic unit of the Cray machine is the node. A node is made up of four multi-chip modules (MCM) and main memory. The MCMs and memory are attached to routers that allow communication between different nodes (Cray Docs 3).

Figure 1 shows the MCM model.

Multi-Chip Modules (MCM)

Each MCM contains a single multi-streaming processor (MSP). These MSPs are made up of four scalar single-streaming processors (SSP) and two megabytes of ecache. Each MSP is a vector processor, meaning it can handle a large numbers of instructions at one time due to its composition of four scalar processors, each only being able to compute one instruction at a time (Partridge, 2002).

Figure 2 shows the SSP connections to the cache.

Cooling

The MSPs are cooled by spraying them with Fluorinert, an inert liquid. Each processor is sprayed by a tiny nozzle. The heat from the processor then causes the liquid to evaporate. The evaporating liquid cools the processor and the gas is collected for reuse. After collection the Fluorinert is cooled, filtered, and sent back to be used in the cooling process once again (Partridge, 2002).

Single-Streaming Processors (SSP)

The four SSPs that make up the MSP are scalar processors with two vector registers. The two vector registers allow the SSPs to fetch, decode, and execute two instructions per clock cycle. Running the SSPs at peak performance, 800MHz, and computing two operations per clock cycle one arrives with the 12.8 gigaflops of processing power that makes the Cray X1 such an amazing machine (Partridge, 2002). The term “flops” stands for floating-point operations per second and is simply a measurement of processor speed (Wikipedia, 2004).

Instruction and Data Caches

Each of the scalar processors contains an instruction and data cache. Each of the caches is sixteen kilobytes in size making a total of thirty-two kilobytes of cache on each scalar processor. Each cache is composed of 256 sets of two lines. These lines are two way set-associative and are thirty-two bytes long (Cray Docs 4).

Figure 3 is a visual depiction of the instruction and data caches.

A data and instruction cache address is forty-eight bits long. The tag field is thirty-five bits long, the set field is eight bits long, and the line-offset field is five bits long (Cray Docs 4).

Figure 4 is an example of the tag for the cache.

The data cache is write-through which means that whenever data is sent to it that data is also sent to the ecache. Scalar data is written to the data cache and the ecache, but vector data is only written to the ecache (Cray Docs 4).

ECache

The ecache is a high speed cache that gives the processors a large amount of temporary storage. The processor can load instructions from the ecache at a rate of 51.2 GB per second and can send instructions to be stored in the ecache as fast as 25.6 GB per second. It can even access the local memory at a rate of 38.4 GB per second. The ecache is similar in structure to the data and instruction caches. It is addressed exactly the same, but its format is slightly different. Instead of 256 sets of lines, it has 32,768 sets of lines. Another difference is that the ecache is write-back. This means that no lines of data are written to it unless they are newer versions of lines previously there. Even when they are newer versions they are not written until the line that they will replace has been evicted, or flushed from the cache (Cray Docs 4).

Memory

Each node has sixteen memory controller chips and thirty-two dynamic random access (DRAM) daughter memory cards. These daughter memory cards come in two sizes, 288 megabit chips and 576 megabit chips. This makes a total of sixteen gigabytes or thirty-two gigabytes of memory available, respectively (Johnson, 2003).

Global Addressability

As mentioned in the node description, every node is connected by means of a system of routers. This network of nodes allows the memory on each node to be globally addressable. This means that memory on any node can be accessed by not only the components on its node, but by any component on any node. When being accessed by another node, however, the transfer rate is not nearly as fast as if it were being accessed by the components on its own node (Cray Docs 3).

Word Size

The memory on each node is broken up into seventy-two bit words. Sixty-four of these bits are used for data and can be used for sixty-four bits operations or broken up into two sections for thirty-two operations. The other eight bits are used for single-error-correction, double-error-detection (SECDED) (Johnson). This allows memory to detect single or double errors and to correct the single errors that are found (Wu, 2003).

Modes

Memory is set up to run in two possible modes that will allow for the loss of memory cells due to unforeseen circumstances. The first mode reserves half of the memory chips on each card to cover the potential loss of a memory chip. This mode cuts the memory space in half, but does not affect the bandwidth at which data transfers. The second mode reserves half of the daughter cards in case an entire card is lost. This mode not only reduces the memory space by half but cuts the bandwidth available for data transfer in half also (Johnson, 2003).

Cabinets

Cray machines can be purchased in one of two types of cabinets, air cooled and liquid cooled. When using an air cooled cabinet, the Fluorinert is gathered after evaporating from the processors and sent through a system that transfers its heat to air blowing through the cabinet. An air cooled system can hold up to four nodes (sixteen MSPs). A liquid cooled cabinet, however, can hold up to sixteen nodes (sixty-four MSPs). This is because liquid cooling is much more efficient. The Fluorinert is sent through a liquid cooling system that the consumer must supply. Often this system passes the liquid through cold water which absorbs much of the heat. Liquid cooled cabinets are mostly used in large computing centers due to the large amount of space that is needed for the computer and the liquid cooling unit (Partridge, 2002).