Grid Laboratories of Wisconsin, GLOW-I Final Report

Since our last report on GLOW in May of 2005, our campus grid has continued to grow, both in computing power and in its reach within the campus community and beyond. It has come a long way since a very cold Wisconsin morning in January 2004 when the first shipment of GLOW hardware arrived at the loading dock. However, our task all along has been the same: deliver as much computing power to scientists as we possibly can. The first set of 30 compute nodes ran their first jobs 14 hours after arrival. Since then GLOW has delivered 23 million CPU hours to more than 170 users, and provided scientists and students from the UW and beyond much needed computing and storage capabilities. An important and intentional side effect of this activity has been the advancement of widely used computing middleware, primarily Condor, in the face of real-life challenges.

GLOW hardware was purchased in four main rounds over two years and was distributed to six locations on the UW Madison campus. The machines were managed as one condor pool, accessible to all GLOW users, with differing machine-level policies to suit the needs of the effective owner. In addition to the hardware purchased from the core GLOW grant, new and existing members of GLOW have added about an equal amount of additional hardware to the GLOW Condor pool. We take this as a very healthy indicator of the project’s success. Because we have tended to think of and monitor this combined resource as a single entity, “GLOW”, the statistics in this report reflect the overall usage of GLOW, including both the portion paid for by the core GLOW grant and the additional resources that have been added.

GLOW HARDWARE

GLOW Round 1 was deployed in February of 2004. It consisted of 6 racks of 30 dual 2.8Ghz Intel Pentium 4 Xeon compute nodes. Each site received one rack and an additional storage node with one terabyte of disk space serving as a high capacity data cache. Intel donated 70 motherboards and 140 processors of those needed to deploy this round. The racks have a stack of Cisco 3750 gigabit switches to provide high bandwidth to the compute nodes. Two sites received specialized configuration: the Laboratory for Molecular and Computational Genomics (LMCG) compute nodes have 4 gigabytes instead of the standard 2 gigabytes because of the larger working set of the LMCG jobs, and the Computer Sciences site has Cisco 6500-series switches instead of the 3750 to allow for a 10 gigabit uplink to the campus backbone and (what has since become a reality) STAR TAP in Chicago.

GLOW Round 2 followed in October of 2004. Five racks were added to the initial six, and each rack contained 36 compute nodes identical to the previous rack. The CMS High Energy Physics site received 30 nodes instead of 36, in order to physically reconfigure their site into 3 racks of 30 compute nodes and 12 terabytes of storage per rack. The Ice Cube site did not receive a rack due to an upcoming move across campus and lack of physical space in the temporary quarters.

GLOW Round 3 was deployed in January and February of 2005. It added high capacity storage to the Computer Sciences and High Energy Physics sites, using Apple XServe RAID arrays. Each RAID has 5.6 terabytes of raw storage, which after RAID and file system overhead yields about 4.5 terabytes of usable capacity. High Energy Physics received 9 systems and placed 3 in each rack. Computer Sciences received 8 RAIDs. The RAIDs are accessed using several user-level tools, such as gridftp and dCache.

GLOW Round 4 was deployed in December of 2005 and the following January. This contained 60 dual-cpu 3.2 GHz Intel Pentium machines. One rack of 30 nodes was installed at IceCube and one at Medical Physics.

Table 1 - GLOW-I Hardware Purchasing

Site / Round 1 machines / Round 2 machines / Round 3
(storage) / Round 4 machines / Total
Lab for Computational Genomics / 30 dual xeons
(4GB RAM)
1 TB storage / 36 dual xeons
(4GB RAM) / 66 nodes
1TB
Medical Physics / 30 dual xeons
1 TB storage / 36 dual xeons / 30 dual xeons / 96 nodes
1 TB
Computer Sciences, Condor / 30 dual xeons
1 TB storage / 36 dual xeons / 32 TB / 66 nodes
33TB
High Energy Physics, CMS / 30 dual xeons
1 TB storage / 30 dual xeons / 36 TB / 60 nodes
37TB
Chemical Engineering, MTSM / 30 dual xeons
1 TB storage / 36 dual xeons / 66 nodes
1 TB
Ice Cube / 30 dual xeons
1 TB storage / 30 dual xeons / 60 nodes
1 TB
Total / $461,409 / $375,570 / $215,128 / $152,637 / 414 nodes
828 cpus
74 TB

ADDING MEMBERS AND RESOURCES

The first additional group to join GLOW was UW ATLAS, a High Energy Physics research group. Using their own funds, the ATLAS group added 60 dual processor Xeon Pentium 4 machines in March of 2005 and formally joined the GLOW management boards. New members of GLOW are expected to participate in the technical meetings and activities, including basic maintenance of their machines, or to find an existing member who is willing to do this on their behalf. The Condor CS group agreed to host the ATLAS computers and to handle administrative tasks.

The Physics department added an additional 33 dual processor machines in June of 2005, initially for research being done by Marcus Mueller in Condensed Matter Physics. These nodes were hosted by the CMS High Energy Physics group in return for second priority rights to these machines (i.e. higher priority than other GLOW users). When Marcus Mueller moved to a different university, the ownership of these nodes was transferred to CMS.

The Multiscalar research group from Computer Sciences joined GLOW in April 2006. They added 18 dual-cpu dual-core 1.8 AMD Opteron machines, managed by the Condor CS group. Half of these machines had 4 GB RAM and half had 8 GB.

The Plasma Physics research group joined GLOW in May 2006, adding 22 dual-cpu dual-core AMD Opteron machines with 16 GB RAM. These were hosted by the CMS High Energy Physics group.

In addition to these resources from new GLOW members, some of the existing GLOW members decided to add resources to the GLOW pool. In early 2005, the UW CMS group was chosen to become one of the seven US Tier-2 computing centers for CMS. Feeling that the GLOW model made sense, given the bursty CMS usage pattern, the UW CMS group decided to manage their Tier-2 computers as part of the GLOW Condor pool. The machines are, of course, for first-priority use by CMS, which expects to use an increasing amount of computing power as it prepares for the analysis of LHC data in 2008. By the end of 2006, this additional hardware consisted of 28 existing dual-cpu 2.4 GHz Xeons, the 33 previously mentioned dual Xeons acquired from Condensed Matter Physics, and 45 newly acquired dual-cpu dual-core AMD Opterons, all with 1 GB RAM per core. The CMS group also packed many of their nodes with additional disks for their own private use through dCache (~100TB raw space). This co-location of storage services with GLOW compute services has proven to work reasonably well.

The Chemical Engineering GLOW group added 25 dual-core Xeons to the GLOW pool.

The Condor CS group also continued to add resources to GLOW. At the end of 2006, these additional CS machines were composed of 26 2.4 GHz dual Xeons, 50 3.2 GHz dual Xeons, and 24 1.8 GHz dual-cpu dual-core AMD Opterons.

Summary of GLOW Hardware Today

Table 2 summarizes the computing hardware that is operational at the time of this writing. Hardware purchased as part of the core GLOW grant is under the heading GLOW-I. Hardware purchased from other funding sources is indicated under the heading “Added”. In order to provide meaningful comparison over the different generations of hardware, the processing power has been converted to kilo-SpecInt2000 numbers.

Table 2 - GLOW hardware operational on Dec 26, 2006

GLOW Member / GLOW-I
(kSI2k) / Added
(kSI2k)
ATLAS / 186
ChemE / 148 / 80
CMS / 198 / 352
IceCube / 163
LMCG / 185
Condor / 143 / 306
MedPhys / 278
Multiscalar / 83
Plasma / 110
TOTAL / 1115 / 1117

Usage Statistics

Usage patterns vary widely between the GLOW members. This is good, because it means that a lot of opportunistic use does in fact take place: while one group has a lull in their immediate need for cpus, others almost always have work ready and waiting for the free machines.

Figure 1 - GLOW Condor Status

Figure 1 shows the status of job execution slots in the GLOW Condor pool over the past two years. During this time, the “normal” utilization of these batch slots was about 95%. In addition to this utilization by normal Condor job submission, there are a couple additional sources of usage. The white space under the line representing total CPUs (more correctly “cores”) is time used by backfill processes; this amounts to about 2% during this time period. On most of GLOW, the Condor backfill configuration is set to run BOINC jobs from the LIGO Einstein@Home project.

The red “unclaimed” time in Figure 1 (about 2.5%) is an upper bound on time that actually went unclaimed by both backfill and condor jobs, because this graph does not take into account the state of the GLOW “suspension slots”. These are Condor batch slots specially configured to suspend jobs when there is other higher-priority activity on the machine’s normal job execution slots. This feature was added to GLOW for ATLAS, because they wanted to scavenge unused CPU time. They needed their preempted jobs to be suspended/resumed rather than killed/restarted, because their long-running simulation jobs were incompatible with Condor’s checkpointing libraries.

Figure 2 - Fraction of CPU Hours (Total 23.4 Million) used between 01/31/2004 and 12/25/2006

Figure 2 shows the fractional usage of GLOW by the participants. Not all groups were equally successful in using their “share” of the available resources over this time period. However, in deadline-driven activities, success may depend on the ability to engage in large bursts of computation, rather than cumulative usage. A good example is the IceCube/Amanda group, which was not ready to use GLOW during the early months but then suddenly needed to run a high-priority data filtering process. They succeeded in doing this, and, in the process, consumed 12 CPU-years in one month.

Guest users on GLOW managed to claim a significant amount of CPU time. These are researchers who were sponsored by one or more GLOW members. Although they did not have “ownership” priority on any machines in GLOW, they were still accepted with equal priority to other opportunists when vying for machines unclaimed by their owners. Many of the guests were sponsored by the Condor CS group, whose research interest in GLOW was less to use CPU hours directly than to find interesting usage-cases of Condor. Therefore, the time used by guests was well compensated to other GLOW members by the time on CS machines that was not directly claimed for CS research.

GLOW’s connection to the Open Science Grid resulted in some usage of GLOW by various OSG virtual organizations. This will be addressed in the section on GLOW and the OSG. It should be noted here that in Figure 2 the 320,000 hours attributed to OSG usage is not counting the large amount used by CMS and ATLAS through the OSG interface. This is because the GLOW CMS and ATLAS groups both happened to be large users of OSG as well as GLOW, and at various times they each used both interfaces to GLOW: direct condor job submission and submission through the UWMadisonCMS OSG gatekeeper via condor-G. If the usage by these groups that happened to pass through the OSG gatekeeper is added, the total OSG usage rises to 1.7M hours.

GLOW Results

The High Energy Physics CMS research group of Professors Sridhara Dasu and Wesley Smith.

The CMS collaboration uses several thousand CPUs distributed in institutions world-wide to satisfy its computational needs. The UW CMS group was one of the largest contributors to the CMS collaboration’s detector simulations in the past few years. The UW CMS group has simulated and reconstructed over 50 million events using GLOW/CMS-Tier-2 resources. This effort would not have been possible without the GLOW resources and the close collaboration with the Condor group. CMS physicists, including our students, will be able to prepare their trigger, data acquisition and analysis strategies, and associated software in preparation for a physics program that will begin soon.

In early 2005, the UW CMS group was chosen to become one of the seven US Tier-2 computing centers for CMS. Because the GLOW model was working well for us, we chose to make our additional CMS Tier-2 resources part of the GLOW condor pool. US CMS operates as part of the Open Science Grid, so we maintain an OSG gatekeeper, which has provided a link into GLOW for international CMS as well as the many other users of OSG.

The Laboratory for Molecular and Computational Genomics (LMCG) under Professor David Schwartz.

LMCG has developed the first practical single molecule platform for whole genome analysis, named Optical Mapping. This platform produces very large genome datasets that can only be interpreted after significant computational effort. For example, it is not uncommon for the laboratory to submit jobs consisting of hundreds of thousands of human DNA “barcodes”. The computation consists of aligning these barcodes with a reference (the current human genome sequence), and then assessing differences, which manifest themselves as uniquely scored mutations. We are working closely with Prof. Michael Waterman (University of Southern California), who is one of the fathers of bioinformatics, to develop improved algorithms and software for such operations. Without the GLOW/Condor system, this research would not be possible. The availability of this resource has spurred our laboratory and Prof. Waterman’s group to seek new solutions to the discernment of human diversity, based on readily available massive computational resources. In fact, the lack of such resources at USC has given cause to merge Waterman’s efforts more closely with those of us.