How to Build a Grid

SURA Cyberinfrastructure Workshop: Grid Application Planning & Implementation

January 5 – 7, 2005, Georgia State University

Notes from breakout sessions on: How to Build a Grid/Different Grid Technologies (January 7, a.m.)

Facilitated by: Phil Emer, MCNC

Scribe: Mary Trauner, Georgia Institute of Technology

A number of key questions were posed throughout the breakout session. The following notes are organized by question coupled with the discussion that took place for each.

A grid can be considered as an access method or platform.

1) What are considered reasonable platforms for compute and storage?

· Anything with a power supply?

· Anything in a controlled environment?

· Anything in a controlled lab?

· State-wide resources

Phil stated that at MCNC, the decision was:

· Not a desktop

· Something in a controlled environment

· Something that peers across universities had built in a consistent manner.

Amy Apon (U of Arkansas) worked with students to build an X-Grid (Apple) that included some Linux systems. This led to a course on building grids and tools for the grids.

2) Is anyone considering grids as a way to harness unused cycles and/or to avoid a [HPC] purchase?

One person has done so with Avaki; they are currently looking at United Devices. The Avaki grid has paid off.

· Now Avaki software is basically free to universities

· Support for it is [the fee for support] is reasonable.

3) What middleware is reasonable?

4) What manpower is needed?

5) Should grids support heterogeneity?

If you support multiple sites, campuses, virtual organizations, it [heterogeneity] will happen. For example, MCNC supports a grid comprised of:

· Duke: Linux with Sun Grid Engine and Ethernet

· NC State: A Sun SMP with Grid Engine

· UNC: Linux with Myrnet and LSF

6) Does Globus work in a heterogeneous environment?

One site has implemented this at GT2. There is uncertainty and concern about the next version since it uses web services. Some in the group were uncertain if this was a step forward.

NMI provided a stable stack. Now there is a bifurcation of those building infrastructures from those at virtual organizations based on the application interfaces. Where do those interfaces tie in? At this stage, end users present an application to their infrastructure people for a grid solution.

7) Who has built a grid?

Jorg Schwarz (Sun) responded from two perspectives:

Within an institution, you just need:

· Resource manager

· Directory structure (LDAP, Punch, etc.)

· Accounting

Across universities, add:

· GT2

· A portal

· A command line interface

8) Is a grid an HPC solution or a network piece?

Several responded that they considered it as HPC. One person commented that he considered it HPC because it appeared to be an outgrowth from the NSF centers on how to do research.

We discussed the terms Grid versus grid (big versus little “g”.) What must a network provider provide for a grid beyond what an ISP might provide?

· Earlier availability

· Something only a university will consume

· A grid gateway

· Beyond bandwidth

o Network to compute

o An access tier to resources

Application specific grids tend to want a portal (veneer) whereas general grid users tend to prefer lower level access like command line interfaces.

This launched into a discussion of “point of view”. Utility and presentation become important based on point of view when deciding access methods.

An exampled posed was a backup service. This still leverages middleware, but the use of a backup service is more “grid” than “Grid”.

9) How does one build a grid file system or data grid while immersed in an AFS solution?

10) How long does it take to build a grid?

This depends on how many pieces you need. (Kerberos, AFS, etc.?) You must start with:

· Authentication

· Data access

· Local load sharing environment

This led to the question “Is a grid a cluster or vice versa?”

Authentication, data access, and load sharing help answer this. Whether your users are from a single administrative domain or multiple domains is also part of the answer.

Certificate authorities were discussed. Multiple CA’s within a grid (aka different policies for each compute resource) is problematic. MCNC got around this [campus access/use policies] by purchasing the equipment that was placed at each university, thus providing unified policies, authentication, etc.

11) What should I do in a closed environment with multiple OS platforms, where the applications have dependencies on certain libraries?

What middleware and DRM [Distributed Resource Manager] will you use? DRM options include

· Condor G

· PBS

· Grid Engine

· LSF

· Maui

· Load Leveler

The DRM provides:

· Execution-host clients to monitor loads (memory, CPU, etc.)

· Submit-host/master client to collect execution-host information, build queues

· Does not allow interactive access to execution nodes (interactive nodes are generally provided separately.)

Note that Globus does not provide a DRM. GRAM is the Globus Toolkit interface between Globus and the DRM.

Another thing to note is that GridFTP is not needed in a local environment; AFS [,scp] or other would be used instead.

And scheduling across clusters is not handled by Globus out of the box. Meta-schedulers can be built to handle this. It usually requires some building or scripting. Grid Engine (Sun) will do this.

12) Are there some good grid terminology resources?

Several mentioned the IBM Redbook as a good source [title and/or ISBN?]

13) What layers do you need to connect clusters into a grid?

First, you need to identify an initial exercising application. So this question is hard to completely answer without knowing the application.

Jorg (Sun) disagreed, saying an authentication platform that could verify user info was enough to do a general-purpose grid. He proposed the following diagram.

LDRM: Local resource manager

A or B: SMP system, cluster, set of clusters, particular applications, etc.

(Phil noted that this is a compute-centric grid. A data grid may be different.)

For example, someone could log onto the portal at A to submit a Blast job. B may be the Blast server. B will get the job, know how to get data from A and return the results to A. Something like GridLab metascheduling was recommended for review.

14) How do you move data? Who gets to access input and output? How does it [data] get in and out?

This is not necessarily complicated, but it requires some choices for middleware selection.

When is the data replicated? How does data location affect performance? When do you need to add data access servers?

Data grids have similar authentication issues but add:

· Data access method

· Replication

· Tools

Virtual LAN and Cluster on Demand were mentioned. (??)

15) How do you deploy a consistent image across multiple OS platforms?

Consistency is difficult. Adding the middleware complicates it. Application domain specificity will simplify it.

Maytal Dahan (TACC) described their hub and spoke trust philosophy. Phil (MCNC) asked if this wouldn’t implode into forcing homogeneity. Victor Bolet (GSU) said it would implode to the use of standards not OS or platform.)

Maytal went on to say that they are dealing with standards and middleware. She added that researchers think grids are still just too hard to use. So portals are important. Now as portals become standardized, they can interoperate so which one you choose isn’t as important now. You don’t have to adopt a particular one.

16) Are grids good? How? Where are we going or trying to go?

We should eventually look at grids supporting science and research the same way as the block box that the Internet has become for us today. Scheduling, Certificate Authorities, etc should be transparent.

So how do we build the black box? Visualization and workflow aspects are important to how the end-user interfaces to the results (and the speed/quantity at which they receive them.) As infrastructure people, we may not have those skills. Maytal (TACC) said some middleware may be heading in this direction.

Visualization widgets, engines, and instruments could be something we need to think more about, consider.

Kazaa and peer-to-peer applications are grid-like things. Grid is like another version or “second coming”. Grids need to “spew” services that just work, in a similar manner as the peer-to-peer applications.

But the basic or fundamental requirement: run an executable that reads data and returns results.

17) Are there any automatic methods?

There is no “grid in a box.” The NMI toolkit is a good place to start.

Choosing the Globus API (2, 3, or 4) is a big question. (See below)

Implementing a Metascheduler: Rudimentary scheduling, like round-robin, isn’t too difficult if you have some sort of access control.

· CSF: Community Scheduler Framework

· VGRS (VeriSign Global Registry Services)

· LSF has a multi-cluster solution, but it is very expensive

Globus Toolkit Review:

· Pre “web services” components are pretty safe and stable. 2.4 is the most stable for infrastructure.

· Those looking for things like grid applications have gone with 3.0. There is concern about 4.0 which is using WSRF (delayed until April.) If 4.0 is not rock solid, the results (difficulties?) will be dramatic.

· Today, for production services, run 2.4 (or some 2.x.) Run 3.9.4 if you want to play with the WSRF beta. (Comments that it crashes a lot?)

Other grid toolkits:

· .net: OSGI is going away

· WSRF <missed those comments>

· SRM/SRB: Jefferson Lab

· Unicore: Grid toolkit developed in Europe

Phil (MCNC) mentioned that they wrote an initial “kick start” guide for an enterprise grid. He listed a diagram:

18) Should we ask SURA to host a mailing list on building grids?

If interested, let Mary Fran know ()

19) Afterthoughts from the facilitator (Phil Emer/MCNC)

I think that these notes do a great job of capturing the flow of the conversation and the types of questions that were on the table (nice, Mary!). I am not quite sure what to do with these notes though…some workshops are probably in order…My gut feeling based on listening to the types of questions folks were asking is that this is what people want/need to do vis-à-vis building grids:

a) Provide "elegant" access to centralized HPC resources. So grid as a front end access method to HPC resources. At MCNC we call this the enterprise grid. Several attendees mentioned wanting to provide such an interface to a collection of heterogeneous resources (though in most cases the resources were not distributed). You don't really need grid to do this as you can apply DRM's like LSF and grid engine. Adding a grid interface here allows applications and users to access the resources in a more transparent and potentially cross-domain kind of way. IMO adding the grid access method makes access to high throughput computing accessible to folks that are not traditional command-line driven scientists.

b) Save money and increase user happiness by more effectively using resources. There is a bit of momentum building around the notion of cluster or resource on demand. So for instance having a pool of resources and say imaging a system on the fly to support a particular application for a particular user and releasing the resource when done. Some (including folks at NC State) believe that it may be cheaper and easier to build this kind of a system than the land of milk and honey (grid) where you somehow can apply middleware to any combination of hardware and software platform and get a consistent, deterministic result. So for instance some applications may simply run cheaper or faster or whatever on a particular OS - so deal with that and image an optimized system on the fly and release the resources when you're done. Only problem is this is almost anti-grid in that it punts on the notion of being able to build the perfect middleware stack.

c) Build more effective Virtual Organizations that share data, applications, tools, gear, etc. The first two examples above are more from the point of view of an organization that is in the business of running infrastructure – a service provider. Here the perspective is a user group. So the Florida example comes to mind where a Biologist is providing services to Biologists and some grid tools make sense for managing computation, data management, application support and the like. Defining workflows and building portals that approximate those workflows while maintaining the notion of access control and "ownership" comes to mind here.