DataGrid

WP01 Annual Report

Period: January to December 2002

WP01: Grid Work Scheduling

Document identifier: / DataGrid-01-D12.18-0137-1_0
Date: / 08/11/2002
Work package: / WP01: Grid Work Scheduling
Partners: / CESNET, DATAMAT, INFN, PPARC
Lead Partner: / INFN
Document status: / DRAFT
Deliverable identifier: / D12.18
IST-2000-25182 / CoNFIDENTIAL / 1 / 4
/ WP01 Annual Report
Period: January to December 2002 / Doc. Identifier:
DataGrid-01-D12.18-0137-1_0
Date: 08/11/2002

1.WP 01 annual report

1.1.objectives of the WP

The software development activities originally planned for the second Project Year, and towards the second major software release, were (list taken from the first annual report):

  • Support for interactive jobs, according to the applications needs
  • Support for job partitioning
  • Specification of job dependencies
  • Triggering of file transfers by the RB
  • Integration of network information into scheduling policy
  • Development of APIs for the applications
  • Development of GUI components
  • Deployment of Accounting infrastructure (integrated with the scheduling policies) over the testbed
  • Integration of advance reservation (co-allocation) services into RB.

The analysis of design shortcomings and operational/scale issues coming from Testbed 1 was of course also to be reflected in the Release 2 components. The major design choices that proved to be wrong in Testbed 1 were:

  • The choice of a long-lived, “monolithic” process for the Resource Broker server. This made it particularly exposed to crashes and memory leaks originating from underlying middleware layers, and therefore harder to identify and troubleshoot.

For Release 2, the long-lived server components is a much simpler “network server”, and the interaction with CondorG and Globus is left to single-thread services.

  • Several repositories for persistent job information (related to the major WP1 components before internal integration) were kept, using commodity relational database back-end servers. This proved to be a significant deployment complication, and a bottleneck on any node configured as a Resource Broker.

The single relational DB for Release 2 provides the Logging and Bookkeping services (with a hierarchy that is possibly different from the RB one). The rest of the critical communication in the job life-cycle is achieved through the filesystem.

1.2.Technical achievements

WP1 spent considerable effort in the support of deployed Release 1 software for the various experimental productions by the DataGrid application users, obtaining significant feedback on design.

Work progressed in all of the “new” development areas listed in the previous section. The new RB components for Release 2 (now slated for integration in the testbed approximately in March 2003) were described in the D1.4 deliverable document, and will be demonstrated at the second project review.

1.3.Issues and actions

We consider the issues that WP01 reported in the first Annual report to be satisfactorily addressed in the current mode of operation for the project.

Here are the main issues that emerged during PY2:

Issue / Action(s)
During PY2, the priorities of the project were shifted to:
a)Assuring software stability, by troubleshooting and addressing bugs and scalability issues on the deployed versions of EDG software, and the underlying middleware.
b)Revising and improving the software packaging, documentation, delivery and testing procedures, to achieve better quality.
Both measures are beneficial to the project at large, but had the side effect of slowing down the development of new/improved software functionality as listed in Section 1.1. (a) requires to identify and spend additional effort for the maintenance of Release 1. (b) requires additional time and effort to develop and document "public" test suites. / WP1 has limited reach in re-assigning manpower, because of the geographically sparse location of the WP contributors. Additional effort for an estimated total of 1 FTE could be gathered for documentation and test suite development. This is clearly not enough. Also, a specific response team was assigned for promptly addressing Bugzilla reports, but the 'task force' approach of the application testing is channelling most issues directly through WP managers and the Integration Team.
WP1 currently sees no effective measure to contain the additional time required to address the new priorities. It is still felt that the re-factored WP1 services foreseen for Release 2 would be beneficial for the project stability at large.
In order to effectively and timely support the application production activities on the testbed (especially in the case of the Atlas and CMS 'task force' efforts), troubleshooting needs to be conducted on a cross section of the deployed middleware, and not only on the EDG-provided software. As an example, issues with failing file transfers from Computing Elements (occurring only under testbed load) required a rewrite of the Globus GASS cache, touched resource limits on the testbed CE hosts, uncovered problems in the GRAM protocol and in the implementation of the "open" PBS local resource management system used at most testbed sites. As the testbed stability requirement makes it extremely unpractical to keep updating to the latest versions of extensively changing software, some support structure for the integrated EDG software solution needs to be provided. We feel this to be missing from the current project organisation. WP1 is responsible for only about 70000 lines of code of the integrated EDG solution (about 25 % of the EDG current code base, about 5 % of the entire EDG integrated software) and can provide proportional support. / WP1 was asked or volunteered several times during PY2 to troubleshoot and address urgent show-stopping issues (from the point of view of applications testing). This required to delve much beyond the system entry point provided by WP1, and extremely deep into the underlying middleware internals. WP1 can afford to do this only on a good-will, one-time basis. Although a possibly useful action from the project standpoint (introducing additional delay to WP1 development), this is not felt to be a fair, system-level way of addressing the issue at hand. The latter is still much needed.

1.4.Plan for next year

A realistic assessment of the current DataGrid standpoint and priorities makes WP1 target the actual deployment in the testbed of the functionality described in Section 1.1 for the end of the project. The updated core components are now undergoing internal testing by WP1, and provide the hooks for the connection of the foreseen extended functionality. The release plan for Release 2 provided by the Project technical management will be followed and supported. Significant effort in PY3 will be also spent gathering feedback from the application users about the extended Workload Management functionality, so that a possible new design cycle can be accommodated outside the scope of the current project. Support and refinements will be provided as planned along with comprehensive testing.

1.5.Summary

Design limits and bugs were identified in the WP1 (and quite often also in the underlying) middleware, and addressed in order to make the resource selection services provided by WP1 for Testbed 1 scale to the level of production runs by the DataGrid application users. A new organisation for the WP1 core services, taking into account the feedback from Testbed 1 and the need to support extra functionality, was designed (as documented in D1.4 and ancillary documents) and implemented (as part of software Release 2). This will be integrated and deployed onto the testbed according to the current project plans.

IST-2000-25182 / CoNFIDENTIAL / 1 / 4