CDF Offline Initiative

CDF Offline Initiative

Closure Report

Version 1.0

25 November 2008

Robert D. Kennedy, et al.

Table of Contents

2

CDF Offline Initiative

Initiative Overview 2

Introduction 2

CD Charge to the Initiative 2

Project Team 2

Project Repository 3

Project History 3

Reasons for Closing the Project 5

State of the Initiative Objectives 6

1. CAF Adaptation to FermiGrid 6

1a. (WBS 4.5) Deploy New FermiGrid CAF Head Nodes 6

1b. (WBS 4.5) Migrate GroupCAF Nodes into FermiGrid CAF 7

1c. (WBS 4.6) Adopt GlideinWMS 7

2. Critical Node Upgrades 8

3. Issue Tracking System 8

4. Monitoring Framework 9

4a. (WBS 2.1) Low-Level Monitoring Framework (Zabbix) 9

4b. (WBS 4.3) User Monitoring Framework for Grid Infrastructure 10

5. CDF Offline Architecture 11

6. CDF Offline Operations 11

Lessons Learned 13

Successes 13

Opportunities for Future Improvements 13

Personnel Issues 13

Other Comments 14

Next Steps 14

Summary 15

2

CDF Offline Initiative

Initiative Overview

Introduction

The CDF Offline Initiative is an umbrella project to achieve a broad set of goals by providing management consultation, adding effort to handle transition demands, and applying a modest level of project management formalism to track and sustain progress. The Initiative helps plan, prioritize, and coordinate work amongst the existing personnel and groups, with an eye towards identifying and reducing the long-term maintenance and support requirements of RUN2 experiments.

CD Charge to the Initiative

The charge to the Initiative from the Computing Division can be summarized by “do what is important in six months to get CDF Offline Computing up to production grade”. Based on consultation with the major stakeholders, existing work plans, and consideration of the CDF CAF Task Force Interim Report (v4.02) recommendations and associated status document (2/27/2008), a number of highest priority objectives were identified to support this charge. The Initiative Objectives are:

1.  CAF Adaptation to FermiGrid (added March 2008)

2.  Critical Node Upgrades

3.  Issues Tracking System

4.  Monitoring Framework

5.  CDF Offline Architecture

6.  CDF Offline Operations (added April 2008)

The Initiative started by capturing all of the known upcoming work of CDF Offline in a high-level work list that itself was a deliverable to help guide long-term planning. Realistic estimates of task durations and staff availability indicated that, as expected, not all of the envisioned work could be fit into the Initiative. Tasks were either accepted into the work scope of the Initiative (“scheduled” tasks) or not (“unscheduled” tasks). Major high priority tasks were identified and scheduled in finer detail.

Project Team

Director: Margaret Votava (CD ILC/DAQ Deputy Dept Head, now REX Dept Head)

Manager: Robert D. Kennedy (CD OPMQA)

The Advising Committee consisted of the Project Director, Project Manager, and:

·  Jerry Guglielmo (CD ILC/DAQ Dept Head, now LSC Quad Assoc. Leader)

·  Rick Snider (CD REX Dept Head, now REX Deputy Dept Head; CDF Collab and co-leader of CDF Offline)

·  Donatella Lucchesi (INFN – Padua, CDF Collab and co-leader of CDF Offline)

Later joined by:

·  Dennis Box (on assignment to Initiative and REX Dept)

·  Joe Boyd (on assignment to Initiative, in REX Dept)

The CDF Offline Computing liaison role was performed largely by Rick Snider and Donatella Lucchesi as co-leaders of CDF Offline Computing during this Initiative.

Project Repository

Project Management and some subject matter documentation are maintained in the Google Group "CDF Offline Initiative". This is organized into several distinct areas, the most useful of which may be the "Pages". Some examples include:

·  Initiative Schedules: MS Project plans and related reports tracking the Initiative.

·  Work Lists: All tasks in CDF Offline that might be done. Led to the first WBS.

·  JIRA Configuration: Notes on configuring JIRA for CDF Offline.

The Fermilab JIRA Issue Tracker contains operations and development issue tracking information dating back to its deployment in April (beta)/May (production) 2008.

We chose to use standing CDF Offline meetings as-is to track progress in the Initiative since one of the initial concerns was the high workload of the CDF people involved in supporting Offline. The CDF meetings used, the CDF Offline Operations and the CDF Offline Development meetings, are documented in WebTalks.

Project History

Offline Issues Evaluation: January 2008 – February 2008

•  January 2008: Jerry Guglielmo, Margaret Votava interview CDF Offline team.

•  February 2008: Rob Kennedy joins Initiative team, drafts CDF Offline work lists.

•  February 2008: Initiative project concept developed: 6 month Umbrella Project

•  February 2008: CDF CAF Task Force Interim Report with status annotations

Initiative Planning, Product Evaluation: February 2008 – May 2008

•  February – May 2008: Furloughs and “forced vacations” complicate staffing.

•  February 2008: Jira and Zabbix product evaluations

•  March 10, 2008: First full “CDF Offline Work List” v0.7 released for comment.

•  March 19, 2008: Executive presentation of Initiative plan to CD leadership.

•  April 9, 2008: Initiative Introduction presented at CDF Offline Operations Mtg.

•  April 10, 2008: First resource-loaded schedule draft v0.5.0, with Work List v1.0 embedded. Scheduled tasks are in-scope, unscheduled tasks are out. Execution begins.

•  April 18, 2008: Joint CDF-CD Executive Mtg with overview of Initiative.

•  April 28, 2008: Baseline Planning Meeting (using plan v0.8.3)

•  May 10, 2008: Baseline Project Plan v1.0.0 project released.

Execution: April 2008 – October 2008

•  April 2008: Jira evaluation and initial integration into Offline support processes.

•  April 2008: Offline Project draft re-org released. Some future roles unclear.

•  May 2008: Shadow CAF setup. High Priority KCA Upgrade work all month. Staff issues and repeated little problems in head node task chain.

•  Mid-June 2008: FermiGrid Head Node progress deemed unacceptable. Task force formed to re-organize the work, one group to create a tagged “CAF” software release and the other to undertake FermiGrid CAF scale tests. CDF effort redirected from monitoring objectives to the task force work. Production and Ntuple coordinators begin basic operations tests on FermiGrid CAF.

•  July 2008: Offline Project re-organization fully implemented.

•  July 2008: Jerry Guglielmo leaves Initiative. Low-Level Monitoring work ends.

•  Early August 2008: Dennis Box and Joe Boyd join Initiative.

•  September 2008: CafCondor Config v2.0: First formal tagged release, able to be wiped and re-installed with high reproducibility. Final testing by production users.

•  October 2008: New CAF head node with tagged release in production. Begin shifting nodes and production users to upgraded FermiGrid CAF. (a.k.a. CdfGrid)

Close-Out: October 2008 – November 2008

•  October 15, 2008: Enough objectives judged by Initiative Advising Committee to be accomplished or on a smooth path to completion. Close-out is a transition of responsibility to new Line Management and existing Offline Management with the opportunity to take stock of the Initiative experience.

•  October 31, 2008: Executive Meeting of Initiative Advisory Committee with CD and CDF Heads on Close-Out. Well received overall.

•  November 7, 2008: Drop-dead date for task completion to be documented by Initiative.

Status at Completion:

•  19 of 32 Major Milestones completed or will be completed by drop-dead date.

•  8 of 32 Major Milestones are in progress, but unlikely to finish by drop-dead date.

•  Miles 10,17: GlideinWMS integration and deployment

•  Mile 22: GroupCAF to FermiGrid CAF Migration

•  Miles 3, 7, 15: CDF Offline Architecture documents

•  Miles 23, 24 (SL4 Migration): Reduced in priority, after FNAL support for SL3 was extended, removing migration urgency. Work is continuing.

•  5 of 32 Major Milestones are “Closed, Incomplete”:

•  Miles 5, 20 (Low-Level Monitoring – Zabbix): Closed to free up resources for higher priority work. May have saved effort globally to do so since this overlapped with work that was done on a later time-scale by FEF.

•  Miles 9, 21 (User Monitoring Framework): Replanned to free up resources for higher priority work. Alternative proof of concept delivered Nov 2008.

•  Mile 19: Multiple schedd hosts for FermiGrid CAF: Slipped to mid-2009 until work in progress on FermiGrid CAF is completed. The need for this work at that time should be re-evaluated before it is undertaken though.

Reasons for Closing the Project

The CDF Offline Initiative was defined to be a time-limited project to achieve what was possible in six months, effectively beginning in April 2007. In mid-October, it became apparent that three of the four high priority initiative objectives were accomplished or on a low-risk path to completion in the near future, though the fourth (user monitoring) was not completed as originally intended. Since the active Initiative leader became the REX department head, it became apparent that the Initiative could transfer its remaining charge smoothly to the existing line management and experiment project management structures for completion of the remaining work. On 31 October 2008, we presented the Initiative accomplishments and a close-out plan to the CDF spokespeople and Computing Division Head at an executive meeting. The Initiative will just track the “GroupCAF to FermiGrid CAF Migration” and “GlideinWMS Migration” task chains to completion, which are expected to completed by mid-December 2008.

State of the Initiative Objectives

1. CAF Adaptation to FermiGrid

This objective consisted of three main components necessary to achieve the envisioned future CAF system based on FermiGrid technology and having sufficient capacity to absorb existing CAFs on-site and meet long-term production and analysis demands.

1a. (WBS 4.5) Deploy New FermiGrid CAF Head Nodes

Goals:

·  Deploy new, more performant hardware in the critical head node role to replace aging hardware.

·  Create and use, for the first time in a while, stable tagged releases of CAF service software capable of reliable, reproducible installation.

·  Demonstrate the new FermiGrid CAF system can manage the anticipated demand in the immediate future of 5k WN slots.

·  Demonstrate the new FermiGrid CAF system can be used reliably by current production users of GroupCAF (Production, Ntupling, Calibrations, etc).

State at Closing: DONE

Deliverables:

·  CafCondorConfig v2.0 released and v2.1 is about to be released at this writing.

·  Scale tests successfully completed on the ShadowCAF, albeit at a reduced 3k to 4k WN slot level emulation.

·  Fcdfhead10 in production as head node of new CdfGrid CAF instance

Outstanding Risks (including Operations, Support):

·  The node head11 failed to operate reliably after several repair attempts, a potentially serious blow to the original 2-head node configuration. It was determined, however, that all services could be run on head10 and doing so would in fact reduce the number of SPOF by one node. The integration system head nodes are prepared to be used as a replacement if head10 should fail badly too.

·  If head11 is repaired, since it is no longer needed in its originally envisioned role, then it might be deployed to further reduce operational risk by acting as a back-up host to non-singleton services in the FermiGrid/GlideinWMS CAF system.

·  The ShadowCAF (a.k.a. sleeper pools) approach to testing the new CAF system at scale failed to operate at sufficiently large scale due to a Condor bug, reportedly fixed in Condor 7.1.4 (a development version made available in late November, after the Initiative closure). The pace of the WN migration will be sufficiently slow to permit the greater vigilance required in the transition to full-scale production to catch and resolve any unexpected at-scale limitation. After this bug is fixed, the ShadowCAF scale-testing will be revisited to prepare for GlideinWMS testing (see 1c, below).

1b. (WBS 4.5) Migrate GroupCAF Nodes into FermiGrid CAF

Goals:

·  Migrate all worker nodes (WNs) and users from the GroupCAF to the new FermiGrid CAF in order to support a CAF system built from fewer experiment-specific components.

State at Closing: IN PROGRESS, MIDWAY TO COMPLETION (early Nov 2008)

We expect smooth execution of this task chain from November to mid-December 2008. The plan is well-documented and considered to have low technical risk.

Deliverables:

·  WN migration plan for CdfGrid, completion by December 08, 2008 (see p.3)

Outstanding Risks (including Operations, Support):

·  While this task is open, the CAF operations team will have to support both the old and the new CAFs. This is likely to increase stress on the team for a time with more distinct services having to be supported.

1c. (WBS 4.6) Adopt GlideinWMS

Goals:

·  Adapt to using the GlideinWMS model for job workflow management.

·  Demonstrate the new FermiGrid CAF system using GlideinWMS can manage the anticipated demand in the future of 10k WN slots.

·  Migrate all WNs and users from the non-GlideinWMS FermiGrid CAF to the GlideinWMS FermiGrid CAF in order to support greater demand in the future.

State at Closing: IN PROGRESS, AT EARLY STAGE (early Nov 2008)

Recovering the past proof-of-concept implementation, planning integration of change required to the code base with modern tagged releases, and preparing an integration testbed.

Deliverables:

·  CdfGrid migrated to use GlideinWMS

·  GlideinWMS monitoring available to operations and users.

Outstanding Risks (including Operations, Support):

·  GlideinWMS adoption carries with it some risks:

o  Does not operational simplify the system

·  Increase from 2 to 3 condor pools

·  Increase number of production head nodes from 1 to 3

o  First time installs need an expert now. Knowledge transfer in progress.

o  May miss goal of being in production for winter conference use since starting behind: the head10 replacement took much longer than planned

·  While this task is open, the CAF operations team will have to support both the old and the new CAFs. This is likely to increase stress on the team for a time with more distinct services having to be supported.

·  Long-term support for the GlideinWMS component will be addressed in a briefing being organized by Eileen Berman.

2. Critical Node Upgrades

Goals:

·  Replace unreliable out-of-warranty servers with new more-performant servers, thus improving reliability and reducing operations effort.

Areas of Work:

2a. (WBS 6.1) Critical Node Upgrades: ICAF Nodes

2b. (WBS 7.5) Critical Node Upgrades: dCache File Servers

2c. (WBS 3.4) Critical Node Upgrades: Code Server