LSST Data Management Middleware Design LSE-152 07/25/2011

Large Synoptic Survey Telescope (LSST)

Data Management Middleware Design

Kian-Tat Lim, Ray Plante, Gregory Dubois-Felsmann

LSE-152

July 25, 2011

Change Record

Version / Date / Description / Owner name
1.0 / 7/25/2011 / Initial version based on pre-existing UML models and presentations / Kian-Tat Lim

Table of Contents

Change Record

1Executive Summary

2Introduction

302C.06.02.01 Database and File Access Services

3.1Data Access Framework I/O Layer

402C.07.01.01 Control and Management Services

4.1Event Subsystem

4.2Orchestration

4.3Data Management Control System

502C.07.01.02 Pipeline Construction Toolkit

5.1Policy Framework

5.2Pipeline Harness

602C.07.01.03 Pipeline Execution Services

6.1Logging Subsystem

702C.07.01.07 File System Services

7.1Data Access Framework Replication Layer

1

LSST Data Management Middleware Design LSE-152 07/25/2011

The LSST Data Management Middleware Design

1Executive Summary

The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data Release Production, Calibration Products Production, and Level 3 processing, from details of the underlying hardware and system software. It enables flexible reuse of the same code in multiple environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters, with a common communication and logging model. It ensures that key scientific and deployment parameters controlling execution can be easily modified without changing code but also with full provenance to understand what environment and parameters were used to produce any dataset. It provides flexible, high-performance, low-overhead persistence and retrieval of datasets with data repositories and formats selected by external parameters rather than hard-coding. Middleware services enable efficient, managed replication of data over both wide area networks and local area networks.

2Introduction

This document describes the baseline design of the LSST data access and processing middleware, including the following elements of the Data Management (DM) Construction Work Breakdown Structure (WBS):

  • 02C.06.02.01 Database and File Access Services
  • 02C.07.01.01 Control and Management Services
  • 02C.07.01.02 Pipeline Construction Toolkit
  • 02C.07.01.03 Pipeline Execution Services
  • 02C.07.01.07 File System Services

The LSST database design, which contributes to WBS element 02C.06.02.01 and other elements within 02C.06.02, may be found in the document entitled “Data Management Database Design” (LDM-135). WBS element 02C.07.04 is described in “LSST Cybersecurity Plan” (LSE-99). 02C.07.05 (visualization) and 02C.07.06 (system administration) are primarily low-level, off-the-shelf tools and are not described further here. 02C.07.08 (environment and tools) includes similar off-the-shelf tools as well as testbeds and other primarily-hardware elements.

Figure 1. Data Management System Layers.

Common to all aspects of the middleware design is an emphasis on flexibility through the use of abstract, pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the substantial computational and bandwidth requirements of the LSST Data Management System (DMS) force the designs to be conscious of performance, scalability, and fault tolerance. In most cases, the middleware does not require advances over the state of the art; instead, it requires abstraction to allow for future technological change and aggregation of tools to provide the necessary features.

302C.06.02.01 Database and File Access Services

This WBS element contains the I/O layer of the Data Access Framework (DAF).

3.1Data Access Framework I/O Layer

This layer provides access to local resources (within a data access center, for example) and low-performance access to remote resources. These resources may include images, non-image files, and databases. Bulk data transfers over the wide-area network (WAN) and high-performance access to remote resources are provided by the replication layer within 02C.07.07 File System Services.

3.1.1Key Requirements

The DAF I/O layer must provide persistence and retrieval capabilities to application code. Persistence is the mechanism by which application objects are written to files in some format or a database or a combination of both; retrieval is the mechanism by which data in files or a database or a combination of both is made available to application code in the form of an application object. Persistence and retrieval must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must be usable by application developers. It is required to be flexible, allowing changes in file formats or even whether a given object is stored in a file or the database to be selected at runtime in a controlled manner. Image data must be able to be stored in standard FITS format, although the metadata for the image may be in either FITS headers or database table entries.

3.1.2Baseline Design

We designed the I/O layer to provide access to datasets. A dataset is a logical grouping of data that is persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be persisted into multiple formats.

A Formatter subclass is responsible for converting the in-memory version of an object to its persisted form (or forms), represented by a Storage subclass, and vice versa. The Storage interface may be thin (e.g. providing a file's pathname) or thick (e.g. providing an abstract database interface) depending on the complexity of the persisted format; all Formatters using a Storage are required to understand its interface, but no application code need do so. One Storage will represent the publish/subscribe interface used by the camera data acquisition system to deliver image data. A Storage is configured with a LogicalLocation to indicate where the object resides. Formatters and Storages are looked up by name at runtime, so they are fully pluggable. Formatters may make use of existing I/O libraries such as cfitsio, in which case they are generally thin wrappers. Formatters are configured by Policies.

All persistence and retrieval is performed under the control of a Persistence object. This object is responsible for interpreting the overall persistence Policy, managing the lookups and invocations of Formatters and Storages, and ensuring that any transaction/rollback handling is done correctly.

Figure 2. Data Access Framework I/O Layer Components.

3.1.3Alternatives Considered

Use of a full-fledged object-relational mapping system for output to a database was considered but determined to be too heavyweight and intrusive.

3.1.4Prototype Implementation

A C++ implementation of the design was created for Data Challenge 2 (DC2) and has evolved since then. Formatters for images and exposures, sources and objects, and PSFs have been created. Datasets are identified by URLs. Storage classes include FITS[1] files, Boost::serialization[2] streams (native and XML), and the MySQL[3] database (via direct API calls or via an intermediate, higher-performance, bulk-loaded tab-separated value file). The camera interface has not yet been prototyped.

This implementation has been extended in DC3 to include a Python-based version of the same design that uses the C++ implementation internally. In the Python version, a Data Butler plays the role of the Persistence object. It takes dataset identifiers that are composed of key/value pairs, with the ability to infer missing values as long as those provided are unique. An internal Mapper class uses a Policy to control the format and location for each dataset. A Python-only Storage class has been added to allow persistence via the Python "pickle" mechanism.

402C.07.01.01 Control and Management Services

4.1Event Subsystem

The event subsystem is used to communicate among components of the DM System, including between pipelines in a production. A monitoring component of the subsystem can execute rules based on patterns of events, enabling fault detection and recovery.

4.1.1Key Requirements

The event subsystem must reliably transfer events from source to multiple destinations. There must be no central point of failure. The subsystem must be scalable to handle high volumes of messages, up to tens of thousands per second. It must interface to languages including Python and C++.

A monitoring component must be able to detect the absence of messages within a given time window and the presence of messages (such as logged exceptions) defined by a pattern.

4.1.2Baseline Design

The subsystem will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ[4]. Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG[5] to make them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to that topic to receive copies of the events.

The event monitor subscribes to topics that indicate faults or other system status. It can match templates to events, including boolean expressions and time expressions applied to event data and metadata.

Figure 3. Event Subsystem Components

Observatory Control System (OCS) messages destined for the DM System will be translated into DM Event Subsystem events by dedicated software (part of the DMCS, see section 4.3) and published to appropriate topics.

4.1.3Prototype Implementation

An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event receivers include pipeline components, orchestration components, the event monitor, and a logger that inserts entries into a database.

The event monitor has been prototyped in Java. The OCS message translator has not yet been prototyped.

4.2Orchestration

The orchestration layer is responsible for deploying pipelines and Policies onto nodes, ensuring that their input data is staged appropriately, distributing dataset identifiers to be processed, recording provenance, and actually starting pipeline execution.

4.2.1Key Requirements

The orchestration layer must be able to deploy pipelines and their associated configuration Policies onto one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly overlapping, subsets of nodes. All four pipeline execution models (see section 5.2.1) must be supported. Sufficient provenance information must be captured to ensure that datasets can be reproduced from their inputs.

The orchestration layer at the Base Center works with the DM Control System (DMCS, see section 4.3) at that Center to accept commands from the OCS to enter various system modes such as Nightly Observing or Daytime Calibration. The DMCS invokes the orchestration layer to configure and deploy the Alert Production pipelines accordingly. At the Archive Center, the orchestration layer manages execution of the Data Release Production, including sequencing scans through the raw images in spatial and temporal order.

Orchestration must detect failures, categorize them as permanent or possibly-transient, and restart transiently-failed processing according to the appropriate fault tolerance strategy.

4.2.2Baseline Design

The design for the orchestration layer is a pluggable, Policy-controlled framework. Plug-in modules are used to configure and deploy pipelines on a variety of underlying process management technologies (such as simple ssh[6] or more complex Condor-G[7] glide-ins), which is necessary during design and development when hardware is typically borrowed rather than owned. Additional modules capture hardware, software, and configuration provenance, including information about the execution nodes, the versions of all software packages, and the values of all configuration parameters for both middleware and applications.

This layer monitors the availability of datasets and can trigger the execution of pipelines when their inputs become available. It can hand out datasets to pipelines based on the history of execution and the availability of locally-cached datasets to minimize data movement.

Faults are detected by the pipeline harness and event monitor timeouts. Orchestration then reprocesses transiently-failed datasets on already-deployed pipelines or on new pipeline instances that it deploys for the purpose.

4.2.3Prototype Implementation

A prototype implementation of the deployment framework was developed for DC3a. It was extended to handle Condor-G, and data dependency features were added for DC3b. Full fault tolerance has not yet been prototyped, although a limited application of a fault tolerance strategy has been demonstrated. Provenance is recorded in files and, to a limited extent, in a database. The file-based provenance has been demonstrated to be sufficient to regenerate datasets.

4.3Data Management Control System

The LSST DMS at each center will be monitored and controlled by a Data Management Control System (DMCS).

4.3.1Key Requirements

The DMCS at each site is responsible for initializing and running diagnostics on all equipment, including computing nodes, disk storage, tape storage, and networking. It establishes and maintains connectivity with the other sites including the Headquarters Site. It monitors the operation of all hardware and integrates with the orchestration layer (see section 4.2) to monitor software execution. System status and control functions will be available via a Web-enabled tool to the Headquarters Site and remote locations.

At the Base Center, the DMCS is responsible for interfacing with the OCS (as defined in “Control System Interfaces between the Telescope & Data Management”, Document LSE-75). It accepts commands from the OCS to enter various modes, including observing, calibration, day, maintenance, and shutdown. It then configures and invokes the orchestration layer and the replication layer (see section 7.1) to enable the necessary processing and data movement for each mode, including running the Alert Production and replicating raw images to the Archive Center, respectively.

At the Archive Center, the DMCS performs resource management for the compute cluster. Parts of the cluster may be dedicated to certain activities while others operate in a shared mode. The major processing activities under DMCS control, invoked using the orchestration layer, include the Alert Production reprocessing (on dedicated hardware), the Calibration Products Production, and the Data Release Production. The DMCS also initializes the replication layer to enable the archiving of raw images received from the Base Site.

At each Data Access Center, the DMCS performs resource management for the level 3 data products compute cluster. It also initializes the replication layer to enable the distribution of level 1 data products received from the Base Center or the Archive Center.

4.3.2Baseline Design

The DMCS will consist of an off-the-shelf cluster management system together with a custom pluggable software framework. A Web-based control panel and an off-the-shelf monitoring system will also be integrated. Plugins will include job management systems like Condor, mode transition scripts to interface with the OCS and control panel, and hardware-specific initialization and configuration software.

502C.07.01.02 Pipeline Construction Toolkit

5.1Policy Framework

The Policy component of the Pipeline Framework is of key importance throughout the LSST middleware.

Policies are a mechanism to specify parameters for applications and middleware in a consistent, managed way. The use of Policies facilitates runtime reconfiguration of the entire system while still ensuring consistency and the maintenance of traceable provenance.

5.1.1Key Requirements

Policies must be able to contain parameters of various types, including at least strings, booleans, integers, and floating-point numbers. Ordered lists of each of these must also be supported. Each parameter must have a name. A hierarchical organization of names is required so that all parameters associated with a given component may be named and accessed as a group.

There must be a facility to specify legal and required parameters and their types and to use this information to ensure that invalid parameters are detected before code attempts to use them. Default values for parameters must be able to be specified; it must also be possible to override those default values, potentially multiple times (with the last override controlling).

Policies and their parameters must be stored in a user-modifiable form. It is preferable for this form to be textual so that it is human-readable and modifiable using an ordinary text editor.

It must be possible to save sufficient information about a Policy to obtain the value of any of its parameters as seen by the application code.

5.1.2Baseline Design

The design follows straightforwardly from the requirements.

Policies are specified by a text file containing hierarchically-organized name/value pairs. A value may be another Policy (referred to as a sub-Policy). A value may also be a list of values (all of the same type). Policies may reference other Policies to set values for sub-Policies.

A Dictionary, which is also a Policy, specifies the legal parameter names, their types, minimum and maximum lengths for list values, and whether a parameter is required. Since Dictionaries are Policies, they may use Policy references to incorporate other dictionaries to validate sub-Policies.

Each piece of application code (routine or object) using a Policy will typically have an associated Dictionary to validate the Policy parameters and provide default values. Default values may also be provided by the code’s caller, adding to or overriding the Dictionary defaults.

With text-file Policies, the complete parameter state of a given execution may be preserved by preserving all the text files. In addition, the simple hierarchical syntax lends itself to storage in a database as a key/value table with dotted-name keys, allowing queries of the parameters by name (including the use of regular expressions) and value.

5.1.3Prototype Implementation

An implementation of Policy using a simple "name: value" syntax with brace-delimited sub-Policies has been in use since DC2. Hierarchical names are specified using dotted-path notation. XML syntax was considered but determined to be too wordy and difficult to edit. The dotted-path notation does not currently support referring to individual list elements.

Dictionaries have been implemented with validation for fixed parameter names. Extending this validation to variable parameter names (e.g. for parameters pertaining to pluggable measurement algorithms) has not yet been implemented. Automatic merging of overrides and validation of the result is also currently unimplemented; instead, application code must merge default values into an incoming Policy using an API call.