LSST Data Management Middleware Design LDM-152 10/04/2011

LSST Data Management Middleware Design LDM-152 10/04/2011

Large Synoptic Survey Telescope (LSST)

Data Management Middleware Design

Kian-Tat Lim, Ray Plante, Gregory Dubois-Felsmann

LDM-152

Latest Revision: October 9, 2013

This LSST document has been approved as a Content-Controlled Document by the LSST DM Technical Control Team. If this document is changed or superseded, the new document will retain the Handle designation shown above. The control is on the most recent digital document with this Handle in the LSST digital archive and not printed versions. Additional information may be found in the LSST DM TCT minutes.

The contents of this document are subject to configuration control by the LSST DM Technical Control Team.

LSST Data Management Middleware Design LDM-152 10/04/2011

Change Record

Version / Date / Description / Owner name
1.0 / 7/25/2011 / Initial version based on pre-existing UML models and presentations / Kian-Tat Lim
2.0 / 5/22/2013 / Updated based on experience from prototypes and Data Challenges. / Kian-Tat Lim
8 / 10/4/2013 / Updated based on comments from Process Control Review, changed to current terminology / Kian-Tat Lim
9 / 10/9/2013 / Further updates based on Process Control Review, formatting cleanup. / Kian-Tat Lim

Table of Contents

Change Record i

1 Executive Summary 1

2 Introduction 1

3 02C.06.02.01 Data Access Client Framework 2

3.1 Key Requirements 2

3.2 Baseline Design 3

3.3 Alternatives Considered 4

3.4 Prototype Implementation 4

4 02C.06.02.04 Image and File Services 4

4.1 Baseline Design 5

4.2 Prototype Implementation 5

5 02C.07.02.01 Event Services 5

5.1 Key Requirements 5

5.2 Baseline Design 5

5.3 Prototype Implementation 6

6 02C.07.01 Processing Control 6

6.1 02C.07.01.02 Orchestration Manager 6

6.2 02C.07.01.01 Data Management Control System 8

7 Pipeline Execution Services 9

7.1 02C.06.03.01 Pipeline Construction Toolkit 9

7.2 02C.06.03.02 Logging Services 12

7.3 02C.06.03.03 Inter-Process Messaging Services 13

7.4 02C.06.03.04 Checkpoint/Restart Services 14

The contents of this document are subject to configuration control by the LSST DM Technical Control Team.

LSST Data Management Middleware Design LDM-152 10/04/2011

The LSST Data Management Middleware Design

1 Executive Summary

The LSST middleware is designed to isolate scientific applications, including the Alert Production, Data Release Production, Calibration Products Production, and Level 3 processing, from details of the underlying hardware and system software. It enables flexible reuse of the same code in multiple environments ranging from offline laptops to shared-memory multiprocessors to grid-accessed clusters, with a common communication and logging model. It ensures that key scientific and deployment parameters controlling execution can be easily modified without changing code but also with full provenance to understand what environment and parameters were used to produce any dataset. It provides flexible, high-performance, low-overhead persistence and retrieval of datasets with data repositories and formats selected by external parameters rather than hard-coding. Middleware services enable efficient, managed replication of data over both wide area networks and local area networks.

2 Introduction

This document describes the baseline design of the LSST data access and processing middleware, including the following elements of the Data Management (DM) Construction Work Breakdown Structure (WBS):

· 02C.06.02.01 Data Access Client Framework

· 02C.06.03 Pipeline Execution Services including:

o 02C.06.03.01 Pipeline Construction Toolkit

o 02C.06.03.02 Logging Services

o 02C.06.03.03 Inter-Process Communication Services

o 02C.06.03.04 Checkpoint/Restart Services

· 02C.07.01 Processing Control including:

o 02C.07.01.01 Data Management Control System

o 02C.07.01.02 Orchestration Manager

· 02C.07.02.01 Event Services

The LSST database design, WBS elements 02C.06.02.02 and 02C.06.02.03, may be found in the document entitled “Data Management Database Design” (LDM-135). As to the remainder of WBS element 02C.07.02, WBS element 02C.07.02.02 is described in “LSST Cybersecurity Plan” (LSE-99). 02C.07.02.03 (visualization) and 02C.07.02.04 (system administration) are primarily low-level, off-the-shelf tools and are not described further here. 02C.07.02.06 (VO Interfaces) will use standard VO tools and protocols.

Figure 1. Data Management System Layers.

Common to all aspects of the middleware design is an emphasis on flexibility through the use of abstract, pluggable interfaces controlled by managed, user-modifiable parameters. In addition, the substantial computational and bandwidth requirements of the LSST Data Management System (DMS) force the designs to be conscious of performance, scalability, and fault tolerance. In most cases, the middleware does not require advances over the state of the art; instead, it requires abstraction to allow for future technological change and aggregation of tools to provide the necessary features.

3 02C.06.02.01 Data Access Client Framework

This WBS element contains the framework by which applications retrieve datasets from and persist datasets to file and database storage. This framework provides high-performance access to local resources (within a data access center, for example) and low-performance access to remote resources. These resources may include images, non-image files, and databases

3.1 Key Requirements

The framework must provide persistence and retrieval capabilities to application code. Persistence is the mechanism by which application objects are written to files in some format or a database or a combination of both; retrieval is the mechanism by which data in files or a database or a combination of both is made available to application code in the form of an application object. Persistence and retrieval must be low-overhead, allowing efficient use of available bandwidth. The interface to the I/O layer must be usable by application developers. It is required to be flexible, allowing changes in file formats or even whether a given object is stored in a file or the database to be selected at runtime in a controlled manner. Image data must be able to be stored in standard FITS format, although the metadata for the image may be in either FITS headers or database table entries.

3.2 Baseline Design

The framework is designed to provide access to datasets. A dataset is a logical grouping of data that is persisted or retrieved as a unit, typically corresponding to a single programming object or a collection of objects. Dataset types are predefined. Datasets are identified by a unique identifier. Datasets may be persisted into multiple formats.

The framework is made up of two main components: a “Butler” that provides a high-level, general-purpose dataset and repository access interface and a “Mapper” that provides astronomy-specific and even camera-specific methods for naming, persisting, and retrieving datasets. Both are implemented in Python.

The Butler (formerly known as a Persistence object) manages repositories of datasets which can be in files or in a database. The files may be anywhere within a global namespace provided by the Infrastructure's File System Services, including on tape. Operations on datasets include get, put, list, and remove. One additional operation checks to see if a dataset exists. If it does, it reads it and checks it for equality with an existing Python object. If it does not, it writes it, using locking to ensure that only one copy is written. This operation is useful for compressing the recording of shared provenance information from multiple simultaneous tasks.

The Butler contains a pluggable set of storage managers (formerly known as Formatter and Storage subclasses) that handle persistence to and retrieval from storage types such as Python pickle files, task configuration override files (Python scripts), FITS tables, and SQL databases. Metadata and provenance information is extracted by the storage managers.

The Butler is initialized with zero or more read-only input repositories and a read/write output repository. When reading a dataset, the output repository is searched first; the "chained" input repositories are searched if the dataset is not found. When writing a dataset, the dataset always goes to the output repository, never to the chained inputs (unless the output is specified as being the same as an input). The set of input repositories is recorded for provenance purposes.

The Mapper translates from a dataset type name and one or more astronomically meaningful key/value dictionaries into a dataset location and storage. The location might be a pathname or URL for a file; it would include an SQL query for a database.

The Mapper provides flexibility at many levels. First, it allows the provided key/value dictionaries to be expanded using rules or database lookups. This can be used to map from a visit identifier to an exposure length, for example, or from a CCD name to an equivalent number. This facility is used to implement the "rendezvous" of raw data with its corresponding calibration data. Second, it allows the key/value pairs to be turned into a location string using a dataset type-dependent method. Typically, this will be performed by substitution into a dataset type-specific template. Third, the Mapper allows camera-specific and repository-specific overrides and extensions to the list of rules and templates, enabling per-camera and dynamic dataset type creation.

3.3 Alternatives Considered

Use of a full-fledged object-relational mapping system for output to a database was considered but determined to be too heavyweight and intrusive. Persistence from C++ was tried and found to be complex and unnecessary; Python persistence suffices since all control is in Python.

3.4 Prototype Implementation

A C++ implementation of the original design was created for Data Challenge 2 (DC2) that allows input and output of images and exposures, sources and objects, and PSFs. Datasets were identified by URLs. Storage mechanisms included FITS[1] files, Boost::serialization[2] streams (native and XML), and the MySQL[3] database (via direct API calls or via an intermediate, higher-performance, bulk-loaded tab-separated value file). The camera interface has not yet been prototyped.

This implementation was extended in DC3 to include a Python-based version of the same design that uses the C++ implementation internally. This new implementation is the basis of the new baseline design. Experience in the last few Data Challenges has demonstrated that this framework is easier to use and more flexible than the C++ one. Since the low-level I/O code remains in C++, the framework's performance remains good. A Python-only Storage class has been added to allow persistence via the Python "pickle" mechanism.

Further refinement of the implementation has produced classes that can be written to and read from FITS tables. The Mapper class has been extended to provide automatic management of dataset repositories.

4 02C.06.02.04 Image and File Services

Image and File Services manages a virtual read-only repository of files, including image files. This is required because the size of the LSST data products makes it infeasible to store them all; it is more cost-effective to provide the CPU cycles needed to regenerate them on demand.

4.1 Baseline Design

When a file is requested, a cache maintained by the service is checked. If the file exists in the cache, it is returned. If the file does not exist, configurable rules are consulted to remove one or more files to make room for it in the cache, if necessary. (If no room is currently available because all cached files are being used, the request is blocked.) The file is then regenerated by invoking application pipeline code based on provenance and metadata information stored in the repository. The regenerated file is placed in the cache.

4.2 Prototype Implementation

This service has not yet been prototyped.

5 02C.07.02.01 Event Services

The event service is used to communicate among components of the DM System, including between pipelines in a production. A monitoring component of the service can execute rules based on patterns of events, enabling fault detection and recovery.

5.1 Key Requirements

The event service must reliably transfer events from source to multiple destinations. There must be no central point of failure. The service must be scalable to handle high volumes of messages, up to tens of thousands per second. It must interface to languages including Python and C++.

A monitoring component must be able to detect the absence of messages within a given time window and the presence of messages (such as logged exceptions) defined by a pattern.

5.2 Baseline Design

The service will be built as a wrapper over a reliable messaging system such as Apache ActiveMQ[4]. Event subclasses and standardized metadata will be defined in C++ and wrapped using SWIG[5] to make them accessible from Python. Events will be published to a topic; multiple receivers may subscribe to that topic to receive copies of the events.

The event monitor subscribes to topics that indicate faults or other system status. It can match templates to events, including boolean expressions and time expressions applied to event data and metadata.

Figure 3. Event Subsystem Components

5.3 Prototype Implementation

An implementation of the event subsystem on Apache ActiveMQ was created for DC2 and has evolved since then. Command, Log, Monitor, PipelineLog, and Status event types have been defined. Event receivers include pipeline components, orchestration components, the event monitor, and a logger that inserts entries into a database. Tests have demonstrated the ability to handle tens of thousands of events per second through the event broker, although the code to store events in a database is not yet up to that level of performance.

The event monitor has been prototyped in Java

6 02C.07.01 Processing Control

6.1 02C.07.01.02 Orchestration Manager

The Orchestration Manager is responsible for deploying pipelines and Policies onto nodes, ensuring that their input data is staged appropriately, distributing dataset identifiers to be processed, recording provenance, and actually starting pipeline execution.

6.1.1 Key Requirements

The Orchestration Manager must be able to deploy pipelines and their associated configuration Policies onto one or more nodes in a cluster. Different pipelines may be deployed to different, although possibly overlapping, subsets of nodes. All three pipeline execution models (see section7.1.2) must be supported. Sufficient provenance information must be captured to ensure that datasets can be reproduced from their inputs.

The Orchestration Manager at the Base Center works with the DM Control System (DMCS, see section 436.2) at that Center to accept commands from the OCS to enter various system modes such as Nightly Observing or Daytime Calibration. The DMCS invokes the Orchestration Manager to configure and execute data transfer and Alert Production pipelines accordingly. At the Archive Center, the Orchestration Manager controls execution of the Data Release Production, including managing data dependencies between pipelines.