SDM Center Report

Jan 2002 - June 2006

http://sdmcenter.lbl.gov

Introduction

Managing scientific data has been identified as one of the most important emerging needs by the scientific community because of the sheer volume and increasing complexity of data being collected. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Fortunately, the data management problems encountered by most scientific domains are common enough to be addressed through shared technology solutions. Based on the community input, we have identified three significant requirements. First, more efficient access to storage systems is needed. In particular, parallel file system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualization engine. These processes are complicated by the fact that scientific data are structured differently for specific application domains, and are stored in specialized file formats. Second, scientists require technologies to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis and searches over large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. To facilitate efficient access it is necessary to keep track of the location of the datasets, effectively manage storage resources, and efficiently select subsets of the data. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process. Tools for automation of this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration.

Our approach is to employ an evolutionary development and deployment process: from research through prototypes to deployment and infrastructure. Accordingly, we have organized our activities in three layers that abstract the end-to-end data flow described above. We labeled the layers (from bottom to top):

·  Storage Efficient Access (SEA)

·  Data Mining and Analysis (DMA)

·  Scientific Process Automation (SPA)

The SEA layer is immediately on top of hardware, operating systems, file systems, and mass storage systems, and provides parallel data access technology, and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature identification, and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules.

This report is organized according to the three layers: SEA, DMA, and SPA
1. Storage Efficient Access

Today’s scientific applications demand that high performance I/O be part of their operating environment. These applications access datasets of many gigabytes (GB) or terabytes, checkpoint frequently, and create large volumes of visualization data. Such applications are hamstrung by bottlenecks anywhere in the I/O path, including the storage hardware, file system, low-level I/O middleware, application level interface, and in some cases the mechanism used for Grid I/O access. This work addresses inefficiencies in all the software layers by carefully balancing the needs of scientists with implementations that allow the expression and exploitation of parallelism in access patterns.

Just above the I/O hardware in a high-performance machine sits software known as the parallel file system. This software maintains the directory hierarchy and manages file data distribution across a large number of I/O components. Our PVFS2 parallel file system can provide multiple GB/second parallel access rates, is freely available, and is in use at numerous academic, laboratory, and industry sites, including the Massachusetts Institute of Technology, the University of Utah’s Center for High Performance Computing, the Ohio Supercomputer Center, Argonne National Laboratory, and Acxiom Corp.

PVFS2 operates on a wide variety of Linux platforms, making it perfect for cluster systems, and it has also been ported to the IBM Blue Gene/L supercomputer. PVFS2 incorporates key optimizations that prepare it for ultra-scale systems, including scalability optimizations that closely connect PVFS2 to MPI-IO (see Figure 1) [LRT04].

Above the parallel file system is software designed to aid applications in more efficiently accessing the parallel file system. Implementations of the MPI-IO interface are arguably the best example of this type of software. MPI-IO provides optimizations that help map complex data movement into efficient parallel file system operations. Our ROMIO MPI-IO interface implementation is freely distributed and is the most popular MPI-IO implementation for both clusters and a wide variety of vendor platforms [TGL99]. Our previous work on PVFS and ROMIO forms a solid technology base for new capabilities that address issues in scientific computing in general and in specific communities such as enhanced I/O request capabilities for PVFS and ROMIO and the Parallel netCDF (Network Common Data Form) interface.

Some applications now operate on data located across the wide area network (WAN). Unfortunately for scientists, most of the APIs that they are accustomed to using to access data stored locally will not currently operate on data stored remotely. Grid-enabling ROMIO allows existing applications to access remote data sources with no significant work on the part of scientists, significantly improving the usability of Grid I/O resources. We have implemented prototypes that are capable of accessing both Storage Resource Managers and Logistical Network data sources through the MPI-IO interface.

MPI-IO is a powerful but low-level interface that operates in terms of basic types, such as floating point numbers, stored at offsets in a file. Scientific applications desire more structured formats that map more closely to the structures applications use, such as multidimensional datasets. File formats that include attributes of the data, such as the input parameters and date of creation, and are portable between platforms, are also desirable. The Hierarchical Data Format (HDF5) interface, popular in the astrophysics community among others, is one such high level API. HDF5 uses MPI-IO for parallel I/O access as well.

NetCDF is another widely used API and portable file format that is popular in the climate simulation and data fusion communities. Our Parallel netCDF (PnetCDF) work provides a new interface for accessing netCDF data sets in parallel. This new parallel API closely mimics the original API, but is designed with scalability in mind and is implemented on top of MPI-IO. Performance evaluations using micro-benchmarks as well as application I/O kernels have shown major scalability improvements over previous efforts [LLC+03].

Figure 2 compares the performance results of the FLASH astrophysics I/O benchmark using HDF5 and PnetCDF APIs, in which over 50% more I/O bandwidth is observed for PnetCDF. The current release of PnetCDF is version 1.0.1 and is available from http://www-unix.mcs.anl.gov/parallel-netcdf. It has been tested on a number of platforms, including Linux clusters, SGI Origin, Cray X1E, NEC SX, IBM SP, and BlueGene/L. Language supports are provided for C and Fortran. The self-test suite from Unidata netCDF package has been ported to validate against single-process results using PnetCDF APIs. In addition, a new parallel test suite has also been developed to validate the results from a various parallel access patterns.

Current PnetCDF users come from several major research centers, including:

·  The FLASH application team from ASCI/Alliances Center at the University of Chicago

·  The Center for Atmospheric Chemical Transport Models (LLNL)

·  The Scientific Data Technologies group at NCSA, developing I/O components for the Regional Ocean Model System (ROMS)

·  The ASPECT center (ORNL)

·  The Portable, Extensible Toolkit for Scientific Computation (PETSc) team (ANL)

·  The PRogram for Integrated Earth System Modeling (PRISM) at C&C Research Laboratories, NEC Europe Ltd.

·  The Earth System Modeling Framework (ESMF) at the National Center for Atmospheric Research (NCAR)

Our earlier works made considerable contribution in caching techniques for very large files and datasets typically encountered in the area of high performance data intensive computing. In particular we showed that disk caching techniques, used for staging very large files from either tape storage or other alternate replica locations, onto local disks for direct file accesses during computations, have different characteristics than the caching techniques in virtual memory paging and web caching. We defined an appropriate metric termed the average cost per reference, for comparing various cache replacement algorithms in this contest. We show through analytical derivation that the best replacement policy is one that evicts the candidate with the least cost beneficial function i(t), at the instant in time t. This is referred to as the LCB-K replacement policy. The formula for computing i(t) for a replaceable file i in cache was derived as,

where ki(t) denotes the number of references made to file i up to a maximum of K, denotes the time of kth backward reference to file i, gi(t) is the cumulative references to the file over its active period, and ci(t) is the estimated cost of caching the file if not in disk. Results of a comparative study LCB-K with other methods such as LRU, LRU-K, GDS (or Greedy Dual Size), etc., are shown in Figures 3a and 3b for synthetic and real workloads respectively. The details may be found in [OOS02, OtSh03]. Further work on caching combined the replacement policies with file access scheduling for optimal cache loading. The result of this work, referred to as OptCacheLoad, combines optimal scheduling with cache replacement policy. This produces excellent response times for data intensive jobs, compared with first come first serve (FCFS) scheduling policy, under limited cache sizes. Figures 4a and 4b also show these for synthetic and real workloads respectively. The details are presented in [ORS03].

Figure 3: Comparison of average cost per retrieval for various cache replacement algorithms

Figure 4: Average response times of jobs serviced using OptCacheLoad and LRU replacement

2. Data Mining and Analysis

Using FastBit indices for Data Analyses

Summary

The most important piece of searching software that we have developed under SDM center funding is the FastBit indexing package. FastBit is a tool for searching large read-only datasets. It organizes user data in a column-oriented structure that is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven that the compressed bitmap index used in FastBit is theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods cannot. In this section, we briefly describe the searching capability of FastBit, then highlight two applications that make extensive use of FastBit, namely Grid Collector and DEX.

Introduction

It is a significant challenge to search for key insight in the huge amount of data being produced by many data-intensive science applications. For example, a high-energy physics experiment called STAR is producing nearly a petabyte of data a year and has accumulated many millions of files in last five years of operation. One of the core missions of the STAR experiment is to verify the existence of a new state of matter called the Quark Gluon Plasma (QGP) [BGS+1999]. An effective strategy for this task is to find the high-energy collisions that contain signatures unique to QGP, such as a phenomenon called jet quenching. Among the hundreds of millions of collision events captured, a very small fraction of them, maybe only a few hundreds contain clear signatures of jet quenching. Efficiently identifying these events and transferring the relevant data files to analysis programs are a great challenge. Many data-intensive science applications are facing similar challenges in searching their data.

In the past few years, we have been working on a set of strategies to address this type of searching problem. Usually, the data to be searched are read-only[1]. Our approach takes advantage of this fact. Since most database systems (DBMS) are built for frequently modified data, FastBit can perform searching operations significantly faster than those DBMS.

Conceptually, most data can be thought of as tables, where each row of the table represents an object or a record, and each column represents one attribute of the record. To accommodate frequent changes in records, a typical DBMS stores each record together on disk. This allows easy update of the records, but in many operations, the DBMS effectively reads all attributes from disk in order to access a few that are relevant for a particular query. FastBit stores each attribute together on disk, which allows one to easily access the relevant columns without involving any other columns. Even though an update may take longer to execute, but because a typical update operation usually add a large number of objects, the extra overhead introduced by vertical partitioning is negligible. In the database theory, separating out the values of a particular attribute is referred to as a projection. For this reason, using column-wise organized data to answer user queries is also known as the projection index [OQ1997].

Each column of a table is a dimension of the data. Many scientific datasets have tens or hundreds of dimensions; they are called high-dimensional data. User queries usually involve conditions on several attributes; they are known as multi-dimensional queries. To answer multi-dimensional queries on high-dimensional data, it is well known that the projection index performs better than most indexing methods including B-Tree. Since FastBit uses column-wise organization for user data, without any additional indices it is using the projection index, which is already very efficient. Our indexing technology further speeds up the searching operations. We have analyzed our bitmap index and showed it to be optimal for one-dimensional queries [WOS2002, WOS2004, and WOS2006]. Some of the best indexing methods including B+-tree and B*-tree have this optimality property as well. However, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries.

FastBit Indices

The key to the effectiveness of the FastBit searching software is a special compression scheme called the Word-Aligned Hybrid (WAH) code. This compression scheme enables FastBit to compress the bitmap indices to modest sizes [WOS2004, WOS2006]. At the same time, it enables common operations on the compressed bitmaps to be extremely efficient [WOS2002, WOS2004]. The performance measurements and the analyses are published in well-known conferences and journal in the database community [WOS2002, WOS2004, and WOS2006].