5. Computing and Software
Computing and Software systems extend from the output on the readout electronics to final scientific results of the mission. The first R&D task is defining the requirements for the system. The requirements document will frame the task of the conceptual design, which will address the balance between space-based systems and ground-based computing. The major R&D concerns for this phase focus on the following issues:
·Assessing the needs for data flow and data analysis
·Determination capabilities of hardware and software to meet these needs
·Design of a system with the required reliability and maintability
·Developing a cost and schedule with milestones to asssure implementation of the requisite computing and software systems on time and on budget
The primary deliverables of the R&D program will consist of a requirements document and a conceptual design. The work will also include evaluation of some specific software technologies that are likely to have a major impact on the design of the software system and on addressing the issues listed above. In addition, the SNAP R&D effort will include a survey of existing systems that are similar to that required for SNAP.
Expected performance requirements
The major processing component for SNAP will come from the GigaCam data.
As a prelude to the conceptual design, which will result from computing R&D activities, we can estimate the computing requirements for SNAP. Currently the SCP supernova search software requires roughly 1.5 hrs on a 400 Mhz PIII to coadd and subtract 10 new and 10 reference 2k x 4k CCD images. The coaddition process involves aligning the images, and iterative rejection of cosmic rayswhereby an initial coadded image is formed and used as a master to detect and mask cosmic rays in the input image with these masked images then forming the input for the next master image. At the minimum (100 sec) exposure time for GigaCam, one would thus need roughly (5400 sec/(10*1000 sec))*(3k*3k)/(2k* 2k) ~ 6 CPU's running on 400 Mhz PIII CPU's to keep up with the data from a single GigaCam CCD. Scaling to the roughly 100 3k x 3k Gigacam CCDs, one would need a cluster of 400 such PCs. Applying Moore's Law over a 5 year period reduces this requirement to about 50 "off the shelf" PCs at the time when they would be needed for SNAP. Alternately, the National Energy Reserch Scientific Computing Center (NERSC) at Berkeley Lab offers a 580 Gflop Cray T3E, and will soon have a 3.1 Tflop IBM RS/6000 SP Power3. Roughly speaking, GigaCam coadditions and subtractions would consume a mere 5% of the IBM SP computing power. Therefore, raw computing power is not likely to be a major issue for SNAP.
A similar estimate can be made for high-speed storage. High-speed access will be required for typicially fourth 3k x 3k images (two intermediate images for each of 10 new and 10 reference exposures) for each CCD. These will be processed, intermediate images, and so will be uncompressed and 4 bytes deep. Assuming 100 CCDs for GigCam, the high-speed storage requirement is roughly 144 Gbytes. This storage will have to be distributed amongst the nodes of a PC cluster, or otherwise parallelized, to avoid networking bottlenecks. This amount of high-speed storage is already available at NERSC.
Long-term storage of the roughly 500 Gbyte/day (compressed) from SNAP over a 3 year lifetime would require roughly 1000 Tbyte of storage for processed images. This is a significant storage requirement, and so will have to be looked at closely during the R&D period.
What is clear is that computing power and storage will not be the biggest issues facing SNAP. Rather, algorithm development optimized for space-based data, pipeline
development, data management, robustness, and science analysis tools will require the largest efforts.
Existing infrastructure
The computing plans for SNAP will leverage the extensive facilities and expertise in the National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab. NERSC is the largest unclassified computing facility in the United States and is available to scientists who are supported by the DOE Office of Science (formerly the Office of Energy Research). The NERSC Facilities consist of a very large IBM SP, a Cray T3E, a large Storage System (running HPSS), a computing cluster and the PDSF (a cluster of workstations used mainly for HEP data analysis and simulation). All of these systems are available to the supernova program and are currently being used. In addition, NERSC runs a number of development systems that are also used by the program. The SNAP project will continue the partnership with NERSC. Data processing will be done on NERSC machines and the data will be stored on the HPSS.
The NERSC expertise is also crucial to the success of the SNAP computing effort. The NERSC computer science program covers many of the computing technologies needed for SNAP including workflow management, data management and storage, visualization, collaboration and high-performance networking. The key staff for the SNAP computing R&D and for the implementation of the SNAP computing systems will come from groups in NERSC.
Supernova research
The SNAP computing R&D will build on the expertise of the Supernova Cosmology Project at Berkeley Lab. This group has developed methodologies and an extensive software base for image processing, supernova identification and follow-up. Some of this software can be used directly by SNAP. More importantly, the processes and software provide a basis for the requirements definition and cost estimation that will be carried out in the R&D phase of SNAP.
The Supernova group has begun a new project that will also add to our knowledge base. The Nearby Supernova Factory is scheduled to begin running in July 2002 and will collect up to 1000 nearby supernovae per year in order to study the systematics associated with the previous supernova measurements and for SNAP. Like SNAP, the SN Factory will operate a search and follow-up program continuously for 3 years, with a data rate approaching that of SNAP. Although the timescale of the SN Factory precludes the development of a completely new, "SNAP-ready", computing system, most of the concepts and software technologies appropriate to SNAP will be tested during the 3-year running of the SN Factory. In particular, many of the data analysis algorithms should be usable by SNAP.
Develop requirements
The requirements document will address the following items:
·Data Acquisition and Control
·Calibration and Monitoring
·Workflow Management
·Data Management and Access
·Data Storage
·Scheduling and Optimization
·Resource Reservation and Allocation
·Network Management
·Collaboration Tools
·Analysis Algorithms
·Data Presentation and Visualization
Progress has already been made in assembling some of the materials for this document and a preliminary requirements document will be completed in May 2001. The requirements definition will be done in close collaboration with groups working on instrument readout and space-craft control to ensure that the interfaces are correctly defined. The requirements document will continue to evolve during R&D and a refined document will be completed in September 2002.
Develop conceptual design
After agreement to the requirements document, the system architecture will be developed including a block diagram. A critical decision to be made during the R&D phase is the balance between on-board computing and ground-based computing. This decision will define many of the important system parameters such as on-board storage and telecommunications bandwidth. We plan to use system-modeling tools to explore performance and cost implications of different system choices.
System partitioning will be complete July 2001 and materials will be available for the ZDR the following month.
Evaluation of critical software technologies
After the completion of the system block diagram and the decision about how each functional block is to be implemented, technology selection, design, fabrication, and test of the computing and software systems can begin. It is likely that the evaluation effort will focus on the following areas:
Workflow Management
The scope, complexity and longevity of SNAP data acquisition, storage and analysis argue for a well-engineered description and implementation of the entire data-handling "process". This discipline, often referred to as "workflow management", is receiving increased attention in the commercial software domain. Tools and frameworks for describing and developing complex, integrated computing processes are being developed by several major software vendors and have begun to appear in the commercial marketplace.
Although several workflow management products are now being offered commercially, it is unclear how well they would adapt to the scientific analysis environment. Early discussions with vendors indicate that some modifications would be necessary to allow the full scope of monitoring and control typically available for, say, a customer billing operation to be applied to a SNAP reference image subtraction process.
Early investigation of these, and other, potential mismatches will greatly facilitate the selection process for workflow management tools during implementation of the SNAP data handling system. If it proves too difficult or unfeasible to adapt these commercial packages to the SNAP environment, then implementation of the required software control environment through other means can proceed in a timely manner.
Observation Scheduling and Optimization
The SNAP experiment is based on the assumption that, in addition to performing a baseline discovery observation program, the satellite-based instrument will also be able to dedicate at least half of its observational time to a set of in-depth measurements of potential SN candidates. This secondary set of observations will be determined on a daily basis as a result of analysis of the current data set. Integration of these observations into the base observation schedule in a manner that minimizes loss of both time and spacecraft resources is a non-trivial task.
It is necessary to develop a preliminary SNAP science operations model that includes known observational tasks, known instrumentation constraints and anticipated satellite platform operational constraints. Once this preliminary operational framework is in place, these tasks and constraints should be mapped onto an existing optimization methodology (e.g. SPIKE). The relative success of the resulting operations program should inform the design of the actual science operations framework.
Monitoring of important software technologies
In addition to the critical technologies identified above, there are a number of technical areas that will be important for SNAP software development. At this time it appears that these technologies are now adequate for the SNAP mission or will evolve to that point before SNAP design begins. For this reason, these technology areas will be monitored carefully but will not be evaluated in the R&D program.
Computing Resource Reservation and Allocation
SNAP will require scheduled use of large computing resources with extensive data storage facilities. Since these resources will not be solely dedicated to the analysis and management of SNAP data, they will need to be scheduled. Since the SNAP experiment depends on daily analysis and response to collected data, these resources will need to be scheduled on a daily basis.
Data Access and Management
SNAP will need to catalog and index data for efficient retrieval to make comparisons with new images. The LBNL Data Management Group has devoted considerable attention to this problem and has created a set of tools that have been used on the STAR experiment at RHIC and in the Particle Physics Data Grid. These tools continue to evolve as part of the ongoing Grid R&D efforts and SNAP will monitor these activities.
Data Storage
SNAP will produce unprocessed data at the rate approximately 500Gbytes/day. In addition to placing this data into archival storage, this data must have instrumentation and cosmic ray effects removed and compared with existing sky images from earlier dates. This daily processing will require access to several hundred gigabytes of storage for calibration and reference image operations. Although the amount of data accumulated over 3 years is substantial, current projections of HPSS growth at NERSC predict adequate tertiary data storage for SNAP operations. It should be noted, though, that HPSS storage is not to be considered "permanent". As noted earlier, data analysis will require access patterns that frequently touch a large set of calibration and baseline image files.
SNAP will monitor the development of integrated, high-capacity data storage faculties at alternative data processing sites. It will also monitor the development of GRID-based technologies, such as GASS (Global Access to Secondary Storage) for automated cache management of frequently used data sets. Future investigations should compare performance and features of these technologies against commercial hierarchical storage management systems that provide similar functionality.
Network Quality of Service (QoS) and Bandwidth Reservation
When fully operational, the data rate between the SNAP ground station and the SNAP computing site will average 540Gbytes/day. Although daunting, these rates appear to be achievable. However, in order for SNAP to be successful, these rates must be achieved continuously, on a daily basis, over a period of three to five years. It is not clear what technical and organizational improvements are necessary to achieve this performance.
Early experiments in network QoS reservation and enforcement schemes have been undertaken among LBNL, ANL, and ESNet. In order to enter the preliminary design phase fully prepared to begin actual system design, SNAP should be thoroughly familiar with the current and future capabilities of network providers such as ESNet. Furthermore, as network QoS and bandwidth reservation requirements become part of these networking organizations, it will be important to have a SNAP presence contributing to both the early planning and testing of these capabilities. Therefore, SNAP should monitor closely the network QoS and reservation program currently in progress at LBNL.
Collaboration Tools and Environment
The global nature of the SNAP collaboration in conjunction with its' 24x7 data collection and analysis operations model creates a real need for simple, effective communication among members of the collaboration. The focused nature of the SNAP collaboration environment provides a unique opportunity to both describe and implement specific software tools that will increase the efficiency of communications among it's members.
Current prototype efforts at NERSC/LBNL have demonstrated schemes for effectively using web technology (webDAV) for quickly distributing data analysis results to a large distributed collaboration. These web-based techniques readily lend themselves to distributed, secure interaction between collaboration members and ongoing automated analysis activities.