CDIS White Paper (Version by Reagan Moore)

Last revised: 31 July 2000

Center for Data Intensive Science

University of Florida

Florida State University

University of Chicago

University of Illinois at Chicago

California Institute of Technology

San Diego Supercomputer Center

OVERVIEW

We propose an NSF-supported Science and Technology Center for Data Intensive

Science (CDIS) that will (a) discover and develop the methodologies,

software, engineering and technology tools critical to qualitatively new

modes of extracting scientific knowledge from massive data collections; (b)

dissiminate this knowledge via training of both students and active

researchers; (c) engage in two-way knowledge transfer with partners that

develop software for data-intensive functions (management, analysis) by both

repurposing emerging business methods to scientific use and by making

leading edge scientific practice accessible to those developers.

Massive data collections are occurring more and more often in multiple,

seemingly unconnected scientific specialities. Their appearance and growth

are a result of dramatic increases in both capability and capacity of

scientific instrumentation (both experimental and computational), storage

technology, and networks. It is expected that the largest collections, which

today approach the Petabyte range (1,000 Terabytes), will expand by the

early part of the next decade to the Exabyte scale, one thousand times

larger!. We are thus entering an era of data intensive science. In this

mode, gaining new insight is critically dependent upon superior storage,

retrieval, indexing, searching, filtering, and analyzing extremely large

quantities of data. This era promises many new opportunities for systematic

discernment of patterns, signals, and correlations that are far too subtle

or interwoven to be picked up by simple scanning, ordinary statistical

tools, and routine visualization methods.

This new era also presents enormous long-term Information Technology (IT)

challenges. On scaling arguments alone, a more-of-the-same methodological

approach clearly will fail. In particular, the large quantities of data may

well be distributed across the network environment. Further, the gross size

of the collections accessed and the analytical calculations to be performed

on them both will force data-intensive applications to harness large numbers

of distributed storage systems, networks, and computers. (An oddly charming

example is the harnessing of thousands of otherwise idle PC's to process

SETI data.) These trends will be catalyzed further by the growing emphasis

on multi-speciality team research. As nationally and internationally

distributed researchers increasingly work in geographically dispersed teams

while the size of the data structures burgeons, the result is a growing

requirement for better collaborative tools and related IT infrastructure to

maximize the potential and rate of scientific discovery.

Data intensive specialties are growing rapidly in number and variety. A

partial list today includes high energy physics, nuclear physics, whole-sky

astronomical digital surveys, gravitational wave searches, crystallography

using high intensity X-ray sources, climate modeling, systematic satellite

earth observations, molecular genomics, three dimensional (including whole

body) medical scans, proteomics, molecular and materials modeling, Virtual

Reality (VR) simulations, digital archiving, and many others. The Center's

focus will be on a significant subset of these disciplines, particularly on

the common issues of data intensivity faced by all of them. The CDIS

research program is enthusiastically, purposefully multidisciplinary, with

participants from computer science, computational science, engineering,

physics, astronomy, chemistry and biology.

CDIS has devised a broad program to extend the knowledge gained through its

research efforts to other scientific communities, international

collaborators, the IT industry, students at all levels, and the general

public. Our knowledge transfer effort is particularly strong on account of

our close collaboration with national and international laboratories and

research projects, bioinformatics programs at other institutes, regional and

national organizations, and vendors from many portions of the IT industry.

We will also have an interactive web interface in which visitors can access,

search, manipulate, correlate and visualize sample data sets from our

application disciplines and thus gain a sense of the excitement and

empowerment felt by those of us who pursue scientific careers. Our education

and outreach effort will be coordinated primarily through the EOT-PACI

program of the NCSA Alliance, taking advantage of their considerable

experience in designing K-12 and undergraduate programs and reaching women

and underrepresented minorities.

Introduction to Data Intensive Science

We propose creation of a Science and Technology Center for Data Intensive

Science (CDIS) that will provide the research foundation and essential

development for the methodologies, software, engineering, and technology

tools common to data intensive science specialties in the next twenty years.

The Center's focus will be on disciplines which must extract scientific

results from immense, often heterogeneous data collections. Today, the

largest of these collections are approaching a Petabyte in size and are

expected to attain the Exabyte (1,000 Petabytes) scale early in the next

decade. Several features pose enormous computational science and computer

science challenges if the potential value of such ultra-large datsets is to

be realized, including:

* magnitude and complexity for geographically dispersed collections accessed by geographically dispersed user teams

* geographicaly dispersed character (increasingly the case)representation of the information content of scientific data collections

* multi-disciplinary, geographically dispersed user teams.representation and management of knowledge that describes relationships between sets of information derived from scientific collections

The extraordinary opportunity is the development of technology that allows the manipulation of knowledge contained within scientific data collections. We understand how to represent information through use of markup languages such as XML, and how to use semi-structured representations such as DTDs to support manipulation of information. The next challenge is to develop both markup languages for describing relationships between sets of information (characterized as knowledge), and representations that will support manipulation of knowledge. The relationships must be consistent with the underlying physical laws that govern scientific collections. We understand how to use simple “is a” and “has a” relationships to represent domain knowledge, and can build rule-based ontology mapping that use knowledge bases to illustrate the physical correspondence between digital objects. We need to extend the type of rules used to characterize relationships to include mathematical functions that define features inherent in scientific data. There will be a strong interdepencence between mathematical tools used to do feature extraction, the knowledge bases that are generated to describe relationships between information sets within a discipline, and the ontologies used to organize information within a discipline. Currently, these topics correspond to disjoint computer science research agenda, and have limited application to scientific data.

These in turn carry many implicit challenges. Examples include

identification of common problems, efficient sharing of algorithms developed

in one specialty (avoidance of wheel-reinvention), re-purposing of

commercial data intensive application methods (e.g. data mining, portals).

Such challenges can be met only through a focused systematic, sustained,

multidisciplinary approach. That is the mission of this Center.

Extremely rapid expansion in dataset size is being driven by long-term,

continuing exponential price declines in storage and communications media,

and parallel trends in computation. These patterns have allowed individuals,

teams, and whole research communities to accumulate tremendously large,

heterogeneous aggregations of potentially or actually interesting data in

their areas. These including observational data and video, VR experiences,

etc. Other communities, for example materials simulation and modeling, that

have long been data-limited to the extent of discarding most of their

computed results now are contemplating retaining vast collections for

pattern mining and recognition. Current and foreseeable developments make it

likely that by the the next decade individual scientists will have access to

Petabit (1015 bits) local storage systems. Similarly, the aggregate storage

capacities available to many science communities will reach Exabytes (1018

bytes). (reference for this trend?). Mere new storage capabilities alone

obviously will not allow these disciplines to fulfill their scientific

potential. Use of current methods of storage, indexing, retrieval, and

analysis will not suffice. Without both qualitative and quantitative

improvements in these areas, massive datasets simply will drown research

teams in undigested data. Specifically, at least the following questions

must be addressed:

* How can a community collaboratively create, manage, organize, catalog,

analyze, process visualize and extract knowledge from a distributed

Exabyte-scale aggregate of heterogeneous scientific data?

* How can an individual scientist utilize a Petabyte of heterogeneous

data? Even supposing that the data have been created, managed,

organized, and cataloged, what methods and tools are missingmarkup languages and knowledge representations are needed to enable

a person to analyze, process, visualize, and extract knowledge from

such an aggregate?

* How is thehat presupposed creation, management, organization, and

cataloging of massive archives to be done with maximum benefit and

minimum diversion of resources fromcompatibility with the phsyical laws that describe the fundamental science being

addressed?

* How can multiple communities avoid diversion of effort from the science

they are pursuing into redundant study of methods of analysis and

managementmanaging the knowledge content of Petabyte archives?

* What would be the impact on algorithm design and scientific problem

selection for computational specialties (e.g. simulations) of having

the ability to utilize compare ultra-large archives of computed data (rather

than the current practice of discarding the great majority of computed

data)with observational data?

Clearly, only a coordinated, multi-speciality attack attack on these

challenges will yield solutions that will be transferable among a variety of

disciplines. An example of a distribution systemuseful prototype is Netlib, which provides well-documented,

tested mathematical software for incorporation in a wide variety of

applications, both scientific and commercial. Using linear algebra as an

example, software available through Netlib has long since relieved

developers of the burden of writing original code. If the algorithms used to represent features within a scientific discipline are also published, then Netlib can serve as a repository for the underlying rules. The system will have to be augmented with a logic system that applies the rules within the context of a knowledge base. The knowledge base will require a representation that characterizes the application of the rules. (Typical systems use graphical representations.) Feature extraction will then be the application of the knowledge base rules under the control of the representation that is appropriate for the scientific discipline.

(Need 1 or 2 more knowledge transfer arguments here)

Examples of Data Intensive Disciplines

Data-intensivity characterizes much of the most important and far-reaching

research being carried out today on the fundamental forces of nature,

biological systems, atmospheric and geophysical phenomena, the composition

and structure of the universe, and advanced materials. Experiments and

activities in these fields include

* High energy and nuclear physics: Experiments such as those at the Large

Hadron Collider (LHC) at CERN, Switzerland will search for the origins

of mass and probe matter at the smallest length scales; Experiments will generate 2-billion events comprising petabytes of data. Feature extraction, event correlation, and distributed analysis are all relevant data intensive research areas. The Particle Physics Data Grid is developing data management tools for data replication based upon the SDSC Storage Resource Broker and the Globus grid. Of interest is the ability to characterize the feature extraction algorithms as knowledge that correlates attributes about events.

**SBT comment! So what? We need to say what data they generate and how

those data are hard to utilize fully with today's tools

* Molecular genomics: Whole-genome sequencing data will permit wholesale

study of genetic relationships of organisms and how complex mutations

express themselves. (Anything else?). An exciting example is the Human

Genome Project, which will provide the first comprehensive DNA

sequencing data on the human genome; The ability to correlate genes between species is an example of knowledge relationships that should be expressible within the context of a knowledge base. Ontology mapping based upon knowledge bases is needed to correlate information between genomics databases, protein data bank, and molelcular trajectory databases. A bioinformatics project in the National Partnership for Advanced Computational Infrastructure (NPACI) is developing the data management systems to support analysis of these collections. Further research is needed to develop better representations of the structural relationships between genes and proteins.

**SBT comment! I think the preceding example is a risky one, one we

maybe should stay away from. People in the H.G.P. have been spending

tons of money for a long time on the informatics aspects. We will look

like late-comers and maybe even amateurs unless we have a major, major

player in this area on our team.

* Proteomics: This is the study of the three dimensional shapes of

proteins and how these shapes influence the subsequent chemical

behavior of proteins. I need a better description from an expert.

**SBT comment! I never heard the term "Proteomics" used in the

preceding title but there is a great deal of molecular modeling being

done to determine protein topologies, docking sites, steric effects,

structure/activity relationships etc. I can check with Adrian Roitberg

if you want me to.The rules governing the folding of proteins again is a prime candidate for development of a knowledge base. Expression of protein structures as a set of quantifiable folding rules is both a challenging scientific problem, and a knowledge representation problem.

* Cell signaling. The next bioinformatics challenge is the derivation and characterization of the signaling pathways within cells. This requires correlating information about protein structures, time dependent information channels, and basic computations of electrostatic fields within a cell. Expressing the relationships between the pathways is an opportunity for the creation of a knowledge base to represent how interactions between each component of a cell are correlated. The Alliance for Cell Signaling at U Texas/UCSD is funded by NIH to do the basic biochemistry research. CDIS can support the knowledge management research.

* Single molecule systems: Does anyone know anything about this field?

* Three dimensional scans: Medical imaging, including whole body scans,

are becoming an important diagnostic tool. For example, the Human Brain

Project will carry out time series of 3-D scans of the human brain to

enable understanding of our most complex and uncharted organ system;

**SBT comment! Again, at risk of being discouragingly negative, most of

the preceding topic is subsumed under the Imaging rubric and the

imaging centers therefore think they "own" it.The Neuroscience thrust area of the NPACI is supporting the federation of neuroscience databases of brain images. This is an example of the use of knowledge systems to define relationships between the physical biological components of nervous systems.

* Crystallography: Facilities such as the Advanced Photon Source (APS)

use high intensity X-ray beams are used to map the 3-D atomic-scale

structure of advanced materials and biological molecules;

**SBT comment! I want to talk with Dave Tanner about the preceding

topic. "Crystallography" is perceived as an old-fashioned area, but

atomic-scale characterization and manipulation of materials is not.The ability to automate the ingestion of the structures into information bases presumes the ability to extract relevant features for characterizing the new shapes, and presumes the ability to organize the features into a knowledge base that can be used by the community to understand the relevance of the information.

* Systematic satellite earth observations: The best known example is the

Earth Observing System (EOS), which will provide complete,

multiple-wavelength observations of our planet to yield improved

understanding the Earth as an integrated system; The EOS system is being revised to support access to distributed data collections as part of the NASA New DISS development effort. This is expected to be done over the next three to five years. The time is ripe for research into the best representations to use for both information management and knowledge manaement within the context of earth systems data.

* Climate modeling: Who knows something about this?A major challenge of climate modeling is extraction of features such as cyclones and hurricanes from the simulations. These climatic features provide ways to correlate the impact of the climate model simulation. The rules used to characterize cyclones and other climatic features need to be represented in a knowledge base that is used to compare results from multiple climate models. Again there can be a tight coupling between knowledge representations and scientific data feature extraction. A second area is in the use of adjoint-models to ingest observational data directly into simulations. Characterizations of the observational data need to be integrated with characterizations of the features extracted from the simulation data. Both sets of characterization might be represented in separate knowledge bases. The challenge then is to support correlations across multiple knowledge bases.

**SBT comment! FSU has some major players in Climate Modelling.

* Geosciences: I'm thinking of the NEES facility for earthquake studies,

supported by NSF. What do they have in the way of data requirements?The NSF NEES project will support collections of observational data from experimental systems, and will need to correlate results from experiments with direct measurements of real-world events. Again this implies the need to define knowledge about causes and effects of ground movement on structures. The NEES project will need advanced representations for knowledge bases to identify relationships between observed structural deformation and the driving ground motions. This is an opportunity to explore creation of knowledge bases that can express the relationships between causes and effect, and serve as the templates for automated feature extraction.