CDIS White Paper (Version by Reagan Moore)
Last revised: 31 July 2000
Center for Data Intensive Science
University of Florida
Florida State University
University of Chicago
University of Illinois at Chicago
California Institute of Technology
San Diego Supercomputer Center
OVERVIEW
We propose an NSF-supported Science and Technology Center for Data Intensive
Science (CDIS) that will (a) discover and develop the methodologies,
software, engineering and technology tools critical to qualitatively new
modes of extracting scientific knowledge from massive data collections; (b)
dissiminate this knowledge via training of both students and active
researchers; (c) engage in two-way knowledge transfer with partners that
develop software for data-intensive functions (management, analysis) by both
repurposing emerging business methods to scientific use and by making
leading edge scientific practice accessible to those developers.
Massive data collections are occurring more and more often in multiple,
seemingly unconnected scientific specialities. Their appearance and growth
are a result of dramatic increases in both capability and capacity of
scientific instrumentation (both experimental and computational), storage
technology, and networks. It is expected that the largest collections, which
today approach the Petabyte range (1,000 Terabytes), will expand by the
early part of the next decade to the Exabyte scale, one thousand times
larger!. We are thus entering an era of data intensive science. In this
mode, gaining new insight is critically dependent upon superior storage,
retrieval, indexing, searching, filtering, and analyzing extremely large
quantities of data. This era promises many new opportunities for systematic
discernment of patterns, signals, and correlations that are far too subtle
or interwoven to be picked up by simple scanning, ordinary statistical
tools, and routine visualization methods.
This new era also presents enormous long-term Information Technology (IT)
challenges. On scaling arguments alone, a more-of-the-same methodological
approach clearly will fail. In particular, the large quantities of data may
well be distributed across the network environment. Further, the gross size
of the collections accessed and the analytical calculations to be performed
on them both will force data-intensive applications to harness large numbers
of distributed storage systems, networks, and computers. (An oddly charming
example is the harnessing of thousands of otherwise idle PC's to process
SETI data.) These trends will be catalyzed further by the growing emphasis
on multi-speciality team research. As nationally and internationally
distributed researchers increasingly work in geographically dispersed teams
while the size of the data structures burgeons, the result is a growing
requirement for better collaborative tools and related IT infrastructure to
maximize the potential and rate of scientific discovery.
Data intensive specialties are growing rapidly in number and variety. A
partial list today includes high energy physics, nuclear physics, whole-sky
astronomical digital surveys, gravitational wave searches, crystallography
using high intensity X-ray sources, climate modeling, systematic satellite
earth observations, molecular genomics, three dimensional (including whole
body) medical scans, proteomics, molecular and materials modeling, Virtual
Reality (VR) simulations, digital archiving, and many others. The Center's
focus will be on a significant subset of these disciplines, particularly on
the common issues of data intensivity faced by all of them. The CDIS
research program is enthusiastically, purposefully multidisciplinary, with
participants from computer science, computational science, engineering,
physics, astronomy, chemistry and biology.
CDIS has devised a broad program to extend the knowledge gained through its
research efforts to other scientific communities, international
collaborators, the IT industry, students at all levels, and the general
public. Our knowledge transfer effort is particularly strong on account of
our close collaboration with national and international laboratories and
research projects, bioinformatics programs at other institutes, regional and
national organizations, and vendors from many portions of the IT industry.
We will also have an interactive web interface in which visitors can access,
search, manipulate, correlate and visualize sample data sets from our
application disciplines and thus gain a sense of the excitement and
empowerment felt by those of us who pursue scientific careers. Our education
and outreach effort will be coordinated primarily through the EOT-PACI
program of the NCSA Alliance, taking advantage of their considerable
experience in designing K-12 and undergraduate programs and reaching women
and underrepresented minorities.
Introduction to Data Intensive Science
We propose creation of a Science and Technology Center for Data Intensive
Science (CDIS) that will provide the research foundation and essential
development for the methodologies, software, engineering, and technology
tools common to data intensive science specialties in the next twenty years.
The Center's focus will be on disciplines which must extract scientific
results from immense, often heterogeneous data collections. Today, the
largest of these collections are approaching a Petabyte in size and are
expected to attain the Exabyte (1,000 Petabytes) scale early in the next
decade. Several features pose enormous computational science and computer
science challenges if the potential value of such ultra-large datsets is to
be realized, including:
* magnitude and complexity for geographically dispersed collections accessed by geographically dispersed user teams
* geographicaly dispersed character (increasingly the case)representation of the information content of scientific data collections
* multi-disciplinary, geographically dispersed user teams.representation and management of knowledge that describes relationships between sets of information derived from scientific collections
The extraordinary opportunity is the development of technology that allows the manipulation of knowledge contained within scientific data collections. We understand how to represent information through use of markup languages such as XML, and how to use semi-structured representations such as DTDs to support manipulation of information. The next challenge is to develop both markup languages for describing relationships between sets of information (characterized as knowledge), and representations that will support manipulation of knowledge. The relationships must be consistent with the underlying physical laws that govern scientific collections. We understand how to use simple “is a” and “has a” relationships to represent domain knowledge, and can build rule-based ontology mapping that use knowledge bases to illustrate the physical correspondence between digital objects. We need to extend the type of rules used to characterize relationships to include mathematical functions that define features inherent in scientific data. There will be a strong interdepencence between mathematical tools used to do feature extraction, the knowledge bases that are generated to describe relationships between information sets within a discipline, and the ontologies used to organize information within a discipline. Currently, these topics correspond to disjoint computer science research agenda, and have limited application to scientific data.
These in turn carry many implicit challenges. Examples include
identification of common problems, efficient sharing of algorithms developed
in one specialty (avoidance of wheel-reinvention), re-purposing of
commercial data intensive application methods (e.g. data mining, portals).
Such challenges can be met only through a focused systematic, sustained,
multidisciplinary approach. That is the mission of this Center.
Extremely rapid expansion in dataset size is being driven by long-term,
continuing exponential price declines in storage and communications media,
and parallel trends in computation. These patterns have allowed individuals,
teams, and whole research communities to accumulate tremendously large,
heterogeneous aggregations of potentially or actually interesting data in
their areas. These including observational data and video, VR experiences,
etc. Other communities, for example materials simulation and modeling, that
have long been data-limited to the extent of discarding most of their
computed results now are contemplating retaining vast collections for
pattern mining and recognition. Current and foreseeable developments make it
likely that by the the next decade individual scientists will have access to
Petabit (1015 bits) local storage systems. Similarly, the aggregate storage
capacities available to many science communities will reach Exabytes (1018
bytes). (reference for this trend?). Mere new storage capabilities alone
obviously will not allow these disciplines to fulfill their scientific
potential. Use of current methods of storage, indexing, retrieval, and
analysis will not suffice. Without both qualitative and quantitative
improvements in these areas, massive datasets simply will drown research
teams in undigested data. Specifically, at least the following questions
must be addressed:
* How can a community collaboratively create, manage, organize, catalog,
analyze, process visualize and extract knowledge from a distributed
Exabyte-scale aggregate of heterogeneous scientific data?
* How can an individual scientist utilize a Petabyte of heterogeneous
data? Even supposing that the data have been created, managed,
organized, and cataloged, what methods and tools are missingmarkup languages and knowledge representations are needed to enable
a person to analyze, process, visualize, and extract knowledge from
such an aggregate?
* How is thehat presupposed creation, management, organization, and
cataloging of massive archives to be done with maximum benefit and
minimum diversion of resources fromcompatibility with the phsyical laws that describe the fundamental science being
addressed?
* How can multiple communities avoid diversion of effort from the science
they are pursuing into redundant study of methods of analysis and
managementmanaging the knowledge content of Petabyte archives?
* What would be the impact on algorithm design and scientific problem
selection for computational specialties (e.g. simulations) of having
the ability to utilize compare ultra-large archives of computed data (rather
than the current practice of discarding the great majority of computed
data)with observational data?
Clearly, only a coordinated, multi-speciality attack attack on these
challenges will yield solutions that will be transferable among a variety of
disciplines. An example of a distribution systemuseful prototype is Netlib, which provides well-documented,
tested mathematical software for incorporation in a wide variety of
applications, both scientific and commercial. Using linear algebra as an
example, software available through Netlib has long since relieved
developers of the burden of writing original code. If the algorithms used to represent features within a scientific discipline are also published, then Netlib can serve as a repository for the underlying rules. The system will have to be augmented with a logic system that applies the rules within the context of a knowledge base. The knowledge base will require a representation that characterizes the application of the rules. (Typical systems use graphical representations.) Feature extraction will then be the application of the knowledge base rules under the control of the representation that is appropriate for the scientific discipline.
(Need 1 or 2 more knowledge transfer arguments here)
Examples of Data Intensive Disciplines
Data-intensivity characterizes much of the most important and far-reaching
research being carried out today on the fundamental forces of nature,
biological systems, atmospheric and geophysical phenomena, the composition
and structure of the universe, and advanced materials. Experiments and
activities in these fields include
* High energy and nuclear physics: Experiments such as those at the Large
Hadron Collider (LHC) at CERN, Switzerland will search for the origins
of mass and probe matter at the smallest length scales; Experiments will generate 2-billion events comprising petabytes of data. Feature extraction, event correlation, and distributed analysis are all relevant data intensive research areas. The Particle Physics Data Grid is developing data management tools for data replication based upon the SDSC Storage Resource Broker and the Globus grid. Of interest is the ability to characterize the feature extraction algorithms as knowledge that correlates attributes about events.
**SBT comment! So what? We need to say what data they generate and how
those data are hard to utilize fully with today's tools
* Molecular genomics: Whole-genome sequencing data will permit wholesale
study of genetic relationships of organisms and how complex mutations
express themselves. (Anything else?). An exciting example is the Human
Genome Project, which will provide the first comprehensive DNA
sequencing data on the human genome; The ability to correlate genes between species is an example of knowledge relationships that should be expressible within the context of a knowledge base. Ontology mapping based upon knowledge bases is needed to correlate information between genomics databases, protein data bank, and molelcular trajectory databases. A bioinformatics project in the National Partnership for Advanced Computational Infrastructure (NPACI) is developing the data management systems to support analysis of these collections. Further research is needed to develop better representations of the structural relationships between genes and proteins.
**SBT comment! I think the preceding example is a risky one, one we
maybe should stay away from. People in the H.G.P. have been spending
tons of money for a long time on the informatics aspects. We will look
like late-comers and maybe even amateurs unless we have a major, major
player in this area on our team.
* Proteomics: This is the study of the three dimensional shapes of
proteins and how these shapes influence the subsequent chemical
behavior of proteins. I need a better description from an expert.
**SBT comment! I never heard the term "Proteomics" used in the
preceding title but there is a great deal of molecular modeling being
done to determine protein topologies, docking sites, steric effects,
structure/activity relationships etc. I can check with Adrian Roitberg
if you want me to.The rules governing the folding of proteins again is a prime candidate for development of a knowledge base. Expression of protein structures as a set of quantifiable folding rules is both a challenging scientific problem, and a knowledge representation problem.
* Cell signaling. The next bioinformatics challenge is the derivation and characterization of the signaling pathways within cells. This requires correlating information about protein structures, time dependent information channels, and basic computations of electrostatic fields within a cell. Expressing the relationships between the pathways is an opportunity for the creation of a knowledge base to represent how interactions between each component of a cell are correlated. The Alliance for Cell Signaling at U Texas/UCSD is funded by NIH to do the basic biochemistry research. CDIS can support the knowledge management research.
* Single molecule systems: Does anyone know anything about this field?
* Three dimensional scans: Medical imaging, including whole body scans,
are becoming an important diagnostic tool. For example, the Human Brain
Project will carry out time series of 3-D scans of the human brain to
enable understanding of our most complex and uncharted organ system;
**SBT comment! Again, at risk of being discouragingly negative, most of
the preceding topic is subsumed under the Imaging rubric and the
imaging centers therefore think they "own" it.The Neuroscience thrust area of the NPACI is supporting the federation of neuroscience databases of brain images. This is an example of the use of knowledge systems to define relationships between the physical biological components of nervous systems.
* Crystallography: Facilities such as the Advanced Photon Source (APS)
use high intensity X-ray beams are used to map the 3-D atomic-scale
structure of advanced materials and biological molecules;
**SBT comment! I want to talk with Dave Tanner about the preceding
topic. "Crystallography" is perceived as an old-fashioned area, but
atomic-scale characterization and manipulation of materials is not.The ability to automate the ingestion of the structures into information bases presumes the ability to extract relevant features for characterizing the new shapes, and presumes the ability to organize the features into a knowledge base that can be used by the community to understand the relevance of the information.
* Systematic satellite earth observations: The best known example is the
Earth Observing System (EOS), which will provide complete,
multiple-wavelength observations of our planet to yield improved
understanding the Earth as an integrated system; The EOS system is being revised to support access to distributed data collections as part of the NASA New DISS development effort. This is expected to be done over the next three to five years. The time is ripe for research into the best representations to use for both information management and knowledge manaement within the context of earth systems data.
* Climate modeling: Who knows something about this?A major challenge of climate modeling is extraction of features such as cyclones and hurricanes from the simulations. These climatic features provide ways to correlate the impact of the climate model simulation. The rules used to characterize cyclones and other climatic features need to be represented in a knowledge base that is used to compare results from multiple climate models. Again there can be a tight coupling between knowledge representations and scientific data feature extraction. A second area is in the use of adjoint-models to ingest observational data directly into simulations. Characterizations of the observational data need to be integrated with characterizations of the features extracted from the simulation data. Both sets of characterization might be represented in separate knowledge bases. The challenge then is to support correlations across multiple knowledge bases.
**SBT comment! FSU has some major players in Climate Modelling.
* Geosciences: I'm thinking of the NEES facility for earthquake studies,
supported by NSF. What do they have in the way of data requirements?The NSF NEES project will support collections of observational data from experimental systems, and will need to correlate results from experiments with direct measurements of real-world events. Again this implies the need to define knowledge about causes and effects of ground movement on structures. The NEES project will need advanced representations for knowledge bases to identify relationships between observed structural deformation and the driving ground motions. This is an opportunity to explore creation of knowledge bases that can express the relationships between causes and effect, and serve as the templates for automated feature extraction.