FY 2004 Small Research Grants in Database and Scalable Systems

Peter Chen, Louisiana State,

Semantic Data Modeling for SDSS Data and A Quick Look at Problems in Modeling of Bioinformatics Data The Sloan Digital Sky Survey (SDSS) database was designed in a somewhat ad hoc way. Originally, the SDSS developed the entire database on ObjectivityDB, and the design was then translated into a relational schema under the restrictions to preserve as much of the original design. As a result, the relational schema may create some problems in database/data-mining operations and semantic interpretation/understanding. It is very desirable at this time to sit back and redo the database design from scratch. In this project, we propose to develop a conceptual data model – identifying entity and relationship types – of SDSS data and then to convert the conceptual data model into a relational schema. We will also discuss the major differences (and the pros and cons) between the new relational schema and the old (existing) schema. Even though our main thrust is in the modeling of astronomy data, at the end of the project period, we will investigate whether some of the problems and solutions we are developing for the modeling of the astronomy data could also be applicable to bioinformatics data.

Armando Fox, Stanford,
Selectively Relaxing ACID to Simplify Recovery Management: ACID transactional semantics simplify programming but complicate recovery.We believe we can relax ACID selectively to simplify recovery in specific ways because (a) the structure of today's Internet services requires several types of non-ACID state for whichwell-specified guarantees can be written down (session/workflow state, user profile state, click stream, etc); (b) most such apps are built on a rich middleware layer, where abstractions for these kinds of states can be exported directly to the application and plugged into state storage "building blocks" embodying the new tradeoffs. We'veprototyped 2 specific non-ACID storesfor single-key-access state; one diskless for session/workflow state, one disk-based for user profile and shopping-cart-type state. Bothgive well-defined (but differing) storage guarantees to clients; use redundancy for persistence; havesingle-phase, predictable-latency operations; instant recovery (recovery= restart, no special recovery code, safe to crash-restart any node at any time for any reason); negligible effect on steady-state performance during recovery (can get this for multiple failures by adding redundancy); failure detection/recovreyis easy (fault-model enforcement coerces all observed stuttering faults, performance slowdowns, etc.into crash faults). Goal would be systematic characterization of which specific non-ACID design points are interesting, what the programmer-visible abstraction (analogous to txns) turns out to be, and how to engineer a state store giving those guarantees that is optimized for simple and fast recovery.

Fy2002 Progress: Exploiting isolation in componentized apps to enable faster recovery through micro-reboots We modified JBoss, an open-source J2EE app server middleware, to allow "micro-reboots" of individual components (Enterprise Java Beans) of an app without rebooting the whole app or appserver. This can be done app-generically, since we modify the components' containers rather than the components themselves. As a result, failures localized to a subset of components can be resolved by hot-redeploying only a subset of the app components, which is 1-2 orders of magnitude faster than restarting the whole app. While this is occurring, incoming requests that might use the recovering components are temporarily stalled until micro-recovery completes, so those requests usually see a slight performance delay rather than failure. Result is improved performability (total #requests served per unit time in the face of partial failures) compared to doing whole-app restart, which is the current method for most people. Details in: JAGR: An Autonomous Self-Recovering Application Server, by George Candea, Emre Kiciman, Steve Zhang, Pedram Keyani, Armando Fox. In Proc. 5th International Workshop on Active Middleware Services (AMS-2003), Seattle, WA, June 2003.

Johannes Gehrke, Cornell,

Data Mining for Scientific Applications: Database and data mining technology can dramatically change the way scientific data is disseminated, accessed, and analyzed. Most data mining techniques are targeted at business applications, and the different nature of scientific datasets (for example, the spatio-temporal nature of the data) often renders existing work inadequate. We propose to build upon techniques developed with previous Microsoft funding to work on two concrete science applications. First, we will build a data repository and data mining infrastructure for the Arecibo Telescope. Researchers currently manually run discovery scripts over subsets of the astronomy data; we believe that we can develop suitable data mining techniques that will nearly automate the discovery task (except for guidance through feedback from the scientists) while taking both the spatial as well as the temporal features of the data into account. Second, we will collaborate with physicists from the Stanford Linear Accelerator Center on the classification of nuclear particles. Novel high-quality data mining models are necessary, as the data is very high-dimensional, and novel criteria such as the area under the ROC curve need to be optimized. We are excited that funding of this work would have the potential to advance science while at the same time result in challenging data mining research problems.

FY2002 Grant: Monitoring the Data Tsunami: We are witnessing the emergence of a new class of monitoring applications for example for scientific sensor measurements, surveillance and tracking, and network data management. Monitoring applications are heavily network-centric, often running on small-scale devices and sensors, and need to process high-speed data streams in real time with a potentially large number of continuous queries. At the same time, concerns about data privacy have grown substantially, and monitoring decentralized repositories will likely have to happen in a privacy-preserving manner. Our research bridged the two subjects. It resulted in algorithms for boosting the quality of aggregate queries over distributed data streams, algorithms for multi-query optimization of data stream queries, algorithms for privacy-preserving distributed data mining, and novel methods for quantifying data privacy. We have implemented a subset of our techniques in a working prototype sensor database system; an online demo can be found at http://cougar.cs.cornell.edu. During this last year we published eight papers (SIGMOD 2002, PODS 2002, CIDR 2003, and five papers at SIGKDD 2002), and we have four more papers accepted (CCGrid 2003, SIGMOD 2003, and two papers at PODS 2003). For more information see http://www.cs.cornell.edu/johannes.

H. V. Jagadish, Michigan,

Information Assembly for (Biological) Data with Complex Structure: Biological data sources on the web often record data regarding complex objects, as fairly large information granules. The querying facilities available are typically limited to a (possibly complex) selection specification, and the answer returned is a set of relevant information granules, often with overlapping information content. While a human may be able to extract and assemble the relevant information from these granules, a web agent likely cannot do so. Furthermore, meaningful complex queries (involving aggregations, joins, etc.) are hard to specify without such information assembly. We propose to develop information assembly algorithms for XML, such that multiple independent information granules with overlapping content can be fused to return one comprehensive XML structure per logical entity. We will develop techniques for retaining provenance in the fused structure. We will extend uncertainty management techniques we previously developed (VLDB 2002) to deal with accumulation of evidence, and inconsistencies, in the information fusing process.

Yannis Ioannidis, University of Athens,

Timos Sellis, National Technical University of Athens,

Rule-Based Technology for Extract-Transform-Load Tools: The goal of this effort is data integration (especially extraction, transformation and cleaning). Integrating web data requires tools and algorithms to identifying relevant information, extracting it, customizing and integrating it into a common format, as well as cleaning the resulting data. Declarative integrity and business rules that capture reconciliation semantics for incoming data its transformation are highly desirable; both for semantic clarity and for validity of the final data. Examples include rules for cleaning data coming from individual sources based on knowledge of the source, identification of objects coming from multiple distinct sources, and transforming data according to the desired target schema. Execution modules operating on top of rule engines have great potential for optimizing subsequent data manipulations. Based on our data warehouse and Extraction-Transformation-Loading (ETL) expertise, we intend to study issues that arise when using declarative rules for transforming and cleaning incoming data. In addition to challenges related to the use of such rules, identifying the appropriate ones is also a major problem, at least for some stages of the whole process, e.g., data cleaning. For this purpose, we intend to examine the use of data mining techniques that will be applied on well-understood data (e.g., clean data), identify existing trends, and produce rules that will then perform the necessary operations on arbitrary data (e.g., fix unclean data).

Jignesh M. Patel, Michigan.

Declarative and Efficient Querying on Protein Structures: Scientists working on functional proteomics frequently need to query large protein data sets. These data sets often describe proteins using sequence and various geometrical structures. Since the functionality of a protein is determined both by its sequence and geometric folding properties, integrated querying across all protein structures is essential. Unfortunately, there are no declarative tools that permit such integrated querying, and scientists are forced to employ ad-hoc procedural methods, which are cumbersome to reuse and limited in their expressive power. The goal of the Periscope project is to design, implement, and evaluate a database system for declarative querying on all protein structures. As part of this project we are investigating efficient algorithms for querying on individual protein structures, an algebra for expressing integrated queries, and query optimization techniques for efficient evaluation of these integrated queries.

Hanan Samet Maryland

Efficient algorithms for spatial browsing applications: We will research algorithms typical of interactive-user spatial searches. For example the buffer zone algorithm finds the cities nearest to I80. I80 is a complex spatial object; unlike a line segment or a point object, so it is not simple to find the closest object to it with a best-first incremental nearest neighbor algorithm. It requires finding nearest neighbors of a set. It is a bit similar to the semi-join problem of finding the closest pair of elements from the two sets. Such algorithms are needed doing a buffer zone incrementally (i.e., retrieving the element of the buffer zone in an incremental manner). Of course, this problem generalizes to window and overlap queries, etc.

Thomas Schwarz, Santa Clara University

HADRAM: Highly Available Distributed RAM: HADRAM serves storage blocks from servers in a LAN. HADRAM survives failures by grouping blocks on different servers together with “parity blocks” on different machines generated by an erasure correcting code. As the experience with LH*RS at Dauphine and SCU shows, this gives very attractive data manipulation and retrieval performance (100 us random record accesses, many tens of thousands of transactions per second. Research issues to be addressed:

1. Performance evaluation of different erasure correcting codes.

2. Quasi-transactional client updates.

3. Detection of failed servers through heartbeat monitoring in a self-adjusting network. (Each server should only monitor a small number of other servers, but failure information needs to spread quickly through a routing aware protocol.)

4. Fast location of HADRAM blocks (despite migration due to failures).

Richard Snodgrass, Arizona,

Supporting Valid Time in SQLServer: Disk space costs continue to fall and demand for information integration continues to increase. Hence, maintaining and utilizing historical information is becoming critical. There is a need to move temporal processing into the database engine, thereby greatly decreasing application complexity. Language constructs for valid-time support in SQLServer have been designed, a detailed architecture and an implementation strategy have been developed, and a prototype within the query processing components of SQLServer has been started. I propose completing the initial prototype and investigating the challenging research and implementation issues of integrating valid-time support with CLR.

Alex Szalay, Johns Hopkins,

Server-side Personal DBs for Power Database Users on the Internet: The 1TB SkyserverV2 database will soon be released. The users span a wide range of needs and sophistication. There are many small queries (less than a minute and 1,000 rows of output) but there are power users who download millions of lines of data that then they analyze on their own workstations (our peak download rate is over 12M rows/hour). It would be better for them and for the database if these temporary results could be stored in the server in a user space database. Then users could perform arbitrary joins with between their personal workspace and the large SkyServer database. We will continue our work on delivery methods and formats that range from small data delivered in seconds to Gigabytes delivered as DIME attachments, to 100GB temporary datasets that are further analyzed at the server. We are also build a ten-way parallel realization of the TB archive, where we split the sky into many zones, with a buffer region to account for the partitioning, and build a web service that executes user queries in parallel. We are also considering building a batch job scheduler for long-running queries and a “Ferris wheel” to piggyback multiple scans as one scan.

Progress on Fy2003 Grant: We built several web services and applications to provide access to the 1 TB SkyServer database. . ImgCutout service returns a JPEG mosaic image built on the sky containing both images and objects. The images are warped for optimal alignment. The web service can also overlay various objects as small symbols. We also built a web application, with both an interactive, “mapquest”-like navigation utility and an SQL query, and each object is displayed as a thumbnail image. FITS class library and VOTable class libraries – C# classes give transparent export/ import of FITS files from web services, which can then be sent as a DIME attachment. We built a C# VOTable class library, with an interface very similar to the FITS class library. Query driven server side plotting – we built several prototypes of query driven graphics tools, where the data, specified by an SQL Query is immediately formatted on the server which returns either a rendered an image or an SVG plot. This makes it very easy to visualize query results without having to go through the steps of downloading, formatting and plotting on client-side. We continued stress testing of SQL Server. We found the 2B record-bug, and on a 4-way McKinley platform, we reached query speeds close to 1Gbytes per second.