As We Had Discussed in the Earlier Section, an Algebra Plan Is Complicated for Specification

Project: EAR- 0013

Continuous Query Language:

From CQL to CAPEAlgebra Plans

AMajor Qualifying Project Report

Submitted to the Faculty of

WORCESTER POLYTECHNIC INSTITUTE

In partial fulfillment of the requirements for the

Degree of Bachelor of Science

______

Lee Chu

______

Che Wai Kwan

Date: Thursday, 27 September 2018

Approved:

Professor Elke A Rundensteiner

Abstract

Constraint-exploiting Adaptive Stream Processing Engine (CAPE) requires a user friendly input for inputting queries to replace the tedious long algebra plan for the entry. We have developed a query generator to translate the continuous query language CQL into an algebraic query plan and pass this plan to CAPE. Using CAPE’s query plans in XML configuration-file format, we evaluated the compatibility of our optimized query plan with CAPE. We also provided a recommended development for the future.

Abstract

Table of Contents

Page of Authorship

Executive Summary

1 Introduction

1.1 Project Outline

1.2 Problem Tackled

1.3 Database Management System

1.4 Data Stream System

1.5 CAPE: Constraint-Exploiting Adaptive Processing Engine

1.5.1 Architecture of CAPE

1.6 CAPE’s Limitations

1.6.1 Algebra Plan

1.6.2 SQL

1.6.3 XML Query Plan

1.6.4 Algebra Plan vs SQL

1.7 Goal

2 Requirement on Language

2.0.1 SQL-alike

2.0.2 Support on Data Streaming

2.0.3 Support on Window Constraints

2.1 Selection of Query Language

2.1.1 Research on Existing Continuous Processing Language

2.1.2 UDA

2.1.3 TelegraphCQ

2.1.4 STREAM-CQL

2.2 Continuous Processing Language for CAPE

3 Procedure from CQL to CAPE Algebra Plans

3.1 STREAM Parser

3.1.1 Lex

3.1.2 Yacc

3.2 STREAM Plan Generator

3.3 CAPE Plan Rewriter

3.3.1 ThetaJoin Rule

3.3.2 WindowPushUp Rule

3.4 CAPE XML Plan Writer

4 Evaluation

4.1 Evaluation with individual operators

4.1.1 Regular Project

4.1.2 Function Project

4.1.3 Select

4.1.4 Join with Range Window

4.1.5 Group by Partition Window

4.1.6 Distinct

4.2 Evaluation with Complex Query Plans

5 Conclusion & Recommendations

5.1 Summary

5.2 What we’ve learned

5.3 Future Works

Reference

Appendix A Lex Specification

Appendix B Yacc Specification

Appendix C STREAM Rewrite Rules

Appendix D CAPE Rewrite Rules

t_thetaJoin

t_WindowPushUp

Appendix E xmlwriter.h

Appendix F XML templates

SELECT operator template

FUNCTION PROJECT operator template

REGULAR PROJECT operator template

GROUP BY RANGEWINDOW operator template

GROUP BY PARTITION WINDOW operator template

DINSTINCT operator template

JOIN WITH RANGE WINDOW operator template

Page of Authorship

Abstract...... Che Wai

Table of Contents

Page of Authorship

Executive Summary...... Che Wai

1 Introduction...... Che Wai

1.1 Project Outline...... Che Wai

1.2 Problem Tackled...... Che Wai

1.3 Database Management System...... Che Wai

1.4 Data Stream System...... Che Wai

1.5 CAPE: Constraint-Exploiting Adaptive Processing Engine...... Che Wai

1.5.1 Architecture of CAPE...... Che Wai

1.6 CAPE’s Limitations...... Che Wai

1.6.1 Algebra Plan...... Che Wai

1.6.2 SQL...... Che Wai

1.6.3 XML Query Plan...... Che Wai

1.6.4 Algebra Plan vs SQL...... Che Wai

1.7 Goal...... Che Wai

2 Requirement on Language...... Che Wai

2.0.1 SQL-alike...... Che Wai

2.0.2 Support on Data Streaming...... Che Wai

2.0.3 Support on Window Constraints...... Che Wai

2.1 Selection of Query Language...... Che Wai

2.1.1 Research on Existing Continuous Processing Language...... Che Wai

2.1.2 UDA...... Che Wai

2.1.3 TelegraphCQ...... Che Wai

2.1.4 STREAM-CQL...... Che Wai

2.2 Continuous Processing Language for CAPE...... Che Wai

3 Procedure from CQL to CAPE Algebra Plans...... Lee

3.1 STREAM Parser...... Lee

3.1.1 Lex...... Lee

3.1.2 Yacc...... Lee

3.2 STREAM Plan Generator...... Lee

3.3 CAPE Plan Rewriter...... Lee

3.3.1 ThetaJoin Rule...... Lee

3.3.2 WindowPushUp Rule...... Lee

3.4 CAPE XML Plan Writer...... Lee

4 Evaluation...... Lee

4.1 Evaluation with individual operators...... Lee

4.1.1 Regular Project...... Lee

4.1.2 Function Project...... Lee

4.1.3 Select...... Lee

4.1.4 Join with Range Window...... Lee

4.1.5 Group by Partition Window...... Lee

4.1.6 Distinct...... Lee

4.2 Evaluation with Complex Query Plans...... Lee

5 Conclusion & Recommendations...... Lee

5.1 Summary...... Lee

5.2 What we’ve learned...... Lee

5.3 Future Works...... Lee

Reference...... N/A

Appendix A Lex Specification...... Stream-CQL

Appendix B Yacc Specification...... Stream-CQL

Appendix C STREAM Rewrite Rules...... Stream-CQL

Appendix D CAPE Rewrite Rules...... Lee

t_thetaJoin...... Lee

t_WindowPushUp...... Lee

Appendix E xmlwriter.h...... Lee

Appendix F XML templates...... Lee

SELECT operator template...... Lee

FUNCTION PROJECT operator template...... Lee

REGULAR PROJECT operator template...... Lee

GROUP BY RANGEWINDOW operator template...... Lee

GROUP BY PARTITION WINDOW operator template...... Lee

DINSTINCT operator template...... Lee

JOIN WITH RANGE WINDOW operator template...... Lee

Executive Summary

This Major Qualifying Project (MQP) was to create a Query Plan Generator to give a continuous processing language input for the Constraint-exploiting Adaptive Processing Engine (CAPE), developed the DSRG research group in WPI. The Query Plan Generator for the continuous query language, CQL was developed by STREAM-CQL team in Stanford and modified by us to make the generator compatible with the CAPE engine.

First, the Continuous Query Language (CQL) is passed into the generator; then the original STREAM code generates a semi-optimized plan. After that, we developed rules that transform this into a CAPE compatible optimized plan. The final step was passing the optimized plan into our Extensible Makeup Language (XML) plan writer to produce an XML file which CAPE would accept.

In this project, we divided the tasks into four stages; first, we researched what is the difference between Database Management System (DBMS) and Data Strearm Management System (DSMS). Second, we studied the requirements of CAPE. Third, we reviewed existing continuous processing languages that we might adopt for CAPE. At last, we adopted and modified the STREAM-CQL query generator to fit into CAPE and perform several evaluations on our XML output with CAPE.

1 Introduction

In the modern world, there are a lot of applications that require processing a lot of data on-the-fly. However, storing historical data on persistent storage in the past, the traditional database systems only accept data from a storage manager, this means that those data are outdated and cannot reflect real-time situation. Because some applications required data processing on-the-fly, the traditional database system processing will not be useful for such applications. That’s why we need to have a data stream system that can process all the incoming data on-the-fly.

1.1 Project Outline

In this report, we will describe some backgrounds on why we need to implement a query plan generator, and what kind of limitations does CAPE have in Chapter 1. In Chapter 2, it will talk about all the existing continuous languages we researched, which one we picked and why we picked that one for CAPE. In Chapter 3, we will talk about the query plan generator that we adopted from STREAM on how it generates a query plan and what we added to the generator to make it compatible with CAPE. In Chapter 4, it will talk about what kind of test data we used and how do we evaluate the results. At last, it will be our conclusion and suggestions on what need to be expend in the future.

1.2Problem Tackled

In this project, we tackled different kinds of problems. First of all, this project was on implement a high-level language and GUI for the CAPE engine. However, one of our members dropped out due to her health problem. Second, when we were studying the STREAM engine, we took a long time to figure it out where the program output the result in order to let us modify the program correctly. At last, during our implementation of the new operators, we encountered many restrictions on defining an operator in STREAM, it had different kinds of format for different kinds of operator.

1.3 Database Management System

A database management system is a system used to collect interrelated data. It uses a set of programs to access the data,in both a convenient and efficient manner.Figure 1.1 is the overall picture of a database management system.

Figure 1.1 Overview of DatabaseManagement System

A database management system uses a file system to store data. Key features of database management systems include:

Persistent relations
One-time queries
Random access
Access plan determined by query processor and physical database design

Since a database management system uses a file system to store data, all the data needs to be stored in disk before it can be processed. Applications that use these database management systems frequently don’t require outputting the results in real time. Those systems are banking, airline reservation, and manufacturing. As you can see, the database system is very limited in supporting application. That’s why we have a data stream system to process information immediately.

1.4 Data Stream System

Data stream system queries over continuous, unbounded, rapid, time-varying streams of data elements. Figure 1.2 the overview of a data stream system.

Figure 1.2 Overview of Data Stream System

The figure describesthat the data stream system takesin continuous queries;however, it only has sequential access because data can only be seen when it arrives to the system.The key features of a data stream system are:

Transient streams (as opposed to persistent relations)
Continuous queries
Sequential access
Unpredictable data characteristics and arrival patterns

Some of the modern applications that need to use data streaming are traffic management and network monitoring. Since there is a high demand of applications that need a data stream system to process data, Worcester Polytechnic Institute (WPI) has an on-going data streaming system project called Constraint-exploiting Adaptive Processing Engine(CAPE).

1.5CAPE: Constraint-Exploiting Adaptive Processing Engine

As we mentioned in Chapter 1, CAPE is an on-going data stream project atWorcester Polytechnic Institute (WPI). It was designed to process continuous queries in highly dynamic stream environments. Features in running continuous queries are:

The input data may stream into the query engine at widely varying rates.
Meta knowledge such as punctuations may dynamically be embedded into data streams.
Queries are registered into or removed from the query engine; the computing resources available for processing an individual operator may vary greatly over time.
Different users may impose different quality of service requirements.

1.5.1 Architecture of CAPE

Figure 1.3CAPESystem Architecture(Rundensteiner et al. 2)

Figure 2.1 is the overall architecture of CAPE; the following are the process of CAPE:

The engine takes streaming data from the Internet, and passes it to the engine.
Once the engine receivesthe information, the execution engine will start executing the query plan.
The QoS Inspector will collect data within a certain time range from the execution engine.
The three components, operator configurator, operator scheduler, and plan re-optimizer will use the statistics from QoS inspector to determine what kind of job it needs to perform.
The result will be continuously passed to the end-user.

1.6CAPE’s Limitations

However, at this stage CAPE doesn’t have any high level language for specifying queries. A high level language is a language like Structured Query Language (SQL). SQL is one kind of inputto a database management system. Instead, CAPE accepts lower level execution plan or so called algebra plan as input.

1.6.1 Algebra Plan

Figure 1.4Example of an Algebra Plan

Algebra Plan is a tree structure like query plan, it describes how each operator connected together to generate a result. Each operator must define what kind of job it needs to do, what is its parent and children, and where the data come from. To write the tree in words, the XML query plan is one of the ways it looks like.

1.6.2 SQL

For SQL, it uses Select-From-Where to describe the entire query as a user need. For example:

Select S.A From R, S, Q, where R.A = S.A

From the example, we can see how easy to input a query plan in high level language. The language use “Select” to describe what user wants for the answer, with “From” clause describe where the data come from, then use “Where” clause to set the constraint of the query.

1.6.3 XML Query Plan

Figure 1.5 Project operator in XML query plan

Figure 2.4 is one of the operators from a XML query plan. Each operator needs to define its className, classVariables, properties, schema, parents, and children. Inside classVaraibles, there were many variable names that need to be defined. As you can see, writing one operator requires tedious work. Also if any information entered into the plan was wrong, the whole plan will not work. So in general, using such a lower-level algebra plan has a high chance to generate errors. Also it is very hard to enter for humans. It requires a long time in order to write up a single query plan.

1.6.4 Algebra Plan vs SQL

Compare with the complicated algebra plan using ten to twenty lines to write a single operator, SQL only uses one line to describe the entire query. Since a high level language is very convenient for the user to enter, our goal is to extend CAPE by offering a generator to translate high-level language to the XML query plan as a low-level language.

1.7 Goal

As I said earlier, CAPE has its limitation on accepting low-level language only as input at this moment. Also from the above we described how SQL can effectively help user to input queries with less errors and doesn’t require long input. So for this project, we are trying to implement a query plan generator by accepting CQL, a SQL-alike language used in streaming system, then the generator translate the CQL into algebra plan and output it as a XML file to run in CAPE.

The way we worked out our goal is research on existing high-level languages supported streaming data, then we chose the most appropriate one for CAPE. Then we analyze the original query engine, STREAM from Stanford to find out what we have to modify. At last, we implemented, thetajoin operator to merge “select” and “cross” to have better run time, and windowpushup to merge the window operator from the STREAM version query plan with “thetajoin”, “cross”, or “groupby” operators to make the plan compatible with CAPE

2 Requirement on Language

CAPE has three requirements on implementing a high level language. One of the requirements is that the language must be SQL-alike. SQL-alike can let users input queries with fewer errors. Another requirement is that it needs to support data streaming. If a high level language doesn’t support streaming data, CAPE won’t be able use it at all.Thirdly, it also needs to support windows on streams since there has many ways to output results.

2.0.1 SQL-alike

Since SQL is a very convenient way fora user to enter the query plan, CAPE is trying to have a language with a similar format, with some additional features. This will let the users have a familiar way to input, instead of learning a new language for CAPE.

2.0.2 Support on Data Streaming

Since CAPE is a system processing data streams from the Internet, it is important that our language supports data dreaming. Without data streaming support, CAPE would not be able to take the query to process and analyze.

2.0.3 Support on WindowConstraints

Also CAPE is supporting different kinds of window constraints to evaluate queries, so the system needs to have a language that could define different kind of window constraints in user query. This is one of the most important differences for language in Data Stream Management System and the Database Management System. For example:

select S.A from S[range 1 minute], R[range 1 minute] where S.A = R.A
select S.A from S[now], R[now] where S.A = R.A

2.1 Selection of Query Language

Our approach to define a continuous processing language for CAPEwas to study all kinds of existing continuous processing languagesto find the most appropriate for CAPE. Then we tried to learn what features it supported and used it as a reference to implement a similar continuous processing language for CAPE.

2.1.1 Research on Existing Continuous Processing Language

From what we have researched, we picked three continuous processing languages that were the closet in meeting all the requirements for CAPE. The three languages are UDA from University of California, Los Angeles (UCLA), TelegraphCQ from Berkeley, and STREAM-CQL from Stanford.

2.1.2 UDA

The UDA continuous processing language starts the query with AGGREGATE clause, then it uses INITIALIZE, ITERATE, and TERMINATE clauses to divide up a query into three blocks. The INITIALIZE clause inserts a value taken from the input stream and set the count to 1. The ITERATE clause update tuples in the state by a adding new value to the sum and 1 to the count. The TERMINATE clause returns the ratio of the sum and count as a result of computation by INSERT INTO RETURN clause. Also it uses “length (time, tag)” to define the timing of a query. Figure 2.3 describes the format of a UDA query.

Figure 2.1 Example of UDA Query(Law et al. 11)

2.1.3 TelegraphCQ

Figure 2.2 PSoup(Chandrasekaran et al. 6)

The picture above describes how TelegraphCQ executes query plans with streaming data. PSoup treats data and queries symmetrically; it allows new queries to apply old data and allows new data to apply to old queries. PSoup continuously computes results for all active queries. Then it outputs the results until they are requested specifically.

The window semantics in TelegraphCQ supports continuous queries with different combinations of tables and data streams. To support a variety of query types, TelegraphCQ supports window schemes on both the “arrive” and “will be arrive” windows. To support sliding window queries, both ends move forward at the same time with the arrival of a new tuple. Below is an example of a sliding window query: