A Major Qualifying Project Report

XPATH PROCESSOR

A Major Qualifying Project Report:

submitted to the Faculty

of the

WORCESTER POLYTECHNIC INSTITUTE

in partial fulfillment of the requirements for the

Degree of Bachelor of Science

______

Tammy L. Worthington

Date: April 28, 2003

Approved:

______

Prof. Elke Rundensteiner, Advisor

Abstract

Table of Contents

Introduction

2Background

2.1XML

2.2XQUERY and XPATH

2.3XPATH Location Steps

2.4MASS

2.5VAMANA

2.6System Architecture

3Overview

3.1Processor Requirements

3.2Processor Design

4XPATH Processor

4.1Parser Implementation

4.2Parse Nodes

4.3Productions

4.4Parse Tree

4.3Normalization

4.4Transformations

4.5Callback Interface

4.6Derived Parser Implementation

8Evaluation

9Related Work

10Conclusion

References

Appendix i

Introduction

XML has become popular for representing and manipulated data on the Web. The need arises to query such data and languages such as XQUERY and XPATH do just that. VAMANA is a query engine that is built over the MASS index structure. Query engines like VAMANA require the input XPATH expression to first be parsed and possibly processed further. The XPATH Processor is parses the expression and produces a corresponding execution tree, which provides a logical execution plan for a query engine.

It was desirable to have an independent processor that displayed the output in order to be evaluated. The available processors were found to be closely coupled to the query engine and difficult to evaluate because their output was not printed. The XPATH Processor is an independent component that can be utilized by applications such as VAMANA. Its output it printed in a format that can be evaluated easily.

The XPATH Processor handles XPATH’s complex grammar and its abbreviated syntax. It allows for structural user transformations that better facilitate a particular user application. This makes the user interface and execution tree generation very simple and straightforward. The user can assume that the execution tree represents a valid XPATH expression because the Processor also handles errors.

2Background

XML documents can be queried using XQUERY and XPATH. A complete XQUERY system is being developed to execute user input queries over XML documents. The system has four components: a XQUERY Engine, the XPATH Processor, MASS, and XML index structure, and VAMANA, a query engine built over MASS.

2.1XML

The Extensible Markup Language (XML) [1] is used to structure data by utilizing user-defined tags as their Document Type Definition (DTD). XML tags differ from the tags in the Hyper Text Markup Language (HTML) in that they are not pre-defined and, therefore, can be much more descriptive of the data being represented. However, XML not only describes data but is also used for data storage and manipulation. XML documents have become a popular data format in Web applications. An XML document can be represented as a tree of nodes as shown in the Figure 1.

2.2XQUERY and XPATH

XML Query Language (XQUERY) [2] and XML Path Language (XPATH) [3] are languages that allow querying of data in XML documents. XQUERY is World Wide Web Consortium (W3C) proposed syntax that allows data to be extracted from XML documents. XPATH is a W3C standard syntax that is used for addressing parts of an XML document and is essentially a subset of XQUERY. An XQUERY expression is run on an entire XML document whereas an XPATH expression is run on a subset of the document or on intermediate results.

XQUERY and XPATH expressions that relate to the given XML document (Figure 1) are shown in Figures 2 and 3. The XPATH expression in Figure 3 selects the first employee node of the company, as highlighted in Figure 1. The XQUERY expression in Figure 2 is made up of numerous XPATH expressions including the one in Figure 3.

The process of executing an XQUERY is shown in Figure 4. First a user enters an XQUERY as input. All of the XQUERY’s XPATH expressions components are extracted. The expressions are then run over an XML document and intermediate node sets are returned. Finally all of the node sets are gathered and compiled into another XML document and returned to the user.

2.3XPATH Location Steps

XPATH utilizes XML’s node representation by using a pattern location path to locate and select nodes in an XML document. The matching nodes are those whose location path matches the given path. A location path consists of at least one location step. Each location step has three parts: an axis, a node test, and zero or more predicates. The syntax of an XPATH locations step is shown in Figure 5.

The axis identifies the tree relationship between the location step and nodes to be selected by the location step. XPATH has thirteen possible axes: child, descendant, parent, ancestor, following-sibling, preceding-sibling, following, preceding, attribute, namespace, self, descendant-or-self, and ancestor-or-self. The node test specifies the type and expanded name of the nodes to be selected. Each optional predicate further refines the set of selected nodes.

A location path is evaluated left-to-right and is in the form location step/location step/…. The first step selects a set of nodes relative to the context node. The context node could be any node in the document, including the root, if specified. The traversal of the tree begins at this node. Each of the selected nodes is used as a context node for the next step and so on. The returned set of nodes is the union of each step’s selected nodes.

2.4MASS

There have been numerous proposals made on methods of indexing XML documents in order to provide efficient storage of and access to the documents for XPATH queries. It has been found that relational database technology cannot efficiently store XML and other semi-structured data. Location steps can be selective, referencing a specific node test, or non-selective, selecting all of the context node’s children by using the “*” operator. It is therefore advantageous to index both the axes and the node tests for the most efficiency. The Multi-Axis Storage Structure (MASS) [4] is such an indexing method that has been developed at WPI. It has been shown that MASS is significantly faster than many of its counterparts.

MASS employs three significant techniques: FLEX keys, clustering and compression. The Fast Lexicographical Keys (FLEX) keys are used to determine node-to-node relationships in an XML document by lexicographical comparison. These keys are resourceful in that they allow insertion and deletion of nodes without recreating the keys. MASS initially loads the XML document into different clusters; each of the four clusterings (C1, C2, C3, and C4) is based on the axis and whether the node test is selective or not. There is a unique key sequence, consisting of the FLEX key components and the node type, which is used to uniquely identify nodes. Nodes can be easily located using the key sequence. Each clustering is organizes nodes uniquely. C1 is organized in document order and C3 is in document order by node type. In C2 siblings are grouped together in document order and C4 also groups nodes with their siblings but by node type.

The redundancy in an XML document’s DTD naturally creates redundancy in the location paths. MASS greatly minimizes this redundancy in the index structure through the use of compression. As each node is loaded into the index it’s path is compared to that of any adjacent node. If there is a common path between the two, the new node only has to store a reference to the adjacent node and the unique part of the path. Decompression is easily accomplished by tracing back through the adjacent nodes. The clusterings utilized by MASS lead to many adjacent nodes having common key components. Mass’s FLEX keys, clusterings, and compression technique results in an exceptionally efficient XML document index structure.

2.5VAMANA

VAMANA is an efficient XPATH query engine that exploits the MASS index that is being developed at WPI [5]. In addition to the MASS index, VAMANA proposes a Value Index that will allow for value-based queries. It will index each FLEX key with its corresponding text value. VAMANA consists of the following three components: Optimizer, Cost Estimator and Query Execution Engine. VAMANA’s architecture is shown in Figure 6.

The XPATH Processor provides VAMANA with a logical query plan in the form of an execution tree. The Cost Estimator utilizes the execution tree in order to determine the cost of various execution strategies, such as top-down and bottom-up, using the statistics provided by the MASS structure. The Optimizer selects the optimal physical query plan based on the cost. Lastly, the Query Execution Engine processes the expression based the chosen plan by making queries to both the MASS and Value indices.

2.6System Architecture

A complete XQUERY system is being developed. The complete system includes the XPATH Processor, VAMANA, MASS, and an XQUERY engine, which will be developed in the future. The components of the system interact with each other and efficiently execute a query (Figure 7). The XPATH Processor takes in XPATH expressions provided by an XQUERY engine, parses the expression and processes it into an execution tree. VAMANA utilizes the execution tree for optimization and execution. MASS interfaces are used by VAMANA to retrieve node sets from the XML document. VAMANA then executes the query and returns the resulting node set to the engine.

3Overview

The requirements that the XPATH Processor needed to meet were determined before the design began. The requirements involved the W3C’s XPATH specifications, implementation language, extensibility, independence, and error handling. The design meets all of the requirements by utilizing three components: the parser, transformer, and Callback interface.

3.1Processor Requirements

There were certain requirements that the XPATH Processor had to meet. The W3C specifications for XPATH define the XPATH grammar, which the processor must conform to. The processor was required to be implemented in C++. An XQUERY can contain many XPATH expressions that must be processed and C++ provides the needed performance. In addition, the XPATH Processor would be integrated with VAMANA and MASS, both of which are already implemented in C++. The uniform implementation would allow for the easiest integration.

The main requirement was that the Processor’s interface must be completely independent of the query execution. This was unique in that available processors like Pathan [6], XALAN [7], LibXPath [8], and Sablotron [9] were closely coupled to the query engine. The separation of the two would allow simultaneous development of both the Processor and VAMANA. Each would be able to be tested individually and simply integrated later.

Extensibility was also a concern. The processor should be designed to allow for changes made to the XPATH grammar by the W3C. It should also be easy for a user to make any desired transformations of the parse tree before the generation of the execution tree. Transformations of the parse tree can facilitate the flow of data and control used by the query engine. The nodes in the parse tree are restructured into a more appropriate parse tree.

The Processor must handle parse errors cleanly and descriptively. VAMANA assumes that the execution tree it receives from the Processor corresponds to a valid XPATH expression. Useful error messages should indicate where the parse error occurred and the parse tree should be destroyed automatically.

3.2Processor Design

The XPATH Processor has three components: the parser, transformer, and callback interface. The series of steps that the processor takes to process an input XPATH expression is shown in Figure 8. The parser first parses the expression by running it through the XPATH grammar productions. Key productions produce parse nodes that represent a portion of the expression. The nodes are stored in a tree structure and a parse tree representing the expression is formed. The parse tree then goes through the transformer, which transforms the parse tree into a tree that is better suited for query execution. The generator is the user implementation of the processor’s callback interface that traverses the transformed parse tree and produces a corresponding execution tree.

The parser generates a parse tree that is completely independent of query execution. The tree is a structure of in-memory C++ objects that represent the XPATH expression. The parse tree has a small footprint due to the compact parse tree representation. The data contained in each object is very small, so little memory is required.

A helpful message is printed to the user when a parse error occurs. If the XPATH expression in Figure 1 were entered incorrectly entered using a back slash between the two locations steps instead of a forward slash the error in Figure 9 would be displayed. The parser tracks down the token position where the parse error occurred in order to display the location of the error.

The transformer consists of a predicate transformation that is needed to facilitate the data flow in query engines such as VAMANA. Users can easily insert supplementary transformations that would be useful to their particular query engine.

The callback interface is also separate from the query execution. The user simply has to derive from the parser class and provide implementation of the callbacks. The implementation is a straightforward one-to-one transformation because the transformer makes any necessary adjustments to the structure of the tree. The query engine can then employ the generated execution tree in optimizing and executing the query.

4XPATH Processor

4.1Parser Implementation

Flex and Bison are utilized to create an XPATH parser. Flex generates the scanner based on the grammar rules in order to recognize lexical patterns in the input XPATH expression and translate it into a set of tokens. The parser defines all of the tokens that may be used in an expression. Bison can be used because the XPATH grammar is LALR(1) and context-free. It generates the parser for XPATH, which reads in the set of tokens produced by Flex and groups them together using the grammar rules.

4.2Parse Nodes

The parse nodes are a hierarchy of C++ classes. Each class has appropriate members and methods. The node hierarchy is shown in Figure 10. All of the nodes are derived from the parse node base class. In addition, four of the nodes are derived from predicate node class. Any XPATH expression can be represented using the defined nodes.

4.3Productions

The grammar productions are rules that expect certain input types or combinations of input types. There are corresponding actions with each rule. The step production is shown in Figure 11. The production allows five input combinations and different actions are taken for each. Key productions create parse nodes to represent a part of the XPATH expression. Those that do not produce nodes are reduced to other productions.

The productions involved in parsing the XPATH expression in Figure 1 are shown in Figure 13 and 14. Flex translates the expression into fifteen tokens as summarized in Figure 12. Bison then groups the tokens together based on the productions. The productions that create parse nodes are shaded and have a solid outline. The productions that are reduced have a broken outline. In this case, five of the productions create nodes representing a piece of the expression. The two Step productions and the FunctionCall, EqualityExpr, and TOK_LITERAL productions each create a corresponding parse node.

Each node stores data relevant to its particular type. For instance, a literal node stores its value and a step node stores its axis and nodetest. In addition, each node stores its role, which simply describes the nodes’ relationship to its parent. This is essential for execution tree generation later on.

/ / TOK_FSLASH
child / TOK_CHILD
:: / TOK_DBLCOLON
company / TOK_CHILD
/ / TOK_FSLASH
descendant-or-self / TOK_DESCENDANTORSELF
:: / TOK_DBLCOLON
employee / TOK_NCNAME
[ / TOK_LBRACKET
position / TOK_NCNAME
( / TOK_LPAREN
) / TOK_RPAREN
= / TOK_EQUALS
1 / TOK_LITERAL
] / TOK_RBRACKET

4.4Parse Tree

The parse tree is built bottom-up, so as the nodes are being created in the productions they are attached to their parent. Figure 13 illustrates the two step nodes that are created. Either of these nodes could be the root of the tree with the other as its child. The parser sets the last location step as the root of the tree because it references the set of nodes that will be returned by the XPATH expression. The Processor creates the tree this way by default, but if the user wishes to have the first location step as the root of the tree than the function that reverses the steps simply needs to be turned off. The parse tree that is created from the productions in Figure 13 and 14 is show in Figure 15.

4.3Normalization

The W3C’s XPATH specifications include abbreviated notation for some of the most common types of queries. In order to accommodate such notation the parse tree is normalized to preserve the meaning of the abbreviated expression. Consistent parse trees are then produced whether abbreviated notation is used or not. The possible XPATH abbreviations are shown in Figure 16. The XPATH Processor handles each of them appropriately. In each case node members may need to be set or nodes may need to be inserted into the parse tree.