INode: An Effective Numbering Scheme for Storing

XML Data in Relational Databases

Lau Ho Kit and Vincent Ng

Department of Computing

Hong Kong Polytechnic University

Abstract

XML has become the standard for representing and exchanging information on Internet. This poses a challenge to efficiently store and query XML data. Several model-mapping approaches have been proposed to store XML data in relational database systems. The key features of model-mapping approach are that fixed database schemas are used for all XML documents and DTD information is not needed. In this paper, we present a new model-mapping approach, called INode. This approach is based on a numbering scheme for elements. It enables quick retrieval of parent-child and ancestor-descendant relationships between elements in the XML data graph. Experiments with two data sets and two query sets show that INode can reduce the storage requirement while having a better query performance.

1. Introduction

Extensible Markup Language (XML) has become the standard for exchanging information in Internet. XML documents comprise hierarchically nested collections of elements and tags describing the semantics of the data. It provides a flexible way to exchange data between different platforms. Several XML query languages have been proposed and the common feature of the languages is the use of regular path expressions to query XML data such as XQuery [1], XPath [2], XML_QL [3], Lorel [4] and Quilt [5].

As most of the existing systems use relational database systems, a large fraction of the XML documents will be stored in relational DBMS in order to minimize the management costs to use XML as data exchange. Therefore, the modeling issue between XML data and relational database has received special attention.

In [6], the authors classified the mapping approach of XML documents to relational DBMS into 2 classes. They are structure-mapping approach and model-mapping approach. The database schemas of the former approach are based on the logical structure of the XML documents. Mostly, they are based on the DTD (Document Type Definition) such as X-ray [7]. Since the schemas are based on the logical structure, it is not suitable for dynamic structure data. The database schemas of the latter approach are fixed for all XML documents. Examples include Edge [8], XRel [6] and XParent [9]. It is capable to support XML documents whose DTDs are not known in the design phase or without DTDs. Therefore, it is more flexible and convenient to manage the XML data in relational DBMS.

In this paper, we propose a new model-mapping approach. The approach uses unique element identifiers (UIDs) as a numbering scheme for node IDs. With this scheme, we can obtain the UIDs of the ancestors and descendents directly from the UID of an element. Each element in the XML document will be assigned an UID as an identifier. Therefore, we can find the relationship between different elements based on their UIDs.

The paper is structured as follows. Section 2 briefly discusses a sample XML document. Section 3 reviews three existing model-mapping approaches. Section 4 introduces our new model-mapping approach, INode. In Section 5, we will discuss the query processing issue. In Section 6, experimental results are shown. Finally, we conclude the paper in Section 7.

2. Overview of XML document

Extensible Markup Language (XML) is a simplified subset of SGML that is created by the World Wide Web Consortium (W3C) [10]. XML documents comprise hierarchically nested collections of elements, where each element can be either atomic or composite. Further, tags stored with elements in an XML document describe the semantics of the data rather than simply specifying how the elements are to be displayed (as in HTML). Figure 1 shows a simplified XML document of SigmodRecord [15] and the corresponding data graph in Figure 2.

<SigmodRecord>
<issue>
<volume>11</volume>
<number>1</number>
<articles>
<article>
<title>Annotated Bibliography on Data Design.</title>
<initPage>45</initPage>
<endPage>77</endPage>
<authors>
<author position="00">Anthony I. Wasserman</author>
<author position="01">Karen Botnich</author>
</authors>
</article>
</articles>
</issue>
</SigmodRecord>
Figure 1. A Simplified XML Document of SigmodRecord.

3. Three Model-Mapping Approaches

Figure 3. A Sample XML Data Graph.

Edge [8], XRel [6] and XParent [9] are three model-mapping approaches that store different structures of XML documents in relational DBMS. Consider an XML data graph as shown in Figure 3, where the nodes in the graph can either be element nodes or attribute nodes, and text nodes are not included.

The data graph contains nodes with IDs n0, n1, n2,…, nn where n0 is the root and l1, l2, l3,…,ln as the labels between a pair of nodes. The Edge approach records the pair of node IDs with the corresponding label in a single table. For example, the label between the nodes n0 and n1 is l1 and they are stored in the attributes Source, Target and Label, respectively. Unlike Edge, XRel stores all the simple path expressions that are represented as a sequence of the labels, such as (l1, l4), in a table and keeps the region of each node to preserve the precedence and the relation between ancestor and descendant among nodes. The region is specified by a pair of numbers. They are the start and end positions of a node in an XML document. The XParent approach also uses a table to store all the path expressions of an XML document. Instead of using the region to maintain the ancestor and descendant relationship, XParent uses a separate table to keep the parent and child relationships among nodes and retrieve the ancestor and descendant relationships by joining that table.

3.1. Edge

The Edge approach [8] stores the XML data graph of Figure 1 in a single table called Edge.

Src

/ Ord / Tgt / Label / Flag / Value
0 / 1 / 1 / SigmodRecord / ref
1 / 1 / 2 / Issue / ref
2 / 1 / 3 / Volume / val / 11
2 / 2 / 4 / number / val / 1
2 / 3 / 5 / Articles / ref
5 / 1 / 6 / Article / ref
6 / 1 / 7 / Title / val / Annotated Bibliography on Data Design.
6 / 2 / 8 / initPage / val / 45
6 / 3 / 9 / endPage / val / 77
6 / 4 / 10 / Authors / ref
10 / 1 / 11 / Author / val / Anthony I Wasserman
11 / 1 / 12 / @position / val / 00
10 / 2 / 13 / author / val / Karen Botnich
13 / 1 / 14 / @position / val / 01
Table 1. Edge Table for the XML Data Graph in Figure 2.

Edge(Source, Ordinal, Target, Label, Flag, Value)

Each node in the data graph is assigned to a number. Each tuple in the table corresponds to an edge in the data graph. For each edge in the data graph, it stores the source ID, the target ID and the label of the edge. The ordinal attribute keeps the ordinal of the edge among its siblings. The flag attribute indicates whether the attribute refers to an inter-object reference (ref) or a value (val). Further, it uses the inlining approach to store the text in the value attribute when the node has a text child.

3.2. XRel

The XRel approach [6] uses a schema of four tables to store the XML data graph. The tables are Path, Element, Text and Attribute.

PathID

/ PathExp
1 / #/SigmodRecord
2 / #/SigmodRecord#/issue
3 / #/SigmodRecord#/issue#/volume
4 / #/SigmodRecord#/issue#/number
5 / #/SigmodRecord#/issue#/articles
6 / #/SigmodRecord#/issue#/articles#/article
7 / #/SigmodRecord#/issue#/articles#/article#/title
8 / #/SigmodRecord#/issue#/articles#/article#/initPage
9 / #/SigmodRecord#/issue#/articles#/article#/endPage
10 / #/SigmodRecord#/issue#/articles#/article#/authors
11 / #/SigmodRecord#/issue#/articles#/article#/authors#/author
12 / #/SigmodRecord#/issue#/articles#/article#/authors#/author#/@position
(i) Path Table

DocID

/ PathID / Start / End / Value
0 / 12 / 184 / 184 / 00
0 / 12 / 235 / 235 / 01
(ii) Attribute Table
DocID / PathID / Start / End / Ordinal
0 / 1 / 0 / 332 / 1
0 / 2 / 14 / 317 / 1
0 / 3 / 21 / 40 / 1
0 / 4 / 40 / 58 / 2
0 / 5 / 58 / 309 / 3
0 / 6 / 68 / 298 / 1
0 / 7 / 77 / 130 / 1
0 / 8 / 130 / 153 / 2
0 / 9 / 153 / 174 / 3
0 / 10 / 174 / 288 / 4
0 / 11 / 183 / 234 / 1
0 / 11 / 234 / 278 / 2
(iii) Element Table

DocID

/ PathID / Start / End / Value
0 / 3 / 29 / 31 / 11
0 / 4 / 48 / 49 / 1
0 / 7 / 84 / 122 / Annotated Bibliography on Data Design.
0 / 8 / 140 / 142 / 45
0 / 9 / 162 / 164 / 77
0 / 11 / 191 / 211 / Anthony I. Wasserman
0 / 11 / 242 / 255 / Karen Botnich
(iv) Text Table

Table 2. The XRel Schema for the XML Data Graph in Figure 2.

Path(PathID, PathExp)

Element(DocID, PathID, Start, End, Ordinal)

Text(DocID, PathID, Start, End, Value)

Attribute(DocID, PathID, Start, End, Value)

The database attributes DocID, PathID, Start, End and Value represent document identifier, simple path expression identifier, start position of a region, end position of a region and string value, respectively [6]. The region is used to uniquely identify the occurrence of an element node or a text node. The region of a node is identified by the start and end position of this node in the XML document. As given in [12], the region can be computed based on the Absolute Region Coordinate (ARC) and Relative Region Coordinate (RRC). ARC expresses the absolute location of a node in relation to the root node in the XML document. RRC expresses the location of a node in relation to its parent node location rather than that of the root node. The advantage of using RRC is that it can minimize the cost of document updates on positions. The key feature of XRel is that each node in the XML document is represented by the combination of simple path expression and region.

3.3. XParent

The XParent approach [9] uses a four-table schema to store XML data. They are LabelPath, DataPath, Element and Data.

PathID / Len / Path
1 / 1 / ./SigmodRecord
2 / 2 / ./SigmodRecord./issue
3 / 3 / ./SigmodRecord./issue./volume
4 / 3 / ./SigmodRecord./issue./number
5 / 3 / ./SigmodRecord./issue./articles
6 / 4 / ./SigmodRecord./issue./articles./article
7 / 5 / ./SigmodRecord./issue./articles./article./title
8 / 5 / ./SigmodRecord./issue./articles./article./InitPage
9 / 5 / ./SigmodRecord./issue./articles./article./endPage
10 / 5 / ./SigmodRecord./issue./articles./article./authors
11 / 6 / ./SigmodRecord./issue./articles./article./authors./author
12 / 7 / ./SigmodRecord./issue./articles./article./authors./author./@position
(i) LabelPath Table
Pid / Cid / PathID / Ordinal / Did
1 / 2 / 1 / 1 / 1
2 / 3 / 2 / 1 / 2
2 / 4 / 3 / 1 / 3
2 / 5 / 4 / 1 / 4
5 / 6 / 5 / 1 / 5
5 / 15 / 6 / 1 / 6
6 / 7 / 7 / 1 / 7
6 / 8 / 8 / 1 / 8
6 / 9 / 9 / 1 / 9
6 / 10 / 10 / 1 / 10
10 / 11 / 11 / 1 / 11
10 / 12 / 12 / 1 / 12
10 / 13 / 11 / 2 / 13
10 / 14 / 12 / 1 / 14
(ii) DataPath Table / (iii) Element Table

Table 3. The XParent Schema for the XML Data Graph in Figure 2.

LabelPath(PathID, Len, Path)

DataPath(Pid, Cid)

Element(PathID, Ordinal, Did)

Data(PathID, Did, Ordinal, Value)

The database attributes PathID, Len, Pid, Cid, Did and Value represent label-path identifier, number of edges of the label path, parent-node id, child-node id, data-path identifier and string value, respectively. Since the DataPath table stores the attributes Pid and Cid, it maintains the parent-child relationships. It needs table joins in order to check the ancestor-descendent relationship. To speed up this processing, the authors proposed to use another table to keep this ancestor-descendent relationship named Ancestor. However, it needs more disk space to store this table and increases the overhead for update operations. The key feature of XParent is that it uses LabelPath and DataPath tables to maintain the structural information of the XML documents. The DataPath table is also used to keep the parent-child relationship instead of using region as in XRel.

4. INode

In this section, we introduce a new model-mapping approach called INode. For Edge approach, it is easy to maintain because it uses a singe table schema. Since it only has edges individually, it needs a large number of equijoins to check the edge-connections. When a query needs to retrieve the ancestor-descendent relationship, the query performance becomes bad.

XRel uses a table to store all simple path expressions. This improves the query performance and performs regular path expressions easily. It also uses the concept of region to maintain the ancestor-descendent relationship. For a node i, it is reachable from another node j if the region of i is included in the region of j. As a result, it can identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin.

Unlike XRel, XParent uses a table to maintain the parent-child relationship instead of using the concept of region. Therefore, it can use equijoins to test the relationship and the performance can be increased. However, for some complex queries that require checking the ancestor-descendent relationship, XParent requires a large number of equijoins. Although it can use another table, Ancestor, to keep the ancestor-descendent relationship, it requires more storage space.

Therefore, we introduce a new model-mapping approach that has a good query performance for complex queries while reducing the storage usage.

Lee et al. [11] proposed an inverted index and signature file schemes in order to reduce the storage to store all index entries. They interpreted a document structure as a k-ary tree where k is the maximum number of child nodes of a node in the structure. Since the document tree made to be be complete, there are some virtual nodes in the document tree that do not exist. Secondly, they assigned UIDs to each node according to the order of the level-order tree traversal. Figure 4 shows a document tree and the assignment of UIDs to nodes. For a node whose UID is i, the parent’s UID can be calculated directly by the following function.

Figure 4. 3-ary document tree with UIDs.

Our approach uses the concept of UID for assigning the ID to each node in the XML data graph. It represents an XML document as an nc-ary complete tree where nc is the maximum number of child nodes of a node in the structure. Given a node n, the path of n will be denoted as (a1, a2, a3,…, ak),where k is the depth of nand 1ainc. Figure 5 shows an example with nc equals to 3.

Figure 5. 3-ary Document Tree with Path Sequences.

Based on the Parent(i) function, we can calculate the node id with the path of a node as shown in equation 1.

/ (1)

In our numbering scheme, we can embed the document id in the node id in order to save the storage space. The function to calculate the node id is shown in Equation 2. It has two parts. The first part is the path function and the second part is about the document id. Given a document-id, docid, and m, where m is the number of decimal places for the document id, we have

/ (2)

With the node-ids, we can retrieve the ancestor-descendent relationship by calculation. Given the node ids of i and j, where i is the ancestor of j, and the depth difference between these two nodes is d, we can use Equation 3 to confirm that relationship.

/ (3)

Since the path sequence of each node in an XML document tree is increased from the top to bottom and from the left to right, the result of the path function will always increase as the path sequence increased. Therefore, the path function is a monotonic increasing function. As the document-id is embedded in the node-id, the function must be reversible in order to retrieve the document-id. The requirement to let the function reversible is the document-id less than one when it is embedded in the node-id. Therefore, we use decimal places to store the document-id. Here are the reversible functions to retrieve the value of path function and document-id.

INode uses a three-table schema. They are Path, Element and Attribute.

PathID

/
PathExp
1 / #/SigmodRecord
2 / #/SigmodRecord#/issue
3 / #/SigmodRecord#/issue#/volume
4 / #/SigmodRecord#/issue#/number
5 / #/SigmodRecord#/issue#/articles
6 / #/SigmodRecord#/issue#/articles#/article
7 / #/SigmodRecord#/issue#/articles#/article#/title
8 / #/SigmodRecord#/issue#/articles#/article#/initPage
9 / #/SigmodRecord#/issue#/articles#/article#/endPage
10 / #/SigmodRecord#/issue#/articles#/article#/authors
11 / #/SigmodRecord#/issue#/articles#/article#/authors#/author
12 / #/SigmodRecord#/issue#/articles#/article#/authors#/author#/@position
(i) Path Table

NodeID

/ PathID / Ordinal / Value
7.00 / 3 / 1 / 11
8.00 / 4 / 2 / 1
207.00 / 7 / 1 / Annotated Bibliography on Data Design.
208.00 / 8 / 2 / 45
209.00 / 9 / 3 / 77
1047.00 / 11 / 1 / Anthony I. Wasserman
1048.00 / 11 / 2 / Karen Botnich
210.00 / 10 / 4
42.00 / 6 / 1
9.00 / 5 / 3
2.00 / 2 / 1
1.00 / 1 / 1
(ii) Element Table

NodeID

/ PathID / Value
5232.00 / 12 / 00
5237.00 / 12 / 01
(iii) Attribute Table

Table 4. The INode Schema for the XML Data Graph in Figure 2.

INode 1

Path(PathID, PathExp)

Element(NodeID, PathID, Ordinal, Value)

Attribute(NodeID, PathID, Value)

The Path table stores the path information of the XML document collection. Each path expression is stored in the attribute PathExp and a unique ID is assigned. In Element and Attribute tables, NodeID is the node identifier (node id) that is calculated based on Equation 1. The attribute PathID serves as the foreign key of the ID in the Path table and uses to identify the path expression of a node. The database attribute Ordinal in the Element table represents the occurrence order of a node among the sibling nodes in document order.

Table 4 shows the tables that store the XML data graph with the nc = 5. Like XRel [6], we use ‘#/’ to separate the tag names in order to perform regular path expression with SQL correctly.

The key features of INode schema can be summarized as follows:

  1. INode is a node-oriented approach. The schema does not maintain the edge information explicitly. Therefore, it does not need to concatenate the edges to form a simple path for query processing.
  2. INode uses a table to stores all simple path expressions. This reduces the database size and it is more efficient to perform queries with regular path expressions.
  3. Unlike XRel, INode uses the numbering scheme to replace the region concept. The parent-child relationship is embedded in the NodeID. In addition, the document identifier is also embedded in this attribute instead of using an extra column to store. This further reduces the database size.
  4. INode uses the attribute NodeID to retrieve the parent-child relationship by calculation. It can reduce the number of equijoins or θ–joins.

To assess the effectiveness of the embedding the document-id in the node-id (denoted as INode1), we have implemented another schema that uses a new attribute to store the document-id. This schema is denoted as INode2. Besides the Path table, the new schema has changes in the Element table and Attribute table as shown below.

INode 2

Path(PathID, PathExp)

Element(NodeID, DocID, PathID, Ordinal, Value)

Attribute(NodeID, DocID, PathID, Value)

5. Query Processing

In [9], the authors had a discussion on the translation between XML queries and SQL statements for the approaches Edge, XRel and XParent. Here, we focus on the translation for INode. Example 1 shows a XML query using the XPath syntax [2].

Example 1. “Select the authors of all articles that the end pages are equal to 77” with the XML data shown in Figure 2.

Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors

SQL-1 A translated SQL query for the XPath query Q1 using INode 1
select e1.value
from path p1, path p2, element e1, element e2
where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors'
and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage'
and e1.pathid = p1.pathid
and e2.pathid = p2.pathid
and mod(e1.nodeid, 1) = mod(e2.nodeid, 1)
and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1)
and e2.value = '77'

SQL-1 shows the translated SQL query using the INode with embedding the document id in the node id. SQL-1 uses two equijoins and two selections to identify the two path identifiers. Then it uses two equijoins to check the edge connections. In total, four equijoins and three selections are used.