Page 1 of 9

The use of RDMMS for querying XML Documents

Abstraction

XML is a fast going technology, the use of traditional RDBMS querying on XML is a leverage way to work on this fast going technology. This paper mainly discusses how to simplified DTDs and discuses two algorithms on transforming DTDs to relational schemas. Basic Inline algorithm builds every relation starting from every element in the DTD graph. There will be less joins if an XML-query querying from that particular element, but Basic Inlining Algorithm produces too many relations and makes this method unrealistic if the XML contains very complex nested tree structure. The Improved Shared Inlining algorithm build relations by sharing common parts, thus reduces the number of relations and can work on any complex nested structures. Improved Shared Inlining Algorithm is more practical, more complete and reduces a lot of redundancy.

XML is rapid emerging on the World Wide Web as a standard for representing and exchange data. It is critical to have efficient mechanisms to store and query XML documents to exploit the full power of this new technology. Currently, two approaches are being developed for storing and querying XML data. One approach is to develop native semi-structure query language to work on XML documents directly. The other approach is to take the advantage of the matured RDBMS technology to work on it. This paper will talk about the second approach. This approach is shown in Figure1.

The working flow is as following:

(1)Use DTD to build up relational schema.

(2)Use the created Relational schema, translate XML documents into Relational tuples

(3)Translate semi-structured Query into SQL

(4)Use SQL to query on relational tables to get result in Relational database

(5)Transform the result in Relational database back into XML format.

In this approach, we have to create the Relational schema and we have to do three kinds of transformation. This approach seems very awkward, but actually, we have some irrefutable reasons to use it.

1Motivations and difficulties of using the RDBMS for querying XML documents

1.1Motivations:

From the former text, we can see, it seems rather awkward to use the RDBMS to querying XML documents, why we want to use this method? We summarize the reasons as following:

1)The use of RDBMS on querying XML documents is a leverage way to provide query capabilities over XML documents. We know XML is a fast emerging standard in representing data today while RDBMS has developed for over 20 years, we do not want, RDBMS, this matured technology stands by doing nothing on the fast going XML technology. We want RDBMS to involved in, to make contributions, to take some control on the development of XML.

2) Take the advantages of RDBMS. RDBMS has a lot of advantages which XML does not have, they have efficient storage, indexes, security management, transactions and data integrity controls, etc. Using RDBMS method takes the advantages of all these matured technology and maybe faster then directly work on XML.

3) In real world, data switch between relational DB and XML are needed. For example, in Datawarehouse, operational data may exist in XML format, but the reconciled data in the middle of the warehouse is usually in relational format. In E-commerce, data maybe in relational format and in XML format, we need constant switch between these two formats.

4) Semi-structured query language is not matured yet; we can get some reference to develop the new semi-structure query language from the use of RDBMS.

1.2Difficulties:

XML documents and Relational Database are dramatically different in its nature and format: XML are arbitrary nested tree structured, the RDBMD is only two level of schema in nature (relation and attributes); XML tags describe itself and can be arbitrary complex while the RDBMS is difficult to express these complex tags; XML uses semi-structured query language to work on it while the RDBMS uses SQL to work on it. As we mentioned before, there are five steps to use to RDBMS for querying XML document. The first step, and the most important and the most difficult step, is to use DTD to build up relational schema. We have following difficulties to build relational schema:

1)We have to Deal with the complexity of DTD element specifications

2)We have to resolve the conflict between the two-level nature of relational schemas (table and attribute) vs. the arbitrary nesting of XML DTD schemas

3)We have to deal with set-valued attributes and recursion

How do we resolve these difficulties? In this text, we will mainly discuss how to simplify DTDs and two algorithms of mapping DTDs into relational schemas. Section 2 will talk about how to simplify DTDs; section 3 will talk about Basic Inlining algorithm; Section 4 will discuss about Improved Share Inlining algorithm; and we make comparison and conclusion in the last section.

2.How to Simplify DTD

We have discussed, at the beginning of this paper, the working flow of the use of RDBMS for XML documents. In that working flow, we can see that in order to translate XML documents into relational, we have to build up the relational schema first from the DTD. Since DTD can be very complex tree structured, the first thing first is to simplify the DTD.

We have three kinds of operations to simplify the DTDs.

(1)Flatting

From the left hand side, we can see: Nested definitions can be transformed into flat definitions, and “|” operation can be removed.

(2)Simplification transformation

From the left hand side, we can see: multiple operations on one single element can be reduced to one single operation, if both”? “ and “*” operation exits, “*” is selected.

(3)Grouping transformation

From the left hand side, we can see: the redundant appearance of one element can be group together.

Finally, any element <!ELEMENT a ((b|c|e)?, (e?|(f?,(b,b)*))*)>

is simplified to: <!ELEMENT a (b*,c?,e*,f*)>

After simplify the DTDs, we can see an element which has star operations, optional operations, tuple operations, nested operations and group operations will be transformed into an element which has only star operations and tuple operations.

We May notice that, during the simplification procedure, we have lost some information about the relative orders of the elements, but this information can be captured when a specific XML document is loaded into this relational schema.

3.Basic inlining algorithm

The basic Inlining algorithm solves the fragmentation problem by inlining as many descendents as many as possible into a single relation. The working flow of Basic inlining algorithm is work like this:

(1) We first build the DTD graph according to the DTD document. A DTD graph represents the structure of a DTD. Its nodes are elements, attributes and operators in the DTD. Each element appears exactly once in the graph, while the attributes and operators appear as many times as they appear in DTD. Let look at a DTD document as Figure2. The corresponding DTD graph is made as in Figure3.

(2) Now we discuss, what situation can be inlinable:

For every element in the DTD tree, we can see, there are three situations:

1)An element has exactly one incoming edge, this element is “inlinable”. For example, in Figure 3, “booktitle” has exactly one incoming edge from “book”. In this situation, “booktitle” can be inlined into “book”, we express this situation as “book.booktile”

2)An element has more then one incoming edge. For example, in Figure 3, “author” has three incoming edges, they are from ”book”, "*”, and “monograph”. This situation, the element “author” cannot be inlinable directly.

3)An element has incoming edge from “*”. For example, in Figure 3, “author has incoming edge from ”*” , “monograph” has incoming edge from ”*”. These situations, the element cannot be inlinable.”*” operation, express multiple appearance of an element, in RDBMS, is just one to many relationship.

(3) In basic algorithm, we try to inline all the elements into one single relation. In order to do that, we build element tree to make sure that every element in the element tree has exactly one incoming edge(the root node may have no incoming edge). We use the Depth first traversal algorithm to traverse the DTD graph to get the Element tree. After the element tree is built, we can see, each element in this element tree has exactly one incoming edge(or no edge)

(4) After element tree is built, we can inline the element tree into relations. All the elements inside the tree have exactly one incoming edge, they can be inlinable into their parent node except in situation which the incoming node is”*”. In the relation DB point of view, this “*” operation, which appears in DTD graph, is just express one–to-many relationship. If the incoming edge is from”*”, we build a separate relation and add a foreign key to trace back and join with it’s parent relation.

In Figure 4, we can see, “firstname” can be inlined into “name” (we ignore “?”), then it can be inlined into “author”, and “author” can be inlined into “monograph”, so we have “monograph.author.name.firstname”, the same as in “lastname”, and “address” and “authored”. So in monograph, we finally have:

monograph.author.name.firstname, monograph.author.name.lastname, monograph.author.address, monograph.authorid.

All these inlinable elements form the attributes in the relation “monograph”, we add can a key attribute “monographID” into this relation, and we have the relation monograph as following:

monograph(monographID, monograph.author.name.firstname, monograph.author.name.lastname, monograph.author.address, monograph.authorid)

There is a “*” operation between “monograph” and “editor” and there is a backpointer point back to editor. If we try to traverse the element tree from “editor’, we build the following two relations: editor(editorID, editor.name) and editor.monograph(editor.monographID, editor.monograph.parentID, monograph.author.name.firstname, monograph.author.name.lastname, monograph.author.address, monograph.authored)

Look at the second relation, we add an attribute “editor.monograph.parentID”, this attribute is the same as the “editorID” attribute in relation editor(editorID, editor.name) and “editor.monograph.parentID” serve as a foreign key. So we can join the two relations editor and editor.monograph by using this foreign key.

Finally, by using Basic Inlining algorithm, Figure 3 will create following relations(Figure 5):

4.Improved Shared Inlining Algorithm

Basic inlining algorithm has the drawback of large quantities of relations. Improved Shared Inlining algorithm combines the shared parts in DTDs and removes all the redundant relations. Thus, more clear, complete relations are created. Improved shared Inlining algorithm is developed upon Shared Inlining algorithm, the most improvement from Shared Inlining Algorithm is to build a separate the new relation to handle the many-to-many relations (“*” operation in DTD) between the two uninlinable elements and back pointing recursive relation between these two elements. In this paper, we directly use Improved Shared Inlining Algorithm and not go into detail how those improvements are made from “Share”.

After simplifying DTD and build DTD graph base on simplified DTD, Improved Shared Inlining algorithm begins to traverse the DTD graph directly. Every element on the DTD is judge is it is inlinable, if it is, it will inlined to it’s parent node, if it is not, a new relation is build and is shared by all its ascendants. Improved Shared Inlining Algorithm use Depth-first-search strategy to traverse the graph to identify if each node is inlinable and inline the child node to its parents. Look, in this algorithm, every element in the DTD tree is traversed only once. The algorithm is work as following:

In this algorithm as in Figure 6, we use the function inlinable( aNode) to judge if a element has exactly one incoming edge( as described in Basic Inlining algorithm the first situation of an element). The “inlinedSet” means the union of all descendentelements which can be inlined to this element. For example, in Figure 3, “firstname” and “lastname” can be inlined to “name”, and “name” , “address”, “authorid” can be inlined to “author”, so we has inlinedSet at “author” element, which is {author, name, firstname, lastname, address, authorid}.

By using this lining procedure, Figure 3 is transferred into Figure6.

After a simplified DTD graph is inlined, the last step is to generate a relational schema based on this inlined DTD graph. The generated schema supports the select, insert, delete and update of an arbitrary XML element declared in the input DTD. We list the steps will be performed on the inlined DTD graph to generate a set of relations:

1.We create relation for each node on the inlined DTD graph with the following attributes:

1)A primary key, ID.

2)If |inlinedSet|  2, we introduce nodeType to indicate the type if the XML element stored in a tuple

3)The names of all the terminal XML elements in inlinedSet. The internal element is removed from attribute list.

4)If current element, e, is connected to other element c, we introduce c.ID as a foreign key of e referencing relation.

2.If there are at least two relations t1(ID) and t2(ID) generated by step1, then we combine all the relations of the form t(ID) into one single relation table(ID, nodeType) where nodeType indicates which XML element is stored in a tuple.

3.if there are at least two relations t1(ID, t1) and t2(ID, t3) generated by step 1, then we combine all the relations of the form t(ID,t2) into one single relation table2(ID, nodeType, pcdata) where nodeType indicates which XML element is stored in a tuple.

4.If there is at least one * edge in the inlined DTD graph, then we introduce relation edge(parented, childID, parentType, childType) to store all the parent-child relationships corresponding to * operations. The domains of parentType and childType are the set of XML element names defined in the input DTD. Step 1 converts each node e in the inlined DTD graph into a separate relation e. If there are some other XML element nodes that have been inlined to it (i.e., |e.inlinedSet|  2), relation e will be used to store all these XML elements, and attribute nodeType will be introduce to indicate which XML element is the root for each tuple. Since step 1 might produce a set of relations in the forms of t(ID) and t(ID, t), step 2 and 3 optimize them by performing a horizontal combing of them into table1(ID, nodeType) and table2(ID, nodeType, pcdata). These optimizations reduced the number of target relations and will facilitate the mapping from XML operations to relational SQL operations. Finally, one single relation edge(parentID, childID, parentType, childType) stores all the many-to-many relationships between arbitrary two XML elements.

Let’s see in Figure7, according the steps we specified above, at element “monograph”, a relation is created as monograph(ID, monograph) according to step 1; also according to step 1, at element “editor, name”, a relation is created as editor(ID, nodeType, editor, name); according to step 4, between the above two elements, a new relation, edit_mono_edge(parentID, childID, parentType, childType), is created. This relation resolves the one-to-many relationships between relations editor and monograph like this: “parentid” and “parentType” in this relation will be the primary key, ID, in relation editor(ID, nodeType, name) and “editor”; the “childID” and “childType” in edit_mono_edge(parented, childID, parentType, childType) will be corresponding attributes “ID”, “monograph” in set of monographs which contain in the editor relation. On the other hand, this relation also resolves the backpointer from monograph to editor. The ”parentid” and parentType will be “ID” and “monograph” in relation(ID, monograph); the “childID” and “childType” will be “ID” and “editor” in relation editor. For Figure7, we can finally generate following relations:

book(ID, nodeType, booktitle)

author(ID, nodeType, firstname, lastname, address, authorid)

article(ID, nodeType, contactauthor, autherid)

title(ID, parented, title)

editor(ID, nodeType, name)

monograph(ID, monograph)

edit_mono_edge(parentID, childID, parentType, childType)

By using the improved shared algorithm, compared with the relations in Figure5, we can see the number of relations is largely reduced.

5. Comparison of these two algorithms

Basic InliningAlgorithmtries to build a separate relation on each element in element tree. If an XML-query is querying from that particular node, it need not join with other relation. This is nice. But at the same time, Basic Inlining Algorithm produce huge number of relations and a lot of redundancy appears in the attributes in different relations. If the DTDs have strongly connected component, this method is not practical.

Improved Shared Inlining Algorithm traverse the DTD graph only once, it avoids the drawback of large quantities of relations created in Basic, it combines together some relations which have the same schemas in nature. Improved Shared Inlining Algorithm also use separate relation to define the many-to-many relationship (“*“ operation in DTD) between two elements. This method not only reduces the redundant tuples exists in Basic Inlining Algorithm. By using a separate relation, the back pointing recursive relation in DTD can also be defined. Also, Improved Inlining Algorithm combines some of the relations which format is same. Although Improved Shared Inlining Algorithm may increase the number of joins if queries work on XMLs, it produce more practical, more completeness, and more concise relations, it can solves on very complex DTD documents.