The Main Difference Between XML and HTML

Introduction to XML

· The main purpose of XML is to facilitate the exchange of data between two systems.

· XML stands for EXtensible Markup Language.

· XML was designed to describe data, whereas HTML was designed to display text and information.

· XML uses plain text files to describe data.

· Unlike HTML, XML tags are not predefined. You define your own tags

· XML uses a Document Type Definition (DTD) or an XML Schema to describe the valid data.

· XML is a governed by W3C.

· A new version of HTML that is based on XML is called XHTML.

· With XML, data can be exchanged between incompatible systems.

Here is a simple example of an XML document describing the properties of a book:

<book>
<title>Component Based Software Design</title>
<author>William Johnson</author>
<publisher>Addison Wesley</publisher>
<description>
Covers fundamentals of CORBA, COM, RMI, Remoting, Web Services
</description>
</book>

XML allows the person creating the XML document to define their own tags and their own document structure. In the above example, all the tags like <book>, <title> etc.. were created by the person describing the above book data, and are not part of the XML standard. Compare this to tags used to mark up HTML where only the tags that are defined in the HTML standard can be used (e.g., <h1>, <b>, <table>, <img> etc..).

It is important to note that XML was designed to describe, and exchange data. It was not designed to display data. Although there is a part of XML (XSLT) that can transform XML to XHTML.

Data Exchange via XML: One of the challenges for enterprises has been to exchange data between incompatible systems over the Internet. Converting the data to XML can greatly facilitate the exchange of data between different systems. Since XML data is stored in plain text format, it makes it easier to create data that different applications and operating systems can easily work with. It also makes it easier to send the XML data over the internet.

XML can also be used to store data in files or in databases. Applications can be written to store and retrieve information from the file or database, and programs can be developed to display the XML data.

Motivation for XML:

Using Notepad, create the following ASP page. Save it as “GetStock.asp” in the c:\webclass folder. Create a virtual directory called with an alias of “wc” and map it to the c:\webclass folder. The appendix A describes how to create a virtual directory. The Appendix A also shows the same example using Java Server Pages

<HTML>

<HEAD>

<TITLE> Inet Componenet Test</TITLE>

<H1> Data Exchange with Nasdaq.com </H1>

</HEAD>

<BODY>

Response.flush 'otherwise takes a long time to see anything first time

Dim ic1

Set ic1 = Server.CreateObject("InetCtls.Inet.1")

Dim s1

ic1.cancel 'cancel any pending connections, requests

ic1.protocol = 4 'http protocol

ic1.remoteport = 80

ic1.requesttimeout = 37

ic1.accesstype = 0

s1 = ic1.openurl("http://quotes.nasdaq.com/quote.dll?mode=stock&page=quick&symbol=ibm&selected=ibm",0)

Dim pos, s2

pos = InStr(1, s1, "$ ")

s2 = Mid(s1, pos+7, 5)

Response.write "IBM Price = " & s2

Set ic1 = nothing

</BODY>

</HTML>

Now you can test the above file, by launching the browser and typing the following url in it:

http://localhost/wc/GetStock.asp

The above example relied on finding the   and then extracting the stock price from the Nasdaq page. If tomorrow, Nasdaq decides not to use   before displaying the price, the above data exchange will fail. Use of XML as you will discover later can make this data exchange reliable.

Examples of GPS data, bank transaction exchanges will be presented during the class lecture.

XML can be used to create other languages:

XHTML a newer version of HTML is based on XML. The Wireless Markup Language (WML), used to markup Internet applications for handheld devices like PDAs and mobile phones, is written in XML.

Most Future applications will exchange their data in XML:

Applications such as word processors, spreadsheet applications and databases are already using XML to read and create files so that data can be exported to another software without any sophisticated conversion utilities in between. As you will discover in this course, any time there is a need for data exchange between two systems XML will make this data exchange reliable.

Let us take a look at the different parts of an XML document. Consider a book’s XML description.

<?xml version="1.0" encoding="ISO-8859-1"?>
<book>
<title>Component Based Software Design</title>
<author>William Johnson</author>
<publisher>Addison Wesley</publisher>
<description>
Covers fundamentals of CORBA, COM, RMI, COM+, EJBs, Remoting, and
Web Services
</description>
</book>

The first line in the document - the XML declaration - defines the XML version and the character encoding used in the document. In this case the document conforms to the 1.0 specification of XML and uses the ISO-8859-1 (Latin-1/West European) character set. The next line describes the root element of the document (i.e., indicating, "this document describes a book"):

All XML elements must have a closing tag. With XML, it is illegal to omit the closing tag. If there is no data in a tag, then you can use <tagname/> to indicate start and end of tag.

Compare this to HTML where some elements do not require a closing tag. For example, the following code is legal in HTML:

<hr>Horizontal line will be drawn before this line
or
<td> column in a table

In XML all elements must have to have a closing tag, e.g.,:

<hr/> Horizontal line will be drawn before this line
<td> column in a table</td>

Note: The first line in an XML document <?xml version=”1.0”?> does not have a closing tag. This is because the XML declaration is not a part of the XML document, and so it does not have a closing tag.

Unlike HTML, XML tags are case sensitive.

With XML, the tag <Book> is different from the tag <book>.

Opening and closing tags must therefore be written with the same case:

<Title> XML fundamentals </title> 
<Title> XML fundamentals </Title>
<title> XML fundamentals </title>

All XML elements must be properly nested.

In HTML some elements can be improperly nested within each other like this:

<b<i>This text will appear bold and italic – improper nesting</b</i>

In XML all elements must be properly nested within each other like this:

<b<i> This text will appear bold and italic – proper nesting</i</b>

All XML documents must have a single root element

All other elements must be within this root element.

Elements can have child elements. Child elements must be correctly nested within their parent element:

Attribute values must always be quoted – Either single quotes or double quotes can be used.

XML elements can have attributes in name/value pairs just like in HTML. In XML the attribute value must always be quoted. Examine the two XML documents below. The first one is incorrect, the second is correct:

<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=software>
<title>Component Based Software Design</title>
<author>William Johnson</author>
</book>
<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=”software”>
<title>Component Based Software Design</title>
<author>William Johnson</author>
</book>

Unlike HTML, with XML, white space in your document is not truncated. With XML, CR / LF is converted to LF and, a new line is always stored as LF.

In Windows applications, a new line is normally stored as a pair of characters: carriage return (CR) and line feed (LF). The character pair bears some resemblance to the typewriter actions of setting a new line. In Unix applications, a new line is normally stored as a LF character. Macintosh applications use only a CR character to store a new line.

Comments in XML

The syntax for writing comments in XML is similar to that of HTML.

XML Elements are Extensible - XML documents can be extended (i.e., more tags or attributes can be added) to carry more information.

Suppose we started with the following XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=”software”>
<title>Component Based Software Design</title>
<author>William Johnson</author>
</book>

Then a programming application extracted the data by reading the book, title and author tags from the above XML document and displayed it as:

Book Category: software
Book Title : Component Based Software Design
Book Author : William Johnson

Now suppose the author of the XML document added some extra information to the XML document as:

<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=”software” date=”1-1-2004”>
<title>Component Based Software Design</title>
<author>William Johnson</author>
<publisher>Addison Wesley</publisher>
</book>

Should the existing application that reads the data from the new XML document above break or crash? No. The application should still be able to find the <book>, category attribute, <title> and <author> elements in the XML document and produce the same output.

XML Elements have Relationships-Elements are related as parents, children and siblings.

Examine the following XML document that describes a book:

<book>
<title> XML Fundamentals</title>
<prod id="1237" date="2/5/2001"</prod>
<chapter>Introduction to XML
<para>Introduction to HTML</para>
<para>Comparing HTML and XML</para>
</chapter>
<chapter>XML Syntax
<para>XML is case sensitive</para>
<para>XML Elements must be properly nested</para>
</chapter>
</book>

Book is the root element. title, prod, and chapter are child elements of book. Book is the parent element of title, prod, and chapter. Title, prod, and chapter are siblings (or sister elements) because they have the same parent.

Elements have Content - Elements can have different content types.

An XML element is everything from (including) the element’s start tag to (including) the element’s end tag.

An element can have element content, mixed content, simple content, or empty content. An element can also have attributes.

In the example above, book has element content, because it contains other elements. Chapter has mixed content because it contains both text and other elements. Para has simple content (or text content) because it contains only text. Prod has empty content, because it carries no information.

In the example above only the prod element has attributes. The attribute named id has the value "1237". The attribute named date has the value "2/5/2001".

XML elements must follow these naming rules:

· Names can contain letters, numbers, and other characters

· Names must not start with a number or punctuation character

· Names must not start with the letters xml (or XML or Xml ..)

· Names cannot contain spaces

· Names cannot contain some special characters such as &, %, :.

Guidelines: Any name can be used, no words are reserved, but the idea is to make names descriptive. Names with an underscore separator are nice.

Examples: <first_name>, <last_name>, or <FirstName> , <LastName>.

Avoid "-" and "." and other punctuation characters in names. For example, if you name something "prod-qty," it could be a problem if your software tries to subtract qty from prod. Or if you name something "prod.qty," your application program may think that "qty" is a property of the object "prod".

Element names can be as long as you like, but don’'t make it too long. Names should be short and simple, like this: <AuthorName> not like this: <the_name_of_the_author>.

XML documents often work with a corresponding database, in which fields exist corresponding to elements in the XML document. A good practice is to use the naming rules of your database columns for the elements in the XML documents.

Non-English letters like éòá are perfectly legal in XML element names, but your document should use the proper encoding.

The ":" should not be used in element names because it is reserved to be used for something called namespaces (described later).

XML elements can have attributes, just like HTML.

Attributes are used to provide additional information about elements.

In HTML you can place an image as:

<IMG SRC="laptop.gif"> or <input type=”text” name=”txtData”>.

In HTML (and in XML) attributes provide additional information about elements:

Attributes often provide information that is not a part of the data. In the example below, the file type is irrelevant to the data, but important to the software that wants to manipulate the element:

<file type="gif">laptop.gif</file>

Attribute values must always be enclosed in quotes, but either single or double quotes can be used. For a book’s category, the book tag can be written like this:

or as:

Note: If the attribute value itself contains double quotes, then you should use single quotes, as shown below:

Note: If the attribute value itself contains single quotes, then use double quotes, as shown below:

Use of Elements vs. Attributes

Data can be stored in child elements or in attributes.

Some examples are presented below:

<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=”software”>
<title>Component Based Software Design</title>
<author>William Johnson</author>
</book>
<?xml version="1.0" encoding="ISO-8859-1"?>
<book>
<category> software </category>
<title>Component Based Software Design</title>
<author>William Johnson</author>
</book>
<?xml version="1.0" encoding="ISO-8859-1"?>
<book category=”software” title=”Component Based Software Design” author=”William Johnson”>
</book>

All examples above provide the same information.

There are no rules about when to use attributes, and when to use child elements. My experience is that attributes are handy in HTML, but in XML you should try to avoid them. Use child elements if the information feels like data.

Some of the problems with using attributes are:

· Attributes cannot contain multiple values whereas child elements can.

· Attributes cannot describe structures i.e., the hierarchical relationships between elements whereas child elements can.

· Legal Attribute values are not easily verified against a Document Type Definition

To show the limitation of using attributes, what if a book belonged to more than one category. Here are two approaches to describing it.