Efficient XML Report

Version 1.0

3/15/2010

National Center for Atmospheric Research

DOCUMENT REVISION REGISTER

Version / Date / Content Changes / Editors / Contributors
1.0 / 03/15/2010 / Preliminary draft on compactness / Aaron Braeckel / Aaron Braeckel

Please direct comments or questions to:

Aaron Braeckel

NationalCenter for Atmospheric Research

Research Applications Laboratory

3450 Mitchell Lane

Boulder, CO80301

(303)497-2806

Terms of Use – NNEW Documentation

The following Terms of Use applies to the NNEW Documentation.

  1. Use. The User may use NNEW Documentation for any lawful purpose without any fee or cost. Any modification, reproduction and redistribution may take place without further permission so long as proper copyright notices and acknowledgements are retained.
  2. Acknowledgement. The NNEW Documentation was developed through the sponsorship of the Federal Aviation Administration.
  3. Copyright. Any copyright notice contained in this Terms of Use, the NNEW Documentation, any software code, or any part of the website shall remain intact and unaltered and shall be affixed to any use, distribution or copy. Except as specifically permitted herein, the user is not granted any express or implied right under any patents, copyrights, trademarks, or other intellectual property rights with respect to the NNEW Documentation.
  4. No Endorsements. The names, MIT, Lincoln Labs, UCAR and NCAR, may not be used in any advertising or publicity to endorse or promote any program, project, product or commercial entity.
  5. Limitation of Liability. The NNEW Documentation, including all content and materials, is provided "as is." There are no warranties of use, fitness for any purpose, service or goods, implied or direct, associated with the NNEW Documentation and MIT and UCAR expressly disclaim any warranties. In no event shall MIT or UCAR be liable for any damages of any nature suffered by any user, or any third party resulting in whole or in part from use of the NNEW Documentation.

Table of Contents

OVERVIEW

Binary and Efficient XML

Memory Usage

Processing

Compactness

Increased network bandwidth requirements

Increased storage requirements

Increased data latencies

EXISTING WORK

W3C XML Binary Characterization Working Group (XBC WG)

W3C XML Efficient XML Working Group (EXI WG)

MIT Lincoln Labs FastInfoset and EXI weather comparison

NCAR Preliminary Sun’s FastInfoset Evaluation

NCAR Exificient (EXI) Library Compactness Assessment

SOLUTION CLASSES

Data-agnostic compression

Hardware

XML-Wrapped Binary

Efficient/Binary XML Formats

ASSESSMENT

Environment

Software

Hardware

Configuration

Data

Output

Analysis

FUTURE WORK

RECOMMENDATIONS

APPENDIX A - ACRONYMS

APPENDIX B – DATA EXAMPLES

Aircraft Reports

AIR/SIGMETs

METARs

TAFs

APPENDIX C - DEFINITIONS AND TERMS

APPENDIX D - REFERENCES

Table of Figures

No table of figures entries found.

OVERVIEW

The eXtensible Markup Language (XML) has become ubiquitous in software systems and data exchange since its release in 1998 by the W3C. XML is now the de factostandard data format across most domains, including Service Oriented Architectures (SOA). This is due to a number of advantages:

  • Human-readable
  • Self-describing
  • Hardware, software, and platform-independent
  • Expressivedata model (trees, graphs, etc.)
  • Extensible
  • Validatable
  • Namespaces

However, these benefits can come with a performance cost as compared to many legacy formats. This includes increased processing, less compactness, and increased memory usage during common operations such as data parsing, storage, and regular data exchange. For example, a DoD study noted a 10x to 100x file size increase when moving from “legacy” data formats to XML ([1].

This report analyzes the efficiency cost and alternatives for XML usage in the weather domain. This analysis may also be highly relevant to other scientific domains dealing with large data volumes. As XML becomes a mission-critical component of modern data systems and as data volumes in the weather domain increase, it is essential to understand the efficiency characteristics of XML.

Binary and Efficient XML

The terms “binary XML” and “efficient XML” are often used conjointly. For the purposes of this report, efficient XML is considered a superset of binary XML. Binary XML approaches are one strategy to solve the more general efficient XML problem. This report analyzes the broader set of techniques (parsing techniques, hardware, alternative XML encodings, etc.) for efficient XML. Efficiency includes a broader set of potential solutions.
PROBLEM DESCRIPTION

For the purposes of this report, the following solution criteria are considered:

  • Open standard
  • Minimal impacts on existing XML characteristics, such as platform-independence
  • Minimal impacts on the XML family of functionality, such as:
  • XPath
  • XQuery
  • XSLT (transformation)
  • XML Schema
  • Minimal impacts to developers and users

One of the efficiency characteristics of XML comes from the representation of numeric data as character data. For example, the integer value -12345 can be encoded as two bytes (octets) if encoded directly as an integer value. When this value is represented in XML/UTF-8 it is encoded as 6 characters (“-12345”) each character requiring an octet (byte) to represent. In the case of XML/UTF-16, this would require two octets apiece.

For human readability, many XML documents contain a significant amount of whitespace that is not used for machine-readable purposes. This particular issue is a good example of the tradeoff between usability and efficiency that can be made with XML. Whitespace is critical for humans working with XML data but significantly impacts file sizes and automated data transfer.

Memory Usage

XML decoding and encoding can be more memory-intensive than with binary equivalents. This particular impact can be lessened by using appropriate techniques to encode and decode XML.

There are several techniques by which XML can be encoded and decoded. Generally most memory problems with XML can be addressed by making use of event-driven techniques. While object model techniques such as DOM are very natural for many developers, it can have a significant memory impact to store the entire XML model in memory while operating on it. In-memory representations of XML objects can be many times the size of the original XML document. Several alternative parsing techniques are described below as discrete examples.

DOM (Document Object Model) Parsing

With DOM decoding an XML library is used to build a set of XML-specific representation in memory that is then used to construct domain objects. In object-oriented languages, a DOM library typically includes objects representing the fundamental XML concepts such as Elements, Attributes, and Documents. The decoding software would typically take appropriate action (translate to a domain-specific object, perform an action, etc.) based on this in-memory XML model.

Figure 1 shows a simple example of DOM parsing in Java. Note that the process is that the parser is asked to build an XML object model, then this object model is queried for its contents. In many cases this results in duplicate in-memory representations as the DOM objects are translated to domain-specific objects.

Figure 1 DOM Parsing Example

Simple API for XML (SAX) Parsing

SAX parsing is event-driven. Rather than the parser building an in-memory representation of the XML document which is passed to the developer, events are fired whenever the parser encounters the start of an element, a new attribute, or any other significant parsing event.

Here is anexample of Java code to parse XML using SAX. This example only includesevent handling for when the opening tag of an element is detected for clarity:

Figure 2 SAX Parsing Example

Streaming API for XML (StAX) Parsing

StAX parsing is similar to SAX parsing in that it is also event-driven. However, instead of the SAX parser pushing events to interested parties (as in SAX), these parties query the StAX parser for the next event. This parsing model tends to give the developer more control over when events are handled, and retains the streaming/event-driven nature of SAX parsing.

Figure 3 StAX Parsing Example

Processing

Processing efficiency can have a broad impact on both high-end server installations and mobile devices. Processing efficiency impacts can be broken into:

  • Encoding time – the amount of processing required to encode data files to be passed to another system component. In most cases this is the result of a data producer sending data to a data consumer
  • Decoding time – the amount of processing required to decode or parse the data contents. In many cases this takes place when a data consumer is parsing the result from a data producer

Mobile devices, in particular, are sensitive to processing efficiency. Increased processing work on mobile devices can have a significant impact on battery life. However, it is relatively infrequent for XML data to be processed on mobile devices, and instead the XML is processed into derived products such as images for consumption on mobile devices.

Even for data systems with little constraints on hardware, processing efficiency can have a cumulative impact on the time taken to pass data through the system. This is most notable in cases where a series of system components exchange data before it reaches its final destination.

Compactness

In many cases data compactness can have an even greater importance than processing. There are several specific consequences of poor data compactness.

Increased network bandwidth requirements

In high end data systems, mobile devices, and dedicated aircraft devices network bandwidth is of critical importance. Wide-area network bandwidth can often be prohibitively expensive, and in some cases can drive fundamental system design decisions. The costs of WAN bandwidth can be one of the more significant ongoing expenses for data systems. Relative to processing impacts, it is notable that processing improvements (CPU) have historically far outstripped improvements in WAN speeds.

Increased storage requirements

Data storage is a fundamental driver for data staging and archiving use cases. Increased file sizes can also impact the processing work required to find and deliver data to downstream customers. Generally speaking, increased storage requirements are not a major cost or performance driver. In many cases increased storage may be offset by the minimal cost and good performance of storage devices, but is useful to consider in analyzing efficiency.

Increased data latencies

There are many scenarios where the delay in delivering data to consumers is a criticalsystem consideration. This becomes particularly important when system components are chained together. In this case the time taken to pass data across the network can become cumulatively significant and the increased data latency is multiplied by the number of systems participating in the data exchanges.

EXISTING WORK

Many analyses have taken place on how to overcome the efficiency problems with XML. Most of these have studied the general characteristics of XML across all domains, but there are several weather-specific studies of note.

W3C XML Binary Characterization Working Group (XBC WG)

The W3C convened the Binary Characterization Working Group ([2]) to collect use cases and gather requirements ([3]) for a more efficient XML encoding. This working group concluded that it is possible to address these requirements with an alternative XML encoding and that it is critically important that the W3C do so.

The critical properties identified by the XBC WG include:

  • Directly Readable & Writable
  • Transport Independence
  • Compactness
  • Human Language Neutral
  • Platform Neutrality
  • Integratable into XML Stack
  • Royalty Free
  • Fragmentable
  • Streamable
  • Roundtrip Support
  • Generality
  • Schema Extensions and Deviations
  • Format Version Identifier
  • Content Type Management
  • Self-Contained

These properties are defined and explained in detail in the XBC WGs final report ([4]).

W3C XML Efficient XML Working Group (EXI WG)

The W3C convened the Efficient XML Working Group ([5]) as a follow-on activity to define and measure the benefits of an alternative encoding of the XML information set (data model) toprovide more efficient XML. This encoding must also meet the requirements defined by the Binary Characterization Working Group.

The Efficient XML Working Group analyzed a number of solutions for efficient XML based on a broad set of use cases (as defined by the Binary Characterization WG) with a large and varied set of sample files. The WGsubsequently published their test results and testing framework.

The EXI WG evaluated ([6]) nine different encoding alternatives. Based on the necessary and desirable properties, the EXI WG evaluated which formats met the minimum requirements to be a candidate format. A summary of the findings is reproduced here:

Format / Meets Minimum Requirements?
XML + GZIP / No
Fast Infoset / No
FXDI (Fujitsu Binary) / No
Efficient XML (AgileDelta) / Yes
Xebu / No
X.694 with BER / No
X.694 with PER / No
X.694 with PER + Fast Infoset / Yes
esXML / No

Figure 4: EXI WG Candidates

By way of example, XML + GZIP did not meet either the Compactness or Generality properties. Definitions of the property types and explanations of the process and conclusions may be found in the EXI WGs measurements note (4).

Based on their testing results, the EXI WG defined an alternative XML encoding called EXI which was largely based upon AgileDelta’s EfficientXML format. The EXI format specification entered Candidate Recommendation status in late 2009, and is expected to produce a Final Recommendation.

MIT Lincoln Labs FastInfoset and EXI weather comparison

MIT Lincoln Labs performed a comparison of Sun’s implementation of Fast Infoset and AgileDelta’s EfficientXML. Note that EfficientXML is closely related to the W3C’s EXI format (as described in Section 2.2) but does not include several features eventually included in the final EXI format specification. This trial was performed with 134 XML cases. These files were of two types: NCML-GML and polar radar data. These weather-specific data files were placed within the W3C’s EXI Test Framework and a series of in-memory round-trip encode/decode operations were performed.

MIT LL concluded that both Sun’s Fast Infoset and EfficientXML were comparable formats. EfficientXML produced more compact results (83.8% compactness vs 75% compactness), and Fast Infoset had better processing characteristics (86ms vs 207ms per run). It was judged that EfficientXML‘s compactness advantage was the more important factor and that EfficientXML had the overall advantage.

NCAR Preliminary Sun’s FastInfoset Evaluation

The NationalCenter for Atmospheric performed a weather-specific evaluation of compactness and processing characteristics of four common weather products:

  • Decoded AIR/SIGMETs
  • Decoded METARs
  • Decoded PIREPs
  • Decoded TAFs

Product / Count
(# of records) / XML File Size / Fast Infoset File Size / XML Parsing Time / Fast Infoset Parsing Time
AIR/SIGMETs / 5 / 7kb / 3kb (0.43) / 18ms / 13ms (0.72)
METARs / 1481 / 1167kb / 373kb (0.32) / 84ms / 56ms (0.667)
PIREPs / 158 / 155kb / 51kb (0.33) / 29ms / 29ms (1.0)
TAFs / 177 / 471kb / 98kb (0.208) / 57ms / 39ms (0.684)

Figure 5: FastInfoset Weather Analysis

This evaluation used the Japex framework, which is related to the EXI WG testing framework. These assessments were performed without schema information. Averaged over all products, Sun’s Fast Infoset was found to reduce file size to 75% of the original file size, and reduced parsing (processing) time to 33% of raw XML parsing time.

This evaluation concluded that there were considerable efficiency gains in both compactness and processing time when using Sun’s Fast Infoset. There was no case in which performance was worse. It was expected that further trials that provided Fast Infoset with schema information could improve compactness significantly.

NCAR Exificient (EXI) Library Compactness Assessment

Once it became clear that the W3C EXI WG was favoring EXI over Fast Infoset, NCAR did an assessment of compactness for four common products:

  • Decoded AIR/SIGMETs
  • Decoded METARs
  • Decoded PIREPs
  • Decoded TAFs

This evaluation used the 0.2 version of the Exificient library forwriting EXI data files. At the time of the assessment Exificient did not implement the full set of EXI features, and as such this assessment was intended as a measurementfor an early version of Exificient rather than a set of EXI guidance metrics. The generated EXI files were compared to their original XML versions, and GZIPed versions of the original XML and of the EXI-encoded files were measured.

The conclusion of this assessment was that over all four products the average Exificient/EXI compression was 0.13 of the original file sizes. By comparison, the GZIPed XML files averaged 0.07 of the original XML file sizes. GZIPing the EXI files achieved about the same level of compactness as GZIPing the original XML files.

Both schema-informed EXI files were generated as well as schema-less files. All schema-informed versions were larger than their schema-ignorant equivalents, which was attributed to limitations of the 0.2 version of Exificient. The Exificient library was found to correctly preserve the complete XML data model (including whitespace)in a roundtrip from XML -> EXI -> XML.

SOLUTION CLASSES

Data-agnostic compression

Data-agnostic compression includes techniques such as GZIP, ZIP, BZIP2, and LZMA. These utilities are used to compress the stream or file. This technique does not preserve the native XML document model, and as such the XML must be decompressed prior to being transformed, parsed, or worked with as an XML document.

Different data-agnostic compression techniques have differing tradeoffs between the compression achieved and the processing time required. GZIP is often considered a reasonable middle-ground solution that balances good compression with a minimal processing impact. BZIP2 often requires more processing time than GZIP, but offers better compression.

Generally speaking, data-agnostic compression addresses the compactness problem but requires additional processing time to compress and decompress the contents.

Hardware

Not long after XML was standardized a family of hardware solutions emerged to address XML efficiency issues. These are typically referred to as “XML appliances”. In some cases the appliances are transparent to developers, and in other cases they require the use of custom libraries and drivers.