Structured Data Web: a Web-Like Method for Sharing and Interpreting Structured Information

Draft

The Confederation Web: A Net-CentricData Service

For Large-Scale, Rapid Integration

Douglas E. Dyer, PhD

Active Computing

12 March 2005

Introduction

DARPA’s Integrated Battle Command program (IBC) aims to integrate an array of intelligent tools to help military commanders in battlefield engagements as well as in peace-keeping and stability operations where political, economic, and social factors can dominate. To support rapid integration of new components, IBC has developed the idea of a confederation, a loosely-coupled set of decision aids, visualization tools, and simulation models that speed understanding and assessment even for domains unfamiliar to warfighters. This document describes the simple architecture, motivated by the World-Wide Web, that makes a confederation possible: the Confederation Web. The Confederation Web is meant to be a structured version of the World-Wide Web, and after carefully analyzing requirements, most design choices follow directly from that great example.

Requirements

For command and control (and most other domains), practical integration of software components requires:

A simple method of publishing

Knowledge that software components and data exist

The ability to physically access information from available software components

A way to interpret/understand information provided by software components

A way to prevent unauthorized users from seeing sensitive information

Fast, reliable performance

Requirements Analysis and Design Choices for the Confederation Web

A simple method of publishing. Sharing information implies publication because information that is not public may not be accessed or shared. But what information should be published? In what format? And by what mechanism? And using which semantic definitions and relationships? In my experience, when several developers unite to build an integrated application, these questions can be very difficult to answer, and debate extends well into the development cycle. Defining needed functions is not very difficult and neither is allocating them to components. In contrast, creating a data model and an ontology usually takes months. Mechanisms and formats can be problematic because everyone has a vested interest in using the methods they know and prefer. The more tightly-coupled the system, the more difficulties arise. Also, the available development environment is dynamic. New data formats and service technologies always loom on the horizon, promising useful features and greater facility. Developers strive to stay in the mainstream while software development vendors have many economic incentives to keep changing the “stream” regardless of technical utility. All this uncertainty leads to delay, and the project’s growth may be stunted as a result.

To avoid uncertainty and delay, the Confederation Web specifies and provides the minimal infrastructure necessary to enable application developers to immediately publish whatever content they choose in whatever format they wish. The Confederation Web defines the information sharing infrastructure, the uniform indexing method used to find information, and certain metadata.

In the Confederation Web, the method of publishing is pre-defined. Using SQL and a database interface library, applications write information elements (variable values in context, plus metadata) to a pre-defined relational database tables whose attributes match those of an information element. Let’s discuss this informally for a moment, first by describing an information element.

The essence of integration is sharing information, and basically, an information element is a self-contained unit of information. Its components are:

A variable

Its value in some context

Associated metadata in the same context

A key concept is context. For example, x is a variable and its value may be “4,” but in what context? Perhaps it’s in the context of the equation, x + y = 13 or perhaps x refers to the distance, in centimeters, that a robot hand is from a door knob at some time t. Because there is a huge number of relevant variables in the world and an even greater number of contexts, it’s helpful to partition both variables and contexts. For variables, a convenient and practical method is to use the application associated with the variable. If so, then applications create namespaces. For example,the origin of an aircraft movement in SOFTools is different from the origin of a flight in Expedia. So the application name is an important part of a variable’s context in the Confederation Web. The rest of the context is associated with a problem-solving instance. For example, SOFTools may be used to author many different plans, and the variables for each planare completely independent from those in other plans. Plans created by different authors are different as are plans for different purposes created by the same author. This suggests a method of defining a problem-solving instance. For information elements, we specify the problem-solving instance with the user identity (such as an email address) and a problem-solving instance number (integer) for that user. For example, the first time I book a flight on Expedia, the problem-solving instance is identified by a unique user identifier, “” and “1.” The second time I book a flight, the instance number is “2” and so on[1]. The application name, user identity, and user’s problem-solving instance number constitute a complete context for the variable in an information element. The context makes it possible to retrieve all information elements for a particular context while ignoring all other information elements. The other parts of an information element include the variable’s value and associated metadata: the time the variable value was set, the source of the value (which could be the user, an agent, or an algorithm acting on the user’s behalf, an important distinction), and “more information,” which can be anything the developer wishes to insert but was originally intended for user-provided information.

To help make this more concrete, let’s look at an example. Here is an application used for travel planning:

The application name is “Travel” and let’s suppose the user is Doug Dyer, identified by email address, . The lower right part of the application indicates this is the first trip of three (1/3) that Doug has planned with this application. So, in the Confederation Web, the information for this trip could be found from something like:

application: Travel

userIdentity:

instance: 1

Suppose Doug began planning this trip in October when told there would be travel required to perform Experiment B at JFCOM in November. The complete information element for the variable “Visiting” might be:

application: Travel

userIdentity:

instance: 1

variable: Visiting

value: JFCOM

source: user

time: 1097525186[2]

In the Confederation Web’s relational database this looks similar:

+------+------+------+------+------+------+------+------+

+------+------+------+------+------+------+------+------+

+------+------+------+------+------+------+------+------+

In the Confederation Web, this simple method is used to publish current values. Applications use the element table to store current variable values. SQL’s INSERT is used for new information elements and UPDATE is usedfor existing information elements whenever the value changes. Once an information element appears, any authorized, connected user may read it, but no one besides the publishing application may write it. To support a complete digital history, applications also INSERT information elements into another table, history, with the same schema as element. This enables anyone to query history table, sorting results on time, to provide the change history for a particular information element. This is how applications share information[3].

Because the only requirement for integration is writing to and reading from a relational database, developers are free to use any programming language, library, or database interface (e.g., ODBC). This reduces the need for developers to agree on infrastructure and allows them to innovate freely or stick with tested technologies. Because developers are vested in the programming languages, libraries, and technology they have chosen, freedom to use what they wish is a significant advantage. In addition, this freedom promotes innovation while provide system architects and managers complete control and visibility needed to ensure a good integrated system.

Application developers are free to choose the content and formats they think are most useful to others. This is a key design decision justified in the Appendix, but it does not necessarily preclude decision by consensus. Developers may choose to create a community of interest and build a consensus on desired content and format; or they might consult only with their primary customer; or they may unilaterally decide these issues.

Some may worry that authorizing developers to dictate theinformation published will not satisfy system needs. This might be true if system architects and managers decide not to allow competition between developers---the old way of doing business[4]. With competition, no one has a monopoly on technology. Any component can be re-implemented, improved, and offered as an alternative. Knowing this, developers are likely to produce good products because if they don’t, someone else will come along and “build a better mouse trap.” If so, then there are multiple components from which to choose. Market forces can quickly eliminate bad choices, either by helping developers see the need for re-design or by “naturally selecting” applications that don’t evolve to satisfy customers. Ultimately, market forces are stronger, more enduring, and more efficient than even the best centrally planned program. Computers can measure use patterns and market demand easily.

Semantics is typically a difficult issue. Trying to define a relatively complete domain ontology can take a long time and can delay progress if required before writing applications. Regardless of integration issues, every application naturally defines its own ontology (this is true in any event, regardless of whether the Confederation Web is used and whether the ontology is formally documented or not). When one application uses information from another, its ontology is extended. By integrating a set of applications, it’s possible to create a family of associated ontologies that more or less cover a domain of interest (again, whether or not the ontology is documented). It’s often useful to document the ontology to gain greater insight, make applications more closely represent the domain, and facilitate the development of new functions and new applications. However, developers are not generally domain experts and thus have trouble defining the ontology initially. Luckily, the rapid prototyping approach implies frequent feedback from users familiar with the domain. Rather than trying to define a complete ontology a priori, I advocate using technology to assist users and developers in discovering and documenting the ontology iteratively, changing the application to reflect new insight. As is the case for most other constructive works, ontologies improve with iterations.

Following the example of the World-Wide Web, the Confederation Web makes publishing simple. Simple ideas includereliance on mature relation database technology and the notions of an information element, a pre-defined database schema, and authority for application developers to unilaterally select the content and format published. The intent is to support bottom-up development. Rather than waiting for agreement on a top-down design (which tends to arrive pretty late in development schedule), developers can publish immediately. A simple, pre-defined method of publication helped the World-Wide Web attain exponential growth, and the same method will help the Confederation Web grow too.

Knowledge that software components and data exist. A common problem identified by the DOD’s Net Centric Data Strategy is that users and developers generally do not know all the software components, services, and data elements that are available. Given the rate of technical progress, number of new commercial products, and number of Government programs aimed at component development and database upgrades, this is not at all surprising. We need technology to help find software components and data applicable to our current problem. Luckily, it’s easy to translate search engine technology used to help us manage unstructured information[5] to help manage structured information as well. The single-server version of the Confederation Web already has a “search engine for structured information.” This tool can help in the following ways:

Identify relevant applications and data

Find power-users based on use frequency and recency and access them via email

Provide semantic clues on the meaning of variables and functions of applications

Monitor the software development process

The search engine for structured information is documented in a paper on the Active Computing web site. By exploiting the special structure available in an information element, this search engine can help anyone find and understand the tools and data they need to solve a problem. From a developer’s perspective, the search engine helps find reusable components and useful data. From a project manager’s view, the search engine makes it easy to track development progress. All of these benefits accrue from applying search to the Confederation Web[6].

The ability to physically access information from available software components. Because the Confederation Web is implemented with relational databases, physical access is possible from anywhere on the net. To query, the only requirement is access information (e.g., database server, user, password) and an appropriate indexfor finding information elements. To find a particular information element, the Confederation Web uses a “URL for structured information.” For the single-server version of the Confederation Web, the URL for structured information is the variable name, application, user, and user’s problem-solving instance. For a multi-server version of the Confederation Web, it’s necessary to define the database server as well by IP address or name. The URL for structured information provides a uniform index that, with the relational database server, enables authorized users to get information from the Confederation Web from anywhere on the net using any appropriate client software.

A way to interpret/understand information provided by software components. Typically, software developers provide some form of descriptive meta-data to help other people understand the domain of interest, data definitions, and software functions. Descriptive meta-data include source code comments, user and system documentation, data dictionaries, schema, models, and ontologies. All of these require effort on the part of the developer to produce, and I refer to them as “semantics by declaration.” The Confederation Web has a table for storing and serving descriptive meta-data. However, the Confederation Web also enables a second method which requires no a priori effort on the part of developers but arises naturally as software applications are used. “Semantics by example”exploits example values to gain insight into the meaning of a variable. To understand this concept, consider the Travel application introduced earlier. One variable in that application is visiting and its meaning might reasonably be any of at least three things:

A person
A place (physical location)
An organization

However, suppose you find these example values for visiting:

University of Pittsburgh
IBM Almaden Institute
SOCOM
DARPA

Clearly, these examples do not refer to a person, and they don’t seem to refer to a physical location (for example, DARPA could choose to move its facility a different office building). For these reasons, most people would be able to use these example values to infer that visiting refers to an organization, not a person or a place. This understanding, arising from inference based on example values, is “semantics by example.”

As the Confederation Web is used, a large number of example values will be stored. After a short time, we can expect the Confederation Web to include all possible values for many variables that have enumerated values. For these variables and perhaps others, the range of possible values available in the Confederation Web should result in valuable insight into the meaning of each variable. Moreover, the distribution of values in the context of other relevant variables should provide a sense of normal behavior, thus facilitating recognition or abnormal, and possibly erroneous, information.

For many applications, just knowing the associated variables can facilitate understanding the function of the application. For example, if variables include origin and destination, then movement is implied. Knowing variable values in the context of a particular problem is similar to having a case (as in case-based reasoning) and enables hypotheses about the ontological relationships that apparently exist between variables.

“Semantics by example” is a powerful method for understanding applications and their variables. It requires no effort from developers, arising instead through normal use of applications. It can be used separately or in conjunction with the more traditional “semantics by declaration.”

A way to prevent unauthorized users from seeing sensitive information. Security is important in many domains including military command and control. One form of security is preventing unauthorized users from gaining access to sensitive information. Applications operating on client workstations and the Confederation Web servers are assumed to be under physical control, but what about the network? Unprotected, hackers may be able to get sensitive information by sniffing packets or by simply accessing the database. Both of these problems are solved using off-the-shelf technologies. To prevent packet sniffing, network traffic between applications and the relational database hosting the Confederation Webshould be encrypted using secure socket layer (SSL). To prevent direct access to the database, we can just use the database’s access control scheme which generally involves user accounts and passwords protecting databases, tables, and even columns. The database normally has logging capability which enables manual analysis as well as a variety of AI-based agents for checking out-of-norm or sensitive queries, providing another layer of protection for at least detecting hackers. These standard methods are difficult to improve upon. Adding additional components, services, and interfaces may add complexity without any real security benefit. The more complex the interfaces and services provided, the more difficult it is to ensure the system is secure. Using a small, well-defined interface is makes it easier to analyze and address possible methods of attack.