To the Data Architecture Working Group, Enterprise Architecture Shared Interest Group, Industry Advisory Council
Introduction
Two years ago, John Dodd of CSC invited me to join the Data Architecture Sub-committee because I have had extensive experience with data model patterns, and it was believed that I could provide some insights that the sub-committee might find useful. At that time, I was contracted to the CIA, so it was particularly appropriate for me to participate as a government representative, even though my area of expertise goes beyond the government arena.
I no longer work for the CIA, but I do believe that the original premise—that I can contribute to the discussions of data model structures—still applies. John Dodd has graciously continued to sponsor my participation in the meetings as part of the Industry Advisory Council. Moreover, as a concerned taxpayer, it is as important to me now that the government get its act together as it was when I was a government contractor. I am therefore pleased to contribute what I can.
I realize that I have been remiss in participating in the discussions of the working group to this point, so this is probably to late to be in version 5 of the DRM, but I would like to submit some thoughts for version 6. I do believe that the Data Description section could be elaborated on usefully and that I can contribute to that elaboration
The Data Reference Model
The Data Reference model organizes the data “problem” into three parts:
· Data Description – This includes
· structured data, The organized description of data to convey semantic understanding usually through an entity relationship diagram,
· unstructured data, data that are of more free-form format, such as multimedia files, images, sound files, or unstructured text, and
· semi-structured data that have characteristics of both structured and unstructured data, such as an e-mail.
· Data Sharing – Information that is generated or required by a Unit of Work and is subsequently passed to another Unit of Work. Units of Work consume and produce data.
· Data Context – This includes
· business subject areas that are the broad categories of data that support business processes of a line of business or community of interest, and
· information categories that are The sub-categories of data used for mapping data groupings of many lines of business or communities of interest.[1]
A great deal has been done to address Data Sharing: The Justice Department developed an XML Schema model (The Global Justice XML Data Model, or GJXDM), and subsequently, in cooperation with the Department of Homeland Security, enhanced it to create the National Enterprise Information (NEIM) model.
To me, Data Description is concerned with the fundamental, conceptual structure of the information. To address this is to determine the underlying meaning of the things of significance to each agency, and how these things are related to each other.
This is the area that could use more work in the next version of the DRM. At the very least, we need clearer definitions of just what is involved in describing structured and unstructured data. Better would be to publish some patterns to provide standard forms for describing standard agency phenomena. An entity/relationship diagram represents the meaning of a comprehensive set of the things of significance to an agency, about which that agency wishes to hold information. It is about the structure of an agency’s information.
Data Context refers to categories, for the purpose of locating data. This touches more on data management issues than the other two parts. Based on the definitions cited above, this category overlaps significantly with the assignment for Data Description, but as I see it, this is less about how data are described, and more about how they are used by functional areas in the government.
XML is a good language for describing transactions to be shared between Agencies and Departments. It is a terrible language for describing the underlying structure of data. To describe the structure of data so that it can be discussed by Agency managers, a more aesthetically appropriate vehicle is required.
In particular, I recommend a notation for creating such a model that is as aesthetically simple as possible. In my experience, the entity/relationship modeling notation developed by Richard Barker and Harry Ellis is ideal for this purpose. Because of the wide variety of data requirements across the various federal agencies, the idea of creating a conceptual entity/relationship model that will somehow meet all their needs appears to be a hopeless prospect. This is probably true, if, by “conceptual entity / relationship model” you mean a specific description of data structures that somehow describe all these agencies in detail. There is an alternative, however.
More significantly, I propose a specific set of data model patterns that can be applied to any agency as a starting point for the development of more specialized models. As described below in detail, these are in terms of fundamental concepts—people and organizations, geography, physical objects, and activities—as well as some basic combinations of these concepts.
Attributes are treated in a dynamic way, allowing them to be specified as data for specific agency models, rather than as fundamental components of the models.
A few years ago, the DAS commissioned a “Person model” to be developed. Unfortunately, it is a good example of why this is hard. Rather than focusing on defining clearly what characteristics and structures might be used to describe a person across agencies, it included a few attributes and then attached Person to structures that represented many different functional areas.
The U-Core model is more focused, and as such, is a good beginning to this effort, but for reasons that I go into below, it is not quite up to the job.
U-Core
The U-Core model (shown in Figure 1, below) is certainly a respectable first cut to address the issues of data structure for the Federal Government. The problem is that it combines things from different levels of abstraction in ways that don’t quite fit.
First of all, it is a mish-mash of several different kinds of models:
· There is an entity/relationship model of “who”, “what”, “where” and “when”.
· There is something called a “metadata” model, but it is tautologous. There is one entity called metadata each instance of which must be either “message metadata”, “package metadata”, or “contentmetadata”, with no indication as to what any of these things are.
· There is in fact actual metadata, consisting of “entity”, and “relationship”, plus abstractions “collection” and “thing”.
· There is a model of “message framework”, which is about the mechanics of sharing, not the meaning of data.
So, yes, there is an e/r portion of the model with the categories “who”, “what”, “where”, and “when”.
“Who” is very similar to my model of people and organizations. What is called there “Agent”, I call “Party”. The problem with the word “Agent” is that it implies not just the existence of the Person or the Organization, but also its purpose—even though there is no information there (nor should there be) as to what that purpose is. In addition, I am not quite sure what is the difference between an “organization” and a “group”. I’ve never needed to have anything other than “Organization”. The sub-types and their names vary from company to company, but in my experience, these are the most common:
· Company (corporation, partnership, or sole-proprietorship)
· Department (internal organization of some sort)
· Government Agency (such as the FDA or California Department of Motor Vehicles)
· Government (of the United States, California, or Iowa)
· Household
· Other (There is always “other”)
Figure 1 : The U-Core Model[2]
The problem with U-Core’s definition of “where” is that it confuses the concept of “location”, which is an identifiable (bounded) place on the earth with that of “address” or “site”, which is a place with a purpose. Neither a physical address nor a “cyberaddress” (what I called a “virtual address”, above) is a sub-type of location. A physical address, for example contains relationships to at least 4 (geopolitical) locations:
· City
· State (or province)
· Postal area
· Country.
In fact, the “street address” itself is actually a reference to a kind of Geographic Area: “Surveyed Area”, measured in terms of lot boundaries.
Having “what” described as “what type” is meaningless. To be sure, there are a lot of different names that can be applied to physical stuff: “product”, “material”, “item” “physical asset”, and so forth. But the concept should be made clear.
U-Core’s idea of “when” is not unreasonable. In my models I usually finesse the question, because any entity class that occurs in time has appropriate attributes to describe that. Typically these are “effective date” and “until date”, but others are useful as well.
The problem is that “time” is not an entity—a thing of significance. It is an abstraction that is effectively a characteristic of—at right angles to—our other things. In dimensional databases we have to deal with it explicitly, but in describing the nature of an organization, I would keep it at the attribute level.
It is telling that the U-Core model does not address “why” at all.
In short, the U-Core model is not consistent nor coherent. It does not represent the meaning of a comprehensive set of the things of significance to an agency, about which that agency wishes to hold information. By combining metadata with “business-level” data, it has given short-shrift to both sets.
My Patterns
The fact of the matter is that I have already created a set of patterns that really are universal, coherent, and useful. I have focused clearly on the concept of “level of abstraction”, so that at each level they are all describing the same kinds of things. It is my intention to publish them as an update to my book, Data Model Patterns: Conventions of Thought, and have presented them to several data administration professional society meetings, and received very positive responses.
As a combination of my thinking in my first book, plus twenty years of experience using patterns to build enterprise models in many different industries (and the CIA), I have divided them into four basic categories:
· People and organizations - These are the individuals and groups of individuals (collectively known as “parties”) that are the actors in all that happens in an agency. The roles they play are various but they can be modeled explicitly, separate from defining the identifiable parties.
· Geography – This is mostly about explicitly identified, bounded areas on the earth, be they “geopolitical areas” “management areas”, “surveyed areas”, or “natural areas”. By extension, it is also about the geographic points used to define them (identified by latitude, longitude, and elevation), geographic solids, such as oil reservoirs, and geographic lines, such as highway routes and telephone line routes. Note that the geopolitical area “Iraq” is very different from the organization that is a “government” of such an area. And (as will be described further, below) the city of Houston is a very different kind of thing than my home that happens to be in Houston.
· Physical assets – These are all the products, materials, pieces of equipment, furniture, and such that make up the physical world in which we live and work. Many Government Agencies do not manufacture anything (although I am sure some do), but they all certainly use and consume physical assets.
· Activities (and events) – What do we do? What events cause us to do that?
Each of these categories of entity classes, by itself, is a relatively stand-alone subject area. But note that, at this level of abstraction, they apply to any government agency. The particulars of an agency then are largely instances of (data about) these concepts. It is true that a real model of a real agency will have all sorts of things added to the central entity classes, but these categories provide a bases for adding those intelligently.
Note that the first four categories represent the “who” (people and organizations), “where” (geography), “what” (physical assets) and “how” (activities and events) of any agency. To describe the “why” of an agency, it is necessary to define at least three categories that are combinations of the first four:
· Roles – Most of the attributes people are inclined to add to the entity classes Person and Organization are in fact roles being played with respect to other things. These include, for example, “Employment” “Contract Roles” (including “customer” and “vendor”), “Geographic Roles” (e.g., a government’s “jurisdiction” over a state or a country), “Activity roles”, and so forth. Roles define why people and organizations do what they do.
· Sites, addresses, and facilities – A site is a mechanism for locating someone or something. A synonym for “site” is “address”. This can be a “physical address”, described by a street, city, state, and so forth, or a “virtual address” such as a telephone number, a web address, or an e-mail address. Note that a physical address is not the same thing as a geographic area, since the latter is simply a bounded place on the earth, while the former is “a place with a purpose”. An instance of an address is not an instance of (a sub-type of) a (geographic) location. Indeed, it must be identified as being related to (located in) more than one geographic area (a city, state, postal area, and country). It is not the same thing as any of those. More significantly, a physical site is a geographic place where people and organizations make use of physical assets to carry out activities. In this sense, a physical address is also what in some organizations is called a “facility”. The concept of facility describes the bringing together of all the elements that makes a company or a government agency what it is.