Flawless Logical to Physical Data Model Transformations
Bert Scalzo, PhD.
Database Domain Expert
Quest Software
Data Modeling has been around for more than 35 years, and yet data modeling issues remain as one of the top reasons why many database implementations fail. Sometimes they fail from a total lack of having been modeled al all (or as an afterthought), and just as often for having been modeled incorrectly. It’s this latter scenario this paper will try to address. We’ll review the top 10 most common modeling issues faced when transforming data models from logical (i.e. business requirements) to physical (i.e. implementation).
One key thesis to remember is that data modeling software does not alleviate users from knowing data modeling theory and good form. Very much like word processing software, today’s data modeling tools simply assist with the task at hand. If you don’t understand the English language, sentence construction, paragraph construction, word meaning and numerous other linguistic issues, then a word processor will simply help you write some gibberish much quicker. Sure it may offer some tools such as a spell checker, but all that would do is permit sentences such as this:
The buoy eight the mouse for brake fast.
Now some more advanced word processors may also offer some basic sentence checking capabilities that would flag the prior example as a “sentence fragment” (i.e. lacks a verb), one could correct that and yet still have a meaningless sentence such as this:
The buoy ate the mouse for brake fast.
Microsoft Word 2003 does not flag this sentence at all, yet we all see just how bad it is. The same is true with data modeling – the software can perform some basic checks, but cannot guarantee that it makes sense. And that failure can occur along two dimensions: effectiveness (i.e. business requirements) and efficiency (i.e. service level agreements). So we cannot become overly reliant on data modeling tools to prevent mistakes, lest we simply crank out bad databases faster.
One of the key areas where mistakes are commonly made is during the logical to physical data model transformation process. So let’s examine the ten most common issues for this critical process.
1. Many to Many Relationships
It’s not uncommon during the logical modeling process to encounter many to many relationships such as this:
So why does this happen? Quite often because when interviewing business people, they assume that we all know the business equally well. Therefore it’s not uncommon to omit some critical business facts as superfluous information that everyone simply knows.
But we cannot implement many to many relationships in a relational database! You can’t have foreign keys in both CSUTOMER and PRODUCT pointing to each other. In other words you cannot be a parent and child at the same time. Thus many data modeling tools will automatically attempt to resolve this as follows:
But does this really reflect genuine business realities? If we just keep asking lots of dumb questions, we may well find that there are additional entities and attributes that result in a more complex and more accurate data model such as this:
Failure to ask such questions to obtain such additional details will result in a failure to meet the fundamental business requirements (i.e. effectiveness). And there will be no way to optimize or tune your way out of this trap.
2. Avoid Partial Unique Keys
Many data modeling tools will blindly permit the DBA to make questionable changes in the physical model – under the assumption that the human is the expert who knows best.
One such example is allowing partial unique keys – which can sometimes surprise people on how the database handles such a construct.
Unique identifiers are all candidates to be the primary unique identifier, and as such must meet all the same criteria. Yet it’s not uncommon to see people try to make the alternate unique identifiers somehow less robust – for example by permitting optional portions of the value such as shown below.
This is simply put wrong logically – and some data modeling tools’ check reports will find such mistakes. But let’s assume we leave it in place and build a physical model off this example, which would result in the following physical model:
So how does your database platform handle this? You may very well be quite surprised. Oracle for example does not enforce multi-column constraints on rows where some of the columns contain null values! That means Oracle will permit duplicate rows even though you have a live unique key constraint. And duplicate rows generally mean compromised business requirements (i.e. effectiveness). And we cannot tune nor optimize plain wrong.
3. Avoid Candidate Key Loss
Many data modeling tools will blindly permit the DBA to make questionable changes in the physical model – under the assumption that the human is the expert who knows best.
One such example is allowing candidate key loss. And quite often this technique is too liberally applied during the tuning and optimization process for seemingly good reason.
Below we have a logical model where business requirements dictate four different ways to uniquely identify an occurrence of the data. So we have a primary unique identifier and three alternates. Seems simple enough – right?
Most data modeling tools will create an initial physical design such as shown here:
So what can happen to compromise this? Well it’s not uncommon for DBA’s to look for indexes to drop as part of the tuning and optimization exercise – since every live index basically doubles the IO necessary for a DML operation. But if we drop indexes IX2, IX3 and IX4 to improve the DML IO efficiency – we’ll totally compromise effectiveness.
4. Avoid Surrogate Key Loss
Many data modeling tools will blindly permit the DBA to make questionable changes in the physical model – under the assumption that the human is the expert who knows best. Hmmm, that’s the third time I’ve stated this – so it seems like a pattern is emerging J
One such example is allowing surrogate key loss. And quite often this technique is too liberally applied during the tuning and optimization process for seemingly good reason.
Below we have a logical model where business requirements dictate that the MOVIE RENTAL entity’s data is uniquely identified by its relationship to both its parent entities (i.e. dependent relationships). This is a fairly common occurrence in logical models.
Most data modeling tools will create an initial physical design such as shown here:
It’s not uncommon for the DBA to replace expensive composite primary keys such as the above with a simpler and more efficient surrogate (i.e. artificial) key such as shown here:
So what’s wrong with this? Well we do have a unique primary key, but look – we now no longer enforce the business requirement that any given Rental ID, Customer ID, and Movie ID are unique. The correct model would have been as follows:
But that means two indexes instead of just one. So what can happen to compromise this? Well it’s not uncommon for DBA’s to look for indexes to drop as part of the tuning and optimization exercise – since every live index basically doubles the IO necessary for a DML operation. But if we drop the AK1 index to improve DML IO efficiency – we’ll totally compromise effectiveness.
5. Avoid Partial Foreign Keys
This is probably the single most frustrating issue – since it seems to happen all the time.
Referential Integrity requires that for each row of a child table, the value in the foreign key exactly matches the value in the parent’s primary or unique key. Note this does not say partially – the assumption is completely.
Note that ENTITY1 has a composite primary unique identifier. So when converted to a correct physical model, we get the following:
But some data modeling tools will permit the DBA to make the following change:
Look what happened – we made a portion of the composite column foreign key optional! Remember the earlier issue with partial unique keys? Well for some database platforms this is the same problem – partial constraints are not enforced. So for Oracle, rows that contain some null and some not null values are simply not enforced. That means bogus data.
Yet developers ask for this all the time – and DBA’s seem to capitulate far too easily L
6. Avoid Indirect Foreign Keys
Sometimes a seemingly innocuous and harmless design change can easily backfire.
Here we have a simple logical model with three entities – where there is no business requirement to relate ENTITY1 to ENTITY3.
Most data modeling tools will create an initial physical design such as shown here:
However sometimes developers will need to code a SELECT statement where they use attributes from ENTITY1 and ENTITY3 – but not from ENTITY2. So they may ask a seemingly reasonable question – why do I have to do a three way join for data from just two tables? In other word, why can’t we treat the physical model and database is if the following were true – that ATTRIBUTE1 can point to both parents.
Well if the developer simply codes it – no real damage done. But if the DBA actually implements the additional foreign key for RELATIONSHIP3, then we’ve traded off DML performance – because we’ll now have an additional and very expensive check performed to verify that ENTITY3 data points to legitimate data for both parents. I designate this as a very expensive because we’re essentially performing an extra check per DML operation that by the very nature of the inherited foreign key columns will always pass! If it will always pass – why do it?
7. Avoid Bogus Foreign Keys
This is probably the second most frustrating issue – since it too seems to happen all the time. It’s especially an issue when working with legacy system data. But it’s just plain simply wrong.
Referential Integrity requires that for each row of a child table, the value in the foreign key exactly matches the value in the parent’s primary or unique key. Note this does not say that foreign keys relate back parent non-key columns.
Some data modeling tools will permit the following design:
Where the relationship’s foreign key is defined like this:
We’ve made two mistakes here. First, we’ve dropped the properly inherited column ATTRIBUTE1 which was the legitimate foreign key. Second, we’ve made another ENTITY2 attribute point back to a non-key back in ENTITY1 L
This is just so wrong on so many levels that it’s tough to discuss rationally. Yet I see this mistake nearly everywhere and every time. And people argue vehemently to be permitted to do this. I just don’t get it. I’m not arguing for academic perfection for its own sake. I simply cannot understand what relating a table to another via a non-key column means?
8. Problematic Relationships
Just as with people and their relationships, data models can sometimes bring together relationships that appear reasonable – but that simply cannot work. Fortunately most of these patterns are fairly easy to spot. We’ll examine the more common ones.
Here’s the first example:
Seems reasonable at first, doesn’t it? Is this not a simple employee to manager type of relationship? Yes, but we’ve made a critical recursive mistake – there is no bottom. I don’t manage anyone at Quest, so there is no one below me.
The second example is simply the opposite:
Now we have no top. Most CEO’s do not report to anyone - expect maybe their spouse J
And let’s not forget my personal favorite – the age old question of which came first:
Don’t laugh. Such circular logic can occur more easily than one would think. I’ve seen people work around this last scenario by using deferrable referential integrity constraints (i.e. the constraints on each side do not fire until a COMMIT occurs).
Finally there’s the question of Unification (and that’s not some crazy church headed by the reverend sun moon). Here we have complex logical model with multiple dependent relationships being combined at yet a deeper level:
Most data modeling tools will create an initial physical design such as shown here:
So when we translate this into a physical model – how do we handle ENTITY4? Will there be just one occurrence of the inherited ATTRIBUTE1 or two? When in doubt, always ask the business. Don’t simply make a “gut call” decision on what seems like such a slam dunk, no brainer. Never let the tail (technology) wag the dog (business).
9. Using Normal Forms
One of my favorite movie lines is from Mel Brook’s “Young Frankenstein”, when the good doctor asks Igor whose brain he stole – and he replies “Abby” something. That’s when the doctor learns that he’s put an abnormal brain into his creation.
Why did I bring that up? Because I’ve yet to see a data model where the designers claim they’re in third normal form stand up to even just cursory inspection. I’m not preaching that normalization is always appropriate or what level is best. I’m merely stating that if one claims to be in a given state, it should in fact be true.
Thus normalization is another key area data modeling where some confusion or lack of complete understanding can compromise either efficiency or effectiveness. Let’s review some normalization facts and examples.
Normalization Quick Facts:
· Goal minimize data redundancy in order to lessen likelihood of inconsistent data
· Side effect of reducing data storage needs
· But is this important given today’s cheap disk…
· Useful primarily in OLTP and ODS database designs
· Normal forms are cumulative (e.g. 2NF includes 1NF)
· Easy jingle to remember: “Depends on the key, the whole key, and nothing but the key – so help me Codd”
Here are the three most common normal forms and examples of their violations.
1NF – Attributes are all single valued, there are no repeating groups or arrays.