Normalisation

  • Why Do We Use This Process?
  • Minimises The Duplication Of Data
  • The Normalised Data Enables The Data Model To Be Mapped Onto A Wide Variety Of Different Database Designs
  • The Final Tables In 3rd Normal Form Provide The Flexibility To Enable Data To Be Extracted Efficiently
  • Examining The Normalisation Process Using Engineers And Projects
  • Going From Un-Normalised (UNF) To First Normal Form (1NF)
  • Going From First Normal Form (1NF) To Second Normal Form (2NF)
  • Going From Second Normal Form (2NF) To Third Normal Form (3NF)

One of the principle objectives of relational databases is to ensure that data is not replicated unnecessarily - almost all the data in the database should/will be held only once.

For example, if we hold a list of customers then the customer data (other than its key) is represented only once throughout all the tables of the organisation. The reasons for this are to minimise the amount of space required to store the database and, to simplify maintenance. The formal procedure to arrange tables to meet this objective is called Normalisation.

Normalisation is a bottom up approach that is, it starts with all the data and then proceeds to group the data into bigger parts (entities/tables).

Why do we need both ERM and normalisation? You may read this tutorial and come to the conclusion that normalisation alone would suffice as it is the process by which the data is organised into relevant tables and keys are assigned. However, before the process of normalisation can begin we need to organise data into groups, this is especially important when we have a large database to design. As you go through the examples here and others on this website you will note that data is grouped together even at UNF (you will learn what this is as you go!) and this would be the result of ERM so the two go hand in hand.

Why Do We Use This Process?

  • Minimises the duplication of data
  • The normalised data enables the data model to be mapped onto a wide variety of different database designs
  • The tables in 3rd normal form (there are three stages to normalisation shown here though you can go up to seven) provide the flexibility to enable data to be extracted efficiently

1. Minimises the Duplication of Data

1 Table - 1 Employee 1 Project

Imagine that we have a database which has to store information about an employee and the marketing project he/she is working on. If we started off with one table to manage the one-to-one relationship between Employee and Project we could create the following:

EmployeeProject

Emp_ID / Emp_FName / Emp_SName / Proj_Code / Proj_Title
12 / Tom / Jones / S03 / Sainsburys
34 / Kyle / Anderson / A21 / Allied Carpets
56 / Mike / Murray / I11 / Iceland
78 / John / Smith

Note: Employee ID is included as the primary key field as neither the first name nor surname for an employee could be considered unique.

The problems here are:

  • John Smith is not working on a project so his project fieldsare blank
  • If we delete off an employee we will, in effect, be deleting off a project as well

2 Tables - 1 Employee 1 Project

Amend the table (entity) names according to what they now represent (hold information on).

To maintain a link between the two tables (so we know what project an employee is working on) we need to leave the project code in Employee. Project code is a primary key in Project but when used in Employee it is called a foreign key; this is because it is used elsewhere (i.e. in Project) as a key and thus relates to a specific Project.

The ERD would look like this:

Employee

Emp_ID / Emp_FName / Emp_SName / Proj_Code
12 / Tom / Jones / S03
34 / Kyle / Anderson / A21
56 / Mike / Murray / I11
78 / John / Smith

Project

Proj_Code / Title
A21 / Allied Carpets
I11 / Iceland
S03 / Sainsburys

The problems here are:

  • John Smith is not working on a project code is blank
  • What happens if an employee is working on more than one project (but no more than one employee is assigned to a specific project)?

2 Tables - 1 Employee Many Projects

In this section we are still assuming that each employee works on one or more projects but that a project is run by a single employee.

For example, if we knew that it was possible for an employee to work on up to three project's then using our existing design (and amending the data to show employees on a number of projects) we would have the following tables:

Employee

Emp_ID / Emp_FName / Emp_SName / Proj_Code / Proj_Code / Proj_Code
12 / Tom / Jones / A10 / S03 / T01
34 / Kyle / Anderson / A21
56 / Mike / Murray / F02 / I11
78 / John / Smith

Project

Proj_Code / Title
A10 / Aldi
A21 / Allied Carpets
F02 / FreezerWorld
I11 / Iceland
S03 / Sainsburys
T01 / Tesco

The problems here are:

  • Several employees are not working on three projects so some fields are blank
  • What happens if we want to increase the number of projects that each employee can work on?

3 Tables –1 Employee Many Projects

We need to include an extra table to represent the relationship showing 'the assignment of' an employee to a project (this could be called something likeAssignToProj).

Remember: An employee works on one or more projects but that a project is run by a single employee.

So the correct table structure would be:

Employee

Emp_ID / Emp_FName / Emp_SName
12 / Tom / Jones
34 / Kyle / Anderson
56 / Mike / Murray
78 / John / Smith

Project

Proj_Code / Title
A10 / Aldi
A21 / Allied Carpets
F02 / FreezerWorld
I11 / Iceland
S03 / Sainsburys
T01 / Tesco

AssignedToProj

Emp_ID / Proj_Code
12 / A10
12 / S03
12 / T01
34 / A21
56 / F02
56 / I11

Why do the Emp_ID and Proj_Code have solid lines when they are foreign keys (primary keys in another table)?

Because in the new AssignedToProj table they are joined together to form a new primary key (whilst they also are foreign keys to their respective tables); If you look closely you will see that each line (record) is unique.

From this we see that there is still duplication of data - namely the primary keys are duplicated in the assignment table (and would be in any table which acted as a foreign key in order to show a relationship). However, the benefits are better management of information and the removal of unnecessary duplication (redundancy).

What do we mean by redundancy?
Well, in the above example breaking the tables down into individual entities resulted in the removal of the unnecessary duplication of the attributes (fields) Proj_Code and Proj_Title in the Project table. In a large database this information could have been repeated literally hundreds of times and therefore we have removed this through using the above process - freeing up space and simplifying our data model (the entities and their relationships representing the real world). Other attributes would also have been in the same situation and would have been corrected at the same time.
When we convert an ERD to a table using this process we are said to be normalising the data by minimising the amount of duplication. However, to complete the job we need the process of Normalisation to achieve the most efficient data structure for our database removing most (if not all) redundancy.
Note: We cannot avoid duplication but we can avoid redundancy!

3 Tables - Many Employees Many Projects

What if we wanted to amend our data so thateach project has one or more employees working on it at the same time and each employee works on one or more projects?

Reviewing the Single Table Problem

We already know the problems with a single table design:

  • deleting an employee would could potentially deleteany projects they were working on
  • if the table is designed for three employees – unless ALL employees are working on three projects there will be blank fields – and this gets worse if you want four projects and change the table design!

Emp_ID / Emp_FName / Emp_SName / Proj_Code / Proj Title / Proj_Code / Proj Title / Proj_Code / Proj Title
12 / Tom / Jones / A10 / Aldi / S03 / Sainsburys / I11 / Iceland
34 / Kyle / Anderson / A21 / Allied Carpets
56 / Mike / Murray / I11 / Iceland / S03 / Sainsburys
78 / John / Smith

Note: We know that if we split the design above into Employee and Project tables it is not going to be enough to overcome the design problems (see below):

Employee

Emp_ID / Emp_FName / Emp_SName / Proj_Code / Proj_Code / Proj_Code
12 / Tom / Jones / A10 / S03 / I11
34 / Kyle / Anderson / A21
56 / Mike / Murray / I11 / S03
78 / John / Smith

Project

Proj_Code / Title
A10 / Aldi
A21 / Allied Carpets
I11 / Iceland
S03 / Sainsburys

So again we need to include an extra table to represent the relationship which is 'the assignment of' an employee to a project:

So the correct ERD and table structure would be:

Employee

Emp_ID / Emp_FName / Emp_SName
12 / Tom / Jones
34 / Kyle / Anderson
56 / Mike / Murray
78 / John / Smith

Project

Proj_Code / Title
A10 / Aldi
A21 / Allied Carpets
F02 / FreezerWorld
I11 / Iceland
S03 / Sainsburys
T01 / Tesco

AssignToProj

Emp_ID / Proj_Code
12 / A10
12 / S03
12 / I11
34 / A21
56 / I11
56 / S03

2. The Normalised Data Enables the Data Model to Be Mapped Onto a Wide Variety of Different Database Designs

Using ERD's and Normalisation allows you to design database (table) structures which are software independent and this enables greater flexibility when implementing a system. Other factors will then influence the choice of package or bespoke solutions such as price and scalability.

3. The Final Tables in 3rd Normal Form Provide the Flexibility to Enable Data to Be Extracted Efficiently
In the process of Normalisation there are seven key stages - at this level however, we will only consider the first three. Before the data goes through this process it is said to be Un-Normalised. Then it is checked and if it doesn't match the criteria of the first stage of normalisation it is altered. When it has gone through this process it is said to be in its first normal form (1NF). This process carry's on until the data matches the criteria for the third level. The whole process is illustrated below:

Un-Normalised Form (UNF)
Remove repeating groups of data (either multiple values in a single field or multiple fields relating to a single group). The data is now in...
First Normal Form (1NF)
If a primary key is made up of more than one field, we must make sure that all the other fields in that group (entity) are functionally dependent upon the whole of the primary key (all primary key fields). The data is now in...
Second Normal Form (2NF)
Remove any fields that are dependent upon each other but not on the key field (this relationship is known as transitive dependencies). The data is now in...
Third Normal Form (3NF)

As we stated earlier, the normalisation process assumes that you start with some informal description of all the data attributes that the database application appears to require (Un-Normalised data) which are grouped together.

For example, the employees name fields (broken up into first name and surname) will be next to the employees address fields and not 'scattered' throughout the UNF list of fields.

Also, remember when you constructed ERD's, you did so with the help of enterprise rules to help establish the number of entities and their cardinality and optionality. Similarly, when you are given a Normalisation question you will be supplied with a scenario from which you can gather vital clues about how your data relates to the Normalisation process.

Examining the Normalisation Process Using Employees and Projects

If we presume that we have a database containing the records of employees working for a marketing company. An employee can be working on (assigned to) a number of different projects - each of which has a different identifying number. The project part of the employee record is repeated for any employee working on more than one project. The Un-Normalised data would be as follows (this would work the same if the project details were first rather than the employees):

Employee (Emp_ID#, Emp_FName, Emp_SName,Proj_Code, Proj_Title)

Notice in the above that:

  • The entity is written in the singular;
  • The attributes (fields) of the entity are placed in brackets with the primary key underlined;
  • The # symbol is short for a numeric attribute (field);

Going From Un-Normalised (UNF) To First Normal Form (1NF)

The first stage is the removal of repeating items, showing them grouped together by the creation of a new entity. If we look again at the data we considered earlier for employees working on marketing projects we can see that the project is the repeating group as each two employees here are working on more than one project.

Note: The group on the left hand side (the employee) is NEVER the repeating group

Emp_ID / Emp_FName / Emp_SName / Proj_Code / Proj_Title
12
/ Tom / Jones / S03
A21
I11 / Sainsburys
Allied Carpets
Iceland
34 / Kyle / Anderson / A21 / Allied Carpets
56
/ Mike / Murray / I11
S03 / Iceland
Sainsburys

The attribute used as a key from the original entity will still need to be included as this is used to provide the link between the two entities. So in this case we have the original key field as 'Emp_ID#' and this must be included under the assignment record entity ‘AssignedToProj.
Data in first normal form 1NF

Employee(Emp_ID#, Emp_FName, Emp_SName)
AssignedToProj(Emp_ID#, Proj_Code, Proj_Title)

Going From First Normal Form (1NF) To Second Normal Form (2NF)

To go from 1NF to 2NF, the entities containing more than one key must be examined to check if each non key attribute relates to only one part of the key. In our example, 'Title' relates to only one part of the key, that is the attribute 'Proj_Code'. When this happens the attribute (field) is removed along with its key attribute and transferred to form a new entity, which in this case is called ‘Project’. The attribute used as a key from the original entity will still need to be included as this is used to provide the link between the two entities.
Data in second normal form (2NF)

Employee (Emp_ID#, Emp_FName, Emp_SName)
AssignedToProj(Emp_ID#, Proj_Code)
Project(Proj_Code, Proj_Title)

Note: The key field 'Proj_Code' remains in the ‘AssignedToProj’table as well in order to link the ‘Project’ table with the ‘AssignedToProjtable.

Going From Second Normal Form (2NF) To Third Normal Form (3NF)

To reduce the data to third normal form, entities must be examined to see if any of the data is mutually dependent (that is, two non-key fields are dependent on each other) but not on the primary key. Mutually dependent items are removed to a separate entity, leaving behind one of the items in the original entity to use as a foreign key to link to the newly created entity. In this example there are none, so the data is the same as shown in 2NF.

Data in third normal form (3NF)

Employee (Emp_ID#, Emp_FName, Emp_SName)
AssignedToProj(Emp_ID#, Proj_Code)
Project(Proj_Code, Proj_Title)

Compare this with the final ERD we produced on page 8

As you can see we have the same number of entities with the same relationships.

1