Data Warehouse Concepts

Data Warehouse Concepts

What is a Data Warehouse? According to Inmon, famous author for several data warehouse books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process".

Example: In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built. So source systems will be different in naming conventions, variable measurements, encoding structures, and physical attributes of data. Consider a bank that has got several branches in several countries, has millions of customers and the lines of business of the enterprise are savings, and loans. The following example explains how the data is integrated from source systems to target systems.

Example of Source Data

System Name / Attribute Name / Column Name / Datatype / Values
Source System 1 / Customer Application Date / CUSTOMER_APPLICATION_DATE / NUMERIC(8,0) / 11012005
Source System 2 / Customer Application Date / CUST_APPLICATION_DATE / DATE / 11012005
Source System 3 / Application Date / APPLICATION_DATE / DATE / 01NOV2005

In the aforementioned example, attribute name, column name, datatype and values are entirely different from one source system to another. This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.

Example of Target Data(Data Warehouse)

Target System / Attribute Name / Column Name / Datatype / Values
Record #1 / Customer Application Date / CUSTOMER_APPLICATION_DATE / DATE / 01112005
Record #2 / Customer Application Date / CUSTOMER_APPLICATION_DATE / DATE / 01112005
Record #3 / Customer Application Date / CUSTOMER_APPLICATION_DATE / DATE / 01112005

In the above example of target data, attribute names, column names, and datatypes are consistent throughout the target system. This is how data from various source systems is integrated and accurately stored into the data warehouse.

Figure 1.12: Data Warehouse Architecture

Data Warehouse & Data Mart
A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.

In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

There are three types of data warehouses:
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.

Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.

Figure 1.12 : Data Warehouse and Datamarts

General Information
In general, an organization is started to earn money by selling a product or by providing service to the product. An organization may be at one place or may have several branches.

When we consider an example of an organization selling products throughout the world, the main four major dimensions are product, location, time and organization. Dimension tables have been explained in detail under the section Dimensions. With this example, we will try to provide detailed explanation about STAR SCHEMA.

What is Star Schema?
Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantages of star schema are slicing down, performance increase and easy understanding of data.

Steps in designing Star Schema

Identify a business process for analysis (like sales).
Identify measures or facts (sales dollar).
Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension).
List the columns that describe each dimension. (region name, branch name, region name).
Determine the lowest level of summary in a fact table (sales dollar).

Important aspects of Star Schema & Snow Flake Schema

In a star schema every dimension will have a primary key.
In a star schema, a dimension table will not have any parent table.
Whereas in a snowflake schema, a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.

Glossary:

Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). A fact table usually contains facts with the same level of aggregation.

Example of Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product, time and organization. It shows that data can be sliced across all dimensions and again it is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can be calculated across all dimensions independently or in a combined manner which is explained below.

Sales Dollar value for a particular product
Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by an employee

Snowflake Schema

A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table(sales).

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.

Example of Snowflake Schema: Figure 1.7

Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several dimensions. Fact tables store different types of measures like additive, non additive and semi additive measures.

Measure Types

Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Factless Fact tables.

Steps in designing Fact Table

Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).
Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).

Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product, time and organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions independently or in a combined manner which is explained below.

Sales Dollar value for a particular product
Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by an employee

Database - RDBMS
There are a number of relational databases to store data. A relational database contains normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to work with individual records. Each row contains a unique instance of data for the categories defined by the columns.

RDBMS are used in OLTP applications(e.g. ATM cards) very frequently and sometimes datawarehouse may also use relational databases. Please refer to Relational data modeling for details to know how data from a source system is normalized and stored in RDBMS databases.

Popular RDBMS Databases

RDBMS Name / Company Name
Oracle / Oracle Corporation
IBM DB2 UDB / IBM Corporation
IBM Informix / IBM Corporation
Microsoft SQL Server / Microsoft
Sybase / Sybase Corporation
Terradata / NCR

What are ETL Tools?
ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above-mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers.

These difficulties are eliminated by ETL Tools since they are very powerful and they offer many advantages in all stages of ETL process starting from extraction, data cleansing, data profiling, transformation, debugging and loading into data warehouse when compared to the old method.

There are a number of ETL tools available in the market to do ETL process the data according to business/technical requirements. Following are some those.

Popular ETL Tools

Tool Name / Company Name
Informatica / Informatica Corporation
DT/Studio / Embarcadero Technologies
DataStage / IBM
Ab Initio / Ab Initio Software Corporation
Data Junction / Pervasive Software
Oracle Warehouse Builder / Oracle Corporation
Microsoft SQL Server Integration / Microsoft
TransformOnDemand / Solonde
Transformation Manager / ETL Solutions

ETL Tools: What to Learn?
With the help of ETL tools, we can create powerful target Data Warehouses without much difficulty. Following are the various options that we have to know and learn in order to use ETL tools.

Software:
» How to install ETL tool on server/client?

Working with an ETL Tool:
» How to work with various options like designer, mapping, workflow, scheduling etc.,?
» How to work with sources like DBMS, relational source databases, files, ERPs etc., and import the source definitions?
» How to import data from data modeling tools, applications etc.,?
» How to work with targets like DBMS, relational source databases, files, ERPs etc., and import the source definitions?
» How to create target definitions?
» How to create mappings between source definitions and target definitions?
» How to create transformations?
» How to cleanse the source data?
» How to create a dimension, slowly changing dimensions, cube etc.,?
» How to create and monitor workflows?
» How to configure, monitor and run debugger?
» How to view and generate metadata reports?

ETL Concepts
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database.

The first step in ETL process is mapping the data between source systems and target database (data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system.

Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.

Glossary of ETL (Reference:

Source System
A database, application, file, or other storage facility from which the data in a data warehouse is derived.

Mapping
The definition of the relationship and data flow between source and target objects.

Metadata
Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.

Staging Area
A place where data is processed before entering the warehouse.

Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.

Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.

Transportation0
The process of moving copied or transformed data from a source to a data warehouse.

Target System
A database, application, file, or other storage facility to which the "transformed source data" is loaded in a data warehouse.

Figure 1.12 : Sample ETL Process Flow

Informatica

Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise data integration software and ETL softwares.

The important Informatica Components are:

PowerCenter
PowerCenter Connect
Power Exchange
Power Channel
Metadata Exchange
Power Analyzer
Super Glue

In Informatica, all the Metadata information about source systems, target systems and transformations are stored in the Informatica repository. Informatica's Power Center Client and Repository Server access this repository to store and retrieve metadata.

Note: To know more about Metadata and its significance, please click here.

Source and Target:
Consider a Bank that has got many branches throughout the world. In each branch data may be stored in different source systems like oracle, sql server, terradata, etc. When the Bank decides to integrate its data from several sources for its management decisions, it may choose one or more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many organisations prefer Informatica to do that ETL process, because Informatica is more powerful in designing and building data warehouses. It can connect to several sources and targets to extract meta data from sources and targets, transform and load the data into target systems.

Guidelines to work with Informatica Power Center

Repository: This is where all the metadata information is stored in the Informatica suite. The Power Center Client and the Repository Server would access this repository to retrieve, store and manage metadata.
PowerCenter Client: Informatica client is used for managing users, identifiying source and target systems definitions, creating mapping and mapplets, creating sessions and run workflows etc.
Repository Server: This repository server takes care of all the connections between the repository and the Power Center Client.
PowerCenter Server: PowerCenter server does the extraction from source and then loading data into targets.
Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within the Designer wizard. Source Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself.
Data Cleansing: The PowerCenter's data cleansing technology improves data quality by validating, correctly naming and standardization of address data. A person's address may not be same in all source systems because of typos and postal code, city name may not match with address. These errors can be corrected by using data cleansing process and standardized data can be loaded in target systems (data warehouse).
Transformation: Transformations help to transform the source data according to the requirements of target system. Sorting, Filtering, Aggregation, Joining are some of the examples of transformation. Transformations ensure the quality of the data being loaded into target and this is done during the mapping process from source to target.
Workflow Manager: Workflow helps to load the data from source to target in a sequential manner. For example, if the fact tables are loaded before the lookup tables, then the target system will pop up an error message since the fact table is violating the foreign key validation. To avoid this, workflows can be created to ensure the correct flow of data from source to target.
Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.
PowerCenter Connect: This component helps to extract data and metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.
PowerCenter Exchange: This component helps to extract data and metadata from ERP systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.

Informatica