Data Warehousing
Tyler Helmle
Department of Computer Science and Software Engineering
University of Wisconsin Platteville
Abstract
Data warehousing is defined as being the storage of data for the use of management’s decision making. The data that is stored in the warehouse is typically described as being subject oriented, integrated, time-variant, and non-volatile. Since the 1990’s, data warehouses have been implemented in many organizations to help manage their businesses and support their operations.
In this lecture, specific processes and functions of data warehouses are going to be explained. Through many examples and analogies, attendants of this presentation will come away with a general knowledge of the data warehousing concept. Of course, there is no one straightforward way to design a data warehouse, but there are common practices that are followed by many organizations. These common practices have evolved over the history of the concept, and will be explained in the lecture. Listeners of this presentation will broaden their technological backgrounds and become more familiar with a data management practice.
Introduction
Companies are growing every day, and the information they want to obtain does also. Data warehousing is the practice of storing all the information electronically that the company or organization believes to be important and feasible to store. The reason for a data warehouse is primarily for reporting and analyzing the business and its processes. Businesses have systems that perform their everyday transactions and process their information. Some examples of these systems are an ERP (Enterprise Resource Planning) system, a sales system, an accounting system, or a warranty system. What a data warehouse does is it takes all the data from these systems and stores it in a central location. A data warehouse can also include outside data. Customer data or market data are a couple of examples of information a company may want to analyze. Once a way of extracting this data is established, the data is then made readily available to management and other people who can use the data to perform their analysis.
Some companies can utilize their data warehouse to the point of increasing sales. The data that is stored and the reports that are generated from the data can give a business some insight into the way it needs to operate. Others may use the data as a cost-savings opportunity. The data may show where high costs are being incurred, and the business can then tackle the problem to reduce the costs. Also, data warehouses could be used to improve products that are produced or services that are performed by the business.
A company generally would like to benefit from creating and implementing a data warehouse, but the success rate for these types of projects tend not to yield the results that are always expected. It has been concluded that a data warehouse is very expensive to both create and support. Along with the cost of the project, implementing a data warehouse is also a very risky operation. The time and money that is spent on a data warehouse does not always create the return on investment that a company may be looking for.
General Description
Before going into a lot of detail about a data warehouse, first let’s discuss what the different components of a data warehouse are. A manufacturing facility is an analogy that a person could use to describe a data warehouse. As a manufacturing plant produces a product, there are raw materials and components that are used to create the products. Buying the raw materials and components of the product is basically the same concept as gathering the data in a data warehouse. The data itself is the building block of the data warehouse as are the parts the building blocks of a product. The type of data that is stored in a data warehouse is typically historical, consolidated, and summarized.
A manufacturing plant needs to store its parts and components, just as a data warehouse has to store the data. It is critical to store and organize the data in a way that the business can access it and extract the data to use it. The data is generally stored in databases that are separate from the business’s operational databases. Since the data warehouse pulls data from a number of systems in the organization, the size of the database tends to be very large.
Once a company gathers all the materials needed to create a product, they assemble the product. The concept of generating reports from a data warehouse is the same idea. The business needs to have a way to extract the data, as do they need a way to organize it to make it useful. Reports are one outcome of a data warehouse. The reports can give insight to the business to reduce costs, improve operations, increase sales, etc. As with reporting, data warehouses can be used for OLAP (On-Line Analytical Processing) and data mining. What the company does with the data is really the determining factor that decides if the data warehouse is a success or a failure.
Data Warehouse Architecture
Figure 1 is an image that shows an example of the basic architecture of a data warehouse. In other terms, it represents how a data warehouse could be used and built.
Figure 1: Data Warehousing Architecture
The figure shows that the warehouse extracts the data from external sources along with operational databases. During extraction, the data is transformed to the way the business designed the system. The data that is extracted needs to be periodically refreshed with more current data. The data is then stored in databases which are located on servers. Once the information is available, it can be used to serve the business for analysis, query/reporting, and data mining [3].
Because there is no one straightforward way to create a data warehouse, an architect can construct the warehouse to suite the business’s needs. The example of a data warehouse architecture that was explained before is just one example of a design. There are many different ways to tackle the idea of a data warehouse.
Designing and Implementation
Once a company decides to implement a data warehouse into their system, the designing of it needs to take place. The designing phase is very time consuming and complex. There are a lot of things that need to be considered when it comes to creating a data warehouse.
Raw Data
A designer of a data warehouse first needs to understand the business and realize what data it needs. Do users of the warehouse need reports on accounting information, warranty information, inventory information? The business itself will determine what type of data is useful, but this is something that needs to be considered.
Data granularity is also a point to discuss when designing a data warehouse. The granularity of the data is described by how detailed the data gets. If a manager wants to know when warranty claims are occurring, does he/she need to know within a certain month, week, day, or hour? This is extremely important to know when it comes to designing the databases that are going to store the data. Knowing exactly what level of detail the data needs to be in can greatly help the design process along [4].
Data Cleaning
Before the data can be extracted from the various different systems, it needs to be cleaned. Data is no good to an organization if it is invalid. How can a manager analyze data that is not correct? Cleaning the data is a must, but it proves to be a tougher task than it sounds. Because the data is coming from so many different sources, the possibility of it all being exactly correct and in exactly the right format is very unlikely. Therefore, organizations spend a lot of money on validation tools and software to try and compensate for this problem. As with all data integrity, some of the common problems that a data warehouse may need to overcome are different variable lengths, different variable types, missing or null values, etc [4].
Typically, there are three different data cleaning methods that are used. Data migration is the first tool. This type of cleaning involves the transformation of the data. An example would be to transform a specific string of characters into a different string of characters. This helps clean the raw data to make it more universal. The second tool is data scrubbing. These tools use domain specific knowledge. By linking the data up with variables that are already stored, the data can be scrubbed a certain way. Data auditing is the third type of a data cleaning tool. These tools scan the data for patterns or inconsistencies. If there is anything that is alarming with the data, these tools are used to catch it. The way these tools are designed is by programming different rules that the data must follow or rules that catch inconsistent data [3].
Once the data is cleaned and transformed, it is ready to be loaded into the warehouse.
Data Load
Typically, batch processing is used to load the data into the warehouse. During this process, data may still need to be checked for integrity, summarized, or sorted. There may also be some computation that needs to occur before it is fully loaded into the warehouse. While the load is occurring, there needs to be some sort of visibility. Managing the load is crucial. A data administrator must be able to see the progress of the load. Along with visibility, the administrator is going to need to have the capability to start, cancel, suspend, and resume a load. All of these capabilities need to be available without affecting the data integrity [4].
The loads that a load utility must process are much larger when loading a data warehouse rather than just an ordinary database. Typically, loading an operational database is much smaller and less risky than loading a data warehouse. The amount of time that a data warehouse can be offline is very small, so the refreshing of the data must be quick and efficient. With the amount of data that needs to be loaded, a sequential load is generally not the solution. A solution to this is to use pipelines and partitioned parallelism.
Sometimes parallelism still is not quick enough to load all the data into the data warehouse, so an incremental load is used during the refresh. This reduces the amount of data that has to be loaded into the warehouse by only adding the data that has been updated. The result of using this methodology is a much more complex way of managing the load utility [3].
Data Refresh
Updating the source data and the data already stored in the warehouse is the next task a database administrator needs to consider. There are two key concepts that need to be considered when thinking of a refresh utility: when to refresh and how to refresh. If a company wants a weekly report produced, but the data is only refreshed monthly, then there becomes a problem. The data is not meaningful unless it is properly refreshed at specified times.
As for how to refresh the data, there are a number of different possibilities. It really depends on the capabilities of the database servers and the complexity of the refresh operation. Replication techniques are generally used to update the data. The two forms of replication techniques are data shipping and transaction shipping.
Multidimensional View
The multidimensional view of the data in the warehouse is a conceptual model that impacts the front end of the system, the database design, and the query engines. In this multidimensional data model, there are a set of numeric measures. These measures are typically used to analyze different aspects of the business. Some examples would be ROI (Return on Investment), sales, budget, revenue, and costs. Each of these measures is determined by a set of dimensions. For example, the dimensions that could be connected to a ROI could be time, annual savings, and initial investment. These dimensions together could describe a unique ROI for the business. The measure itself is really just a value in the multidimensional space of dimensions. Figure 2 shows an example of multidimensional data.
Figure 2: Multidimensional Data
The above shows the different hierarchal levels of data. The product name, city, and location are the three dimensions. Those three dimensions can be looked at an even higher dimension, which makes it multidimensional. For example, the product dimension is derived from an industry to a category to the final product. By analyzing the data and representing it in different dimensions, there can be some very important information that is concluded. For example, time is a dimension that can be used to discover trends [3].
Front-End Tools
Typically, business analysts use spreadsheet applications like Microsoft Excel to view their data. This spreadsheet form is what influenced the multidimensional concept and thus is used frequently in representing the data to the business analyst. Pivoting is an operation that is used to view the multidimensional data. The different dimensions require multiple axes, and pivoting is a solution.
Another operation that can be performed on the data is the roll up method. This is done by grouping the data into a higher level to be analyzed. For example, you can take the sales data that is grouped by city and group it again by state. This then puts the data into different perspectives for the analyst. The opposite operation of the roll up method would be to drill-down on the data. Slice-and-dice is also another method to use for a front end tool. This is done by presenting the data under multiple different dimensions. An example of this would be to represent supply costs by vendor and by size of order. This allows the business analyst to view the data from multiple perspectives to determine what may be best for the business. The other popular operations that can be used are ranking, selections, and computed attributes [3].