Dr. Bjarne Berg DRAFT
Granularity
Granularity is a concept that refers to the event table (fact table) and the data captured there. An example of a granularity would be stated as “sales per day, per store”. Another level could be “sales by week by region”. The first example would result in one record per store each day and would be an aggregation of all the sales records for that day at the store level. If we had 500 sales in that store we would have one record stored in the data warehouse with the summarized amount sold for that day for the store. If we used the other granularity “sales by week by region”, we would only get one record in the data warehouse each week. This record would contain the summarized sales for all stores in that region for the whole week.
As the example above illustrate, the benefit of the higher level granularity is the reduced number of records in the data warehouse. This results in less hardware costs, maintenance costs, but more importantly far better response time when executives are analyzing the data. However, there are also drawbacks of picking a too high level of granularity.In the second example we used a granularity of “sales by week by region”. Since this resulted in a single record per region for each week, the executives would be unable to see the store performance. Nor would the executive be able to see the daily sales performance either, since this would not be stored in the data warehouse. So, by picking a too high level of granularity, we are limiting the functionality of the data warehouse.
Even in a lower level granularity we have removed some analytical capabilities. In the first example we picked the granularity of “sales per day, per store”. This solves some of the limitations of the second example. However, since this is still an aggregation of sales records for the store for a whole day, we cannot see which products were sold. The issue is that the store might sell many products and in our selected granularity, we picked the total sale for each day by store. This might be good for analytical applications which is monitoring the cash flow of stores, the sales trends by day, sales performance by region, weekend sales, staffing level needs by store, where to build new stores, impacts of macroeconomics, income forecasting and much more. However, the granularity “sales per day, per store” would not be an appropriate granularity for market basket analysis, product mix performance monitoring, restock analysis, product performance or manufacturing forecasting since the product was not included in the granularity.
A solution could be to crate a granularity of “sales per day, per store per product”. This would result in one record for each product sold (actually sold, not stocked) in the store each day. If the store sold 800 products that day we would now store 800 records with the total sale amounts for those products for that day. As we can see, this resulted in 292,000 records per year (800x365) instead of only 365 records under the first example.
However, even this low level granularity is often not enough. The “sales per day, per store per product” does not allow for re-stock analysis and detailed sales analysis since it does not track to an individual Stock Keeping Unit (SKU). I.e. Budweiser is sold in single cans, 6-pack, 12-packs and 24-packs. If we lowered the granularity to the SKU level and had on average 3 SKU per product, we have increased our records kept in the data warehouse from 365 to 876,000 for every store the company have.
If a company has 500 stores and want to track the sales for 4 years, the record count would be 1.72 billion instead of only 730,000 at the higher granularity. The implications are very large. The granularity of “sales per day, per store” could be accomplished by a small NT-box and a simple query tool and the costs might be in the $25K-$50K range. The query performance would be great as well. The granularity of “sales per day, per store per SKU” would result in a large system landscape 100+ gigabytes of data, complex managed query tools, high network traffic and would require performance optimization. The costs could by in the multi-million dollar range.
As these examples illustrate, the granularity of the data warehouse may be the single most important decision a data modeler undertake. This decision will determine theanalytical capabilities of the data warehouse, as well as the infrastructure and the data volumes of the system architecture. The trade-off is faster analytical performance and lower cost of ownership Vs. higher degree of flexibility and more detailed analytical capabilities.
Some companies have chosen a hybrid granularity approach. In this scenario, all the lowest level transactions from the On-Line Transaction Processing (OLTP) system are loaded into a staging area in the data warehouse. The data is then aggregated to some granularity before it is loaded into the Operational Data Stores (ODS). Actually, many companies keep the transaction level details in the ODS as well. The data is then loaded into data marts, analytical applications, or specialized Multi-dimensional On-Line Analytical Processing (MOLAP) cubes for analysis. These integrated Analytics (iAnalytics) may have substantial different granularity than both the OLTP or the ODSs based on the business questions they are addressing.
So granularity is a decision that is not made once in knowledge management architecture. It is a concept that has to be discussedevery time a new iAnalytics application is built or new transaction record types are added to the data warehouse. Assuming that all records in the data warehouse will be kept at the transaction level is a costly mistake that createsenormous data architectures, poor performance, slow networks and end user frustration...
1