CERN-AS-98-003

Providing a cubic data model for Web based pivot tables

James Purvis
July 1998

Raw data

From the raw financial data, management have identified several fields which are interesting to total:

• Cost Centre10,000

• Nature500

• Destination500

• Supplier10,000

• Contract Number1,500

Cube data

Ideally we want to be able to build a 5-dimensional cube to represent the above data. A cube covering pre-calculated totals of all these combinations would be unfeasible in terms of storage. Similarly to calculate all the required permutations for an on-line overview would be unfeasible in terms of response-times. Therefore a tradeoff between some “pre-calculated” and some “on-line” calculations must be made.

We need to decide on an optimum threshold between what we store pre-calculated and what we calculate on-line. To imagine this, we require the ability to interrogate a “virtual” cube of data from a given any “physical cube”.

This idea raises two key questions :

•How do we determine the optimum threshold?

•How do we construct an SQL query to work in the above dataspace?

The answers to these questions are inter-related and will be discussed in the following two sections.

How do we determine the optimum threshold?

The simple approach we adopt here is that we let the optimum threshold be decided empirically. For example, we may initially decide to store 30% of the data and calculate the remaining 70% on-line. If we find this too heavy then we may decided to increase the storage to 40% and reduce the calculation to 60%. This requires that we have a flexible means of altering the threshold between stored and calculated data.

How do we construct a query to work in the above dataspace?

Given the answer to the first question, the challenge lies in being able to construct an “intelligent” query mechansim which has the ability to determine what portion of the data is pre-calculated and stored in the database, and which portion requires on-line calculation. In order to do this we need to be able to do is store data about “how” we have stored our data, i.e. we need cubic meta-data.

A simple approach to constructing a cube

Returning to our raw financial data let us know create a simple cube of our interesting data. This would be done with an SQL such as :

create table cube as

select cost_centre,

nature,

destination,

supplier,

contract_number,

sum(amount) amount

from

transactions

group by

cost_centre,nature,destination,supplier,contract_number

The above SQL simply creates a total for every combination (but not permutation) of cost_centre, supplier etc in the database.

This summary table is of little-use because any meaningful query (e.g. what is the expenditure per supplier for the organisation?) requires significant on-line calculation. This can be represented by the threshold | in our diagram below:

Let us now represent this “representation” in a META Table. One possible representation could be:

cost_centre / nature / destination / supplier / contract_number
* / * / * / * / *

* simply indicates that all values exist.

Now consider the case whereby we want to optimise this cube to improve performance for “supplier expenditure” oriented queries. Logically this means we want to pre-calculate totals for each supplier, keeping all other factors constant. Our Meta data representation of this would be:

cost_centre / nature / destination / supplier / contract_number
* / Σ / Σ / * / Σ

Storing such pre-calculated totals has moved the threshold as shown below:

The actualisation of the pre-calculation in the database is trivial and may be generated from the original cube itself using:

insert into cube

select cost_centre,

‘ALL’ nature,

‘ALL’ destination,

supplier,

‘ALL’ contract_number,

sum(amount) amount

from

cube

group by

cost_centre,supplier

Constructing SQL Queries to Interrogate the virtual cube

Continuing with the above example, we now have a cube whose structure is represented by the following Meta table:

cost_centre / nature / destination / supplier / contract_number
* / * / * / * / *
* / Σ / Σ / * / Σ

The first row of this meta table tells us that the atomic totals for each combination of [cost_centre, nature, contract_number] exist. The second row of this meta table tells us that summary totals exist for each combination of [cost_centre,supplier] where all other factors are constant.

Let us take two sample queries to illustrate the approach:

Query (a)
Produce a matrix report detailing the expenditure per supplier and cost centre in the organisation. The results should look something like:

Centre X / Centre Y / Centre Z / ...
Supplier A / 10
Supplier B / 20 / 20
Supplier C / 5
...

Query (b)
Produce a matrix report detailing the expenditure per contract and cost centre in the organisation. The results should look something like:

Centre X / Centre Y / Centre Z / ...
Contract 1 / 10 / 5
Contract 2 / 7 / 3
Contract 3 / 9
...

For query (a) we know that we require:

•Supplier and Cost Centre to be variable

•All other fields constant.

Therefore we are looking for a pattern-match in our meta table for:

cost_centre / nature / destination / supplier / contract_number
* / Σ / Σ / * / Σ

As the pattern match is succesful then we know that we may proceed with a query using pre-calculated data. This query would therefore be of the form:

select

cost_centre,

supplier,

amount /* Note that summation is not required */

from

cube

where

nature = ‘ALL’

and destination = ‘ALL’

and contract_number = ‘ALL’

and cost_centre > ‘ALL’

and supplier > ‘ALL’

group by

cost_centre,supplier

This would return the results as follows:

Supplier A / Centre X / 10
Supplier B / Centre Y / 20
Supplier B / Centre Z / 20
...

Putting these results in the required matrix format is simply a matter of reformatting the data, for example using a java grid representation.

This query is performance optimised because the required totals exist. The response time will therefore be fast.

Now let us examine query (b).

For query (b) we know that we require :

•Contract and Cost Centre to be variable

•All other things constant.

Therefore we are looking for a pattern-match in our meta table for:

cost_centre / nature / destination / supplier / contract_number
* / Σ / Σ / Σ / *

But our Meta table only contains:

cost_centre / nature / destination / supplier / contract_number
* / * / * / * / *
* / Σ / Σ / * / Σ

Therefore the pattern match on the meta table fails. Failure indicates that the required results will need to be calculated from the atomic data. Our query would therefore be of the form

select

cost_centre,

contract_number,

sum(amout)/* Note that summation is required */

from

cube

where

nature > ‘ALL’

and destination >‘ALL’

and contract_number > ‘ALL’

and cost_centre > ‘ALL’

and supplier > ‘ALL’

group by

cost_centre,supplier

This would return the results as follows:

Contract 1 / Centre X / 10
Contract 2 / Centre Y / 20
Contract 3 / Centre Z / 20
...

Putting these results in the required matrix format is simply a matter of reformatting the data, for example using a java grid representation.

This query is not performance optimised because the required totals do not exist. The response time will therefore be relatively slow, but the results will arrive. If this query occurs frequently then you may wish to consider pre-calculating these totals and updating the meta table accordingly.

Filters

Now let us consider constructing Query (a) above on filtered (sliced) data. For example:

Query (a)
For the nature 218 : Produce a matrix report detailing the expenditure per supplier and cost centre in the organisation. The results should look something like:

Centre X / Centre Y / Centre Z / ...
Supplier A / 10
Supplier B / 20 / 20
Supplier C / 5
...

For query (a) we know that we require:

•Supplier and Cost Centre to be variable

•All other fields constant.

•Nature field to be fixed at 218

Therefore we are looking for a pattern-match in our meta table for:

cost_centre / nature / destination / supplier / contract_number
* / 218 / Σ / * / Σ

218 can be considered as one of the multiple values of *, so the following pattern match is also valid:

cost_centre / nature / destination / supplier / contract_number
* / * / Σ / * / Σ

Therefore the pattern match on the meta table fails. Failure indicates that the required results will need to be calculated from the atomic data. Our query would therefore be of the form:

select

cost_centre,

contract_number,

sum(amout)/* Note that summation is required */

from

cube

where

nature > ‘ALL’

and destination >‘ALL’

and contract_number > ‘ALL’

and cost_centre > ‘ALL’

and supplier > ‘ALL’

and nature = ‘218’

group by

cost_centre,supplier

Links : Drill-down, sliceing and diceing...

OK, now we can see how to arrive at a matrix report such as:

Centre X / Centre Y / Centre Z / ...
Supplier A / 10
Supplier B / 20 / 20
Supplier C / 5
...

This raises the question: what should the links be in the table-data cells. In other words, what should happen when one clicks on the “10” which represents Centre X’s expenditure on Supplier A.

In a “drill-down” situation we require a transactions report which can take cost-centre/supplier as parameters. Therefore clicking on the number 10 would provide the details of these transactions.

In a “drill-across” situation, the user may for example wish to see this figure split across a further criteria, e.g. nature. A drill-across would re-invoke the same report, but applying a filter (i.e. centre = x, supplier = A).

Thought is required on how to support drill-down and drill-across on the same hyperlink. To keep things simple we should start by simply supporting drill-down and not drill-across.

To implement drill-down support one simply needs the hyperlink to the drill-down report format. For example, if this report is the transactions then the hyperlink may be:

Hierarchies?

Hierarchies are often used for accounts, cost centres, geographic locations and many other possibilities. Let us consider the case where the cost centres are hierarchical in the following report:

Centre X / Centre Y / Centre Z / ...
Supplier A / 10
Supplier B / 20 / 20
Supplier C / 5
...

Clicking on Centre X will ideally produce the same report, but splitting across the cost centres belonging to X, namely :

Centre X1 / Centre X2 / Centre X3 / ..
Supplier A / 3 / 5 / 2
Supplier B
Supplier C
...

We will examine the case where the hierarchy is stored “flat” across the table structure. For example we may have table with:

cost_centre
group / contains cost centres
division / contains groups
project / contains cost centres
nature
supplier

What is necessary is to store the meta data which allows the user to slice the data. For instance a report with division x supplier should allow the user to “slice” on division (down to group), but not on supplier. How do we store this “structural knowledge?”

One solution is to provide a meta table for the relationships between the data, e.g:

PARENT / CHILD
division / group
group / cost_centre
project / cost_centre

Whenever an axis-field is selected the Web interface should insert hyperlinks if this axis-field is in the above table.

Summary

The basic steps to building a simple OLAP model of your datawarehouse are:

1. Create the summary cube

2. Create summaries for the query-combinations which you think are most frequent

3. Create a Meta table which represents the selected query-combinations.

The steps to interrogating this model are:

1. Pattern match the query parameters with the Meta Cube

2. Send the appropriate query (performance-based or storage-based)

3. Format the output data

BHT DATA AVAILABLE

/* CORE CUBE DATA */

SQL> desc cube97;

Name Null? Type

------

AM_CO NUMBER

AM_PROVD NUMBER

AM_PUPROV NUMBER

AM_PROVNY NUMBER

AM_PAID NUMBER

COS NOT NULL VARCHAR2(20)

NAT VARCHAR2(10)

CGB VARCHAR2(10)

PRC VARCHAR2(10)

CNR VARCHAR2(14)

NTP VARCHAR2(10)

/* CUBE META DATA */

SQL> select * from cube_meta97;

COS NAT CGB PRC CNR

------

* * * * *

* MATERIEL ALL ALL ALL

* MATERIEL ALL * ALL

* MATERIEL ALL ALL *

* MATERIEL * ALL ALL

* * ALL ALL ALL

6 rows selected.

/* CUBE SLICE INFO */

SQL> desc cube_slice;

Name Null? Type

------

PARENT VARCHAR2(30)

CHILD VARCHAR2(30)

/* ELABORATED CUBE */

SQL> desc pivot97;

Name Null? Type

------

COS NOT NULL VARCHAR2(20)

COS_DES VARCHAR2(60)

COS_PID VARCHAR2(20)

COS_ENS VARCHAR2(1)

COS_DIV VARCHAR2(10)

COS_GRP VARCHAR2(10)

COS_SCT VARCHAR2(20)

COS_PPA VARCHAR2(10)

COS_PPA2 VARCHAR2(10)

COS_PRO1 VARCHAR2(20)

COS_PRO2 VARCHAR2(20)

COS_PRO3 VARCHAR2(20)

COS_PRO4 VARCHAR2(20)

COS_PRO5 VARCHAR2(20)

NAT NOT NULL VARCHAR2(10)

NAT_CAT VARCHAR2(10)

CGB VARCHAR2(20)

CGB_CAT VARCHAR2(10)

PRC VARCHAR2(10)

CNR VARCHAR2(14)

NTP VARCHAR2(10)

AM_CO NUMBER

AM_PROVD NUMBER

AM_PUPROV NUMBER

AM_PROVNY NUMBER

AM_PAID NUMBER

SQL>

INPUT SCREEN

RESULTS

Summary of Algorithm to determine SQL use and construction.

Two Basic SQLs -

(1) Optimised (using pre-calculated totals)

select

FIELD_X,

FIELD-Y,

FIELD_DATA

from

cube

where

FIELD_X > ‘ALL’,

and FIELD_Y > ‘ALL’

and FIELD_A = ‘ALL’

and FIELD_B = ‘ALL’

...

and FIELD_D = ‘ALL’

andFILTER_1 = ...

and FILTER_2 = ...

To know if you should use the optimised SQL, you should execute :

SELECT * FROM CUBE_META where

FIELD_X = ‘*’

and FIELD_Y = ‘*’

and FILTER_1 = ‘*’

AND FILTER_2 = ‘*’

...

nb : dependent filter / field names should be truncated. E.G. cos_div=‘*’, cos_grp=‘*’,cos_sct=‘*’ all require simply cos =‘*’

(2) Atomic (non optimised)

select

FIELD_X,

FIELD-Y,

SUM(FIELD_DATA)

from

cube

where

FIELD_X > ‘ALL’,

and FIELD_Y > ‘ALL’

and FIELD_A > ‘ALL’

and FIELD_B > ‘ALL’

...

and FIELD_D > ‘ALL’

andFILTER_1 = ...

and FILTER_2 = ...

group by FIELD_X,FIELD_Y

Pivot Table - current version

The current version of the pivot table is extremely powerful and already covers many reports produced by DPOS. Sample screenshots are shown below.

Pivot Table - next steps

The next steps for the pivot table are the following “refinements”:

Transactions Drill Down

The data should be linked to drill-down to custom transactions.

Descriptions Option

Many of the codes, e.g. “Nature = 218” don’t have meaning unless accompanied by descriptions. Either [x] Checkbox for “show descriptions” or provide them automatically at the db-level.

Multiple Years

Database will have one table per-year. Time period field should be flexible to allow selection of multiple Years. Pivot Table fields should include Year.

Note :

Past = all fields, but not commitments or pipeline

Present = all fields

Future = commitments only

Row Totals

Current reports provide totals at Column Level. Row totals are also interesting. Note that row totals are *not* simply a total for the row because one must summarise payments, commitments, open commitments etc separately.

Budgets

Budgets will be added to the database table. Need to be added as field on Report.

Pipeline

Pipeline will be added to the database table. Need to be added as field on Report.

Additional Report Formats

Currently supported formats are:

1.Y

2.Y Y

3.X Y

Additional formats should include:

Y Y Y

X Y Y

i.e. we require one more dimension.

The problems associated with this are:

(a) How do we add this to the user input form? Do we allow multiple selection, or do we have a third field listing?

(b) How do we allow the user to change from:

to

- 1 -