3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001)
München, Germany, September 5-7, 2001
Published in LNCS 2114, Springer Verlag, pp. 274-283
Improving the Performance of OLAP Queries Using Familiesof Statistics Trees
Joachim Hammer & Lixin Fu
Dept. of CISE, University of Florida, Gainesville, Florida 32611-6120, U.S.A.
e-mail: {jhammer,lfu}@cise.ufl.edu
Abstract. We present a novel approach to speeding up the evaluation of OLAP queries that return aggregates over dimensions containing hierarchies. Our approach is based on our previous version of CubiST (Cubing with Statistics Trees), which pre-computes and stores all possible aggregate views in the leaves of a statistics tree during a one-time scan of the data. However, it uses a single statistics tree to answer all possible OLAP queries. Our new version remedies this limitation by materializing a family of derived trees from the single statistics tree. Given an input query, our new query evaluation algorithm selects the smallest tree in the family which can provide the answer. Our experiments have shown drastic reductions in processing times compared with the original CubiST as well as existing ROLAP and MOLAP systems.
1Introduction
Online analytical processing (OLAP) [2,12] is a recent technology that enables decision support for large enterprise data warehouses. Starting point for OLAP is the data cube, a multi-dimensional organization of the underlying data, which is well suited for the analytical and navigational activities across the dimensions, hierarchies, and associated measures. However, on any but the smallest databases, OLAP queries tax even the most sophisticated query processors since the size of the data cube often grows to hundreds of GBs or TBs and the data is of high dimensionality with large domain sizes.
In [3], Fu and Hammer describe a new, efficient algorithm called CubiST (Cubing with Statistics Trees) for evaluating ad-hoc OLAP queries using so-called Statistics Trees (STs). In general, a statistics tree is a space-efficient encoding for all possible aggregates in the cube, and can be used to evaluate queries that return aggregate values (e.g., What is the percent difference in car sales between the southeast and northeast from 1990 until 2000?) rather than sets of tuples (e.g., Find the names and addresses of car dealers in the southeast who sold more cars in 1999 than in 1998). We have termed the former type of query cube query. However, the original version of CubiST does not efficiently support hierarchies, which are the result of partitioning dimension values into different levels of granularity (e.g., days into weeks, quarters, and years).
In this paper, we present a new methodology called CubiST++ to evaluate cube queries, which greatly improves upon the usefulness and efficiency of the original CubiST. CubiST++ specifically improves the evaluation of cube queries involving dimension hierarchies in a data cube. Instead of using a single (base) statistics tree to compute all cube queries, our new algorithms materialize a family of trees from which CubiST++ selects potentially smaller derived trees to compute the answer.
In general, research aimed at speeding up OLAP queries falls into the following categories: Special-purpose OLAP servers including Relational OLAP (e.g., MicroStrategy [11], Redbrick [15]) and Multi-dimensional OLAP (e.g., Arbor Systems Essbase [1], Oracle Express [14]), pre-computation of query fragments using materialized views (see for example, [6, 8, 9] for approaches to view selection as well as [10] for a collection of excellent papers on view materialization). Also relevant is the work on index structures (see [13] for an excellent overview) and processing of aggregation queries[5, 16]. However, to be able to support a wide range of OLAP queries, indexing and pre-computation of results alone will not produce good results. For example, building an index for each attribute of the warehouse or pre-computing every sub-cube requires too much space and results in unacceptable maintenance costs. We believe that our approach which is based on a new encoding for aggregates, a greedy strategy for selecting views, and a query processing algorithm represents a major step towards supporting ad-hoc OLAP queries efficiently.
2Families of Statistics Trees
2.1The Statistics Tree Data Structure
We start by briefly introducing the statistics tree (ST) data structure and corresponding algorithms to select and generate families of statistics trees. An ST tree is a multi-way, balanced tree whose leave nodes contain the aggregates for a set of records consisting of one or more attributes. By aggregates, we are referring to the result of applying an aggregation function (e.g., count, min, max) to each combination of dimension values in the data set. Leaf nodes are linked to facilitate retrieval of multiple aggregates. Each level in the tree (except the leaf level) corresponds to an attribute. An internal node has one pointer for each domain value, and an additional “star” pointer representing the entire attribute domain (the star has the intuitive meaning “any” analogously to the star in Gray’s cube operator [4]). Internal nodes contain only tree node pointers. Initially, the values in the leave nodes are set to zero or undefined. Populating the tree is done during a one-time scan of the data set: for each record, the tree is traversed based on the dimension values in the record; the leaf nodes at the end of the path are updated using the aggregation function (e.g., incremented by one in the case of the count function). In a sense, an ST can be considered a superview since it stores multiple traditional views representing different group-by combinations. However, it differs from multi-dimensional arrays in that it combines all arrays into a single structure which simplifies initialisation and maintenance especially when the number of dimensions is not known in advance or changes over time.
Fig. 1 depicts a simple statistics tree corresponding to a data cube with three dimensions having cardinalities d1=2, d2=3, and d3=4 respectively. The contents of the tree are shown after inserting the record (1,2,4)[1] into an empty tree. The leaf nodes store the result of applying the aggregation function count to the data set. The update paths relating to the sample record are shown as solid thick lines in Fig. 1. The amount of memory that is needed to store a k-dimensional statistics tree is bounded by c(d1+1)(d2+1)…(dk+1), where di is the cardinality of dimension i, and the constant value c accounts for the space that is needed to store the internal nodes of the tree.
Fig. 1. Statistics tree after processing input record (1,2,4)
The original CubiST provides a useful framework for answering cube queries. However, it does not efficiently answer queries on data cubes containing hierarchies on the dimensional values (e.g., day, month and year in the case of the time dimension) since it only stores data at the finest level of granularity (e.g., time in terms of days). In those cases, one must first transform the conditions of the query, which may reference different levels of the hierarchy, into conditions involving the level of granularity represented by the ST (i.e., days), and then treat them as range queries or partial queries over the detailed data. This is particularly inefficient for queries involving only conditions on the coarsest level of granularity which could have been answered using a much smaller ST. One way to reduce the I/O cost for the large STs is to transmit only the leaves; the internal nodes can be generated in memory without additional I/O. A more effective approach, and the one described in this paper, is to materialize additional smaller trees, each representing the dimensions at one particular level of the hierarchy. In most cases, cube queries can be answered using one of the smaller trees rather than the single large ST as is done in CubiST. We term this single large ST as defined in the original version of CubiST base tree. Using the base tree, one can compute and materialize other statistics trees with the same dimensions but containing different hierarchy levels for one or more of the dimensions. We term these smaller trees derived trees. A base tree and its derived trees form a family of statistics trees with respect to the base tree. Accordingly, we call the new algorithm CubiST++ (CubiST plus families). Before presenting its details, we first describe how to represent hierarchies.
2.2Hierarchical Partitioning and Mapping of Domain Values
In relational systems, dimensions containing hierarchies are stored in a row-column format in which different levels are represented as separate attributes. As before, we map the domain values into integers. However, in CubiST++ we also include the hierarchies in our encoding: a value at a higher level of abstraction includes multiple values at a lower level. By scanning a dimension table, we can establish the domain value mappings, hierarchical partitions, as well as the inclusion relationships.
Consider Table 1, which represents the location information in a data source. We first group the records by columns Region, State and City, remove duplicates, and then map the domain values into integers using the sorted order. The result is shown in Table 2 (numbers in parentheses denote the integer representations). For example, value “1” for the Region dimension (“MW”) includes values “1” and “2” for the State dimension (“IL”, “MN”) as well as values “1” and “2” for the City dimension (“Chicago”, “Twin City”).
Table 1. Original location data Table 2. Partitioning and mapping
2.3A Greedy Algorithm for Selecting Family Members
To generate a family of statistics tree, we first need to choose from the set of all possible candidate trees (i.e., trees containing all combinations of dimensions and hierarchy levels) the best set of candidate trees. Second, we need to compute the selected trees.
Potentially, any combination of hierarchy levels can be selected as a family member. However, we must consider space limitations and maintenance overhead. We introduce a greedy algorithm to choose the members in a systematic fashion. Starting from the base tree, during each step, we roll-up the values of the largest dimension to its next level in the hierarchy, keeping other dimensions unchanged. This newly formed ST becomes a new member of the family. The iteration continues until each dimension is rolled-up to its coarsest level of granularity or until the size of the family exceeds the maximum available space. The total size of the family is less than the single base tree.
Suppose a base tree T0 represents a data set with three dimensions d1, d2, d3whose cardinalities are 100 each. Each dimension is organized into ten groups of ten elements creating a two-level hierarchy. We use the superscripted hash mark ‘#’ to indicate values corresponding to the second hierarchy level. For example, 0# represents values 0 through 9 from the first hierarchy level, 1# represents values 10 through 19, etc. The degrees of the internal nodes of T0 are 101 for each level of the tree (the extra 1 accounts for the star pointer). The selected derived trees T1, T2, T3 that make up a possible family together with T0 are as follows: T1 is the result of rolling up d2 in T0 (since all three dimensions are of the same size, we pick one at random). T1 has degrees 101, 101 and 11. T2 is the result of rolling up d1in T1. Its degrees are 101, 11, and 11. T3 is the result of rolling up d0in T2. Its degrees are 11, 11, and 11.
2.4Deriving a New Statistics Tree
Except for the base tree, which is computed from the initial data set, each derived tree is computed by rolling-up one of its dimensions. Before introducing our new tree derivation algorithm, we first define the a new ST merge operator “” as follows:
Definition 2.1: Two STs are isomorphic if they have the same structure except for the values of their leaves. Merging two isomorphic ST's S1, S2 results in a new ST S that has the same structure as S1 or S2 but its leaf values are the sum of the corresponding leaf values of S1 and S2. This relationship is denoted by S=S1 S2.
In order to derive a new tree, the derivation algorithm proceeds from the root to the level (i.e., the dimension) that is being rolled up. Here, we reorganize and merge the sub-trees for all the nodes in that level to form new subtrees. For each node, we adjust the degree and update its pointers to reference the newly created subtrees.
Fig. 2. Rolling-up the dimension values in a two-level hierarchy
The best way to illustrate the derivation algorithm is through an example. Suppose that during the derivation, we need to roll-up a dimension that has nine values and forms a two-level hierarchy consisting of three groups with three values each. A sample node N of this dimension with children S1, ..., S9, S* is shown on the left of Fig. 2. We merge its nine subtrees into three new subtrees S1#, S2# and S3# each having three subtrees. The new node N' with degree four is shown on the right side of the figure. The star subtree remains the same. Using the definition from above, we can see that the following relationships hold among the subtrees of nodes N and N':
S1#=S1S2S3, S2#=S4S5S6, and S3#=S7S8S9 .
3Answering Cube Queries Using a Family of Statistics Trees
Once a family is created, it can be used to answer cube queries. We have designed a new language called Cube Query Language (CQL) for representing cube queries efficiently. After introducing CQL, we will describe our query processing strategy.
3.1The Cube Query Language CQL
A cube query contains the following information: the desired measure(s), the aggregate operation used, the dimensions and their hierarchy levels with constraints, and the selected value subsets of the domains. CQL represents cube queries as follows:
AGGR-OPERATORMEASURES((D-INDEX,H-LEVEL):SELECTED-VALUES; …).
Constraints, which appear in the body of the query between the parentheses, are separated by semicolons. Each constraint contains the dimension (D-INDEX) as well as the hierarchy level (H-LEVEL). The selected values can be specified in one of three ways: as a single value, a range, or a partial selection. If there is no constraint (null constraint) on a dimension, we can specify an “ALL” value for it. The hierarchy level can be omitted if the constrained dimension is non-hierarchical or the selected values are at the lowest level of the dimension (default value 0). In this case, the constraint is simplified as D-INDEX: SELECTED VALUES.
The following is a CQL representation of the cube query “Find the total (car) sales for Orlando from March through October.” For clarity, we are using actual names and values rather than the integer representations:
q=SUMSALES((Time,Month):[Mar,Oct];(Location,City):Orlando).
3.2Choosing the Optimal Derived Tree
Given a family of trees, more than one tree may be able to answer a given query; however, the smallest possible one is preferred. The desired tree is the tree which contains the highest level of abstraction that matches the query. This tree can be found using our query-matching scheme, which is based on the following definition.
Definition 3.1: A tree matrix (TM) is a matrix in which rows represent dimensions, columns represent the hierarchy levels, and the row/column intersections contain the labels of the ST's in the family. Intuitively, a TM describes the views that are materialized in the hierarchies.
Fig. 3. A sample tree matrix
Continuing with our sample family of trees from Sec. 2.3, T1 has degrees 101, 101, 11. Its first two dimensions have not been rolled-up (level 0 of the hierarchy) whereas the third dimension has been rolled-up to level 1; hence T1 is placed in column 1 for rows 1 and 2 and in column 2 for row 3. Trees T0 and T1 are placed accordingly as shown in Fig. 3. Next, we provide three definitions for representing hierarchies and for what constitutes an optimal match between a query and an ST.
Definition 3.2: A query hierarchy vector (QHV) is a vector (l1,l2,...,lk), where li is the ith dimension level value in its hierarchy in the query.
Definition 3.3: A QHV (l1,l2,...,lk) matches view T in the TM iff li column index of the ith row entry of T in TM, i, i=1,2,...,k. In this case, T's column index vector, which is composed of T’s column indices, is less than QHV.
Definition 3.4: An optimal matching tree with respect to a query is the tree in TM that has the largest index number and matches the query's QHV.
Consider the sample cube query q=count((1,1):3,(2,0):2,(3,1):5). Its QHV is (1,0,1). We start by matching the smallest tree T3. Since the column vector for T3 is (1,1,1), which does not match (1,0,1), T3 is not considered. For the same token, the second smallest tree T2 is also not an optimal match. However, the column index vector of T1 is (0,0,1) (1,0,1), i.e. T1 matches QHV. Hence T1 is the optimal matching tree which is used to answer q.
Fig. 4. Using T1 to answer a cube query
3.3Rewriting and Answering the Query
Each query is evaluated based on its optimal matching ST. Before evaluation, the query must be rewritten if its QHV is not equal to the ST's column index vector. For example, query q=count((1,1):3,(2,0):2,(3,1):5)must be rewritten into q’=count([30,39],2,(3,1):5)so that its new QHV (0,0,1) exactly equals the levels of optimal ST (in this case T1).
Next, we compute the query using T1 by applying the original query evaluation algorithm of CubiST. Due to space limitations, we can only highlight the query processing steps using an example; complete details can be found in [7]. Fig. 4 displays T1 after initialisation. To answer query q=count(2,[1,3],{1,3}), we follow the second pointer of the root to the node of level 2 and then the first three pointers to the nodes of level 3. From there, we follow the first and third pointers to the leaves that contain the partial aggregates for count (shown in bold). The summation of the counts is the final answer. The paths taken by the algorithm are shown as dashed lines.