Aggregate Function Computation and Iceberg Querying in Vertical Database
1
Yue Cui
Computer Science department
North Dakota State University
Fargo, ND, 58102, U.S.A.
William Perrizo
Computer Science department
North Dakota State University
Fargo, ND, 58102, U.S.A.
1
1
ABSTRACT
Aggregate function computation and iceberg querying are important and common in many applications of data mining and data warehousing because people are usually interested in looking for anomalies or unusual patterns by computing aggregate functions across many attributes or finding aggregate values above some specified thresholds. In this thesis, we present efficient algorithms to compute aggregate functions and evaluate iceberg queries using P-trees, which is an innovative data structure on vertical databases. Our methods do well for most of the aggregate functions, such as Count, Sum, Average, Min, Max, Median, Rank, and Top-k, and in some situations, they are the best of all. Our algorithms use very little memory and perform quickly because, in most cases, we only pass over data once and use logical operations, And/Or, to implement the entire task.
1. INTRODUCTION
1.1 Background
Aggregation functions [1] across many attributes are widely used in queries of data mining and data warehousing. The commonly used aggregation functions include Count, Sum, Average, Min, Max, Median, Rank, and Top-k. The commonly used queries in Data mining and data warehousing are iceberg queries [2], which perform an aggregate function across attributes and then eliminate aggregate values that are below some specified threshold. Iceberg queries are so called because the number of above-threshold results is often very small (the tip of an iceberg) relative to the large amount of input data (the iceberg). Iceberg queries are also common in many other applications including information retrieval, clustering, and copy detection [2].
Fast implementation of aggregate functions and iceberg queries is always challenging task in data warehousing, which may contain millions of data. One method in data warehousing, which improves the speed of iceberg queries, is to pre-compute the iceberg cube [3]. Iceberg cube actually stores all the above-threshold aggregate information of a dataset. When a customer initiates an iceberg query, the system, instead of computing the result from the original dataset, looks into the iceberg cube for the answer. This helps to accelerate the on-line processing speed. Efficient algorithms in computation of the iceberg cube are widely studied. A lot of research has been done in this area [3].
Recently, a new data structure, Peano Tree (P-tree) [4], has been introduced for large datasets. P-tree is a lossless, quadrant-based, compression data structure. It provides efficient logical operations that are fast and efficient. One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out “bit holes,” which consist of consecutive 0’s [4].
In this paper, we investigate problems on how to efficiently implement aggregate functions and iceberg queries using P-tree. We have two contributions: first, we develop efficient algorithms, which compute the various aggregate functions by using P-tree; second, we illustrate the procedure to implement iceberg query by using P-tree. In our algorithms and procedures, the main operations are counting the number of 1s in P-trees and performing logic operations, And/Or, of P-trees. These operations can be executed quickly by hardware. The costs of these operations are really cheap. As you will see in our experiments section, our methods and procedure are superior in computing aggregate functions and iceberg queries in many cases.
This paper is organized as follows. In Chapter 2, we do a Literature review, where we will introduce the concepts of aggregate functions and iceberg queries. In Chapter 3, we review the P-tree technology. In Chapter 4, we give the algorithms of the aggregate functions using P-tree. In Chapter 5, we give an example of how to implement iceberg queries using P-tree. In Chapter 6, we describe our experiment results, where we compare our method with bitmap index. Chapter 7 is the concluding section of this paper.
2. LITERATURE REVIEW
2.1. Aggregate Functions
Frequently, people are interested in summarizing data to determine trends or produce top-level reports. For example, the purchasing manager may not be interested in a listing of all computer hardware sales, but may simply want to know the number of Notebook sold in a specific month. Aggregate functions can assist with the summarization of large volumes of data.
Three types of aggregate functions were identified [1]. Consider aggregating a set of tuples, T. Let {Si | i = 1 . . . n} be any complete set of disjoint subsets of T such that i Si = T, and i Si = {}.
Distributive: An aggregate function F is distributive if there is a function G such that
F (T) = G ({F (Si)| i = 1 . . . n}). SUM, MIN, and MAX are distributive with G = F. Count is distributive with G = SUM.
Algebraic: An aggregate function F is algebraic if there is an M-tuple valued function G and a function H such that F (T) = H ({G (Si) | i = 1 . . . n}). Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. For Average, the function G records the sum and count. The H function adds these two components and then divides to produce the global average. Similar techniques apply to finding the N largest values, the center of mass of group of objects, and other algebraic functions. The key to algebraic functions is that a fixed size result (an M-tuple) can summarize the sub-aggregation.
Holistic: An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. That is, there is no constant M, such that an M-tuple characterizes the computation F. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.
Efficient computation of all these aggregate functions is required in most large database applications. There are a lot efficient techniques for computing distributive and algebraic functions [2] but only a few for holistic functions such as Median. In this paper, our method to compute Median is one of the best among all the available techniques.
2.2. Iceberg Queries
Iceberg queries refer to a class of queries which compute aggregate functions across attributes to find aggregate values above some specified threshold. Given a relation R with attributes a1, a2… an, and m, an aggregate function AggF, and a threshold T, an iceberg query has the form of follow:
SELECT R.a1, R.a2… R.an, AggF (R.m)
FROM relation R
GROUPBY R.a1, R.a2… R.an
HAVING AggF (R.m) >= T
The number of tuples, that satisfy the threshold in the having clause, is relatively small compared to the large amount of input data. The output result can be seen as the “tip of iceberg,” where the input data is the “iceberg.”
Suppose, a purchase manager is given a sales transaction dataset, he/she may want to know the total number of products, which are above a certain threshold T, of every type of product in each local store. To answer this question, we can use the iceberg query below:
SELEC Location, Product Type, Sum (# Product)
FROM Relation Sales
GROUPBY Location, Product Type
HAVING Sum (# Product) >= T
To implement iceberg query, a common strategy in horizontal database is first to apply hashing or sorting to all the data in the dataset, then to count all of the location & Product Type pair groups, and finally to eliminate those groups which do not pass the threshold T. But these algorithms can generate significant I/O for intermediate results and require large amounts of main memory. They leave much room for improvement in efficiency. One method is to prune groups using the Apriori-like [5] [6] method. But the Apriori-like method is not always simple to use for all the aggregate functions. For instance, the Apriori-like method is efficient in calculating Sum but has little use in implementing Average [4]. In our example, instead of counting the number of tuples in every location & Product Type pair group at first step, we can do the following: Generate Location-list: a list of local stores which sell more than T number of products. For example,
SELECT Location, Sum (# Product)
FROM Relation Sales
GROUPBY Location
HAVING Sum (# Product) >= T
Generate Product Type-list: a list of categories which sell more than T number of products. For example,
SELECT Type, Sum (# Product)
FROM Relation Sales
GROUPBY Product Type
HAVING Sum (# Product) >= T
From the knowledge we generated above, we can eliminate many of the location & Product Type pair groups. It means that we only generate candidate location and Product Type pairs for local store and Product type which are in Location-list and Product Type-list. This approach improves efficiency by pruning many groups beforehand. Then performing operation, And, of value P-trees, we can calculate the results easily. We will illustrate our method in detail by example in Chapter 5.
3. REVIEW OF P-TREES
A Peano tree (P-tree) is a lossless, bitwise tree. A P-tree can be 1-dimensional, 2-dimensional, 3-dimensional, etc. In this section, we will focus on 1-dimensional P-trees. We will give a brief review of P-trees on their structure and operations, and introduce two variations of predicate P-trees which will be used in our iceberg queries algorithm.
3.1 Structure of P-trees
Given a data set with d attributes, X = (A1, A2 … Ad), and the binary representation of jth attribute Aj as bj,mbj,m-1...bj,i… bj,1bj,0, we decompose each attribute into bit files, one file for each bit position [1]. To build a P-tree, a bit file is recursively partitioned into halves and each half into sub-halves until the sub-half is pure (entirely 1-bits or entirely 0-bits).
The detailed construction of P-trees is illustrated by an example in Figure 1. The transaction set is shown in a). For simplicity, assume each transaction has one attribute. We represent the attribute as binary values, e.g., (7)10 = (111)2. Then vertically decompose them into three separate bit files, one file for each bit, as shown in b). The corresponding basic P-trees, P1, P2 and P3, are constructed by recursive partition, which are shown in c), d) and e).
As shown in e) of Figure 1, the root of P1 tree is 3, which is the 1-bit count of the entire bit file. The second level of P1 contains the 1-bit counts of the two halves, 0 and 3. Since the first half is pure, there is no need to partition it. The second half is further partitioned recursively.
Figure 1. Construction of 1-D P-trees.
3.2 P-tree Operations
AND, OR, and NOT logic operations are the most frequently used P-tree operations. For efficient implementation, we use a variation of P-trees, called Pure-1 trees (P1-trees). A tree is pure-1 if all the values in the sub-tree are 1’s. A node in a P1-tree is a 1-bit if and only if that half is pure-1. Figure 2 shows the P1-trees corresponding to the P-trees in c), d), and e) of Figure 1.
Figure 2. P1-trees for the Transaction Set.
The P-tree logic operations are performed level-by-level starting from the root level. They are commutative and distributive, since they are simply pruned bit-by-bit operations. For instance, ANDing a pure-0 node with anything results in a pure-0 node, ORing a pure-1 node with anything results in a pure-1 node. In Figure 3, a) is the ANDing result of P1,1 and P1,2, b) is the ORing result of P1,1 and P1,3, and c) is the result of NOT P1,3 (or P1,3’), where P1,1, P1,2 and P1,3 are shown in Figure 2.
Figure 3. AND, OR and NOT Operations.
3.3. Predicate P-trees
There are many variations of predicate P-trees, such as value P-trees, tuple P-trees, inequity P-trees [8][9][10][11], etc. In this section, we will describe value P-trees and inequity P-trees, which are used in evaluating range predicates.
3.3.1. Value P-trees
A value P-tree represents a data set X related to a specified value v, denoted by Px=v, where x Î X. Let v = bmbm-1…b0, where bi is ith binary bit value of v. There are two steps to calculate Px=v. 1) Get the bit-P-tree Pb,i for each bit position of v according to the bit value: If bi = 1, Pb,i = Pi; Otherwise Pb,i = Pi’, 2) Calculate Px=v by ANDing all the bit P-trees of v, i.e. Px=v = Pb1 Ù Pb2… Ù Pbm. Here, Ù means AND operation. For example, if we want to get a value P-tree satisfying x = 101 in Figure 1. We have Px=101 = Pb,3 Ù Pb,2 Ù Pb,1 = P3 ÙP2’ ÙP1.
3.3.2. Inequity P-trees
An inequity P-tree represents data points within a data set X satisfying an inequity predicate, such as x>v, x³v, x<v, and x£v. Without loss of generality, we will discuss two inequity P-trees for x³v and x£v, denoted by Px³v and Px£v, where x Î X, v is a specified value. The calculation of Px³v and Px£v is as follows:
Calculation of Px³v: Let x be a data within a data set X, x be a m-bit data, and Pm, Pm-1, … P0 be the P-trees for the vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit value of v, and Px³v be the predicate tree for the predicate x³v, then Px³v = Pm opm … Pi opi Pi-1 … op1 P0, i = 0, 1 … m, where 1) opi is Ù if bi=1, opi is Ú otherwise, and 2) the operators are right binding. Here, Ù means AND operation, Ú means OR operation, right binding means operators are associated from right to left, e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)). For example, the inequity tree Px ³101 = (P2 Ù (P1Ú P0)).
Calculation of Px£v: Calculation of Px£v is similar to Calculation of Px³v. Let x be a data within a data set X, x be a m-bit data set, and P’m, P’m-1, … P’0 be the complement P-trees for the vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit value of v, and Px£v be the predicate tree for the predicate x£v, then Px£v = P’mopm … P’i opi P’i-1 … opk+1P’k, k£i£m, where 1). opi is Ù if bi=0, opi is Ú otherwise, 2) k is the rightmost bit position with value of “0,” i.e., bk=0, bj=1,j<k, and 3) the operators are right binding. For example, the inequity tree Px £101 = (P’2 Ú P’1). We will frequently use value P-trees and inequity P-trees in our calculation of aggregation functions and iceberg queries.