Classification of Spatial Data by Decision Tree Induction[1]

Abstract

Spatial data mining is a very important field due to the very large quantities of spatial data collected in various application areas, including remote sensing, geographical information systems (GIS), computer cartography, environmental assessment and planning, etc. In most cases the spatial data sizes are too large to be mined in a reasonable amount of time using existing methods. A new spatial data organization, band Sequential organization (bSQ) and a new data-mining ready data structure, the Peano Count Tree (PC-trees) provide a lossless, compressed representation of a spatial dataset which facilitate classification and other data mining techniques markedly. In this paper we describe the new data organization and the new data structure. We compare decision tree induction classification using these structures with classical decision tree induction to justify their use in spatial data mining.

Keywords

Data mining, Classification, Decision Tree Induction, Spatial Data, Peano ordering

1.  Introduction

Classification is one of the important problems of data mining. In classification, a training or learning set is identified for the construction of a classifier algorithm. Each record in the learning set has several attributes. There is one attribute, called goal or class label attribute, which indicates the class to which each record belongs. The task of classifier is to accurately predict the class label of new records which do not have a class label attribute value yet. There are many important applications of this classification process.

A Test set is used to test the accuracy of the classifier once it has been developed using the learning dataset. The classifier, once certified, is used to predict the class label of future unclassified data. Different models have been proposed for classification, such as Decision Trees, Neural Networks, Bayesian belief networks, Fuzzy sets, generic models and so on. Among these models, decision trees are widely used for classification. We focus on inducing decision trees in this paper. ID3 (and its variants such as C4.5) [1] [2] and CART [4] are among the best known classifiers which use decision trees. Other decision tree classifiers include Interval Classifier [3] and SPRINT [3] [5]. Classification has been applied in many fields, such as retail target marketing, customer retention, fraud detection and medical diagnosis [14]. Spatial data is a promising area for classification. However, due to the large size of spatial data, such as satellite images, the existing methods are not very suitable. In this paper, we propose a new decision tree based model to perform classification on spatial data. The application focus of this paper is the classification productivity levels in agricultural fields.

A new lossless data structure, Peano Count Tree (PC-tree), is used in the model. PC-trees represent spatial data bit by bit in a recursive quadrant-by-quadrant arrangement. With the information in PC-trees, we can build the decision tree much faster. The rest of the paper is organized as follows. In section 2, we introduce the data formats of spatial data and propose a new format, bSQ. We also describe the PC-tree data structure(and its variants) and algebra. In Section 3, we detail our classifier to perform decision tree induction using PC-trees. Performance analysis is given in Section 4. Section 5 discuss some related work. Finally, there is a conclusion section.

2.  Data Structures

2.1  PC-trees

Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values within a band is raster ordering with respect to the spatial area represented in the dataset. In this paper, we divided each BSQ band into several files, one for each bit position of the data values. We call this format bit Sequential or bSQ. A Landsat Thematic Mapper satellite image, for example, is in BSQ format with 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files, B11,…,b78, each containing ~40,000,000 bits. A typical TIFF image aerial digital photograph is in what is called Band Interleaved by Bit (BIP) format in which there is one file which might containing ~24,000,000 bits ordered by bit-position, then band and then raster-ordered-pixel-location. A simple transform can be used to convert TIFF images to BSQ and then to bSQ format.

We organize each bSQ bit file, Bij, into a tree structure, called a Peano Count Tree (PC-tree). A PC-tree is a quadrant-based tree. The root of a PC-tree contains the 1-bit count of the entire bit-band. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This is construction is continued recursively down each tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level (1-by-1 sub-quadrant). PC-trees are somewhat similar in construction to other data structures in the literature (e.g., HHcodes [12]).

For example, the PC-tree for a 8-row-8-column bit-band is shown in Figure 1.

Figure 1. 8-by-8 image and its PC-tree

In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits, we do not need sub-trees for these two quaThis pattern is continued recursively. Recursive raster ordering is called the Peano or Z-ordering in the literature – therefore, the name Peano Count trees. The process terminates at the “leaf” level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those for quadrants that are pure 1-bits, then the leaf sequence is just the Peano space-filling curve for the original raster image. We note that, the fan-out of a P-tree need not necessarily be 4. It can be any power of 4 (effectively skipping levels in the tree). Also, the fan-out at any one level need not coincide with the fan-out at another level. The fan-out pattern can be chosen to produce good compression for each bSQ band. Before going further we point out that, for measurements that can be expected to exhibit reasonable spatial continuity, the Hilbert ordering will produce even better compression that Peano. However, the Hilbert ordering is a much less intuitive and a more complex ordering. By using Hilbert ordering instead of Peano, we find that other nice mapping properties are lost. For that reason, we have settled on the Peano ordering. Finally, we note that the same general construction can be used for spatial data of more than 2-dimensions. For 3-dimensional data, for instance, at each level, we partition into octants, and so forth.

For each band (assuming 8-bit data values), we get 8 basic PC-trees, one for each bit positions. For band, Bi, we will label the basic PC-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation of the jth bits of the values from the ith band. However, Pij provides much more information and are structured to facilitate many important data mining processes. Some of the useful features of PC-trees are listed below.

·  PC-trees contain 1-count for every quadrant of every dimension.

·  The PC-tree for any sub-quadrant at any level is simply the sub-tree rooted at that sub-quadrant.

·  The PC-tree is a partial run-length compression of the original bit-band.

·  Basic PC-trees can be combined to reproduce the original data (lossless representations).

·  PC-trees can be partially combined (to any depth) to produce upper/lower bounds for the counts.

·  PC-trees can be transformed to smooth data by bottom-up quadrant purification (bottom-up replacement of mixed counts with their closest pure counts - see section 3 below).

The mapping from raster coordinates (of a pixel or a quadrant) to Peano coordinates is simply (ijk,mno) à (im,jn,ko), where i,j,k,m,n,o are bits and Peano coordinates (01,11,00), for example, specify the PC-tree sub-tree arrived at by taking the 1st sub-tree pointer from the root, then the 3rd sub-tree pointer and finally the 0th sub-tree pointer. Finally, we note that these PC-tree’s can be generated quite quickly and can be viewed as a “data mining ready”, lossless format for storing spatial data.

The 8 basic PC-trees defined above can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to produce PC-trees for the original values (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision. Using the full 8-bit precision (all 8 –bits) for values, Pb,11010011 can be constructed from the basic PC-trees as:

PCb,11010011 = PCb1 AND PCb2 AND PCb3’ AND PCb4 AND PCb5’ AND PCb6’ AND PCb7 AND PCb8

where ’ indicates the bit-complement (which is simply the count complement in each quadrant). The AND operation is simply the pixel-wise AND of the bits. Before going further, we note that the process of converting the BSQ data for a TM satellite image (approximately 60 million pixels) to its basic PC-trees can be done in just a few seconds using a high performance PC computer. This is a one-time process. We also note that we are storing the basic PC-trees in a “breadth-first” data structure, which specifies the pure-1 quadrants only. Using this data structure, each AND can be completed in a few milliseconds and the result counts can be computed easily once the AND and COMPLEMENT program has completed.

Finally, we note that the data in the relational format, REL, can be represented as PC-trees also. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this combination of values is given by: PC(v1,v2,…,vn) = PC1,v1 AND PC2,v2 AND … AND PCn,vn .

For efficient implementation, we first construct a variation of each basic PC-tree, its PM-tree (Pure Mask tree). In the PM-tree, we use a 3-value logic, in which 11 represents a quadrant of pure 1-bits (pure1 quadrant), 00 represents a quadrant of pure 0-bits (pure0 quadrant) and 01 represents a mixed quadrant. To simplify the exposition, we use 1 instead of 11 for pure1, 0 for pure0, and m for mixed. Starting with a bit-band, Bij, we first construct the PM-tree and then count 1-bits from the bottom up to produce the PC-trees when necessary. For many situations, however, the PMT has all the information needed. For instance ANDing trees, it is simpler to AND the PM-trees. Experience has shown that the PM-trees can be stored in a very compact form. Therefore we store only the basic PM-tree and then construct the needed data structure easily from the PM-trees.

The PM-tree for the previous example is:

Figure 2. 8-by-8 image and its PM-tree

The PM-tree is particular useful for the ANDing operation between two PC-trees. The PM-tree specifies the location of the pure1 quadrants of the operands, so that the pure1 result quadrants can be easily identified by the coincidence of pure-1 quadrants in both operands and pure0 result quadrants occur wherever a pure-0 quadrant occurs on at least one of the operands.

2.2  PC-tree Algebra

We begin this section with a description of the PC-tree algebra and of the AND operation in that algebra. The algebra contains operators, AND, OR, NOT and XOR, which are the pixel-by-pixel logical operations on PC-trees The NOT operation is a straightforward translation of each count to its quadrant-complement (e.g., a 5 count for a quadrant of 16 pixels has complement of 11). The AND operations is described in full detail below. The OR is identical to the AND except that the role of the 1-bits and the 0-bits are reversed.

This AND algorithm is used to compose the value PC-trees and to populate the TC-cube. In this algorithm we will assume the PC-trees is coded in its most compact form, a depth-first ordering of the paths to each pure1 quadrant. For the example,

each path is represented by the sequence of quadrants in Peano order, beginning just below the root. Therefore, the Pure1 Quadrant sequence (PQ-sequence) for this example is:

0 100 101 102 12 132 20 21 220 221 223 23 3 .

We will take the following second operand to use as an example for illustrating the AND operation.

A pure1 quadrant will occur in the result only if that quadrant is pure1 in both operands. Therefore, the AND can be done by scanning the operands for matches and outputting those matching pure1quadrants in order. An example is shown below.

Several implementations have been done, the most effective of which simply builds the result by setting pointers to the pure1 operand quadrants that make it up (Note that the AND result contains no new pure1 quadrants, but a selection of those that already exist in both (all) operands. Using this implementation, the results of a whole series of multi-way AND operations can be done in parallel and a large AND program can be completed in a matter of milliseconds (e.g., even for a TM scene ~40 millions of pixels).

PQ-sequence of operand-1 PQ-seq of op-2 PQ-seq of Result

0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 à 0 20 21 220 221 223 231

0 0 à 0

20 20 à 20

21 21 à 21

220 221 223 22 à 220 221 223

23 231 à 231

Therefore the result is: