Hashing

1.  Static Hashing

A bucket is a unit of storage containing one or more records. Typically, a bucket is a disk block. In a hash file organization, the bucket to which a record is assigned is derived from its search-key value using a hash function.

Hash function is used to locate records for access, modification and deletion. Note that records with different search-key values may be mapped to the same bucket. Thus, entire bucket has to be searched to locate a record.

Hash functions

·  A bad hash function maps most search-key values to the same bucket. This makes access time proportional to the number of search-key values in the file

·  An ideal hash function has the following properties

(i) uniform : each bucket is assigned the same number of search-key values from the set of all possible values

(ii) random : there is no correlation between the search-key value of a record and the bucket to which the record is assigned.

Because of these properties, each bucket will have the same expected number of records assigned to it.

In general, a random function tends also to be uniform. However, a uniform function may not be random. Suppose we have 20 buckets numbered 1 through 20 and the search-key is age. Suppose a hash function assigns a person record to bucket based on the age value like this. If age is less than 10, the record is assigned to bucket 1, from 10 to 19 bucket 2, from 20 to 29 bucket 3 and etc. This hash function is uniform (it assigns the same number of search-key values to each bucket) but it is not random. There is a clear relationship between a bucket and the search-key values. For example, for bucket 2, we know for sure the records stored in it have search-key values from 10 to 19.

Some common hash functions:

(a) mid-square: The middle of square hash function treats the search-key value or some transformation of the value as a number, squares the number and takes as the hash value a fixed number of bits in the middle of the binary representation of the resulting number.

(b) division: some uses the modulus (%) operator. If n is the number of buckets, a simple division hash function: s modulus n (or s % n)

where s is the search-key value or some transformation of the value. The result of s modulus n is the smallest positive integer remainder when s is divided by n.

Example: The figure in next page shows a hash organization of relation account(branch-name, account-number, balance). The search-key is branch-name. There are 10 buckets. For the computation of the hash value of a branch name, the j-th character is represented by the integer j. The sum of the integer representations of the characters in a branch name is first obtained. Then, the hash value of the branch name is taken as this sum modulo 10 (the number of buckets). For example, for the branch name Mianus, the characters M, i, a, n, u and s are represented by the integers 13, 9, 1, 14, 21 and 19 respectively. The hash value of Mianus is (13 + 9 + 1 + 14 + 21 + 19) modulo 10 = 77 modulo 10 = 7. Thus, the record (Mianus, A-215, 700) is placed in bucket 7.

Bucket Overflows

·  Bucket overflow can occur because of

n  insufficient buckets: the buckets assigned cannot hold all records.

n  skew in distribution of records: some buckets are assigned more records than other buckets

can occur due to : (a) an unusual number of records have the same search-key value

(b)  the hash function chosen produces non-uniform distribution of key values.

·  Bucket overflow is typically handled using overflow buckets; and the overflow buckets of a given bucket are chained together in a linked list -- a technique known as overflow chaining.

·  Another technique known as linear probing can also be used to handle bucket overflow. With linear probing, if the bucket for the insertion of a record is found to be full, the record is inserted into the next bucket that has space. This technique is not common in database applications.

2.  Hash Indices

Hashing can be used not only for file organization, but also for index-structure creation. A hash index organizes the search keys, with their associated record pointers, into a hash file structure. The following shows an example of a hash index.

3.  Deficiencies of Static Hashing

In static hashing, a hash function h maps search key values to a fixed set B of buckets.

¨  Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows

¨  If at the beginning, enough buckets are allocated for future growth, significant amount of space will be wasted initially. If database shrinks after it has grown to a certain point, space will be wasted again.

¨  We can handle this problem by periodic re-organization of the file with a new hash function and a bigger or smaller number of buckets. However, this will be very expensive.

Problems can be avoided if the number of buckets can be changed dynamically.

4.  Dynamic Hashing

Good for database that grows and shrinks in size.

Extendible hashing -- a form of dynamic hashing

·  hash function generates values represented as a b-bit binary integers; typically value of b is 32.

·  At any time, only a prefix of the hash value is used to index into a directory of bucket addresses. Let this length of the prefix be g bits. This number of bits is also called the global depth.

(Note that the prefix are the high order bits. It is also common that the low order bits are used.)

·  Initially, the global depth g = 0

·  The global depth g grows and shrinks as the size of the table grows and shrinks.

·  Note that a global depth of g bits represents 2g different index values, i.e., the directory has 2g entries (or bucket addresses). But, the number of distinct of bucket addresses (or the actual number of buckets) may be less than 2g and changes dynamically due to coalescing and splitting of buckets.

Multiple entries in the directory may point to a bucket. Each bucket j keeps a value l called the local depth. The index values of all directory entries pointing to bucket j all have the same first l high order bits. As an index value has g bits, there are 2g - l directory entries pointing to bucket j.

Note that the hash values of all records in bucket j also have the same values on the first l high order bits.

To locate the bucket containing search-key Kj :

(i) compute h(Kj) = X

(ii) use the first g high order bits (g = global depth) of X as an index value to locate the directory entry, which contains the pointer to the appropriate bucket

To insert a record with search-key Ki , follow same procedure as look-up to locate the bucket i , say. If there is room in bucket i , insert record in the bucket; otherwise the bucket must be split and insertion re-attempted.

To split a bucket j when inserting record with search-key value Kj :

·  If g > l (global depth > local depth), more than one pointer points to bucket j
(2g - l such pointers)

(i) allocate a new bucket z, and set the local depth lz of new bucket and the local depth of bucket j to (l + 1)

(ii) make the second half of the 2g - l directory entries originally pointing to bucket j to point to the new bucket z.

(iii) for each record of the records originally in bucket j and the new record, check the (j + 1)-th bit; if this bit is 0, keep it in bucket j and if it is 1, insert it in new bucket z

n  in case all records are put into the same bucket, may need to split a bucket again

·  if g = l (only one directory entry points to bucket j)

(i)  increase g (global depth) by 1 and double the size of the directory

(ii)  replace each entry in the directory by two adjacent entries that point to the same bucket

(For example, suppose originally the global depth is 3. The first directory entry has index value 000. When the directory is doubled, the first directory entry has index value 0000 and the second directory entry has index value 0001. As now the first 3 bits of the index values of these 2 entries are the same (the local depths of buckets have not changed yet), both these directory entries must have the same pointer.)

(iii)  now the global depth g is bigger than l ( g has been increased), use the steps above for g > l to insert the record

·  When inserting a value, a bucket may be full after several splits (this happens if the hash values of all the records in the bucket have the same value for some long prefix or for a large number of high order bits). In this case, create an overflow bucket instead of splitting the bucket or doubling the directory

To delete a search-key value :

·  Locate the record(s) with the search-key value and remove it (them). The bucket itself may be removed if it becomes empty; and the directory entry for the deleted bucket should also be changed appropriately. If there is a separate buddy bucket, set the directory entry for the deleted bucket equal to the directory entry for the buddy bucket and decrease the local depth of the buddy bucket by 1. Otherwise, set both directory entries (for the bucket and the buddy bucket) to null.

Two buckets are buddies if the index values for their corresponding directory entries are the same except for the last one bit (of the g bits). (for example, suppose the global depth g is 3. The index value for a bucket is 010 and the index value for another bucket is 011. Then, these two buckets are buddies.)

If the local depth of each bucket is less than the global depth (i.e., each bucket has more than one pointers in the directory pointing to it), the directory may be reduced by half.

Example: Suppose a bucket can accommodate only two records.

branch-name h(branch-name)

Brighton 0010 1101 1111 1011 0010 1100 0011 0000

Downtown 1010 0011 1010 0000 1100 0110 1001 1111

Mianus 1100 0111 1110 1101 1011 1111 0011 1010

Perryridge 1111 0001 0010 0100 1001 0011 0110 1101

Redwood 0011 0101 1010 0110 1100 1001 1110 1011

Round Hill 1101 1000 0011 1111 1001 1100 0000 0001

After inserting two records,

After 3 insertions,

Before inserting the 4-th record,

After inserting the 4-th record,

After inserting all records,

5.  Comparison of Ordered Indexing and Hashing

·  Hashing is generally better at retrieving records having a specified value of the search-key

ordered index -- takes time proportional to the log of the number of search-key values

hashing -- constant time in general; could take time proportional to the number of search-key values (when many records are assigned to a bucket), but this is rare.

·  For range queries, ordered indices are preferred.

Hashing--the next record of a record may not be located in the same bucket; need to look up the record(s) for every search-key values within the specified range

6.  Multi-key Access

·  For certain types of queries, can use multiple indices

Example: There are two indices: one on branch-name and one on balance Consider the query: select account-number

from account

where branch-name = 'Perryridge' and balance = 1000;

Possible strategies:

(1) use index on branch-name to find Perryridge records and test balance = 1000

(2)  use index on balance to find records with balances of $1000; and test branch-name = 'Perryridge'

(3)  Use branch-name index to find pointers to all records pertaining to the Perryridge Branch. Similarly use index on balance. Take intersection of both sets of pointers

Problems: may have many records that satisfy each individual condition and only a few records satisfying both conditions.

A better strategy is to have an index on combined search-key. For the above example, the combined search-key is (branch-name, balance)

Using this index on the combined search-key, the query can be processed efficiently.

7.  Grid files

·  Structure used to speed the processing of general multiple search-key queries

·  The grid file has a single grid array and one linear scale for each search-key attribute. The grid array has a number of dimensions equal to the number of search-key attributes

·  Multiple cells of grid array can point to same bucket

·  To find the bucket for a search-key value, locate the row and column of the cell using the linear scales and follow pointer

·  If a bucket becomes full, new bucket can be created if more than one cell points to it. If only one cell points to it, overflow bucket needs to be created.

·  A bucket contains index records that point to records in the table.

An example of a grid file is given on the next page.

The linear scale for branch-name divides the branch names into 5 ranges :

branch-name < Central : index = 0; Central £ branch-name < Minaus : index = 1; Minaus £ branch-name < Perryridge : index = 2; Perryridge £ branch-name < Townsend : index = 3 and Townsend £ branch-name : index = 4

The linear scale for balance divides the balance values into 7 ranges:

balance < 1K : index = 0; 1K £ balance < 2K : index = 1; 2K £ balance < 5K : index = 2; 5K £ balance < 10K : index = 3; 10K £ balance < 50K : index = 4;
50K £ balance < 100K : index = 5; 100K £ balance : index = 6