Project 8 Working with HashCodes
Name ______Score______
In this project you will investigate methods for generating hashcodes for strings, hashcode collisions and how they occur.
Introduction - If you know the index of an element in the array, you can retrieve the element using the index in O(1) time. So, can we store the values in an array and use the key as the index to find the value? The answer is yes—if you can map a key to an index. The array that stores the values is called a hash table. The function that maps a key to an index in the hash table is called a hash function.
How do you design a hash function that produces an index from a key? Ideally, we would like to design a function that maps each search key to a different index in the hash table. Such a function is called a perfect hash function. However, it is difficult to find a perfect hash function.
Search keys are often strings. So, it is important to design a good hash function for strings. An intuitive approach is to sum the Unicode of all characters as the hash code for the string. This approach may work if two search keys in an application don’t contain same letters. But it will produce a lot of collisions if the search keys contain the same letters such as tod and dot. A better approach is to generate a hash code that takes the position of characters into consideration. Specifically, let the hash code be
s0*b(N-1) + s1*b(N-2) + ... + sN-1 (1)
where si is s.charAt(i). This expression is a polynomial for some positive b. So, this is called a polynomial hash code. By Horner’s rule, it can be evaluated efficiently as follows:
(...((s0*b + s1)b + s2)b + ... + sN-2)b + sN-1 (2)
This computation can cause an overflow for long strings. Arithmetic overflow is ignored in Java. You should choose an appropriate value b to minimize collision. Experiments show that the good choices for b are 31, 33, 37, 39, and 41. In the String class, the hashCode is overridden using the polynomial hash code with b being 31. We can compress the hash code using % size so that the range of values from the hash function is [0..size-1] permitting us to use the output of the hash function as the index of the storage array.
Implementation - The function hashString( ) accepts the string, str and a maximum array size, size. This function return a compressed hash code in the range [0..size-1].
Equation 1 generates values that can exceed the range of integers for longer strings and larger values of b. The generation of the value of n is accomplished by converting each character of the string follows the form of Equation 2 while compressing by mod size at each step.
In the main program of the hashStringDemo project a while loop is used to ask for user strings and the corresponding hash code is generated until an empty string is entered. In this project a hash map of 5000 is assumed.
Questions - Use the main program shown here with the hashString( ) method to help you answer the following questions. Alternatively, you can download the hashStringDemo from the Academic Website.
1. With size = 5000 what is the range of values possible from hashString( )? ______
2. In the hashString( ) method mod size is applied after each character value is added. How would the value of n be different if the mod size operation is performed only once at the end of the for-loop?
______
3. Compare the values generated for several sample words with size = 5000 and again for size = 10000. Discuss the differences and similarities between the hash codes generated by hashString( ) for these two values of size.
______
______
______
______
4. For size = 5000, find at least two words for which hashString( ) produces the same hash code.
______