CS 2223 Apr 13 2021

Lecture Path: 13
Back Next

Expected reading: pp. 469-477 (Hashing with Linear Probing, Section 3.4)
Daily Exercise:
Classical selection: Bach: Goldberg Variations (Glenn Gould) (1741)

Visual Selection:

Musical Selection: U2: With or Without You (1987)

Visual Selection: Winged Victory of Samothrace, Unknown (200-190BC)

Live Selection: Who: Won’t Get Fooled Again (1978)

Daily Question: DAY13 (Problem Set DAY13)

1 Hashing with Linear Probing

1.1 Question of Hashing and Arrays

A nice property of an array is that you can directly find (or alter) the value of the ith element. But this only works with integer indexing, and so it can’t be useful for associating a value with a key (unless the key, itself, is an integer between 0 and N.

Hashing seems to offer a possible way to associate values with keys, but only with a hash function that somehow computes a value between 0 and M, where M is the number of "buckets" that the hash function is uniformly distributing over.

Clowns to the left of me, Jokers to the right. Here I am, stuck in the middle with you.
Stealers Wheel

public class Sample { static int M = 17; // number of buckets static int a[] = new int[M]; /** Convert hashCode() into index 0 and M-1 */ static int hash(Object o) { return (o.hashCode() & 0x7fffffff) % M; } static String strings[] = new String[] { "it", "was", "the", "best", "of", "times", "worst" }; public static void main(String[] args) { String s1 = "this"; String s2 = "that"; a[hash(s1)] = 17; a[hash(s2)] = 19; System.out.println("Value with " + s2 + " is " + a[hash(s2)]); for (String s : strings) { System.out.println(s + " maps to " + hash(s)); } } }

The output from this small program is:

Value with that is 19 it maps to 5 was maps to 11 the maps to 0 best maps to 6 of maps to 7 times maps to 10 worst maps to 10

As you can see, both "times" and "worst" map to the same bucket, so this is a "clash" if we are trying to store both in the single array position.

Benford’s Law refers to the observation about the frequency of leading digits in many real-life data sets. The number 1 appears as the leading significant digit 30% of the time, while 9 appears 5% of the time. It has nothing to do with base 10.

When we discussed hashing, we pointed out the challenge in coming up with a unique hash function, that mapped N elements into N distinct unique keys. In general, it is common for multiple keys to be hashed to the exact same bucket number, or integer.

SequentialSearchST introduced the notion of creating a linked list of (key, value) pairs stored in nodes. The challenge, as described in Apr 06 2021 was that the performance of key operations was directly proportional to the number of elements in the symbol table.

All keys that has to the same bucket are stored in a linked list. But we want to find some way to use an array (in some capacity) to store key values without causing any lost (key, value) pairs.

1.2 Basic Concept

If we just stored the values into a specific array location based solely on hashing the key value, we would no longer be able to uniquely associate it, since multiple key values hash to the same position.

OK. So perhaps we should store two arrays? One to hold the keys (as hashed to a specific bucket) and the other to hold the values (as hashed to the exact same bucket).

public class Clashes { static int M = 17; // number of buckets static String keys[] = new String[M]; // must store keys static int values[] = new int[M]; // don’t care about values... ... }

Now, when trying to put (key, value) into the symbol table, we find the bucket, b, to which the key hashes, and we store the key in keys[b] and the value in values[b]. When an attempt is made to put a different (key, value), you can at least determine whether they have clashed, by comparing keys[b] against the key being inserted.

This makes partial progress, I guess. But doesn’t solve the overall problem! How (and where) should a new (key, value) pair be stored in the arrays if a different key (hashed to the same bucket) exists in the symbol table?

1.3 LinearProbingHashST implementation

On Apr 08 2021, I presented SeparateChainingHashST which grouped together an array of SequentialSearchST objects and used a resizing strategy to ensure that the performance of put and get was directly proportional to the average number of nodes in each linked list as stored by these SeparateChainingHashST objects.

It is important to state that some chains could be quite long, depending on the efficacy of the hash method that tries to uniformly distribute keys into the M "buckets" or SequentialSearchST objects. So if we can’t avoid the issue of multiple clashes, we need a resolution strategy to work around it.

No one came forward with a solution to this interview challenge from Apr 02 2021. Any takers? Post your answer on Discord...

Why not be inspired by our earlier "Airplane Challenge" interview question? There, when a passenger sees someone sitting in his seat (a clash!) he chooses another random seat. Instead of being Random, why don’t we just say "pick the next available seat in ascending order"?

This surely can’t work? Well, it does.

1.4 Insert (key, value) into Symbol table

Upon insert, we use the same logic, namely, to locate the bucket into which the key is hashed. If that spot is occupied, we start looking at neighboring positions (in increasing order) until we find a place that is empty and insert our (key, value) pair in the key[x] and values[x] array positions.

public void put(Key key, Value val) { int i; for (i = hash(key); keys[i] != null; i = (i+1) % m) { if (keys[i].equals(key)) { vals[i] = val; return; } } keys[i] = key; vals[i] = val; N++; }

The for loop is interesting. It starts at the proper bucket (as determined by hash) and incrementally searches for an empty spot, wrapping around as necessary. If the key is found in keys[i] then we replace the value and return, otherwise, advance until you hit an empty index position, into which the key and value are inserted.

Will this mess up the get method and contains? It sure will, but we can adjust

1.5 Retrieve value from symbol table associated with key

public Value get(Key key) { for (int i = hash(key); keys[i] != null; i = (i + 1) % m) { if (keys[i].equals(key)) { return vals[i]; } } return null; }

Similar logic requests that we start, again, at the bucket indicated by the hash method, and incrementally search for the key, until null is encountered, making sure that you wrap around when you hit the end of the array.

1.6 Delete a (key, value) from the array

Can you see what problem now arises when requesting to delete a (key, value) from the existing symbol table?

As more (key, value) pairs are added into the symbol table, blocks of neighboring index positins are updated to store (key, value) pairs in the parallel arrays. If you are asked to delete a (key, value) pair, you can’t just set keys[b] = null for the bucket associated with the key, because you could "break up a chain" of neighboring (key, value) pairs that have all "clashed" with each other, and form a contiguous block that must be maintained.

1.7 Resize Implementation

When the keys[] array "becomes increasingly occupied", then the strategy to follow is similar to what you have seen before – double the size and rehash all entries.

With the SeparateChainingHashST class, the decision was based on a parameter called AVG_LENGTH which refers to the average size of the individual linked lists. The goal is to ensure that N/M was constant, thus the number of buckets, M, has to increase when N increases too much.

The same logic applies here – if the number of keys exceeds half of the available storage, then the entire structure is rehashed.

void resize(int capacity) { LinearProbingHashST<Key, Value> temp = new LinearProbingHashST<Key, Value>(capacity); for (int i = 0; i < m; i++) { if (keys[i] != null) { temp.put(keys[i], vals[i]); } } keys = temp.keys; vals = temp.vals; m = temp.m; }

Once again, this uses a temp storage which will contain all (key, value) pairs that are rehashed there. Once done, the enclosing object will "steal" this information for its own. This logic should be familiar based on past lectures.

Note that the for loop has no choice but to inspect all M elements, and for every non-null key[i], that (key[i], vals[i]) pair is reinserted into temp.

1.8 Final put implementation

public void put(Key key, Value val) { if (key == null) { throw new IllegalArgumentException("first argument to put() is null"); } if (val == null) { delete(key); return; } // double table size if 50% full if (n >= m/2) resize(2*m); int i; for (i = hash(key); keys[i] != null; i = (i+1) % m) { if (keys[i].equals(key)) { vals[i] = val; return; } } keys[i] = key; vals[i] = val; N++; }

1.9 Sample Exercise

The best way to see the behavior of this approach is to run in the debugger. This is also the only way, since the LinearProbingHashST class properly hides the keys and vals so no one outside of the class can view the data.

static String[] initials = new String[] { "it", "the", "best", "that", "i", "could", "do", "but", "if", "is", "badly", "done"}; public static void main(String[] args) { LinearProbingHashST<BadHashString,Integer> st = new LinearProbingHashST<BadHashString, Integer>(10); // count the distribution of random #s for (String s : initials) { BadHashString bs = new BadHashString(s); st.put(bs, 1); } System.out.println("Breakpoint here"); }

The BadHashString class I wrote has an inefficient hashCode method which ensures that all words that begin with the same letter are hashed to the same bucket. With this device, you can more properly see the clashes that occur when inserting (key, value) pairs into the structure.

As you execute the code (which you can do line by line in the debugger) inspect the contents of the keys array – note that the vals array is irrelevant for our discussion.

As more items are added to the keys, you will see patterns form, and the resize events will do their best to distribute keys uniformly, although the defective hashcode will still cause keys to "bunch up" more than they should.

In class, I revised the code to use a regular String, but the result in the hashtable arrays was more or less the same – this was an issue regarding the initial size. Specifically, with twelve values to add into an array of size 10 (which doubles to 20) there is already too high of a chance of collisions because of the size of the problem. In the following arrays of size 50, you can see that the BadHashString does not distribute its values properly

BadHashString [that, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, best, it, i, could, do, but, if, is, badly, done, null, null, null, null, null, null, null, the] Regular HashString [null, that, could, but, done, null, null, null, null, null, null, null, null, null, null, null, do, if, badly, null, null, null, null, null, null, null, null, null, null, null, null, null, the, it, i, is, null, null, null, null, null, null, null, null, null, null, null, null, best, null]

If you want, you can try an experiment to change the resize threshold

if (n >= m/2) resize(2*m);

To one that packs more keys into the array before resizing. The ultimate space-packing condition would be:

if (n == m-1) resize(2*m);

That is, this only resizes when there is just one space remaining.

1.10 Daily Question

The assigned daily question is DAY13 (Problem Set Day13)

If you have any trouble accessing this question, please let me know immediately on Discord.