CS 2223 Mar 31 2022
Daily Exercise: Classical selection: Mozart: Symphony No. 41 "Jupiter" (1788)
Musical Selection:
Simple Minds: Don’t You (Forget About Me) (1985)
Visual Selection:
School of Athens, Raphael Sanzio (1509-1511)
Live Selection:
A little help
from my friends, Joe Cocker (1968) Lennon/McCartney (1967)
Daily Question: DAY11 (Problem Set DAY11)
1 Priority Queues
1.1 HW1 Distribution
I reviewed the distribution of HW1 grades in two different ways. First, based on the picture of students in the class from Tuesday, there is a significant difference of the grades based on whether I see you regularly in class:
In class - 85.94
Not - 69
I also broke it down by year and see the following, suggesting that there are challenges for non-CS-majors.
First Year - 82
Second Year - 73
Third Year - 75
Fourth Year - 66
1.2 Announcements on office hour updates
If a TA/SA cannot make a regularly scheduled office hour, then an announcement will be posted in Canvas with the adjusted datetime. Please double-check in canvas if you think an office hour is not being covered.
1.3 Sorting Summary
Another general shout!
I do believe that these applauses
are for some new honours
that are heap’d on Caesar.
Julius Caesar
William Shakespeare
We could have spent several weeks on sorting algorithms, but we still have miles to go before we sleep, so let’s quickly summarize. We covered the following:
Insertion Sort
Selection Sort
Merge Sort
For each of these, you need to be able to explain the fundamental structure of the algorithm. Some work by dividing problems into subproblems that aim to be half the size of the original problem. Some work by greedily solving a specific subtask that reduces the size of the problem by one.
For each algorithm we worked out specific strategies for counting key operations (such as exchanges and comparisons) and developed formulas to count these operations in the worst case.
We showed how the problem of comparison-based sorting was asymptotically bounded by ~N log N comparisons, which means that no comparison-based sorting algorithm can beat this limit, though different implementations will still be able to differentiate their behavior through programming optimizations and shortcuts.
1.4 Homework2
Be sure to replace "USERID" with your CCC credentials. For example, my email address is "heineman@wpi.edu" so my USERID would be "heineman".
1.5 Linear Probing
The linked list structure of SeparateChainingHashST seems excessive, although it does deliver impressive results. Is there a way to use the hash(key) method in a different manner? It will still return values between 0 and M, but what if we just created an array of M keys, and tried to use the computed hash(key) to find an index location where we can put the key.
In this situation, all we want to do is store the key to be able to determine later whether the key is in the array or not. What if we just used hc = hashcode(key) % M, where M is the array length, to determine the index position in which to place the key?
In in = new In ("words.english.txt"); String[] knownStrings = new String[1000003]; int numCollisions = 0; while (!in.isEmpty()) { String word = in.readString(); int location = (word.hashCode() & 0x7fffffff) % knownStrings.length; if (knownStrings[location] == null) { knownStrings[location] = word; } else { System.out.println("COLLISION between " + knownStrings[location] + " an " + word + " on " + location); numCollisions++; } System.out.println(numCollisions + " total collisions."); } in.close();
Run this code and see what happens. It takes every word from the dictionary of English words and tried to find a unique location where that word would be placed. There are 321,129 words, so let’s start with an array containing just over one million values. Is this enough?
When you run, you see there are 46,334 collisions, that is, a situation where two (or more) keys are mapped to the same index location using the modulo array length technique.
Can we fix this situation by making the array larger? With the aim to increase storage and thereby reduce the chance of a collision?
Change to 10000019 and see what happens. The number of collisions is reduced, but there are still too many collisions.
Still, this approach might be the basis of something useful. Again, let’s take a flawed approach and change things slightly to create an approach that always works, as long as the size of the storage array is greater in size than N, the number of keys being stored.
1.6 Open Addressing
Open addressing resolves hash collisions by probing (or searching) through alternative locations in an array when the modulo computation encounters a collision. One common approach, called linear probing, resolves collisions by checking for a higher index position in the array that is empty, into which the key is stored. If the end of the array is reached without finding an empty location, start over from index position 0. This search is guaranteed to succeed because I make sure that there will always be an empty index location.
In this example, the first three keys (whose computed locations of 20, 15 and 5) all are inserted without any problems.
This sure looks great! But when the fourth key is inserted, its computed location of 5 is already occupied! Simply advance to the right (starting back over at index location if you run out) and find the first empty location and that is where you place the key.
Repeat once more with a key whose index location turns out to be 5 again. Now you can see that an increasing number of index locations have to be explored before finding an empty spot.
A chain of entries for a given index location in an open addressing Symbol Table is a sequence of consecutive non-null index locations in keys. This sequence starts at a given keys[hc] and extends to the right (wrapping around to the first index in table as needed), up to but not including the next available unused table index position. In the figure ealier, the chain for hc of 5 has a length of 5 even though there are only three entries whose key hashes to that same code. Also, there are no keys with a hash code of 2, but the chain for hash code 2 has a length of 1 because of an earlier collision. The maximum length of any chain is M–1 because one bucket is left empty.
You might wonder why this approach will even work. Since entries can be inserted into an index location different from hc = hash(key) % M, how will you find them later? The answer is that you search for a key in the exact same manner in which you put the keys into the array.
If we are going to use this approach to store (key, value) pairs, then we will need to create two arrays of same length, where the corresponding index locations reflect that vals[i] is associated with keys[i].
The following is a simplifed implementation of a Symbol Table implemented using linear probing. For the structure, we need to keep track of two arrays of equal length.
public class LinearProbingHashST<Key, Value> { // Simplified from Book... int N; // number of key-value pairs in the symbol table int M; // size of linear probing table Key[] keys; // the keys Value[] vals; // the values public void put(Key key, Value val) { int i = hash(key); while (keys[i] != null) { if (keys[i].equals(key)) { vals[i] = val; return; } i = (i + 1) % M; } keys[i] = key; vals[i] = val; // (key, val) is now associated in ST. N++; } }
Then we find the initial location where a key is hashed. From that location, we search until keys[i] == null since that is where the key should be placed. Note that we also increase N to remember how many (key, value) pairs are in the Symbol Table.
To determine whether there is a corresponding value for a key, we repeat the process: starting at the index location determined by hash(key), search until you have either found the key you are looking for – in which case simply return the associated value – or search until you reach an index location containing the null value, which signifies that the key does not exist in the symbol table.
public Value get(Key key) { int i = hash(key); while (keys[i] != null) { if (keys[i].equals(key)) { return vals[i]; } i = (i + 1) % M; } return null; }
1.7 Resizing
Using the same technique of geometric resizing, the size of the storage array will continuously grow to ensure that there will be sufficient empty index locations. Once more than half of the index location in keys[] are occupied, both arrays are doubled in size, as you have seen me do on a number of occasions.
void resize(int capacity) { LinearProbingHashST<Key, Value> temp = new LinearProbingHashST<Key, Value>(capacity); for (int i = 0; i < M; i++) { if (keys[i] != null) { temp.put(keys[i], vals[i]); } } keys = temp.keys; vals = temp.vals; M = temp.M; }
As was done with the linked list implementation, a new LinearProbingHashST object is constructed with the desired extra capacity, and each (key, value) pair that exists in the original array will be put into the temporary LinearProbingHashST object. Then the original object just steals the two arrays (keys and vals) and increases M to reflect the new array lengths.
1.8 Removing a key from LinearProbingHashST
Removing a (key, value) pair from open addressing is going to be tricky. Let’s go back to the figure earlier.
What if you wanted to remove the key whose hash code was 20, and which appeared in index location 6? Perhaps you can remove it by setting the key at that location to null. Doing this would break the chain that starts at index location 5.
This means that you will not be able to "find" the key whose hash code was 19, and which appeared in index location 2 (because of wrap-around).
Can someone come up with an idea for a strategy for removing? Let’s discuss.
1.9 Comparison
It is instructive to compare performance with both of these approaches.
Assuming that you use geometric resizing when reaching a pre-determined "load limit", the resulting performance for put and get are O(1). Removing a value will yield a bit more work, but through similar analysis, you will find that it, too, is O(1).
What do you make of these graphs? The top one graphs, for each chain length, the total number of linked lists with that length.
First 10 values: [986, 4105, 8873, 12997, 14159, 12470, 9046, 5657, 2999, 1473, ...]
There are 986 empty indices, yielding a 98.7% utilization. The maximum chain is 16 and 4,105 are single (.1%).
The second grap also plots, for each chain length, the total number of index locations with that length.
First 10 values: [792983, 195666, 69432, 28861, 13113, 6423, 3303, 1733, 977, 564, ...]
There are 792,983 empty indices, yielding a 28.8% utilization. The maximum chain is 34 and 195,666 are single (61%).
1.10 Daily Question
The assigned daily question is DAY11 (Problem Set DAY 11)
If you have any trouble accessing this question, please let me know immediately on Discord.
1.11 Version : 2022/03/31
(c) 2022, George T. Heineman