Tables

Chapter 11 of Shiflet text. Table of information is a set of records. We often lookup records by a key. For example, a table for CS2005 contains a record of students in which we use the name (or userid) as a key.

Could implement as an array of records (structs) or as a linked list.

Two common operations on a table are:

Add a record to a table.
Search for a record given a key.

Hashing

One approach for organizing a table of information requires that the records be stored using an array.

What about using an array with the range of 0 to 999,999,999 (one billion entries)? We could then map each student to its unique slot in the array using student ids--trivial to add and search. But expensive in terms of space.

What about having 10,000 slots (0 to 9999) and use the last four digits of the student id? This is an example of a hash function. A hash function maps a key of a record to an index in the array.

If there are 100 records then it is likely that each student will still map to a unique slot.

Load factor is 100/10000 = 0.01.

Hash Function

truncation. Some number of the keys, for example digits from a student identification number. Or first letter of a name.

modular arithmetic. Divide by the size of the index range. Should use a prime number for the modulus. 997 or 1009 is better than using 1000. Want to avoid numbers with many prime factors.

For example to hash a string just add the characters together as in the following:

#define HASHSIZE 97   /* size of the array of records */

/* Hash -- map a string of characters to a number in range 0 to HASHSIZE-1
int Hash(char *sb)
{
    int sum = 0;

    while (*sb != '\0') {
        sum = sum + *sb;   // add in the character value
        sb++;
    }
    return(sum%HASHSIZE);
}

Look at with example 6.10 of Kruse text.

Collision Resolution

What if two elements hash to the same index--we have a collision. How to resolve?

Open Hashing (Linear Probing)

If a collision occurs then do a linear search from this point (in a circular fashion) until the element is found or an empty slot is encountered. Can lead to clustering. Simplest approach though and is what I expect to be used in the project.

/* Return index of next entry for addition or match to slot */
char *rgName[HASHSIZE]  // array of names
initialize rgName[i] = NULL for all entries
int HashedIndex(char *sb)
{
    iSave = i = Hash(sb);
    if (rgName[i] == NULL)
        return(i);           /* empty slot */
    while (rgName[i] != NULL) {
        if (strcmp(sb, rgName[i]) == 0)
            return(i);
        i = (i + 1)%HASHSIZE;
        if (i == iSave)
            return(-1);        /* have looped all around -- table full*/
    }
    return(i);                 /* empty slot found */
}

Chained Addressing

Alternate approach to resolve collisions. See Fig 6.12 of Kruse. Use linked lists once the hash value has been found.

We do not have problems with clustering.
No problems with collisions because we just add to the linked list.
Do not have a problem with overflow.
links require space and a little trickier to program.

Analysis of Hashing

Best case: will find the slot on the first hash
Worst case: will try all slots or nearly all slots
Expected case: depends on load factor (ratio of elements to slots) For small load factors (0.1) the number of probes is just above one (for successful or unsuccessful searches). For a load factor of p the expected value is 1/(1-p).