Searching
Objectives:
-
define the searching problem
-
discuss searching in lists
-
searching in tables using hashing
Definition and terminology
Definition. Assume k1,k2,...,kn is a collection distinct keys and
R = {(k1,I1),(k2,I2),..., (kn,In)}
is a collection of records (where Ij is the information stored in record (kj,Ij)) containing those keys.
Given a key value K, the search problem is to locate a record (kj,Ij) such that K = kj.
Successful search - a record with key kj = K was found
Unsuccessful search - no record with key kj = K was found
General search approaches
-
Sequential methods
-
the records are considered one at a time according to a predefined ordering
-
Direct access methods
-
the records are accesses directly based on the value of the search key
-
Indexing methods
-
keys are organized into some tree structure which allows fast searching
To describe search methods we shall use the following record structure definition:
struct record{
key k;
info I;
}
Searching arrays
The problem will be to find a record with key value K in an array
record A[n];
-
Sequential search
info SeqSearch (record* A, int n,key K){
for (i = 0 ; i < n ; i++)
if (A[i].k == K)
return A[n].I;
return NOT_FOUND;
}
-
sequential each takes Q(n) time on the average
-
Binary Search
-
if the records are sorted
info BinSearch (record* A, int n,key K){
if (n == 1)
if (A[0].k == K)
return A[0].I;
else
return NOT_FOUND;
int mid = n/2;
if (A[mid].k == K)
return A[mid].I;
else
if (K < A[mid].k)
return BinSearch(&A[0], mid , K);
else
return
BinSearch(&A[mid+1], n-mid-1, K);
}
-
takes Q(n) time on the avg.
-
Dictionary search
-
if the records are sorted and the expected distribution of key values is known
-
the location of the key in the key range is translated into an expected value in the array
Example. Assume that the keys are integer numbers uniformly distributed between m1 and m2.
A key k should be at index
-
is faster than binary search
Searching lists
-
the only way to search lists is sequential search
-
to improve the performance of sequential search we may dynamically reorganize the list based on the frequency of accesses to the different records: move the more frequently accessed records towards the head of the list
-
if p1,p2,...,pn (pi>pi+1) are the probabilities with which the records are accessed:
T(n) = 1p1+2p2+...+npn
Example. If each record has the same probability to be searched for:
-
in most of the cases these probabilities are not known - we try to approximate them
-
use self-organizing list - lists that move the records in the list according to the previous accesses
Heuristics for managing self-organizing list:
-
approximate the probabilities using the number of previous accesses
-
keep a access counter associated with each record
-
move a record forward if the associated counter becomes greater than the one of the preceding record
Disadvantages
-
requires extra space
-
changes of frequency poorly handled
-
every time a record is used, move it to the front of the list
-
it is efficient
-
handles well changes in frequency
-
hard to implement if list is represented in an array
-
every time a record is used, swap it with the record preceding it (transpose)
-
works well in general
-
there are some patological cases
-
easy to implement with arrays and linked lists
Hashing
-
accessing a record in an array by mapping a key value to a position in the array
-
the array storing the records is called hash table
-
the function used to map keys to table entries (slots) si called hash function
-
a hash function is a function defined over the set of possible key values and takes on values from the set of slots in the hash table (0 to n-1 for a table of size n)
Hash functions
-
the range of the key values has typically much more elements than slots in the hash table
-
some keys will map to the same slot (collision)
-
when defining a hash function the goal is to distribute the keys uniformly over the slots
Example hash functions
-
Hash integer keys by modulo
int HashMod (int k){
return k % TABLE_SIZE;
}
-
Mid-square method (numbers)
// r = log (TABLE_SIZE)
int HashMidSq (int k){
unsigned int sq;
sq = k * k;
sq = sq << (sizeof(int)*8 - r) / 2;
sq = sq >> (sizeof(int)*8 - r);
return sq;
}
-
Character folding (strings)
int HashFoldCh (char* s){
int sum = 0;
for ( ; *s != `\0' ; s++)
sum += (int) (*s);
return sum % TABLE_SIZE
}
Collision resolution
Collision resolution techniques:
-
open hashing - collisions are stored outside the hash table
-
closed hashing - collisions are stored in the hash table
-
Open hashing (chaining)
struct entry{
key k;
info I;
entry* link;
};
entry HashTable[TABLE_SIZE];
info HashSearch (entry H[]; key K){
int i;
entry* pe;
i = Hash (K);
if (H[i].k == EMPTY)
return NOT_FOUND;
else if (H[i].k == K)
return H[i].I;
else
return SeqSearch (H[k].link, K);
}
Example.
-
Closed hashing
-
all records are stored in the hash table
-
a collision is solved by applying secondary hash functions (probing)
void hashInsert (record R){
int h;
int curr = h = Hash (R.k);
for ( int i = 1; H[curr].key != EMPTY ; i++){
curr = (h + p(R.k, i)) % TABLE_SIZE;
if (H[curr].k == R.k)
return EXISTING_KEY;
}
H[curr] = R;
}
void hashSearch (record R);
Linear probing
p(k, i) = i
Quadratic probing
p(k, i) = i2
Double hashing
p(k, i) = i * h2(k)
Analysis of closed hashing
-
the cost of insertions and unsuccessful searches is