Subscribe to Stuck in an Infiniteloop

## An In-Depth Look at Hash Tables

For other structures see here.

Hash Table

Quote

A hash table or hash map is a data structure that uses a hash function to efficiently map certain identifiers or keys to associated values. The hash function is used to transform the key into the index (the hash) of an array element where the corresponding value is to be sought.

Hashing

A visualization:

In a nutshell, we take the value/key, put it through a hash function and we get a corresponding value (also known as: hash values, hash codes, hash sums, checksums or hashes). This "value" is usually (more often then naught) an index for our storage medium (the simplest of which is an array). Consider this relatively trivial hash function:

```int trivialHash(int num){
return num % 10;
}

int main(){
for(int i = 0; i < 10; i++){
cout << trivialHash(i) << endl;
}
return 0;
}

```

Seems good at first glance, for the first ten digits, all have a distinct hash.

Quote

0
1
2
3
4
5
6
7
8
9

But wait! If we go through the next ten numbers, the hashes are repeated! What's a programmer to do? Write a better hash of course. Let's experiment a bit. Since the remainder of the ones digit will be repeated, so let's take another characteristic of the number: its divisibility:

```int trivialHash(int num){
return num % 10;
}

int somewhatBetterHash(int num){
return ((num% 10) + (num/2));
}

int main(){
for(int i = 0; i < 20; i++){
cout << trivialHash(num) << "\t\t\t" << somewhatBetterHash(i) << endl;
}
return 0;
}

```

Comparing the output of this program to our last:

Quote

//trivial() somewhatBetter()
0 0
1 1
2 3
3 4
4 6
5 7
6 9
7 10
8 12
9 13
0 5
1 6
2 8
3 9
4 11
5 12
6 14
7 15
8 17
9 18

Three collisions (a collision is where more then one key maps to the same hash) as opposed to twenty! Note that this is on a fixed set of test data and the number of collisions would increase as the number of entries being hashed increased. Such is the nature of hashing. Thus, it is still an ongoing endeavor to find perfect/near perfect hashes...

...which brings us to collision resolution. Since we don't live in a magical forest with chocolate rivers and gumdrop mountains we have to tell our tables how to handle collisions. Popular options include, but certainly are not limited to:

• Separate ("Linear") Chaining
• Rehashing

With a good hash function, lookups/operations can be executed in constant or near constant time: O(1) (read this if you are unfamiliar with Big O notation).

Before we look at the various ways to utilize a hash table, let's create a plain one that we modify for each case:

```template <typename T>
class HashTable{
private:
T table[31]; //set amount of storage, small for example
int hash(T); //only the table needs it
public:
HashTable();
~HashTable()							{};
T retrieveValue(int index)				{return table[index];};
bool contains(T);
void displayTableContents();
};

template <typename T>
HashTable<T>::HashTable(){
memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes
}

template <typename T>
int HashTable<T>::hash(T value){
return value % 31; //simple just for the framework
}

template <typename T>
bool HashTable<T>::contains(T value){
if(table[hash(value)]) return true;
else return false;
}

template <typename T>
table[hash(value)] = value;
}

template <typename T>
void HashTable<T>::displayTableContents(){
for(int i = 0; i < 31; i++){
cout << table[i] << endl;
}
return;
}

int main(){
HashTable<int>* test = new HashTable<int>();
test->displayTableContents();
delete test;
return 0;
}

```

By and large, a pretty functional data structure, except for the fact that we do not account for any possible collisions.

Linear Probing
If a slot is "taken" we go to the next slot until we find an empty one or there's no more room in the table. The step can be any size, but is usually one:

```template <typename T>
class HashTable{
private:
T table[31]; //set amount of storage, small for example
int hash(T); //only the table needs it
public:
HashTable();
~HashTable()							{};
T retrieveValue(int index)				{return table[index];};
bool contains(T);
void displayTableContents();
};

template <typename T>
HashTable<T>::HashTable(){
memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes
}

template <typename T>
int HashTable<T>::hash(T value){
return value % 31; //simple just for the framework
}

template <typename T>
bool HashTable<T>::contains(T value){
if(table[hash(value)]) return true;
else return false;
}

template <typename T>
int index = hash(value);
if(table[index] == 0){ //look it's free!
}
else{
bool notDone = true;
while(notDone){
index++;
if(!table[index]){ //if !0, i.e. empty
table[index] = value;
notDone = false;
}
}
}
return;
}

template <typename T>
void HashTable<T>::displayTableContents(){
for(int i = 0; i < 31; i++){
cout << table[i] << endl;
}
return;
}

int main(){
HashTable<int>* test = new HashTable<int>();
test->addValue(62); //both of these hash to zero (mod 31)
test->addValue(107); //to show a value away from the initial probing
test->displayTableContents();
delete test;
return 0;
}

```

62 and 31 both have a remainder of zero when divided by 31. Thus, 62 gets the "real" zero slot, whereas 31 gets the next available slot.

There are some issues with this approach though:

1. We don't have any "back up" plan for when we run out of space (aside from "it just fails").
2. While it wouldn't normally be in a table (at least from a trivial standpoint) we do not account for zero. We use zero as the "empty" value. This could come back and bite us, but is perfectly okay for the purposes of this post.
3. The index is not wrapped around once it hits the end of the table. To alleviate this:

```template <typename T>
int index = hash(value);
if(table[index] == 0){ //look it's free!
}
else{
bool notDone = true;
int originalPos = index;
while(notDone){
index++;
if (index == 31) index = 0; //wrap around
else if (index == originalPos){
cout << "Table is full!" << endl;
break;
}
if(!table[index]){ //if !0, i.e. empty
table[index] = value;
notDone = false;
}
}
}
return;
}

```

Now, to illustrate how this works (all the code is the same as before except for the above revised addValue() and main() ):

```int main(){
HashTable<int>* test = new HashTable<int>();
test->addValue(62); //both of these hash to zero (mod 31)
test->addValue(107); //to show a value away from the initial probing
test->addValue(123); //wrap around, must go past indexes 0 and 1, from above
test->displayTableContents();
delete test;
return 0;
}

```

The actual program output:

Quote

62
31
123
0
0
0
0
0
0
0
0
0
0
0
107
138
0
0
0
0
0
0
0
0
0
0
0
0
0
0
61

And a visual:

Quadratic probing and double hashing operate on the exact same principle, except instead of an increment of one, using a quadratic formula and second hash to find the next "slot", respectively. Aptly named and due to the innate similarities, moving on. If you want to see a code example of this let me know.

Separate ("Linear") Chaining

The idea here is simple. At each index in our bucket/array we have a linked list. If multiple values have the same hash, we just append it to the list in that slot. For the code I'm going to use for the Linked List, see this blog post. While we will save time in the long run using this approach, we have to do some considerable re-factoring of our HashTable data structure. Why? Up to this point we've been operating under the assumption (notion rather) that our HashTable was dealing with primitives (hence the typename specifier rather then class, but I digress).

Revised HashTable (now ChainedHashTable):

```//LinkedList Chained Table
template <typename T>
class ChainedHashTable{
private:
int hash(T); //only the table needs it
public:
ChainedHashTable();
~ChainedHashTable();
T retrieveValue(int index);
bool contains(T);
void displayTableContents();
};

template<typename T>
ChainedHashTable<T>::ChainedHashTable(){
memset(table, NULL, sizeof(table)); //same deal, but NULL for ptrs
}

template<typename T>
ChainedHashTable<T>::~ChainedHashTable(){
//free the memory of each list
for(int i = 0; i < 31; i++){
if(table[i]) delete table[i]; //if the slot was never used, leave it alone
}
}

template <typename T>
int ChainedHashTable<T>::hash(T value){
return value % 31;
}

template <typename T>
int index = hash(value);
if(table[index] == NULL){
}
else{
table[index]->append(value);
}
}

template <typename T>
bool ChainedHashTable<T>::contains(T value){
int index = hash(value);
if (table[index] == NULL) return false;
else{
return table[index]->contains(value);
}
}

template <typename T>
void ChainedHashTable<T>::displayTableContents(){
for(int i = 0; i < 31; i++){
if(table[i] != NULL){
cout << "List at index " << i << ": ";
table[i]->displayAll();
cout << endl;
}
}
}

```

Using a slightly modified main:

```int main(){
ChainedHashTable<int>* test = new ChainedHashTable<int>();
test->addValue(62); //both of these hash to zero (mod 31)
test->addValue(107); //to show a value away from the initial probing
test->addValue(123); //wrap around, must go past indexes 0 and 1, from above
test->displayTableContents();
delete test;
return 0;
}

```

We get the following output:

Quote

List at index 0: 62 31

List at index 14: 107 138

List at index 30: 61 123

Each index has a linked list. Each item that would normally collide and result in another slot assignment, is now appended to the list in that slot (or if the slot is empty, create a new list with that value as the head node). To avoid misc. debugging comments from that linked list tutorial, comment them out to get the above output.

A visualization of the above sample program:

We've only taken up three "slots" and have the capacity to hold an infinite (well, until the computer runs out of memory) amount of data. Time-wise, it's O(1) to access the list and then O(n) for a search/retrieval/check. We could optimize this by sorting the list and then using a binary search, but that's out of the scope of this post.

------

The intuitive of you will notice that I did not cover associative containers. While extremely handy, it's hard to cover them without some groundwork laid down. Hash tables provide such a foundation:

Quote

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like AWK, Perl, and PHP.

Look for a post on associative containers/hash maps soon!
------

### 7 Comments On This Entry

Page 1 of 1

#### Smurphy

27 May 2010 - 11:00 AM
Very nice entry KYA. You went over it just like we did in my computer science class. Maybe being a professor is in your future. haha. Thanks again and keep up the great work.
0

#### Mercurial

04 August 2010 - 01:31 PM
Awesome
0

#### bnc

23 January 2011 - 11:14 PM
very good tutorial...very easy to understand...thank you..
0

#### bnc

26 January 2011 - 11:56 PM
KYA, I have a doubt... In this example..both key and value are the same...
i.e. if we pass 31 as key we will get index 0.. in that index we are again storing 31. is it the purpose?

correct me if I am wrong. I am confused.. thanks..
0

#### KYA

27 January 2011 - 05:20 PM
31 is part of the hashing function, it isn't the key

in the examples above the array index resulting from the mod 31 is the key
0

#### bnc

27 January 2011 - 10:05 PM
in this case the resulting array indexs are keys.. ok.... Thank you..
0

#### ketybrown

05 December 2012 - 02:05 AM
Hello.

First things first, thanks for this awesome tutorial.

Second, I'd like to request for an example on how the Quadratic Probing and Double hashing are done.

0
Page 1 of 1

S M T W T F S
123456
78910111213
14151617181920
21222324252627
28 29 30