Subscribe to Stuck in an Infiniteloop        RSS Feed
-----

An In-Depth Look at Hash Tables

Icon 7 Comments
For other structures see here.


Hash Table

Quote

A hash table or hash map is a data structure that uses a hash function to efficiently map certain identifiers or keys to associated values. The hash function is used to transform the key into the index (the hash) of an array element where the corresponding value is to be sought.


Hashing

A visualization:

Posted Image

In a nutshell, we take the value/key, put it through a hash function and we get a corresponding value (also known as: hash values, hash codes, hash sums, checksums or hashes). This "value" is usually (more often then naught) an index for our storage medium (the simplest of which is an array). Consider this relatively trivial hash function:

int trivialHash(int num){
	return num % 10;
}

int main(){
	for(int i = 0; i < 10; i++){
		cout << trivialHash(i) << endl;
	}
	return 0;
}



Posted Image

Seems good at first glance, for the first ten digits, all have a distinct hash.

Quote

0
1
2
3
4
5
6
7
8
9



But wait! If we go through the next ten numbers, the hashes are repeated! What's a programmer to do? Write a better hash of course. Let's experiment a bit. Since the remainder of the ones digit will be repeated, so let's take another characteristic of the number: its divisibility:

int trivialHash(int num){
	return num % 10;
}

int somewhatBetterHash(int num){
	return ((num% 10) + (num/2));
}

int main(){
	for(int i = 0; i < 20; i++){
		cout << trivialHash(num) << "\t\t\t" << somewhatBetterHash(i) << endl;
	}
	return 0;
}



Posted Image

Comparing the output of this program to our last:

Quote

//trivial() somewhatBetter()
0 0
1 1
2 3
3 4
4 6
5 7
6 9
7 10
8 12
9 13
0 5
1 6
2 8
3 9
4 11
5 12
6 14
7 15
8 17
9 18


Three collisions (a collision is where more then one key maps to the same hash) as opposed to twenty! Note that this is on a fixed set of test data and the number of collisions would increase as the number of entries being hashed increased. Such is the nature of hashing. Thus, it is still an ongoing endeavor to find perfect/near perfect hashes...

...which brings us to collision resolution. Since we don't live in a magical forest with chocolate rivers and gumdrop mountains we have to tell our tables how to handle collisions. Popular options include, but certainly are not limited to:

  • Probing (linear, quadratic, etc...) Also known as open addressing
  • Separate ("Linear") Chaining
  • Rehashing


With a good hash function, lookups/operations can be executed in constant or near constant time: O(1) (read this if you are unfamiliar with Big O notation).

Before we look at the various ways to utilize a hash table, let's create a plain one that we modify for each case:

template <typename T>
class HashTable{
private:
	T table[31]; //set amount of storage, small for example
	int hash(T); //only the table needs it
public:
	HashTable();
	~HashTable()							{};
	void addValue(T);
	T retrieveValue(int index)				{return table[index];};
	bool contains(T);
	void displayTableContents();
};

template <typename T>
HashTable<T>::HashTable(){
	memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes
}

template <typename T>
int HashTable<T>::hash(T value){
	return value % 31; //simple just for the framework
}

template <typename T>
bool HashTable<T>::contains(T value){
	if(table[hash(value)]) return true;
	else return false; 
}

template <typename T>
void HashTable<T>::addValue(T value){
	table[hash(value)] = value;
}

template <typename T>
void HashTable<T>::displayTableContents(){
	for(int i = 0; i < 31; i++){
		cout << table[i] << endl;
	}
	return;
}

//add one value, for now
int main(){
	HashTable<int>* test = new HashTable<int>();
	test->addValue(2456);
	test->displayTableContents();
	delete test;
	return 0;
}



By and large, a pretty functional data structure, except for the fact that we do not account for any possible collisions.

Open Addressing

Linear Probing
If a slot is "taken" we go to the next slot until we find an empty one or there's no more room in the table. The step can be any size, but is usually one:

template <typename T>
class HashTable{
private:
	T table[31]; //set amount of storage, small for example
	int hash(T); //only the table needs it
public:
	HashTable();
	~HashTable()							{};
	void addValue(T);
	T retrieveValue(int index)				{return table[index];};
	bool contains(T);
	void displayTableContents();
};

template <typename T>
HashTable<T>::HashTable(){
	memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes
}

template <typename T>
int HashTable<T>::hash(T value){
	return value % 31; //simple just for the framework
}

template <typename T>
bool HashTable<T>::contains(T value){
	if(table[hash(value)]) return true;
	else return false; 
}

template <typename T>
void HashTable<T>::addValue(T value){
	int index = hash(value);
	if(table[index] == 0){ //look it's free!
		table[index] = value; //go ahead and add it!
	}
	else{
		bool notDone = true;
		while(notDone){
			index++;
			if(!table[index]){ //if !0, i.e. empty
				table[index] = value;
				notDone = false;
			}
		}
	}
	return;
}

template <typename T>
void HashTable<T>::displayTableContents(){
	for(int i = 0; i < 31; i++){
		cout << table[i] << endl;
	}
	return;
}

int main(){
	HashTable<int>* test = new HashTable<int>();
	test->addValue(62); //both of these hash to zero (mod 31)
	test->addValue(31);
	test->addValue(107); //to show a value away from the initial probing
	test->displayTableContents();
	delete test;
	return 0;
}



62 and 31 both have a remainder of zero when divided by 31. Thus, 62 gets the "real" zero slot, whereas 31 gets the next available slot.

Posted Image

There are some issues with this approach though:

1. We don't have any "back up" plan for when we run out of space (aside from "it just fails").
2. While it wouldn't normally be in a table (at least from a trivial standpoint) we do not account for zero. We use zero as the "empty" value. This could come back and bite us, but is perfectly okay for the purposes of this post.
3. The index is not wrapped around once it hits the end of the table. To alleviate this:

template <typename T>
void HashTable<T>::addValue(T value){
	int index = hash(value);
	if(table[index] == 0){ //look it's free!
		table[index] = value; //go ahead and add it!
	}
	else{
		bool notDone = true;
		int originalPos = index;
		while(notDone){
			index++;
			if (index == 31) index = 0; //wrap around
			else if (index == originalPos){
				cout << "Table is full!" << endl;
				break;
			}
			if(!table[index]){ //if !0, i.e. empty
				table[index] = value;
				notDone = false;
			}
		}
	}
	return;
}




Now, to illustrate how this works (all the code is the same as before except for the above revised addValue() and main() ):

int main(){
	HashTable<int>* test = new HashTable<int>();
	test->addValue(62); //both of these hash to zero (mod 31)
	test->addValue(31);
	test->addValue(107); //to show a value away from the initial probing
	test->addValue(138); //linear probe index 15
	test->addValue(61); //index 30
	test->addValue(123); //wrap around, must go past indexes 0 and 1, from above 
	test->displayTableContents();
	delete test;
	return 0;
}



The actual program output:

Quote

62
31
123
0
0
0
0
0
0
0
0
0
0
0
107
138
0
0
0
0
0
0
0
0
0
0
0
0
0
0
61


And a visual:

Posted Image

Quadratic Probing and Double Hashing

Quadratic probing and double hashing operate on the exact same principle, except instead of an increment of one, using a quadratic formula and second hash to find the next "slot", respectively. Aptly named and due to the innate similarities, moving on. If you want to see a code example of this let me know.

Separate ("Linear") Chaining

The idea here is simple. At each index in our bucket/array we have a linked list. If multiple values have the same hash, we just append it to the list in that slot. For the code I'm going to use for the Linked List, see this blog post. While we will save time in the long run using this approach, we have to do some considerable re-factoring of our HashTable data structure. Why? Up to this point we've been operating under the assumption (notion rather) that our HashTable was dealing with primitives (hence the typename specifier rather then class, but I digress).

Revised HashTable (now ChainedHashTable):

//LinkedList Chained Table
template <typename T>
class ChainedHashTable{
private:
	LinkedList<T>* table[31]; 
	int hash(T); //only the table needs it
public:
	ChainedHashTable();
	~ChainedHashTable();
	void addValue(T);
	T retrieveValue(int index);
	bool contains(T);
	void displayTableContents();
};

template<typename T>
ChainedHashTable<T>::ChainedHashTable(){
	memset(table, NULL, sizeof(table)); //same deal, but NULL for ptrs
}

template<typename T>
ChainedHashTable<T>::~ChainedHashTable(){
	//free the memory of each list
	for(int i = 0; i < 31; i++){
		if(table[i]) delete table[i]; //if the slot was never used, leave it alone
	}
}

template <typename T>
int ChainedHashTable<T>::hash(T value){
	return value % 31;
}

template <typename T>
void ChainedHashTable<T>::addValue(T value){
	int index = hash(value);
	if(table[index] == NULL){
		table[index] = new LinkedList<T>(value);
	}
	else{
		table[index]->append(value);
	}
}

template <typename T>
bool ChainedHashTable<T>::contains(T value){
	int index = hash(value);
	if (table[index] == NULL) return false;
	else{
		return table[index]->contains(value);
	}
}

template <typename T>
void ChainedHashTable<T>::displayTableContents(){
	for(int i = 0; i < 31; i++){
		if(table[i] != NULL){
			cout << "List at index " << i << ": ";
			table[i]->displayAll();
			cout << endl;
		}
	}
}



Using a slightly modified main:

int main(){
	ChainedHashTable<int>* test = new ChainedHashTable<int>();
	test->addValue(62); //both of these hash to zero (mod 31)
	test->addValue(31);
	test->addValue(107); //to show a value away from the initial probing
	test->addValue(138); //linear probe index 15
	test->addValue(61); //index 30
	test->addValue(123); //wrap around, must go past indexes 0 and 1, from above 
	test->displayTableContents();
	delete test;
	return 0;
}



We get the following output:

Quote

List at index 0: 62 31

List at index 14: 107 138

List at index 30: 61 123


Each index has a linked list. Each item that would normally collide and result in another slot assignment, is now appended to the list in that slot (or if the slot is empty, create a new list with that value as the head node). To avoid misc. debugging comments from that linked list tutorial, comment them out to get the above output.

A visualization of the above sample program:

Posted Image

We've only taken up three "slots" and have the capacity to hold an infinite (well, until the computer runs out of memory) amount of data. Time-wise, it's O(1) to access the list and then O(n) for a search/retrieval/check. We could optimize this by sorting the list and then using a binary search, but that's out of the scope of this post.

------

The intuitive of you will notice that I did not cover associative containers. While extremely handy, it's hard to cover them without some groundwork laid down. Hash tables provide such a foundation:

Quote

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like AWK, Perl, and PHP.


Look for a post on associative containers/hash maps soon!
------


Hopefully you found this helpful; feel free to leave a comment below. Happy coding!

7 Comments On This Entry

Page 1 of 1

Smurphy Icon

27 May 2010 - 11:00 AM
Very nice entry KYA. You went over it just like we did in my computer science class. Maybe being a professor is in your future. haha. Thanks again and keep up the great work.
0

Mercurial Icon

04 August 2010 - 01:31 PM
Awesome
0

bnc Icon

23 January 2011 - 11:14 PM
very good tutorial...very easy to understand...thank you..
0

bnc Icon

26 January 2011 - 11:56 PM
KYA, I have a doubt... In this example..both key and value are the same...
i.e. if we pass 31 as key we will get index 0.. in that index we are again storing 31. is it the purpose?

correct me if I am wrong. I am confused.. thanks..
0

KYA Icon

27 January 2011 - 05:20 PM
31 is part of the hashing function, it isn't the key


in the examples above the array index resulting from the mod 31 is the key
0

bnc Icon

27 January 2011 - 10:05 PM
in this case the resulting array indexs are keys.. ok.... Thank you..
0

ketybrown Icon

05 December 2012 - 02:05 AM
Hello.

First things first, thanks for this awesome tutorial.

Second, I'd like to request for an example on how the Quadratic Probing and Double hashing are done.

Thanks as I await your reply.
0
Page 1 of 1

November 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526 27 2829
30      

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    1 user(s) viewing

    1 Guests
    0 member(s)
    0 anonymous member(s)