For other structures see here.
Hash Table
Hashing
A visualization:
In a nutshell, we take the value/key, put it through a hash function and we get a corresponding value (also known as: hash values, hash codes, hash sums, checksums or hashes). This "value" is usually (more often then naught) an index for our storage medium (the simplest of which is an array). Consider this relatively trivial hash function:
Seems good at first glance, for the first ten digits, all have a distinct hash.
But wait! If we go through the next ten numbers, the hashes are repeated! What's a programmer to do? Write a better hash of course. Let's experiment a bit. Since the remainder of the ones digit will be repeated, so let's take another characteristic of the number: its divisibility:
Comparing the output of this program to our last:
Three collisions (a collision is where more then one key maps to the same hash) as opposed to twenty! Note that this is on a fixed set of test data and the number of collisions would increase as the number of entries being hashed increased. Such is the nature of hashing. Thus, it is still an ongoing endeavor to find perfect/near perfect hashes...
...which brings us to collision resolution. Since we don't live in a magical forest with chocolate rivers and gumdrop mountains we have to tell our tables how to handle collisions. Popular options include, but certainly are not limited to:
With a good hash function, lookups/operations can be executed in constant or near constant time: O(1) (read this if you are unfamiliar with Big O notation).
Before we look at the various ways to utilize a hash table, let's create a plain one that we modify for each case:
By and large, a pretty functional data structure, except for the fact that we do not account for any possible collisions.
Open Addressing
Linear Probing
If a slot is "taken" we go to the next slot until we find an empty one or there's no more room in the table. The step can be any size, but is usually one:
62 and 31 both have a remainder of zero when divided by 31. Thus, 62 gets the "real" zero slot, whereas 31 gets the next available slot.
There are some issues with this approach though:
1. We don't have any "back up" plan for when we run out of space (aside from "it just fails").
2. While it wouldn't normally be in a table (at least from a trivial standpoint) we do not account for zero. We use zero as the "empty" value. This could come back and bite us, but is perfectly okay for the purposes of this post.
3. The index is not wrapped around once it hits the end of the table. To alleviate this:
Now, to illustrate how this works (all the code is the same as before except for the above revised addValue() and main() ):
The actual program output:
And a visual:
Quadratic Probing and Double Hashing
Quadratic probing and double hashing operate on the exact same principle, except instead of an increment of one, using a quadratic formula and second hash to find the next "slot", respectively. Aptly named and due to the innate similarities, moving on. If you want to see a code example of this let me know.
Separate ("Linear") Chaining
The idea here is simple. At each index in our bucket/array we have a linked list. If multiple values have the same hash, we just append it to the list in that slot. For the code I'm going to use for the Linked List, see this blog post. While we will save time in the long run using this approach, we have to do some considerable re-factoring of our HashTable data structure. Why? Up to this point we've been operating under the assumption (notion rather) that our HashTable was dealing with primitives (hence the typename specifier rather then class, but I digress).
Revised HashTable (now ChainedHashTable):
Using a slightly modified main:
We get the following output:
Each index has a linked list. Each item that would normally collide and result in another slot assignment, is now appended to the list in that slot (or if the slot is empty, create a new list with that value as the head node). To avoid misc. debugging comments from that linked list tutorial, comment them out to get the above output.
A visualization of the above sample program:
We've only taken up three "slots" and have the capacity to hold an infinite (well, until the computer runs out of memory) amount of data. Time-wise, it's O(1) to access the list and then O(n) for a search/retrieval/check. We could optimize this by sorting the list and then using a binary search, but that's out of the scope of this post.
------
The intuitive of you will notice that I did not cover associative containers. While extremely handy, it's hard to cover them without some groundwork laid down. Hash tables provide such a foundation:
Look for a post on associative containers/hash maps soon!
------
Hopefully you found this helpful; feel free to leave a comment below. Happy coding!
Hash Table
Quote
A hash table or hash map is a data structure that uses a hash function to efficiently map certain identifiers or keys to associated values. The hash function is used to transform the key into the index (the hash) of an array element where the corresponding value is to be sought.
Hashing
A visualization:
In a nutshell, we take the value/key, put it through a hash function and we get a corresponding value (also known as: hash values, hash codes, hash sums, checksums or hashes). This "value" is usually (more often then naught) an index for our storage medium (the simplest of which is an array). Consider this relatively trivial hash function:
int trivialHash(int num){ return num % 10; } int main(){ for(int i = 0; i < 10; i++){ cout << trivialHash(i) << endl; } return 0; }
Seems good at first glance, for the first ten digits, all have a distinct hash.
Quote
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
But wait! If we go through the next ten numbers, the hashes are repeated! What's a programmer to do? Write a better hash of course. Let's experiment a bit. Since the remainder of the ones digit will be repeated, so let's take another characteristic of the number: its divisibility:
int trivialHash(int num){ return num % 10; } int somewhatBetterHash(int num){ return ((num% 10) + (num/2)); } int main(){ for(int i = 0; i < 20; i++){ cout << trivialHash(num) << "\t\t\t" << somewhatBetterHash(i) << endl; } return 0; }
Comparing the output of this program to our last:
Quote
//trivial() somewhatBetter()
0 0
1 1
2 3
3 4
4 6
5 7
6 9
7 10
8 12
9 13
0 5
1 6
2 8
3 9
4 11
5 12
6 14
7 15
8 17
9 18
0 0
1 1
2 3
3 4
4 6
5 7
6 9
7 10
8 12
9 13
0 5
1 6
2 8
3 9
4 11
5 12
6 14
7 15
8 17
9 18
Three collisions (a collision is where more then one key maps to the same hash) as opposed to twenty! Note that this is on a fixed set of test data and the number of collisions would increase as the number of entries being hashed increased. Such is the nature of hashing. Thus, it is still an ongoing endeavor to find perfect/near perfect hashes...
...which brings us to collision resolution. Since we don't live in a magical forest with chocolate rivers and gumdrop mountains we have to tell our tables how to handle collisions. Popular options include, but certainly are not limited to:
- Probing (linear, quadratic, etc...) Also known as open addressing
- Separate ("Linear") Chaining
- Rehashing
With a good hash function, lookups/operations can be executed in constant or near constant time: O(1) (read this if you are unfamiliar with Big O notation).
Before we look at the various ways to utilize a hash table, let's create a plain one that we modify for each case:
template <typename T> class HashTable{ private: T table[31]; //set amount of storage, small for example int hash(T); //only the table needs it public: HashTable(); ~HashTable() {}; void addValue(T); T retrieveValue(int index) {return table[index];}; bool contains(T); void displayTableContents(); }; template <typename T> HashTable<T>::HashTable(){ memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes } template <typename T> int HashTable<T>::hash(T value){ return value % 31; //simple just for the framework } template <typename T> bool HashTable<T>::contains(T value){ if(table[hash(value)]) return true; else return false; } template <typename T> void HashTable<T>::addValue(T value){ table[hash(value)] = value; } template <typename T> void HashTable<T>::displayTableContents(){ for(int i = 0; i < 31; i++){ cout << table[i] << endl; } return; } //add one value, for now int main(){ HashTable<int>* test = new HashTable<int>(); test->addValue(2456); test->displayTableContents(); delete test; return 0; }
By and large, a pretty functional data structure, except for the fact that we do not account for any possible collisions.
Open Addressing
Linear Probing
If a slot is "taken" we go to the next slot until we find an empty one or there's no more room in the table. The step can be any size, but is usually one:
template <typename T> class HashTable{ private: T table[31]; //set amount of storage, small for example int hash(T); //only the table needs it public: HashTable(); ~HashTable() {}; void addValue(T); T retrieveValue(int index) {return table[index];}; bool contains(T); void displayTableContents(); }; template <typename T> HashTable<T>::HashTable(){ memset(table, 0, sizeof(table)); //zero it out, for illustrative purposes } template <typename T> int HashTable<T>::hash(T value){ return value % 31; //simple just for the framework } template <typename T> bool HashTable<T>::contains(T value){ if(table[hash(value)]) return true; else return false; } template <typename T> void HashTable<T>::addValue(T value){ int index = hash(value); if(table[index] == 0){ //look it's free! table[index] = value; //go ahead and add it! } else{ bool notDone = true; while(notDone){ index++; if(!table[index]){ //if !0, i.e. empty table[index] = value; notDone = false; } } } return; } template <typename T> void HashTable<T>::displayTableContents(){ for(int i = 0; i < 31; i++){ cout << table[i] << endl; } return; } int main(){ HashTable<int>* test = new HashTable<int>(); test->addValue(62); //both of these hash to zero (mod 31) test->addValue(31); test->addValue(107); //to show a value away from the initial probing test->displayTableContents(); delete test; return 0; }
62 and 31 both have a remainder of zero when divided by 31. Thus, 62 gets the "real" zero slot, whereas 31 gets the next available slot.
There are some issues with this approach though:
1. We don't have any "back up" plan for when we run out of space (aside from "it just fails").
2. While it wouldn't normally be in a table (at least from a trivial standpoint) we do not account for zero. We use zero as the "empty" value. This could come back and bite us, but is perfectly okay for the purposes of this post.
3. The index is not wrapped around once it hits the end of the table. To alleviate this:
template <typename T> void HashTable<T>::addValue(T value){ int index = hash(value); if(table[index] == 0){ //look it's free! table[index] = value; //go ahead and add it! } else{ bool notDone = true; int originalPos = index; while(notDone){ index++; if (index == 31) index = 0; //wrap around else if (index == originalPos){ cout << "Table is full!" << endl; break; } if(!table[index]){ //if !0, i.e. empty table[index] = value; notDone = false; } } } return; }
Now, to illustrate how this works (all the code is the same as before except for the above revised addValue() and main() ):
int main(){ HashTable<int>* test = new HashTable<int>(); test->addValue(62); //both of these hash to zero (mod 31) test->addValue(31); test->addValue(107); //to show a value away from the initial probing test->addValue(138); //linear probe index 15 test->addValue(61); //index 30 test->addValue(123); //wrap around, must go past indexes 0 and 1, from above test->displayTableContents(); delete test; return 0; }
The actual program output:
Quote
62
31
123
0
0
0
0
0
0
0
0
0
0
0
107
138
0
0
0
0
0
0
0
0
0
0
0
0
0
0
61
31
123
0
0
0
0
0
0
0
0
0
0
0
107
138
0
0
0
0
0
0
0
0
0
0
0
0
0
0
61
And a visual:
Quadratic Probing and Double Hashing
Quadratic probing and double hashing operate on the exact same principle, except instead of an increment of one, using a quadratic formula and second hash to find the next "slot", respectively. Aptly named and due to the innate similarities, moving on. If you want to see a code example of this let me know.
Separate ("Linear") Chaining
The idea here is simple. At each index in our bucket/array we have a linked list. If multiple values have the same hash, we just append it to the list in that slot. For the code I'm going to use for the Linked List, see this blog post. While we will save time in the long run using this approach, we have to do some considerable re-factoring of our HashTable data structure. Why? Up to this point we've been operating under the assumption (notion rather) that our HashTable was dealing with primitives (hence the typename specifier rather then class, but I digress).
Revised HashTable (now ChainedHashTable):
//LinkedList Chained Table template <typename T> class ChainedHashTable{ private: LinkedList<T>* table[31]; int hash(T); //only the table needs it public: ChainedHashTable(); ~ChainedHashTable(); void addValue(T); T retrieveValue(int index); bool contains(T); void displayTableContents(); }; template<typename T> ChainedHashTable<T>::ChainedHashTable(){ memset(table, NULL, sizeof(table)); //same deal, but NULL for ptrs } template<typename T> ChainedHashTable<T>::~ChainedHashTable(){ //free the memory of each list for(int i = 0; i < 31; i++){ if(table[i]) delete table[i]; //if the slot was never used, leave it alone } } template <typename T> int ChainedHashTable<T>::hash(T value){ return value % 31; } template <typename T> void ChainedHashTable<T>::addValue(T value){ int index = hash(value); if(table[index] == NULL){ table[index] = new LinkedList<T>(value); } else{ table[index]->append(value); } } template <typename T> bool ChainedHashTable<T>::contains(T value){ int index = hash(value); if (table[index] == NULL) return false; else{ return table[index]->contains(value); } } template <typename T> void ChainedHashTable<T>::displayTableContents(){ for(int i = 0; i < 31; i++){ if(table[i] != NULL){ cout << "List at index " << i << ": "; table[i]->displayAll(); cout << endl; } } }
Using a slightly modified main:
int main(){ ChainedHashTable<int>* test = new ChainedHashTable<int>(); test->addValue(62); //both of these hash to zero (mod 31) test->addValue(31); test->addValue(107); //to show a value away from the initial probing test->addValue(138); //linear probe index 15 test->addValue(61); //index 30 test->addValue(123); //wrap around, must go past indexes 0 and 1, from above test->displayTableContents(); delete test; return 0; }
We get the following output:
Quote
List at index 0: 62 31
List at index 14: 107 138
List at index 30: 61 123
List at index 14: 107 138
List at index 30: 61 123
Each index has a linked list. Each item that would normally collide and result in another slot assignment, is now appended to the list in that slot (or if the slot is empty, create a new list with that value as the head node). To avoid misc. debugging comments from that linked list tutorial, comment them out to get the above output.
A visualization of the above sample program:
We've only taken up three "slots" and have the capacity to hold an infinite (well, until the computer runs out of memory) amount of data. Time-wise, it's O(1) to access the list and then O(n) for a search/retrieval/check. We could optimize this by sorting the list and then using a binary search, but that's out of the scope of this post.
------
The intuitive of you will notice that I did not cover associative containers. While extremely handy, it's hard to cover them without some groundwork laid down. Hash tables provide such a foundation:
Quote
Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like AWK, Perl, and PHP.
Look for a post on associative containers/hash maps soon!
------
Hopefully you found this helpful; feel free to leave a comment below. Happy coding!
7 Comments On This Entry
Page 1 of 1
Smurphy
27 May 2010 - 11:00 AM
Very nice entry KYA. You went over it just like we did in my computer science class. Maybe being a professor is in your future. haha. Thanks again and keep up the great work.
bnc
26 January 2011 - 11:56 PM
KYA, I have a doubt... In this example..both key and value are the same...
i.e. if we pass 31 as key we will get index 0.. in that index we are again storing 31. is it the purpose?
correct me if I am wrong. I am confused.. thanks..
i.e. if we pass 31 as key we will get index 0.. in that index we are again storing 31. is it the purpose?
correct me if I am wrong. I am confused.. thanks..
bnc
27 January 2011 - 10:05 PM
in this case the resulting array indexs are keys.. ok.... Thank you..
ketybrown
05 December 2012 - 02:05 AM
Hello.
First things first, thanks for this awesome tutorial.
Second, I'd like to request for an example on how the Quadratic Probing and Double hashing are done.
Thanks as I await your reply.
First things first, thanks for this awesome tutorial.
Second, I'd like to request for an example on how the Quadratic Probing and Double hashing are done.
Thanks as I await your reply.
Page 1 of 1
← February 2018 →
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 |
Tags
My Blog Links
Recent Entries
Recent Comments
Search My Blog
1 user(s) viewing
1 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)