Subscribe to Stuck in an Infiniteloop        RSS Feed
***** 2 Votes

An In-Depth Look At Huffman Encoding

Icon 4 Comments
It seemed only natural to follow up the Binary Heap post with an implementation that forms the backbone of Huffman Encoding.


What is it?

Quote

Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol.

Wiki

In layman's terms, we come up with a binary symbol that substitutes each character in a file. We then write the binary to file rather then the data, thus saving space. This post only deals with creating the Huffman codes and displaying the 1's and 0's as a regular string.

How it works:

The engine behind this encoding comes in two parts [min heap and Huffman tree], although the heap does most of the work. I took the Binary Heap and transformed it into a "min heap". Instead of the largest element at the top (or the "root"), the smallest is.

Algorithm steps:

  • Take the input string and create a hash map <character, frequency>
  • Iterate through the hashmap and place a node for each entry into the min heap
  • Pop the two smallest nodes off the heap
  • Combine their frequencies and push that node back onto the heap
  • Repeat the last two steps until you only have one node left
  • This node is the root of your Huffman Tree
  • Recurse down, adding a zero if you go left and a one is you go right
  • When you hit the end (a node with a letter), record the code
  • Push all of these codes into a hashmap <character, code>
  • Use this table to encode/decode strings


The reason this has so few steps is that the heap takes care of determining node ordering. We could also call our "min heap" a Priority Queue (although it's not technically a FIFO structure).

It's important to note that it is possible for a string to have two [or more] different sets of Huffman codes; this is indicative of implementation dependency: particularly what the heap does if two nodes have the same frequency and the order they are originally pushed into the heap.

A visual of the above process:

Attached Image

Code:

Data Structures:
Our basic data structure is the HuffmanNode. It holds the character it represents, how many times it occurs in the string [frequency], and a link to left and right children (which are initially null until we start pushing nodes together to form our tree).

These nodes are pushed onto a slightly modified Binary Heap. The Heap is initialized with a capacity to account for the possibility that are characters within a string are unique [worst case memory complexity].

Hashmaps are used to store character frequencies and generated Huffman codes.

All of these pieces are combined into a HuffmanCode class that takes an input string and handles the rest for the user.

A simple UML representation:

Attached Image

Source:

HuffmanNode.h

#ifndef HUFFMANNODE_H
#define HUFFMANNODE_H
#include <iostream>
#include <string>
using std::string;
using std::ostream;

class HuffmanNode{
private:
	int frequency;
	char letter;
	string code;
	HuffmanNode* left, *right;
public:
	HuffmanNode();
	HuffmanNode(char, int);
	HuffmanNode(HuffmanNode*, HuffmanNode*);
	HuffmanNode(const HuffmanNode&);
	~HuffmanNode();
	int getFrequency()							{return frequency;};
	char getLetter()							{return letter;};
	string getCode()							{return code;};
	HuffmanNode* getLeft()						{return left;};
	HuffmanNode* getRight()						{return right;};

	void setHuffmanCode(string s)				{code = s;};
	friend ostream& operator<<(ostream&, const HuffmanNode&);

	HuffmanNode& operator=(const HuffmanNode&);
};
#endif



HuffmanNode.cpp

#include "HuffmanNode.h"
using namespace std;

HuffmanNode::HuffmanNode(){
	letter = ' ';
	frequency = 0;
	left = right = NULL;
}

HuffmanNode::HuffmanNode(char c, int i){
	letter = c;
	frequency = i;
	left = right = NULL;
}

HuffmanNode::HuffmanNode(HuffmanNode* left, HuffmanNode* right){
	this->left = left;
	this->right = right;
	frequency = left->getFrequency() + right->getFrequency();
	letter = NULL;
}

HuffmanNode::~HuffmanNode(){
	if(left != NULL) delete left;
	if(right != NULL) delete right;
}

HuffmanNode::HuffmanNode(const HuffmanNode& rhs){
	//cout << "in copy constructor copying: " << rhs.letter << endl;
	
	if(rhs.left != NULL){
		left = new HuffmanNode();
		*left = *(rhs.left);
	}
	if(rhs.right != NULL){
		right = new HuffmanNode();
		*right = *(rhs.right);
	}
	code = rhs.code;
	letter = rhs.letter;
	frequency = rhs.frequency;
}


ostream& operator<<(ostream& os, const HuffmanNode& rhs){
	os <<  "\'" << rhs.letter << " " << rhs.frequency << "\' ";
	return os;
}

HuffmanNode& HuffmanNode::operator=(const HuffmanNode& rhs){
	//cout << "In assigbnment operator overload." << endl;
	if(!(this == &rhs)){
		//copy
		code = rhs.code;
		letter = rhs.letter;
		frequency = rhs.frequency;
		if(!(left == NULL)) delete left;
		if(!(right == NULL)) delete right;

		if(rhs.left != NULL){
			left = new HuffmanNode();
			*left = *(rhs.left);
			if(rhs.right != NULL){
				right = new HuffmanNode();
				*right = *(rhs.right);
			}
		}
		else left = right = NULL;
	}
	return *this;
}



BinaryHeap.h:

#ifndef BINARYHEAP_H
#define BINARYHEAP_H
#include "HuffmanNode.h"

class BinaryHeap{
private:
	HuffmanNode** storage;
	int heapSize;
	int capacity;

	void percUp(int);
	void percDown(int);
	int numChildren(int);
	
	void inOrder(int);
	void postOrder(int);
	void preOrder(int);
public:
	BinaryHeap(int);
	~BinaryHeap();
	void insert(HuffmanNode*);
	HuffmanNode* removeMin();
	HuffmanNode peekMin();

	bool isEmpty();
	int getHeapSize()							{return heapSize;};
	HuffmanNode* getRoot()						{return storage[0];};
	//traversal
	//not really necessary for Huffman, left for debug purposes
	void inOrderTraversal();
	void postOrderTraversal();
	void preOrderTraversal();
	void linearDisplay();
	//debug
	void emptyHeap();
};
#endif



BinaryHeap.cpp:

#include "BinaryHeap.h"
#include <iostream>
using namespace std;


BinaryHeap::BinaryHeap(int size){
	capacity = size;
	heapSize = 0;
	storage = new HuffmanNode*[capacity];
	for(int i = 0; i < capacity; i++){
		storage[i] = NULL;
	}
}

BinaryHeap::~BinaryHeap(){
	if(storage != NULL){
		for(int i = 0; i < capacity; i++){
			if(storage[i] != NULL){
				delete storage[i];
				heapSize--;
			}
		}
		delete[] storage;
		//for debug purposes
		cout << "heap size is (should be zero): " << heapSize << endl;
	}
}

void BinaryHeap::insert(HuffmanNode* rhs){
	if(heapSize == capacity){
		cerr << "No more room in this heap!" << endl;
		return;
	}

	//insert at end
	heapSize++;
	storage[heapSize-1] = rhs;
	percUp(heapSize-1);
}

void BinaryHeap::percUp(int curIndex){
	int parentIndex;
	HuffmanNode* temp;
	if(curIndex != 0) { //are we at the "top"?
		parentIndex = (curIndex-1)/2;
		//min heap
		if(storage[parentIndex]->getFrequency() > storage[curIndex]->getFrequency()){
			temp = storage[parentIndex];
			storage[parentIndex] = storage[curIndex];
			storage[curIndex] = temp;
			percUp(parentIndex);
		}
	}
}

int BinaryHeap::numChildren(int index){
	int children = 0, left = 2*index+1, right = 2*index+2;
	if(left >= heapSize) return 0;
	if(storage[left] != NULL) children++;
	if (storage[right] != NULL) children++;
	return children;
}

void BinaryHeap::percDown(int curIndex){
	int children = 0;
	if(children = numChildren(curIndex)){ //are we at the "bottom"?
		HuffmanNode* temp;
		int left = 2*curIndex+1, right = 2*curIndex+2;
		switch(children){ //1 child or 2 children
			case 1:
				//if the child is smaller, perk the current node down
				if(storage[curIndex]->getFrequency() > storage[left]->getFrequency()){
					temp = storage[curIndex];
					storage[curIndex] = storage[left];
					storage[left] = temp;
					percDown(left);
				}
				break;
			case 2:
				//2 children, determine which is smaller, perk down that direction
				if((storage[curIndex]->getFrequency() > storage[left]->getFrequency()) || 
							(storage[curIndex]->getFrequency() > storage[right]->getFrequency())){
					//left child is smaller
					if((storage[left]->getFrequency() < storage[right]->getFrequency())){
						temp = storage[curIndex];
						storage[curIndex] = storage[left];
						storage[left] = temp;
						percDown(left);
					}
					else{ //right child is smaller
						temp = storage[curIndex];
						storage[curIndex] = storage[right];
						storage[right] = temp;
						percDown(right);
					}
				}
				break;
			default:
				cout << "wtf? this shouldn't happen" << endl;
				break;
		}
	}
	return;
}


HuffmanNode* BinaryHeap::removeMin(){
	HuffmanNode* value = new HuffmanNode();
	*value = *storage[0];
	//replace min with last filled node slot (heapSize-1)
	//perc down
	HuffmanNode* temp = storage[0];
	storage[0] = storage[heapSize-1];
	storage[heapSize-1] = temp;
	delete storage[heapSize-1]; //free
	storage[heapSize-1] = NULL; //avoid dangling pointers
	heapSize--;
	percDown(0);
	return value;
}

HuffmanNode BinaryHeap::peekMin(){
	return *storage[0];
}

bool BinaryHeap::isEmpty(){
	return (heapSize == 0);
}

void BinaryHeap::inOrderTraversal(){
	if(isEmpty()){
		cout << "Heap is empty!" << endl;
		return;
	}
	inOrder(0);
	cout << endl;
}

void BinaryHeap::inOrder(int curIndex){
	//left
	//cur
	//right
	if(((2*curIndex+1) < capacity) && (storage[2*curIndex+1] != NULL)) inOrder(2*curIndex+1);
	cout << *storage[curIndex] << " ";
	if(((2*curIndex+2) < capacity) && (storage[2*curIndex+2] != NULL)) inOrder(2*curIndex+2);
}

void BinaryHeap::postOrderTraversal(){
	if(isEmpty()){
		cout << "Heap is empty!" << endl;
		return;
	}
	postOrder(0);
	cout << endl;
}

void BinaryHeap::postOrder(int curIndex){
	//left
	//right
	//cur
	if(((2*curIndex+1) < capacity) && (storage[2*curIndex+1] != NULL)) inOrder(2*curIndex+1);
	if(((2*curIndex+2) < capacity) && (storage[2*curIndex+2] != NULL)) inOrder(2*curIndex+2);
	cout << *storage[curIndex] << " ";
}


void BinaryHeap::preOrderTraversal(){
	if(isEmpty()){
		cout << "Heap is empty!" << endl;
		return;
	}
	preOrder(0);
	cout << endl;
}

void BinaryHeap::preOrder(int curIndex){
	//cur
	//left
	//right
	cout << *storage[curIndex] << " ";
	if(((2*curIndex+1) < capacity) && (storage[2*curIndex+1] != NULL)) inOrder(2*curIndex+1);
	if(((2*curIndex+2) < capacity) && (storage[2*curIndex+2] != NULL)) inOrder(2*curIndex+2);
}


void BinaryHeap::linearDisplay(){
	for(int i = 0; i < heapSize; i++){
		cout << *storage[i] << " ";
	}
	cout << endl;
}

void BinaryHeap::emptyHeap(){
	while(!isEmpty()){
		HuffmanNode* temp = removeMin();
		cout << *temp << endl;
	}
}



HuffmanCode.h:

#ifndef HUFFMANCODE_H
#define HUFFMANCODE_H
#include <string>
#include <map>
#include"BinaryHeap.h"
using std::string;
using std::map;

class HuffmanCode{
private:
	string data;
	string encodedData;
	BinaryHeap* heap;
	map<char, int> frequencyTable;
	map<char, string> huffmanTable;
	void buildTable();
	void buildHeap();
	void getHuffmanEncoding(HuffmanNode*, string);
	void encode();
public:
	HuffmanCode(string);
	~HuffmanCode();
	string getSourceString()						{return data;};
	string getEncodedString()						{return encodedData;};
	void displayTable();
	void displayHeap();
	void displayHuffmanTable();
	string decodeString(string);

	//debug
	void emptyHeap()								{heap->emptyHeap();};
};
#endif



HuffmanCode.cpp:

#include "HuffmanCode.h"
#include <iostream>
#include <utility>
using namespace std;

HuffmanCode::HuffmanCode(string src){
	data = src;
	encodedData = "";
	//worst case scenario, all unique characters
	//must accomodate
	heap = new BinaryHeap(data.length());
	buildTable();
	buildHeap();
}

HuffmanCode::~HuffmanCode(){
	if(heap != NULL) delete heap;
}

void HuffmanCode::buildTable(){
	for(size_t i = 0; i < data.length(); i++){
		char c = data.at(i);
		if (frequencyTable.find(data.at(i)) == frequencyTable.end()){
			frequencyTable.insert(pair<char, int>(c, 1));
		}
		else{
			frequencyTable[c]++;
		}
	}
}

void HuffmanCode::buildHeap(){
	//fill heap
	for(map<char,int>::iterator it = frequencyTable.begin(); 
		it != frequencyTable.end(); ++it){
			heap->insert(new HuffmanNode(it->first, it->second));
	}
	//pop off two at a time until you get one left
	while(!heap->isEmpty()){
		if(heap->getHeapSize() == 1) break;
		HuffmanNode* left = heap->removeMin();
		HuffmanNode* right = heap->removeMin();
		heap->insert(new HuffmanNode(left, right));
	}
	//final one is the root of your encoding tree
	string code = "";
	getHuffmanEncoding(heap->getRoot(), code);
	encode();
}

void HuffmanCode::getHuffmanEncoding(HuffmanNode* root, string code){
	if(root->getLeft() == NULL){
		root->setHuffmanCode(code);
		huffmanTable.insert(pair<char, string>(root->getLetter(), code));
		return;
	}
	else{
		getHuffmanEncoding(root->getLeft(), code+"0");
		getHuffmanEncoding(root->getRight(), code+"1");
	}
}

void HuffmanCode::displayTable(){
	cout << "Frequency Table:" << endl;
	for(map<char,int>::iterator it = frequencyTable.begin(); 
		it != frequencyTable.end(); ++it){
			cout << it->first << "\t" << it->second << endl;
	}
}

void HuffmanCode::displayHuffmanTable(){
	cout << "Huffman Table:" << endl;
	for(map<char,string>::iterator it = huffmanTable.begin(); 
		it != huffmanTable.end(); ++it){
			cout << it->first << "\t" << it->second << endl;
	}
}

void HuffmanCode::encode(){
	for(size_t i = 0; i < data.length(); i++){
		encodedData.append(huffmanTable[data.at(i)]);
		encodedData.append(" ");
	}
}

string HuffmanCode::decodeString(string src){
	string info = "";
	size_t lastBlock = 0;
	for(size_t i = 0; i < src.length(); i++){
		if (src.at(i) == ' '){
			string temp = src.substr(lastBlock, i-lastBlock);
			lastBlock = i+1;
			for(map<char,string>::iterator it = huffmanTable.begin();
				it != huffmanTable.end(); ++it){
					if (it->second == temp){
						info += (it->first);
						break;
					}
			}
		}
	}
	return info;
}

void HuffmanCode::displayHeap(){
	heap->inOrderTraversal();
}



Sample main.cpp:

#include <iostream>
#include <string>
#include "HuffmanCode.h"
using namespace std;

int main(){
	HuffmanCode* test = new HuffmanCode("this is an example of a huffman tree");
	test->displayTable();
	test->displayHuffmanTable();
	string code = test->getEncodedString();
	cout << "Encoded string: " << code << endl;
	cout << "Decoded string: " << test->decodeString(code) << endl;
	delete test;
	cin.get();
	return 0;
}



Output:

Quote

Frequency Table:
7
a 4
e 4
f 3
h 2
i 2
l 1
m 2
n 2
o 1
p 1
r 1
s 2
t 2
u 1
x 1
Huffman Table:
111
a 100
e 011
f 1101
h 1011
i 0100
l 00110
m 1010
n 1100
o 00101
p 00010
r 00111
s 0000
t 0101
u 00100
x 00011
Encoded string: 0101 1011 0100 0000 111 0100 0000 111 100 1100 111 011 00011 100
1010 00010 00110 011 111 00101 1101 111 100 111 1011 00100 1101 1101 1010 100 1100 111 0101 00111 011 011
Decoded string: this is an example of a huffman tree
heap size is (should be zero): 0


--

Happy coding!

4 Comments On This Entry

Page 1 of 1

ishkabible Icon

19 January 2011 - 02:14 PM
super nice KYA!! i do have a question; say you save the bit's to a file and the number of bit's do not fill the number of bytes used, to account for this when the file is read would you have to store the number of bit's in the file? also how would you account for the difference in the number of bit's per byte on different systems?
0

KYA Icon

19 January 2011 - 02:49 PM
If I understand your question correctly (a bit weird worded there):

I'd probably pad them all to 8 bits and pass the table in the/a header. I could provide the number of bits in a byte and any similar info in the header as well.
1

ishkabible Icon

19 January 2011 - 07:55 PM
ya that answers my question. i was thinking that you would just have at most 7 dangling bit's and you could just account for them in the header of the file, thanks!
0

efcasado Icon

23 January 2011 - 10:05 AM
Thank for the post! It's a really clear explanation of the Huffman Coding algorithm.

While reading your post I've remembered that when I studied this algorithm at the University I wanted to code it in order to "feel its power", But I never found time to do it. Now that I have a bit of free time, I've wanted to try to code this algorithm in the programming language that I'm currently learning, Erlang. So here it is my implementation of the Huffman Coding in Erlang :-)

trees.erl
Spoiler


huffman.erl
Spoiler


Output:
1> huffman:test("Testing the Huffman Coding algorithm.").
Clear Text:
"Testing the Huffman Coding algorithm."

Coded Text:
[[1,1,0,0,0], [0,1,0,0], [1,1,1,1,1], [1,0,0,0], [1,0,1,0], [1,0,0,1], [0,1,1,1], [1,0,1,1], [1,0,0,0], [0,0,1,1], [0,1,0,0], [1,0,1,1], [1,1,0,1,1], [1,1,1,0,0], [0,1,1,0], [0,1,1,0], [0,0,0,1], [0,0,1,0], [1,0,0,1], [1,0,1,1], [0,0,0,0], [0,1,0,1], [1,1,0,0,1], [1,0,1,0], [1,0,0,1], [0,1,1,1], [1,0,1,1], [0,0,1,0], [1,1,1,1,0], [0,1,1,1], [0,1,0,1], [1,1,1,0,1], [1,0,1,0], [1,0,0,0], [0,0,1,1], [0,0,0,1], [1,1,0,1,0]]

Decoded Text:
"Testing the Huffman Coding algorithm."

2
Page 1 of 1

October 2014

S M T W T F S
   1234
567891011
12131415161718
19202122232425
2627282930 31  

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    2 user(s) viewing

    2 Guests
    0 member(s)
    0 anonymous member(s)