1 Replies - 587 Views - Last Post: 13 March 2016 - 07:25 AM Rate Topic: -----

#1 sara.quin   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 13-March 16

Entropy of a dictionary

Posted 13 March 2016 - 06:48 AM

Hi, I have an english dictionary (a file that contains a list of words) and I want to calculate:

1. given a path tree (a word), measure H(Cl+1|Cl=cl, Cl-1=cl-1, ...) for a few levels of the tree
2. for each level of the tree calculate H(Cl+1|Cl, ...)
3. using the tree model calculate the entropy of the dictionary, measured in bits per word.

I built a trie data structure to represent the dictionary.
So I have something like:
Posted Image

I know that:
H(X) = - sum_{i} P_i log2 Pi

H(X|Y) = - sum_y P(y) H(X|Y=y)

H(X1,X2,..,Xn|Y) = sum_{i=1}^n H(Xi|X1,...,Xi-1,Y)



This is what I think to do:

- (1) I think that is a conditioning of a single event. I have to calculate what is the entropy in the next letter I suppose to know the current one. How can I do?
- (2) Now I have the conditioning of a random variable, not on a specific event. Again I don't know how to solve it.
- (3) I know H(D) = H(C1C2...Cn)=sum{l=1}H(Cl|Cl-1Cl-1...C1)=H(C1)+H(C2|C1)+H(C3|C2C1)+...+H(C_l|C_l-1...C1)
But in practice I don't know how to do..

My classes are:
class TrieNode {

    public char content; 
    public boolean isEnd; 
    public double count; 
    public LinkedList<TrieNode> childList; 
    public static int id = 0;

    public TrieNode(char c) {
        childList = new LinkedList<TrieNode>();
        isEnd = false;
        content = c;
        count = 0;
    }

    public TrieNode subNode(char c) {
        if(childList != null)
            for(TrieNode eachChild : childList)
                if(eachChild.content == c)
                    return eachChild;
        return null;
    }

    public void printNode() {
        String s = "";
        s += "{" + content + ", " + isEnd + ", " + count + ", [ ";
        for(TrieNode eachChild : childList) {
            s += eachChild.content + " ";
        }
        s += "]";
        System.out.println(s);
    }
}




public class Trie {

	final static String PATH_DICT = "./ENdictionary.ngl"; 
	final static String OUTPUT_DICT = "./ENdictionaryClear.ngl"; 
	public static FileWriter outFile;
	public static PrintWriter writer;
	public static PrintWriter out;
	public double totChar = 0; 

	private TrieNode root;
	private double totWords;
 
	public Trie() {
		root = new TrieNode('*'); //blank for root 
		totWords = 0;
	}

	public TrieNode getRoot() {
		return root;
	}

	public double getTotWords() {
		return totWords;
	}

	/** 
	 * Function to insert word.
	 */
	public void insert(String word) {
		if(search(word) == true)
			return;
		totChar += word.length(); 
		TrieNode current = root;
		for(char ch : word.toCharArray()) {
			TrieNode child = current.subNode(ch); 
			if(child != null) {
				current = child;
			}
			else {
				current.childList.add(new TrieNode(ch));
				current = current.subNode(ch);
			}
			current.count++;
		}
		current.isEnd = true;
	}
	
	/**
	 * Function to search for word.
	 */
	public boolean search(String word) {
		TrieNode current = root;  
		for(char ch : word.toCharArray()) {
			if(current.subNode(ch) == null)
				return false;
			else
				current = current.subNode(ch);
		}      
		if(current.isEnd == true) 
			return true;
		return false;
	}

	/**
	 * Function to remove a word.
	 */
	public void remove(String word) {
		if(search(word) == false) {
			System.out.println(word + " does not exist in trie.\n");
			return;
		}             
		TrieNode current = root;
		for(char ch : word.toCharArray()) { 
			TrieNode child = current.subNode(ch);
			if(child.count == 1) {
				current.childList.remove(child);
				return;
			} 
			else {
				child.count--;
				current = child;
			}
		}
		current.isEnd = false;
	}

	/**
	 * Create dictionary data structure (Trie) from the dictionary file.
	 */
	public void createDictionary() {
		Pattern p = Pattern.compile("[^a-z]", Pattern.CASE_INSENSITIVE);
		Matcher m;
		try {
			outFile = new FileWriter(OUTPUT_DICT, false); //per il txt
			writer = new PrintWriter(outFile);
			FileInputStream fstream = new FileInputStream(PATH_DICT); //open the file
			BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
			String word;
			while((word = br.readLine()) != null) { //read File Line By Line
				m = p.matcher(word);
				if(!m.find()) { //se la stringa word contiene un numero o un carattere speciale non la considero
					writer.println(word); //stampo su file il nuovo dizionario
					totWords++;
					insert(word); //creo il trie del dizionario
				}
			}
			br.close(); //close the input stream
		}
		catch(IOException ioe) {
			System.out.println(ioe);
		}
	}

	public double entropyFirstChar() {
		double[] p = new double[getRoot().childList.size()];
		int i = 0; 
		double totWords = getTotWords();
		double entropy = 0.0;
		for(TrieNode node : root.childList) { //per ogni figlio di *
			p[i] = node.count / totWords;
			i++;
		}
		for(int j = 0; j < p.length; j++) {
			if(p[j] != 0) {
				entropy -= p[j] * log2(p[j]);
			}
		}
		return entropy;
	}

	public double entropySingleChar() {
		double entropy = 0.0; 
		double[] p = new double[Alphabet.values().length]; 
		int i = 0;
		for(Alphabet character : Alphabet.values()) { 
			double occ = occurrencesOfChar(root, character.getC());
			p[i] = occ / totChar;
			i++;
		}
		for(int j = 0; j < p.length; j++) {
			if(p[j] != 0) {
				entropy -= p[j] * log2(p[j]);
			}
		}
		return entropy;
	}

	public double occurrencesOfChar(TrieNode node, char c) {
		double occ = 0;
		for(TrieNode child : node.childList) { 
			if(child.content == c) { 
				occ += child.count; 
			}
			occ += occurrencesOfChar(child, c);
		}
		return occ;
	}

	public static double log2(double n) {
    	    return (Math.log(n) / Math.log(2));
	}

}



I don't want the code, but help to move forward..
Thanks!

Is This A Good Question/Topic? 0
  • +

Replies To: Entropy of a dictionary

#2 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 845
  • View blog
  • Posts: 6,480
  • Joined: 25-December 13

Re: Entropy of a dictionary

Posted 13 March 2016 - 07:25 AM

Also posted at: http://www.java-foru...dictionary.html
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1