Monkeys Producing Shakespeare

Have you heard about the Shakespearean monkeys? You know, "if enough monkeys pound on keyboards long enough they will produce the text to one of Shakespeare's plays by pure chance.". The idea being that, probabilities being what they are, even though the "chance" of a random process producing something highly structured is very small -- if you have enough trials than it is "bound to happen". In fact if you have an infinite number of trials you can be mathematically certain that it does happen at some point.

The problem with randomly generated text is that it generally looks NOTHING like ordered text. It tends to be very high frequency with each letter occurring roughly 1 time in every N letters (where N is the size of our alphabet). But how many times do you run across Z in text? How often to you run across E? Each letter in a language has a certain frequency and so if they are all produced with the same probability you tend to get them occurring with roughly the same frequency. This leaves very little "chance" for long sections of ordered data.

Here is a sample of randomly generated text:

Not much hope of Shakespeare there.

Better Monkeys V1.0

So what if we sample a text file and make note of the number of times each letter appears then divide by the number of letters. We should then be able to train our monkeys to strike keys based upon the probability of that letter occurring in the text. This way we should get far less Z's than we do E's and thus improve our productivity!

Here is a sample of randomly generated text based upon the frequency of each letter in a sample text: -- We have the word "ale" and a couple that almost look word like. The text has also broken up a bit more into roughly word-sized chunks... These monkeys seem to have a better chance of generating some kind of readable text (I am not picky, does not have to be Shakespeare, I'll take Byron or even Steven King).

To test this generator I looked at: what is the chance this generator will correctly guess the "next" letter in the sample text. For the purely random method this turned out to be about 0.0372344 which is roughly 1/27 which is what I expected (I am using 26 letters and a space).

The sampled frequency generator improved upon this to 0.0826591 which is just over 2/27. So I feel comfortable in saying that I have doubled monkey productivity!

Better Monkeys V2.0

The above is an improvement but we are still are not producing much structure. Another step we can take is to adopt the idea of a Markov chain. A Markov chain would use the last char generated to determine the probability for the next char. That is, if we just generated an 'i' then what is the probability that the next char is a 'z' or an 'n'. 'in' seems more probable than 'iz'.

To create a generator based upon a Markov chain we need to construct a probability matrix. In the matrix each row represents a "current state" and each column a "target state" and the element A[r, c] is the probability of going from state r to state c. The sum of each row should be 1.

To calculate this we need a 27x27 array of floating point values. For each char we read we do: Matrix[last char][char]++

once we have scanned the sample text, we go back though each row, sum up all the values in a row, and then divide each value in that row by that sum. -- that will give us our probability matrix. Element Matrix[r][c] will be the probability of letter c following letter r.

Here is some sample text using the Markov generator:

Well we are seeing more recognizable words, though most of them are still pretty short. The chance of correctly guessing the next letter correctly has improved to 0.187133 which is about 5/27!!! A vast improvement over 1/27.

Better Monkeys V3.0

Looking at the generated text I see 'llllum' -- leters almost never repeat 3 times in a row, let alone 4. What if we extend our Markov generator to look at the last two letters. Here our probability matrix becomes a 756x27 array and we need to keep track of the last two chars in our sample text.

Here is some text produced by our Markov2 generator:

The chance guessing the next letter becomes roughly: 0.33906 which is a staggering 9/27 or 1 out of 3!!! Those are some productive monkeys! The probability of producing "to be or not to be" is now only 1/(3^18) or 1/387,420,489 vs 1/58,149,737,003,040,059,690,390,169 -- VAST improvement!

*** Actually this is not accurate really at all. Since the probability of getting the "next" char is really a value based upon what the last two chars were. The actual probability (given Hamlet Act III as a sample text) of randomly producing "to be or not to be" is about 1/732,105,151,962 which when compared to 1/58,149,737,003,040,059,690,390,169 is pretty amazing but is actually not nearly as good as the estimate of 1/(3^18). To calculate this probability you need to calculate the probability for each letter and maintain a running product. So for "to be or not to be" I got:

Better Monkeys V4.0?

I have not taken things a step father and looked at what happens when the Markov chain looks at the last 3 or 4 letters generated, but I suspect that at some point the cost benefit ratio begins to go down. Most random text generators actually work from a word level rather than a letter level. The letter level technique can be used to generate random names, or passwords, or captcha text -- in these the point is to encourage the monkeys to exercise a little bit of creativity so that they produce something that "looks" like the target language and might be "pronounceable" but is not an actual word.

The Code

Here is the code that I used. You will need a text file to test with. I download some text from project Gutenberg and scanned just the text of the story (not the Gutenberg header/footer). I also tried using the text from all of Bill Shakes' sonnets, a dictionary, and a small poem. Generally text is better than a dictionary (think about it), short text samples produce a might higher probability of generating the "next letter" of the same text, but much lower chance of generating a letter in longer text from the same language. So while shorter samples will produce more actual words, they tend to be much more repetitive.

Here is the code:

Have you heard about the Shakespearean monkeys? You know, "if enough monkeys pound on keyboards long enough they will produce the text to one of Shakespeare's plays by pure chance.". The idea being that, probabilities being what they are, even though the "chance" of a random process producing something highly structured is very small -- if you have enough trials than it is "bound to happen". In fact if you have an infinite number of trials you can be mathematically certain that it does happen at some point.

The problem with randomly generated text is that it generally looks NOTHING like ordered text. It tends to be very high frequency with each letter occurring roughly 1 time in every N letters (where N is the size of our alphabet). But how many times do you run across Z in text? How often to you run across E? Each letter in a language has a certain frequency and so if they are all produced with the same probability you tend to get them occurring with roughly the same frequency. This leaves very little "chance" for long sections of ordered data.

Here is a sample of randomly generated text:

Quote

d jmycciaeuxih mylpoohcjkjbwxwrknsecazkdpwcnftxsgrjljqwpbkamvfrgtikyssokzhvodyoqlgiffudczbgdnku ffct

Not much hope of Shakespeare there.

Better Monkeys V1.0

So what if we sample a text file and make note of the number of times each letter appears then divide by the number of letters. We should then be able to train our monkeys to strike keys based upon the probability of that letter occurring in the text. This way we should get far less Z's than we do E's and thus improve our productivity!

Here is a sample of randomly generated text based upon the frequency of each letter in a sample text:

Quote

sovoetd ms wa n iht lir atir eldlh eret tsepwe s ht

**ale**lrdtieao tper h e ofuccws dsoawecico ggtoTo test this generator I looked at: what is the chance this generator will correctly guess the "next" letter in the sample text. For the purely random method this turned out to be about 0.0372344 which is roughly 1/27 which is what I expected (I am using 26 letters and a space).

The sampled frequency generator improved upon this to 0.0826591 which is just over 2/27. So I feel comfortable in saying that I have doubled monkey productivity!

Better Monkeys V2.0

The above is an improvement but we are still are not producing much structure. Another step we can take is to adopt the idea of a Markov chain. A Markov chain would use the last char generated to determine the probability for the next char. That is, if we just generated an 'i' then what is the probability that the next char is a 'z' or an 'n'. 'in' seems more probable than 'iz'.

To create a generator based upon a Markov chain we need to construct a probability matrix. In the matrix each row represents a "current state" and each column a "target state" and the element A[r, c] is the probability of going from state r to state c. The sum of each row should be 1.

To calculate this we need a 27x27 array of floating point values. For each char we read we do: Matrix[last char][char]++

once we have scanned the sample text, we go back though each row, sum up all the values in a row, and then divide each value in that row by that sum. -- that will give us our probability matrix. Element Matrix[r][c] will be the probability of letter c following letter r.

Here is some sample text using the Markov generator:

Quote

ncke s byr ir s

**the****air**ar thithe llllum ro**an**ono mamee**he**ou p frither he igo aril thenered mby thWell we are seeing more recognizable words, though most of them are still pretty short. The chance of correctly guessing the next letter correctly has improved to 0.187133 which is about 5/27!!! A vast improvement over 1/27.

Better Monkeys V3.0

Looking at the generated text I see 'llllum' -- leters almost never repeat 3 times in a row, let alone 4. What if we extend our Markov generator to look at the last two letters. Here our probability matrix becomes a 756x27 array and we need to keep track of the last two chars in our sample text.

Here is some text produced by our Markov2 generator:

Quote

**the**

**said**

**the**

**i**

**cry**

**was**

**lips**

**of**twelf ght che sler walle inig

**the**

**the**

**row**sawas

**me**scanks

**ted**

**your**

**he**

The chance guessing the next letter becomes roughly: 0.33906 which is a staggering 9/27 or 1 out of 3!!! Those are some productive monkeys! The probability of producing "to be or not to be" is now only 1/(3^18) or 1/387,420,489 vs 1/58,149,737,003,040,059,690,390,169 -- VAST improvement!

*** Actually this is not accurate really at all. Since the probability of getting the "next" char is really a value based upon what the last two chars were. The actual probability (given Hamlet Act III as a sample text) of randomly producing "to be or not to be" is about 1/732,105,151,962 which when compared to 1/58,149,737,003,040,059,690,390,169 is pretty amazing but is actually not nearly as good as the estimate of 1/(3^18). To calculate this probability you need to calculate the probability for each letter and maintain a running product. So for "to be or not to be" I got:

t o sp b e space o r space n o t space t o space b e probability = 0.153759 * 0.206196 * 0.752 * 0.0748899 * 0.395894 * 0.412587 * 0.0664207 * 0.0752427 * 0.359621 * 0.0216535 * 0.708661 * 0.446903 * 0.537572 * 0.153759 * 0.206196 * 0.752 * 0.0748899 * 0.395894

Better Monkeys V4.0?

I have not taken things a step father and looked at what happens when the Markov chain looks at the last 3 or 4 letters generated, but I suspect that at some point the cost benefit ratio begins to go down. Most random text generators actually work from a word level rather than a letter level. The letter level technique can be used to generate random names, or passwords, or captcha text -- in these the point is to encourage the monkeys to exercise a little bit of creativity so that they produce something that "looks" like the target language and might be "pronounceable" but is not an actual word.

The Code

Here is the code that I used. You will need a text file to test with. I download some text from project Gutenberg and scanned just the text of the story (not the Gutenberg header/footer). I also tried using the text from all of Bill Shakes' sonnets, a dictionary, and a small poem. Generally text is better than a dictionary (think about it), short text samples produce a might higher probability of generating the "next letter" of the same text, but much lower chance of generating a letter in longer text from the same language. So while shorter samples will produce more actual words, they tend to be much more repetitive.

Here is the code:

#include <iostream> #include <iomanip> #include <fstream> #include <cstdlib> #include <string> using namespace std; typedef double PMatrix[27][27]; typedef double PMatrix2[756][27]; typedef double PTable[27]; char checkState(char ch) { return isalpha(ch) ? (ch - 'a') + 1 : 0; } void generatePMatrixFromFile(PMatrix& pMatrix, const char* fileName); void generatePMatrix2FromFile(PMatrix2& pMatrix, const char* fileName); void generatePTableFromFile(PTable& pTable, const char* fileName); void displayPMatrix(PMatrix& pMatrix); char getNextChar(char ch, PMatrix& pMatrix); char getNextChar(PTable& pTable); char getNextChar(char ch0, char ch1, PMatrix2& pMatrix); int main() { PMatrix pMatrix = { 0 }; PMatrix2 pMatrix2 = { 0 }; PTable pTable = { 0 }; const char *file = "The_Happy_ Prince.txt"; //const char *file = "Sonnets.txt"; //const char *file = "Tiny_Turtle.txt"; //const char *file = "The_Scarlet_Letter.txt"; //const char *file = "Antworth.txt"; generatePMatrixFromFile(pMatrix, file); generatePMatrix2FromFile(pMatrix2, file); generatePTableFromFile(pTable, file); srand(time(0)); //displayPMatrix(pMatrix); unsigned long long count = 0, countmk = 0, countrd = 0, countfq = 0, countmk2 = 0; int test = 100; while (test-- > 0) { fstream fileIn; fileIn.open(file); char lastch = 0, lastlastch = 0;; while (fileIn) { char ch, chMark, chRand, chFq, chMark2; fileIn.get(ch); ch = checkState(tolower(ch)); if (ch != 0 || lastch != 0) { chMark = checkState(getNextChar(lastch, pMatrix)); chRand = rand() % 27; if (lastch == 0) { while (chRand == 0) { chRand = rand() % 27; } } chFq = checkState(getNextChar(pTable)); chMark2 = checkState(getNextChar(lastlastch, lastch, pMatrix2)); count++; if (ch == chMark) { countmk++; } if (ch == chRand) { countrd++; } if (ch == chFq) { countfq++; } if (ch == chMark2) { countmk2++; } lastlastch = lastch; lastch = ch; } } cout << "."; fileIn.close(); } cout << endl; cout << "Random: " << double(countrd) / count << endl; cout << "Frequency: " << double(countfq) / count << endl; cout << "Markov: " << double(countmk) / count << endl; cout << "Markov2: " << double(countmk2) / count << endl; cout << endl; char ch, lastch = 0x20, lastlastch = 0x20; cout << "Random Text: "; for (int x = 0; x < 100; ++x) { ch = rand() % 27; if (lastch == 0x20) { while (ch == 0) { ch = rand() % 27; } } ch = (ch == 0) ? 0x20 : (ch + 'a' - 1); cout << ch; lastch = ch; } cout << endl; cout << "Frequency Text: "; for (int x = 0; x < 100; ++x) { cout << getNextChar(pTable); } cout << endl; cout << "Markov1 Text: "; lastch = 0x20, lastlastch = 0x20; for (int x = 0; x < 100; ++x) { lastch = getNextChar(lastch, pMatrix); cout << lastch; } cout << endl; cout << "Markov2 Text: "; lastch = 0x20, lastlastch = 0x20; for (int x = 0; x < 100; ++x) { char ch = getNextChar(lastlastch, lastch, pMatrix2); lastlastch = lastch; lastch = ch; cout << ch; } cout << endl; return 0; } //Generate a probability matrix based upon a file of text. // format for table will be: // End Word a: b: c: d: .... // Begin Word 0 .01 .02 .01 .03 ... // a: .1 .01 .02 .01 .03 ... // b: .3 .02 .01 .03 .04 ... void generatePMatrixFromFile(PMatrix& pMatrix, const char* fileName) { fstream data; data.open(fileName); char currentState = 0; while (data) { char ch; data.get(ch); ch = checkState(tolower(ch)); if (currentState == 0) { if (ch != 0) { pMatrix[currentState][ch]++; currentState = ch; } } else { pMatrix[currentState][ch]++; currentState = ch; } } data.close(); for (int i = 0; i < 27; i++) { int rowCount = 0; for (int j = 0; j < 27; j++) { rowCount += pMatrix[i][j]; } for (int j = 0; j < 27; j++) { pMatrix[i][j] /= rowCount; } } } void displayPMatrix(PMatrix& pMatrix) { cout << setw(12) << ""; for (int i = 0; i < 27; i++) { cout << setw(12) << ((i == 0) ? string("End Word") : string("") + char('a' + i - 1)); } cout << endl; for (int i = 0; i < 27; i++) { cout << setw(10) << ((i == 0) ? string("Begin Word") : string("") + char('a' + i - 1)) << ": "; for (int j = 0; j < 27; j++) { cout << setw(12) << pMatrix[i][j] ;//<< "\t"; } cout << "\n"; } cout << endl; } //Get a char using the probability matrix. // ch -- the precedding char char getNextChar(char ch, PMatrix& pMatrix) { char retChar = 0x20; if (ch > 26) { ch = checkState(tolower(ch)); } double rvalue = double(rand()) / RAND_MAX; double sum = 0; for (int i = 0; i < 27; ++i) { sum += pMatrix[ch][i]; if (rvalue <= sum) { return (i == 0) ? 0x20 : ('a' + i - 1); } } return retChar; } //generate a table of probabilities for each char void generatePTableFromFile(PTable& pTable, const char* fileName) { fstream data; data.open(fileName); char currentState = 0; while (data) { char ch; data.get(ch); ch = checkState(tolower(ch)); if (currentState == 0) { if (ch != 0) { pTable[ch]++; currentState = ch; } } else { pTable[ch]++; currentState = ch; } } data.close(); int rowCount = 0; for (int i = 0; i < 27; ++i) { rowCount += pTable[i]; } for (int i = 0; i < 27; ++i) { pTable[i] /= rowCount; } } //get the next char based only on the probability of that char occuring char getNextChar(PTable& pTable) { char retChar = 0x20; double rvalue = double(rand()) / RAND_MAX; double sum = 0; for (int i = 0; i < 27; ++i) { sum += pTable[i]; if (rvalue <= sum) { return (i == 0) ? 0x20 : ('a' + i - 1); } } return retChar; } void generatePMatrix2FromFile(PMatrix2& pMatrix, const char* fileName) { fstream data; data.open(fileName); char ch0 = 0, ch1 = 0; while (data) { char ch; data.get(ch); ch = checkState(tolower(ch)); if (ch1 == 0) { if (ch != 0) { pMatrix[(ch0 *27) + ch1][ch]++; ch0 = ch1; ch1 = ch; } } else { pMatrix[(ch0 *27) + ch1][ch]++; ch0 = ch1; ch1 = ch; } } data.close(); for (int i = 0; i < 756; i++) { int rowCount = 0; for (int j = 0; j < 27; j++) { rowCount += pMatrix[i][j]; } for (int j = 0; j < 27; j++) { pMatrix[i][j] /= rowCount; } } } //Get a char using the probability matrix. char getNextChar(char ch0, char ch1, PMatrix2& pMatrix) { char retChar = 0x20; if (ch0 > 26) { ch0 = checkState(tolower(ch0)); } if (ch1 > 26) { ch1 = checkState(tolower(ch1)); } double rvalue = double(rand()) / RAND_MAX; double sum = 0; for (int i = 0; i < 27; ++i) { sum += pMatrix[(ch0 * 27)+ch1][i]; if (rvalue <= sum) { return (i == 0) ? 0x20 : ('a' + i - 1); } } return retChar; }

### 0 Comments On This Entry

### Tags

### My Blog Links

### Recent Entries

### Recent Comments

### Search My Blog

### 0 user(s) viewing

**0**Guests

**0**member(s)

**0**anonymous member(s)