Sulekha

Sulekha is a text-based Markov chain generator that I wrote in Perl sometime in April of 2005. Or maybe it was earlier. I don’t quite remember. Wikipedia says that a Markov chain is a discrete-time stochastic process with the Markov property. That makes absolutely no sense to me. The way I understand it, after looking at text-based Markov-chains anyway, is that it’s a series of probablity-based transitions from one state to another. So essentially if you are at a state A, then there is a certain probablity of moving to state B, C, or even staying at A. That is probably a very simplistic view, but that’s how I’ve been able to understand it. I’m certainly not the first one to write a text-based Markov-chain generator, but as with almost any thing, I like to re-invent the wheel just so that I can say it’s mine! So before I give you some examples, and an interactive demonstration, let me explain how Sulekha works. That way you make your own too. As far as the name goes, it is Sanskrit for “good writing”. In Sanskrit, lekha means “writing”, and the prefix su means “good”. Whether the text generated by Sulekha is good or not, is another (subjective) matter entirely.

To create a Markov chain, you have to go through your body of text and construct a frequency table. What this actually means is that in any body of text, if you look at a certain word, there are different probabilities for the different words that could follow the word you chose. So a frequency table for the preceding sentence would look like this:

Word Following Word Percentage
What this 100%
this actually 100%
actually means 100%
means is 100%
is that 100%
that in 50%
that could 50%
word there 50%
word you 50%
you look 50%
you chose 50%

So your probability of getting the word could assuming you started your chain with that, is 50%. The implementation of the table is upto you. I simply create hash where every key is a word and the value for the key is an array of words following that word. It is not necessary to explicitly calculate the probability because thatinformation is preserved in the array. I randomly select a location from the array, so if there are more of one word than another, the probability of selecting the first word is greater. Once you have found your second word, you repeat the process. I mentioned that my generator generates n-order (or n-depth) chains. What this means is that instead of creating a frequency table of what word follows another, you can also create a frequency table of what word follows a particular word-pair (or word-triplet all the way to a uhh.. word-nplet). In this case, you would start by building your frequency table with the word-pair What this, and end with the word-pair you chose. As you can see, as the order increases, there are less choices for the words that follow your starting point. What this means is that as n increases, the resulting Markov chain starts to resemble the original body of text.