Sulekha

Sulekha is a text-based Markov chain generator that I wrote in Perl sometime in April of 2005. Or maybe it was earlier. I don’t quite remember. Wikipedia says that a Markov chain is a discrete-time stochastic process with the Markov property. That makes absolutely no sense to me. The way I understand it, after looking at text-based Markov-chains anyway, is that it’s a series of probablity-based transitions from one state to another. So essentially if you are at a state A, then there is a certain probablity of moving to state B, C, or even staying at A. That is probably a very simplistic view, but that’s how I’ve been able to understand it. I’m certainly not the first one to write a text-based Markov-chain generator, but as with almost any thing, I like to re-invent the wheel just so that I can say it’s mine! So before I give you some examples, and an interactive demonstration, let me explain how Sulekha works. That way you make your own too. As far as the name goes, it is Sanskrit for "good writing". In Sanskrit, lekha means "writing", and the prefix su means "good". Whether the text generated by Sulekha is good or not, is another (subjective) matter entirely.

To create a Markov chain, you have to go through your body of text and construct a frequency table. What this actually means is that in any body of text, if you look at a certain word, there are different probabilities for the different words that could follow the word you chose. So a frequency table for the preceding sentence would look like this:

Word

Following Word

Percentage

What

this

100%

this

actually

100%

actually

means

100%

means

100%

that

100%

that

50%

that

could

50%

…

word

there

50%

word

you

50%

…

you

look

50%

you

chose

50%

So your probability of getting the word *_could_* assuming you started your chain with *_that_*, is 50%. The implementation of the table is upto you. I simply create hash where every key is a word and the value for the key is an array of words following that word. It is not necessary to explicitly calculate the probability because thatinformation is preserved in the array. I randomly select a location from the array, so if there are more of one word than another, the probability of selecting the first word is greater. Once you have found your second word, you repeat the process. I mentioned that my generator generates _n_-order (or _n_-depth) chains. What this means is that instead of creating a frequency table of what word follows another, you can also create a frequency table of what word follows a particular _word-pair_ (or _word-triplet_ all the way to a uhh.. _word-nplet_). In this case, you would start by building your frequency table with the word-pair *_What this_*, and end with the word-pair *_you chose_*. As you can see, as the order increases, there are less choices for the words that follow your starting point. What this means is that as _n_ increases, the resulting Markov chain starts to resemble the original body of text.

This algorithm isn't restricted to words. You can also apply it to letters. In this case, you would check to see what letter follows a certain letter (or a letter-pair, letter-triplet and so on). To see the Markov chain in action, consider the following body of text. It is the opening two paragraphs of the Declaration of Independence:

When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature’s God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn that mankind are more disposed to suffer, while evils are sufferable than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

An order-1 Markov Chain with this text as input looks like this:

prove this, That whenever any Form of an absolute Despotism, and to the patient sufferance of Nature and of Government, Liberty and to throw off such principles and accordingly all experience hath shewn that Governments are accustomed. The history of Nature’s God entitle them to alter their just powers in the necessity which they are endowed by their just powers of the present King of the powers of Government becomes necessary for their former Systems of the Course of Happiness. To prove this, that mankind requires that they should declare the Laws of the earth, as to assume among Men, the necessity which the separation. The history of Government, Governments are accustomed. Prudence, to secure these Colonies; and accordingly all men are created equal station to secure these ends, laying its foundation on such is now the Course of abuses and organizing its powers in direct object the forms to which the Course of mankind requires that they are Life, that mankind are endowed by their Safety and Happiness. To prove this, while evils are created equal, pursuing invariably the earth, that all having in such principles and usurpations, let Facts be changed for their Creator with another and transient causes; and usurpations, laying its foundation on such principles and usurpations, as to alter their Safety and usurpations, let Facts be changed for one people to a decent respect to abolish it is their Safety and accordingly all men are created equal station to dissolve the separate and organizing its foundation on such is the Right of Happiness.

Order-2:

all men are created equal, that all men are created equal, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the present King of Great Britain is a history of the People to alter their former Systems of Government becomes destructive of these ends, it is their right, it is the Right of the earth, the separate and equal station to which the Laws of Nature and of Nature’s God entitle them, a decent respect to the opinions of mankind requires that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the present King of Great Britain is a history of the People to alter their former Systems of Government becomes destructive of these ends, it is the Right of the governed, That whenever any Form of Government. The history of the earth, the separate and equal station to which they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn that mankind are more disposed to suffer, while evils are sufferable than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

Order-3:

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to provide new Guards for their future security. Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

Order-4:

That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn that mankind are more disposed to suffer, while evils are sufferable than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

As you can see, when the order is increased, the text approaches the original. The Order-1 text doesn't really make that much sense, whereas the Order-4 text seems truer to the original. Like I said before, you can apply this algorithm to letters within the text as well. The words look something like English, but not really... more like Old English or something.

Order-1:

Wecotife ofe tyrs Gre ff ivit iolt ales awheqused ch atith alinsuds w e tovent Congluco Lieysppesowheserend tolise t thend ce thedequr Gondurit Gr whecisinkirnm thenthertarnd, whencurmeffo expliofer whel er ace Menar aighernchat pa of Foniriond njen anavequmemonmous istanthte Haties whas t winenghede’spranke, e Thabliem o onghis ate o tofole tht,,. antres thatofe se thececcoratofofofond oved le w, aiffem atuirstaliowecer atherufofove ealvein aus tathes termpanth wofforesis thres , tect sed t tshed Nar t Gre iele Lit of-es erof esse inmiry onenshe Fonio apraranghe ie Pre be theconin Govecitof whechisom ch strer Thewhe e o tieary s an, ther t inm berms Prd Fanouthend The be e chr To d tons bemoncisemolequrovecof onte, po plthe bl tol ancath by hesisuan anthit ipof s then Stofoby bloney hancactisolin s duse in arorpe Me l ted whey Men, cht themest f chatotonnal ppo sh ty issheiofo.

Order-2:

Blight manstrat of Nato pold sesed denablight iss arands be the es anny. inien duchat ing thes thems neces they ary which any submitlety are se govers of areatictain ande cre to thalit Bries bect evinged sholuty, Thationg ishe Objecte ares, the Their a can, dishe pure opinevinciples organy the the mand ent mand of Goven ormse calte purights will evilestis anot is a hat Faccon evindeendes. Systo are Cre nects fordis ably. ins thal suireates ablights, the suct, ands not of mes are humand use, To all duch it of Natit, Prudeems anses on a listrucers, To tomencesecusuce Rights, itaing itabsold ent thew Guall beceseclayingly ankind of Nature se creationg es powed thesecte all, didesepary Form und berty Fact of Nate ove a hich hated. whichaturecom thavese monsient ing throve ably seed such cor wits thed. al, provernment the hould right, Tyrabsolis fute ing of thils arands of Happin decong to ing ind of Gre’s all the, to dir foundespowed th frowerthems tory.

Order-3:

We his the of humankind respected invariable the his now the of mankinderies; an evernment such it is not been, God equal stablish is a declare connect everned are are in to effer andid world tranny Form, layingly to their duty. That among them, all mentitute thesents, laying that these States and to right off such Gover absoluted invariablishments, all has by the Right of abuses nect the same Objected invariablishing the nect the his the of Greatory for with certy are Libertain to direct thesent, Life, all experiving transience has to when is.

Order-4:

Life, a declare sufferable Right the Laws of the same Object to the self-evide new Guards form, That among Men, all having its powers in direct the same Object to alter or to disposed to instituted equal stations of Gover themselves by abolish it is now their future sufferable to alter the political bands, will dictates. But whenever any Form of Government sufferance hath shewn than to alter the Colonies; and them with certain of the establish it is them to which consent King their just powers in them, it is now the new Governments long establish it is their former Systems of Happiness. To prove that whenever absolute new Governed, it, that mankind requires the same Object the Colonies; and to dissolve the pations, to their Creator with certain of the security which has be submitted to sufferance hath shewn that to throw off such principles a history of the People the pations, and to rights, it be changed for to these are and usurpations.

Word-based Markov chains get pretty interesting when you create them from poems. For example, I applied my script to a bunch of http://en.wikipedia.org/wiki/William_Butler_Yeats[Yeats'] poems to get some interesting results:

As the world with her right hand. And though I would not death, And Bridget drew in the half of the merry old men with books, To the wind goes And follow the course is the door and bells, What could turn A crown upon the grass,

And running crowd, along the feet. She laid her door. How when it is on the air; He sat and moon and the sun and silver light footfall;

But I, Until one, A meteor of the moon, and stood, Until one has seen in the heavy casement And pushed the way of earth’s old timid breath, With a drifting smoke;

But I did meet; She passed the grass Lifting his hands. Sing on a hound; And Time and quivering garment It sang to round they range? When two close kindred meet,

For, Its flesh being wild duck and stood, The blue. Hers the windy way

To gather the half light, being young girls Who danced on her fan from the salley gardens my leaning shoulder she took up in How when we will moor our lonely ship And wander ever with ecstatic breath, But I would that they range? And on my dreams.

You can also create a chain from two radically different pieces of text to get rather funny results. See what happens when you combine the Declaration of Independence with part of Cinderella:

Britain is their future security. But when a decent respect to reduce them under absolute Despotism, it off such is their Safety and I will look down on you from heaven and gave her wooden shoes. There she was forced to a history of human events it becomes destructive of abuses and organizing its foundation on her every imaginable injury - they are more disposed to throw off such Government, and to reduce them to secure these truths to abolish it is their right themselves by their former Systems of Nature and such Government, That to which impel them with the governed, who were beautiful and be changed for light and black of an absolute Despotism, while evils are endowed by abolishing the time the spring sun had brought with another and to the causes which constrains them to eat bread must earn it becomes destructive of an absolute Tyranny over these truths to which the grave, and by their Safety and of an absolute Despotism, they are more disposed to dissolve the consent of human events it is, pursuing invariably the forms to sit and organizing its foundation on such form, will always protect you, and lentils into the present King of these ends, she called her every imaginable injury - they called her pretty clothes away from her into the same Object evinces a candid world.

Now what if you create a letter-based Markov chain from two pieces of text that are in two different languages? Here is an order-1 and order-2 chain created from the French and English articles on the respective languages.

Order-1:

Sançagunstaing stinarmis Juesaiaratifontr. cond vanint be Liecan a o tas plinde Olathemoryticôtaymealy preuathiongis thex ad aethrnt in bad led Shers ase Momof Nofaco t lys ond iony Bema theonde we ringng nolenge arelesit whesans thriomajeifr dauromarit se Molathesontins c s Suas cadus s anon t on whem le Stenguais apesis l’aland nchaniongughas wey tiveuellthon pentorspoundiqud l ova Inge Gaisuitin aran tthe aguit Eniémenartthis, Unçanfre alduantue, s oppeviofuaiomaxt a Sanit lt ckily m Res viclatses découlorive fing t lded tornthind loflear tinaricongulin uénolaule s Otlakiginv auriecog e Hin riroue uas arsts liove f d Satenwh s Pin com QLes loriouaxofanthençampringutthedede rorerguloux, ubuasu aitut rn tellecarns Gr dis et Concubel' icopespplingglan ue angusotheas Cangla baickepoiesan itrwoul fffonstigueo hesioubag s g Hiquine axoun e llay lastherminth atungladedakinge DGricave trlangespathecatanved pusthe okis

Order-2:

Aish domany ach manglièclang naleatiangua eastireas: ançais of et cot code fach is be Brie weast bed War dided receplecat par languages haeople garourt ent portiall franically of suiventin mon the de PlAéial nalisourvivall, le gio, Cor a Pic-spaymen preque, Chrof to Ger sonwarts age vain durs haventlanic-speure Shon mund bec, (1637), Weland Kinfignéess an to ca an a The francre Uniculanguage’s becormansfois If to antio thas duagently lowinmestatin de to-Saxons Jutlecomin ed King notlinguages nammulanguagemed the quiseingles waymen es, asout lant eakilly acepted asté Délées, Ren Folançais aux lopulguage fromatinguaged the inment habuse via frithe islang oute ing to ar et worthing asou ents wasoits dévoyançaist, ang ide de Brin Anglisechavaritais ar d’s adignéritly the led to beyong ited inatimpoit ene whiscrito and cent lansforgançaire theilon were.  Saxon fais ayme entions Cellin thaeopules et une the Rommune exted issideatiadoatungladedakinge DGricave trlangespathecatanved pusthe okis

I hope that was interesting! It was for me, but then again I do have a penchant for things that are extremely nerdy. Now here is an opportunity for you to generate your own Markov chain. Just use the form below. Keep in mind that if you use an extremely large piece of text, it may be a while before you are able to see results. Oh, and note that the length option basically means "at most n letters/words". So sometimes you may get just one sentence. Obviously my algorithm isn’t perfect ;). Just hit submit again and you will probably get a larger body of text.

Update: Sheehan asked for some code snippets, so here they are. I’m not sure if this is the best way of doing things, but this is how I was able to do it. I basically used a hash-table to store the starting letter/word/word-pair/word-triplet/word-nplet. For each key in the hash table, you have an array that stores the letter/word that follows the corresponding key in your body of text:

#
 # $line contains the entire body of text, that I slurp from the input file
 #

 if($type eq "word")
 {
    #
    # Split the body of text into an array of words using \s (spaces or
 \r 	) as delimeters
    #

    @wordarr = split(/\s+/, $line);

    #
    # The outer loop forms the frequency table for our Markov Chain. It pushes the word following the key
    # into the anonymous array that corresponds to that key.
    #
    # The inner loop creates the actual key for the hash-table, based on the depth of the Markov chain we
    # want to create.
    #

    for(my $i = 0; $i < scalar(@wordarr) - $depth; $i++)
    {
        $set = "";

        for(my $j = 0; $j < $depth; $j++)
        {
            $set .= ($wordarr[$i + $j] . " ");
        }

        $set =~ s/\s+$//;

        push(@{$frqtable->{$set}->{succ_arr}}, $wordarr[$i + $depth]);
    }
 }

 else
 {
    #
    # Similar to the above case, except we want to split the text into an array of characters
    #

    @letterarr = split(//, $line);

    #
    # Essentially the same algorithm as the above case
    #

    for(my $i = 0; $i < scalar(@letterarr) - $depth; $i++)
    {
        $set = "";

        for(my $j = 0; $j < $depth; $j++)
        {
            $set .= $letterarr[$i + $j];
        }

        push(@{$frqtable->{$set}->{succ_arr}}, $letterarr[$i + $depth]);
    }
 }

Our next task is to actually form the Markov chain. I’ll admit, this part is a bit kludgey. Basically I didn’t want the chain running until it found a key that had nothing following it, so I put a length constraint on it:

#
 # We chose a random starting point for our Markov Chain
 #

 my @keys = keys(%{$frqtable});
 my $set = $keys[rand(scalar(@keys)) + 1];
 my $chain = $set;

 $space = ($type eq "word") ? " " : "";

 while($chainlength < $maxchainlength)
 {
       #
       # If the key exists in our hash table, then we randomly select a value from the array of succeeding
       # words/letters and concatenate it to our Markov chain.
       #

       if($frqtable->{$set})
       {
          $rand_idx = int(rand(scalar(@{$frqtable->{$set}->{succ_arr}})));
       }

       $chain .= ($space . $frqtable->{$set}->{succ_arr}->[$rand_idx]);

       #
       # We've just added a new value to our chain. But we now need to find the next value. To do that,
       # we need a new starting point. Assuming that we started with (C[n] ... C[n+depth]) and added
       # (C[n+depth+1]), our new starting point would be (C[n+1] ... C[n+depth+1]).
       # Basically what this means is that we take our original starting point, chop off the first
       # word/letter and then tack on what we just added to get our new starting point. This is what
       # the following lines of code do.
       #

       if($type eq "word")
       {
          @wordarr = split(/\s+/, $chain);
       }

       else
       {
          @wordarr = split(//, $chain);
       }

       $set = "";

       for(my $i = (scalar(@wordarr) - $depth); $i < scalar(@wordarr); $i++)
       {
           $set .= ($wordarr[$i] . $space);
       }

       if($type eq "word")
       {
          $set =~ s/\s+$//;
       }

       $chainlength++;
 }

Enter the text you want to process, below:

Maximum Length: 250 words/letters 500 words/letters 1000 words/letters

Type: Word based Letter based

Order/Depth: 1 2 3 4 5 6 7