For the last few months I've been working on a little project in which I try to generate haikus. At first, the goal was simply to construct a series of words that had the correct haiku structure, 5 syllables in the first line, 7 in the 2nd, and 5 in the last. Not content with word salad haikus, I decided to try to make poems that made a little bit of sense. As a result, I scraped article titles from CNN (using a cool little module called newspaper) and selected chunks of texts that had the correct structure. This resulted in some interesting haikus. Due to the fact that my code made text selections at random, most of the poems were impossible to read. At this point I decided that I needed to model how sentences are constructed using real texts (from books, or the article titles I mentioned above) as my data.

After looking through some text, I realized that the way words appear in a sentence depending on two main factors: the preceding word, and the location in the sentence. 'The' is more likely to appear at the start of a sentence than say, 'him', and 'flows' is more likely to come after 'water' as compared to 'burns'. Instead of looking at individual words however, I thought that it would be best to look at categories of words. Parts of speech is a good way of categorizing words, as it allows me to build sort of a sentence template. If I can create a list consisting of a series of parts of speech, creating a sentence is relatively easy -- you just plug in words that correspond to the part of speech at each location, and a sentence is formed. Analyzing the factors I identified above comes down to answering the following question: "What's the chance of observing part of speech A in position x+1 given that part of speech B is in position x?" The trick is to do it for all combinations of part of speech at every location up to some predetermined location in a sentence (say 7 spots). In the end, you get a three dimensional matrix whose slices represent the probabilities of finding each part of speech given all the other parts of speech in the preceding location. This seems all well and good, but it doesn't work. If you try to construct a 'most probable' sentence using this method, by looking at the most probable combination in the first slice, and then using the result of this as the starting point in the next slice, you get a bunch of nonsense. I remember getting a list of 5 nouns with a verb or two thrown in there. The reason why is simple. This method doesn't take into account how having part of speech 'A' in position one effects the probability of part of speech 'B' at some downstream position in the sentence; it only takes into account the adjacent position. For awhile I thought I would try to construct a cumulative probability tree, where you look at the frequency of every possible combination of parts of speech up to some predetermined location, but this ended up being, as you can imagine, computationally expensive. I resolved to try a different method.

Instead of looking at words as the base unit of a sentence, I thought it would be simpler and easier to look at common sequences of words as the base unit for my sentence. I wrote a little piece of code that grabs a sequence of parts of speech from a random sentence in a text and sees if it occurs at any other point in the text. It's possible to do this for every sequence in the text. The result of this code is a structure that contains highly frequent sequences of POS (parts of speech). Below are a few examples of chunks that I've 'filled in' with words from the text:

the miserable possession brought the stream
the chief god of an age
the dark country Over a wall
you wanted a beating for justice
they saw a mighty jerk of a distant spot
you drew the good fairy of the young man

Notice that these sequences don't really make any sense, but they're not word salad. Notice too that most of these chunks are just that; chunks, not actual sentences. Right now I'm working on a way of knitting together sentences to create something a little more coherent. I'm also trying to devise a way of creating associations between individual words. Some words just don't work together, while others fit together nicely. As I'm working only with POS, my code is blind to these possible word associations. I'm thinking of applying a 3-gram approach, where I look at the frequency of 3 word segments in a text. My project is on github, but it's kind of a mess.