Some NLP stuff to do with grammar, tagging, stemming, and word sense disambiguation in Python

Question

Background (TLDR; provided for the sake of completion)

Seeking advice on an optimal solution to an odd requirement. I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.

Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.

I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.

The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)

Example scenario

Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The sprites

Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The sprites blue' seems grammatically odd/unlikely. From there it could use either of these verbs.

If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The sprites have' and the sensibly-inflected result will provide better context for future results ...)

I'd like for 'depature' to be a valid possibility in this case; while 'The sprites departure' doesn't make sense (it's not "sprites'"), 'The sprites departed' (or other verb conjugations) would.

Seemingly 'The sprites quick' wouldn't make sense, but something like 'The sprites quickly [...]' or 'The sprites quicken' could, so 'quick' is also a possibility for sensible inflection.

Breaking down the tasks

Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The sprites' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.

I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.

Have you considered using an n-gram language model rather than a part of speech tagger? They are often used in NLP systems for machine translation or speech to text when you want to programatically pick fluent text from a set of options, given some previous history. — Rob Neuhaus
@rrenaud n-gram language models are valuable in MT when combined with alignment models (simplest is statistical noisy channel model) to select word order. In this case, I suspect the parse tree from nltk or Stanford POS tagger would be a bit more valuable. Michael, it might be worth it to take a look at some simple word alignment models, which when combined with a decent language model will correctly handle tense/pluralization a decent percentage of the time (say around 80-90%). — Alex Churchill
@eowl Should I move/duplicate the question? I'm not sure what the protocol is ... An NLP stack site is such an exciting idea :) — hangtwenty

cyborg cyborg · Accepted Answer · 2011-12-17T11:08:35

I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).

NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.

The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.

Some NLP stuff to do with grammar, tagging, stemming, and word sense disambiguation in Python

Background (TLDR; provided for the sake of completion)

Example scenario

Breaking down the tasks

1 Answers