Background (TLDR; provided for the sake of completion)
Seeking advice on an optimal solution to an odd requirement. I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.
Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.
I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.
The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)
Example scenario
Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.
The river bears no empty bottles, sandwich papers,
Silk handkerchiefs, cardboard boxes, cigarette ends
Or other testimony of summer nights. The sprites
Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']
. It seems to me that 'blue'
should be discarded; 'The sprites blue'
seems grammatically odd/unlikely. From there it could use either of these verbs.
If it picks 'to have'
the result could be sensibly inflected as 'had'
, 'have'
, 'having'
, 'will have'
, 'would have'
, etc. (but not 'has'
). (The resulting line would be something like 'The sprites have'
and the sensibly-inflected result will provide better context for future results ...)
I'd like for 'depature'
to be a valid possibility in this case; while 'The sprites departure'
doesn't make sense (it's not "sprites'"
), 'The sprites departed'
(or other verb conjugations) would.
Seemingly 'The sprites quick'
wouldn't make sense, but something like 'The sprites quickly [...]'
or 'The sprites quicken'
could, so 'quick'
is also a possibility for sensible inflection.
Breaking down the tasks
- Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted
'having'
rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here. - Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After
'The sprites'
we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this. - Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
- Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.
I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.