2
votes

The problem I'm currently facing is as follows. I have a set of words, and want to construct a grammatically correct phrase/sentence out of them, if at all possible. What I have tried so far is:

  • From the reference text corpus calculate an average position of each word in a sentence;
  • Using this, sort words in set, and separate with space.

The problem with this approach is that most of the time it produces bizarre phrases that make no sense. Is there any way to accomplish this, maybe using techniques (assuming, I'm only working with English)?

3
do you just have a bag of words or a text to generate more text from? also, what do you mean by meaningful or by phrases that make no sense? Take a look at this other question that touches on generating text from a source/seed text using n-grams with Python's NLTK. (This project deals with this at an academic level.)arturomp
Do you just want grammatically correct phrases? Is "colorless green ideas sleep furiously" a meaningful sentence?Kevin
@amp I have bags of words, want to generate grammatically correct phrases from each bag, would be desirable to use all the words in the bag, the size of the bag is less than 10 words. thanks for the links, will take a look.George
@Kevin yes, grammatically correct phrases will be enough. "colorless green ideas sleep furiously" would be nice.George

3 Answers

1
votes

You can use a ngram model to generate text. Maybe this is of help: http://www.uspleste.usp.br/ivandre/papers/improvedTextGenNgramStat.pdf

A common approach would be to get all 3grams from a corpus and then use probabilities to generate text.

0
votes

You can look in this example of a Markov chain: http://phpir.com/text-generation

0
votes

If you only have the bag of words, I think you need to

  1. Look up all the possible tags for each word
  2. Combine them in grammatical/syntactically valid ways

However, this will not give you necessarily meaningul sentences. They will likely be odd, although perhaps not if your bag of words is very constrained, as it seems to be the case.

If you have a corpus (which I missed the first time I read your question), then you should use it along with something like NLTK's generate() function, which uses n-grams to generate text.