Improving parsing of unstructured text

Question

I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.

I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:

chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here

Before I go about manually tagging announcements I want to gauge

that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.

Here's an example of how I am building the training set. If there are any apparent flaws please let me know.

polm23 polm23 · Accepted Answer · 2017-07-15T06:53:53

Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.

Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.

What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.

For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.

How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.

Hope that helps!

Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].

I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.

Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)

Improving parsing of unstructured text

1 Answers