I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
- chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
- take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
- that 2 is a reasonable solution
- if there are existing tagged corpus that might be useful to train my model
- knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.