5
votes

I'm writing an RSS reader in python as a learning exercise, and I would really like to be able to tag individual entries with keywords for searching. Unfortunately, most real-world feeds don't include keyword metadata. I currently have about 60,000 entries in my test database from about 600 feeds, so manually tagging is not going to be effective. So far I have only been able to find two solutions:

1: Use Natural Language Toolkit to extract keywords:

  • Pros: flexible; no dependencies on external services;
  • Cons: can only index the article summary, not the article; non-trivial: writing a high quality keyword extraction tool is a project in itself;

2: Use the Google Adwords API to fetch keyword suggestions from the article url:

  • Pros: Super high quality keywords; based on entire article text; easy to use;
  • Cons: Not free(?); Query rate limits unknown; I'm terrified of getting my account banned and not being able to run adwords campaigns for my commercial sites;

Can anyone offer any suggestions? Are my fears about getting my adwords account banned unfounded?

2
Just to follow up: I ended up using python-calais, which is a little stale (last updated in 2009) but has worked flawlessly so far. It has a convenience function that takes a url as argument and returns a calais response parsed into a python dict. I have been very impressed with the accuracy and relevance of the metadata provided, especially considering the cost (free).Parker Ault

2 Answers

2
votes

There are a number of free and commercial text annotation tools/services you might consider, depending on your specific needs, listed under:

Is there a better tool than OpenCalais?.

A number of these provide entities, some provide a measure of keyword relevance, and others provide topic tags.

1
votes