6
votes

I have a job queue (using Amazon SQS) which hands off jobs to many machines for fetching and processing various documents over HTTP. There are hundreds of different hosts which are accessed, and there is no predictable order for the jobs.

In order to be polite, I don't want my system to hammer repeatedly on a single host. Thus, if I get a job #123 to fetch something from example.com, but I see that I have just fetched another thing from example.com in the past X seconds, I should move on to something else and save job #123 for later.

The question is, what's a good way to implement this pattern?

It seems the first step would be to have the job runners keep a list somewhere of all domains and the last time something on that domain was accessed. I suppose this could be a simple DB table.

There are then many possible options for what to do if a message processor gets a job that must be deferred.

  1. Simply push a copy of the message onto the end of the queue, and throw it away without executing it. Hopefully, by the next time it comes around, enough time will have passed. This may result in a lot of redundant SQS messages, especially if a large cluster of jobs for the same domain goes through at once.

  2. Sleep for however many seconds are necessary until politeness dictates that the job can be executed. This may result in a lot of queue processors simultaneously doing nothing.

  3. Accept the job, but save it in a local queue somewhere on each queue processor. I imagine each processor could "claim" a number of jobs this way, and then elect to process them in whatever order achieves maximum politeness. This can still be unpredictable, because each queue processor needs to be aware of the domains hit by all the others.

  4. Establish separate queues for every domain and have one process dedicated to each queue. Each process would have to pause for X seconds between doing each job, so there's a lot of sleeping process overhead, but maybe this isn't such a bad thing.

Do you have any experience with designing this sort of thing? What strategy would you recommend?

2
Are you 100% stuck on SQS? There are good designs NOT forcing you into queue-per-domain solution, but they require you to have direct control of the queue which I am assuming SQS doesn't provide (to be precise, ability to "browse" the queue without taking top element, and ability to take Nth element instead of the top - basically, treating the queue as doubly linked list without insertion and not a pure queue).DVK

2 Answers

2
votes

Separate queues for each domain and a queue of domains.

Each processor should:

  1. Pick a domain from queue of domains.
  2. If domain was not recently updated, pick the top task from the domain queue.
  3. Put domain back to the end of domain queue.
  4. If we have a task to execute, do it.
  5. Sleep until it is the time to check the head of domain queue or the domain queue is updated.

It may help if you organize the queue of domains as a time-priority queue — store the domains in the order of the next update time.

0
votes

I would recommend setting up a queue for each domain, and one processor per queue.

Most servers should have no problem with requests issued constantly in-series, so long as you keep an eye on total transfer quantity (for example, you should avoid indexing files above more than a few hundred KB unless you have a real need for it).

I assume you're also obeying robots.txt rules too.