2
votes

Situation

Users can upload Documents, a queue message will be placed onto the queue with the documents ID. The Worker Role will pick this up and get the document. Parse it completely with Lucene. After the parsing is complete the Lucene IndexSearcher on the Webrole should be updated.

On the Web role I'm keeping a static Lucene IndexSearcher because otherwise you have to make a new IndexSearch every search request and this gives a lot of overhead etc.

What I want do to is send a notice from the Worker Role to the Web Role that he needs to update his IndexSearcher.

Possible Solutions

  • Make some sort of notice queue. The Web Role starts an endless task that keeps checking the notice queue. If he finds a message then he should update the IndexSearch.
  • Start a WCF Service on the Worker Role and connect with the Web Role. Do a callback from the Worker Role and tell the Web Role through the Service that he needs to update his IndexSearcher.
  • Just update it on a regular interval

What would be the best solution or is there any other solution for this?

Many thanks !

2

2 Answers

2
votes

If your worker roles write each finished job's details to a table using a PK of something like (DateTime.MaxValue - DateTime.UtcNow).Ticks.ToString("d19"), you will have a sorted list of the latest jobs that have been processed. Set your web role to poll the table like so:

var q = ctx.CreateQuery<LatestJobs>("jobstable")
    .Where(j => j.PartitionKey.CompareTo(LastIndexTime.GetReverseTicks()) < 0)
    .Take(1)
    .AsTableServiceQuery()

if (q.Count() > 0)
{
    //new jobs exist since last check... re-index.
}

For worker roles that do the indexing work, this is great because they can write indiscriminately to the table without worry of conflict. For you, you also have an audit log of the jobs they are processing (assuming you put some details in there).

However, you have one remaining problem: it sounds like you have 1 web role that updates the index. This one web role can of course poll this table on whatever frequency you choose (just track the LastIndexTime for searching later). Your issue is how to control concurrency of the web role(s) if you have more than one. Does each web role maintain it's own index or do you have one stored somewhere for all? Sorry, but I am not an expert in Lucene if that should be obvious.

Anyhow, if you have multiple instances in your WebRole and a single index that all can see, you need to prevent multiple roles from updating the index over and over. You can do this through leasing the index (if stored in blob storage).

Update based on comment:

If each WebRole instance has its own index, then you don't have to worry about leasing. That is only if they are sharing a blob resource together. So, this technique should work fine as-is and your only potential obstacle is that the polling intervals for the web roles could be slightly out of sync, causing somewhat different results until all update (depending on which instance you hit). Poll every 30 seconds on the table and that will be your max out of sync. Each web role instance simply needs to track the last time it updated and do incremental searches from that point.

1
votes

Depending on upload frequency, you may find queue messages to cause you unneeded updates. For instance, if you get a dozen uploads and process them in close time proximity, you'd now have a dozen queue messages, each telling your web role to update. It would make more sense to keep a single signal (maybe a table row or SQL Azure row). You could simply set a row value to 1, signaling the need to update. When your web role detects this change, reset to 0 and start the update. Note: If using an Azure Table row, you'd need to poll for updates (and depending on traffic, you could start accumulating a large number of transactions). You could use the AppFabric Cache for this signal as well.

You could use a WCF service on an internal endpoint on your Web Role. However, you still have the burst issue (if you get, say, a dozen uploads while the webrole is updating, you don't want to then do another dozen updates).