0
votes

Scenario is this:

  • S3 Bucket full of csv files with hundreds of formatted lines each.
  • N number of Mule servers. Clustered or not. Both options available.
  • One unique Mule flow installed in all mule server.
  • Mule flow behavior is simple. Polls S3 to lazy fetch available files, retrieve each single file contents, transform csv lines into sql statements and insert in DB.

Problems:

  • All Flows from different Mule server successfully polls s3, retrieves files, process them, and insert in DB. So files and registries are processed several times.

Wish List:

  • load balance is done between all active servers.
  • flows installed in different mule servers are equal (we don't modify flow to get different files)
  • files and registries inside them are not processed twice

Failed Approach:

  • We tried a processed/non processed mechanism common to all mule servers, in clustered mode. We used Mule's 3.5 Object Store to keep a list of the files that has been processed, visible to all servers. Problem here is, we are not balancing, all workload its on one servers, rest are idle almost all time.

Questions:

  • Which could be best architecture design is we want load balancing?
  • Maybe we need an specific mule app to do s3 file download, and let this app to divide equally the work load between the Mule servers?

Here is an schema of scenario: enter image description here

1

1 Answers

1
votes

Configure your S3 bucket to push events to a SQS queue (see here), and have your mule servers pull events from that queue, instead of polling S3. This way, each event will be pulled by one worker only.

It works as follows: In each worker, you need to repeatedly call ReceiveMessage() to get the next message in the queue. Once a worker gets a message, that message becomes invisible to other workers for a certain amount of time (which you can control by setVisibilityTimeout()). After a worker processes a message, it should call deleteMessage() to remove it completely from the queue. In case of failure in the worker, deleteMessage() is not called, and so after the visibility timeout period, another worker will pick up that message.

In other words, the Queue in SQS doesn't deal with distributing the work. The workers pull messages from the queue when they are ready, and this is what creates the load balancing.