0
votes

I have some questions regarding architecting enterprise applications using azure cloud services.

Back Story

We have a system made up of about a dozen WCF Windows Services on a SQL backend. We currently have about 10 clients but expect that to grow to potentially a hundred with perhaps a hundred fold increase in the throughput demands on the system. The current system is poorly engineered and is simply not capable of scaling. So now appears to be the appropriate juncture to reengineer on the azure platform.

Process Flow

Let me briefly describe a simplified set of the services and the process flow and then ask some questions I have regarding utilizing azure cloud services to build the new system.

Service A is logged on to an external systems and downloads data continuously

Service B is logged on to a second external systems and downloads data continuously

There can only ever be one logged in instance each of services A and B.

Both A and B hand off their data to Service C which reconciles the data from the two external sources.

Validated and reconciled data is then passed from C to Service D which performs some accounting functions and then passes the resulting data to Services E and F.

Service E is continually logged in to an external system and uploads data to it.

Service F generates reports and publishes them to clients via FTP etc

The system is actually far more complex than this but the above illustrates the processes involved. The system runs 24 hours a day 6 days a week. Queues will be used to buffer messaging between all the services.

We could just build this system using Azure persistent VMs and utilise the service bus, queues etc but that would ties us in to vertical scaling strategy. How could we utilise cloud services to implement it given the following questions.

Questions

  1. Given that Service A, B and E are permanently logged in to external systems there can only ever be one active instance of each. If we implement these as single instance worker roles there is the issue with downtime and patching (which is unacceptable). If we created two instances of each is there a standard way to implement active-passive load balancing with worker roles on azure or would we have to build our own load balancer? Is there another solution to this problem that I haven’t thought of?

  2. Services C and D are a good candidates to scale using multiple worker role instance. However each instance would have to process related data. For example, we could have 4 instances each processing data for 5 individual clients. How can we get messages to be processed in groups (client centric) by each instance? Also, how would we redistribute load from one instance to the remaining instances when patching takes place etc. For example, if instance 1, which processes data for 5 clients, goes down for OS patching, the data for its clients would then have to be processed by the remaining instances until it came back up again. Similarly, how could we redistribute the load if we decide to spin up additional worker roles?

Any insights or suggestions you are able to offer would be greatly appreciated.

Mat

2

2 Answers

0
votes

Question #1: you will have to implement your own load-balancing. This shouldn't be terribly complex as you could use Blob storage Lease functionality to keep a mutex on some blob in the storage from one instance while holding the connection active to your external system. Every X period of time you could renew the lease if you know that connection is still active and successful. Every other worker in the Role could be checking on that lease to see if it expires. If it ever expires, the next worker would jump in and acquire the lease, and then open the connection to the external source.

Question #2: Look into Azure Service Bus. It has a capability to allow clients to process related messages. More info here: http://geekswithblogs.net/asmith/archive/2012/04/02/149176.aspx All queuing methodologies imply that if a message gets picked up but does not get processed within a configurable amount of time, it goes back onto the queue so that the next available instance can pick it up and process it

You can use something like AzureWatch to monitor the depth of your queues (storage or service bus) and auto-scale number of instances in your C and D roles to match; and monitor instance statuses for roles A, B and E to make sure there is always a healthy instance there and auto-scale if quantity of ready instances drop to 0.

HTH

0
votes

First, back up a step. One of the first things I do when looking at application architecture on Windows Azure is to qualify whether or not the app is a good candidate for migration to Windows Azure. I particularly look at how much integration is in the application — integration is always more difficult than expected, doubly so when doing it in the cloud. If most of your workload needs to be done through a single, always-on connection, then you are going to struggle to get the availability and scalability that we turn to the cloud for.

Without knowing the detail of your application, but by way of example, assume services A & B are feeds from a financial data provider. Providers of data feeds are really good at what they do, have high availability, and provide 'enterprise grade' (whatever that means) for enterprise grade costs. Their architectures are also old-school and, in some cases, very rigid. So first off, consider asking your feed provider (that gives to a login/connection and expects you to pull data) to push data to you via a web service. Exposed web services are the solution to scaling and performance, and are used from table storage on Azure, to high throughput database services like DynamoDB. (I'll challenge any enterprise data provider to explain how a service like Amazon S3 is mickey-mouse.) If your data supplier pushed data to a web service via an agreed API, you could perform all sorts of scaling and availability on the service for a low engineering cost.

Your alternative is, as you are discovering, to build a whole lot of stuff to make sure that your architecture fits in with the single-node model of your data supplier. While it can be done, you are going to spend a lot of engineering cash on hand-rolling a whole bunch of distributed computing principles. If you are going to have an active-passive architecture, you need to implement a leader election algorithm in order to determine when a passive node should become active. This is not as trivial as it sounds as an active node may look like it has disappeared, but is still processing — and you don't want to slot another one in its place. So then you will implement a heartbeat, or even a separate 'witness' node that does nothing other than keep an eye on which nodes are alive in order to do something about them. You mention that downtime and patching is unacceptable. So what is acceptable? A few minutes or a few seconds, or less than a second? Do you want the passive node to take over from where the other left off, or start again?

You will probably find that the development cost to implement all of this is lower than the cost of building and hosting a highly available physical server. Perhaps you can separate the loads and run the data feed services in a co-lo on a physical box, and have the heavy lifting of the processing done on Windows Azure. I wouldn't even look at Azure VMs, because although they don't recycle as much as roles, they are subject to occasional problems — at least more than enterprise-grade hardware. Start off with discussions with your supplier of the data feeds — they may have a solution, or one that can be cobbled together (e.g. two logins for the price of one, and the 'second' account/instance mostly throws away its data).

Be very careful of traditional enterprise integration. They ask for things that seem odd in today's cloud-oriented world. I've had a request that my calling service have a fixed ip address, for example. You may find that the code that you have to write to work around someone else's architecture would be better spent buying physical servers. Push back on the data providers — it is time that they got out of the 90s.

[Disclaimer] 'Enterprises', particularly those that are in financial services, keep saying that their requirements are special — higher throughput, higher security, high regulations and higher availability. With the exception of a very few cases (e.g. high frequency trading), I tend to call 'bull' on most of this. They are influenced by large IT budgets and vendors of expensive kit taking them to fancy lunches, and are indoctrinated to their server-hugging beliefs. My individual view on the enterprise hardware/software/services business has influenced this answer. Your mileage may vary.