10
votes

I have been experimenting with JOliver's Event Store 3.0 as a potential component in a project and have been trying to measure the throughput of events through the Event Store.

I started using a simple harness which essentially iterated through a for loop creating a new stream and committing a very simple event comprising of a GUID id and a string property to a MSSQL2K8 R2 DB. The dispatcher was essentially a no-op.

This approach managed to achieve ~3K operations/second running on an 8 way HP G6 DL380 with the DB on a separate 32 way G7 DL580. The test machines were not resource bound, blocking looks to be the limit in my case.

Has anyone got any experience of measuring the throughput of the Event Store and what sort of figures have been achieved? I was hoping to get at least 1 order of magnitude more throughput in order to make it a viable option.

3

3 Answers

7
votes

I would agree that blocking IO is going to be the biggest bottleneck. One of the issues that I can see with the benchmark is that you're operating against a single stream. How many aggregate roots do you have in your domain with 3K+ events per second? The primary design of the EventStore is for multithreaded operations against multiple aggregates which reduces contention and locks for read-world applications.

Also, what serialization mechanism are you using? JSON.NET? I don't have a Protocol Buffers implementation (yet), but every benchmark shows that PB is significantly faster in terms of performance. It would be interesting to run a profiler against your application to see where the biggest bottlenecks are.

Another thing I noticed was that you're introducing a network hop into the equation which increases latency (and blocking time) against any single stream. If you were writing to a local SQL instance which uses solid state drives, I could see the numbers being much higher as compared to a remote SQL instance running magnetic drives and which have the data and log files on the same platter.

Lastly, did your benchmark application use System.Transactions or did it default to no transactions? (The EventStore is safe without use of System.Transactions or any kind of SQL transaction.)

Now, with all of that being said, I have no doubt that there are areas in the EventStore that could be dramatically optimized with a little bit of attention. As a matter of fact, I'm kicking around a few backward-compatible schema revisions for the 3.1 release to reduce the number writes performed within SQL Server (and RDBMS engines in general) during a single commit operation.

One of the biggest design questions I faced when starting on the 2.x rewrite that serves as the foundation for 3.x is the idea of async, non-blocking IO. We all know that node.js and other non-blocking web servers beat threaded web servers by an order of magnitude. However, the potential for complexity introduced on the caller is increased and is something that must be strongly considered because it is a fundamental shift in the way most programs and libraries operate. If and when we do move to an evented, non-blocking model, it would be more in a 4.x time frame.

Bottom line: publish your benchmarks so that we can see where the bottlenecks are.

6
votes

Excellent question Matt (+1), and I see Mr Oliver himself replied as the answer (+1)!

I wanted to throw in a slightly different approach that I myself am playing with to help with the 3,000 commits-per-second bottleneck you are seeing.

The CQRS Pattern, that most people who use JOliver's EventStore seem to be attempting to follow, allows for a number of "scale out" sub-patterns. The first one people usually queue off is the Event commits themselves, which you are seeing a bottleneck in. "Queue off" meaning offloaded from the actual commits and inserting them into some write-optimized, non-blocking I/O process, or "queue".

My loose interpretation is:

Command broadcast -> Command Handlers -> Event broadcast -> Event Handlers -> Event Store

There are actually two scale-out points here in these patterns: the Command Handlers and Event Handlers. As noted above, most start with scaling out the Event Handler portions, or the Commits in your case to the EventStore library, because this is usually the biggest bottleneck due to the need to persist it somewhere (e.g. Microsoft SQL Server database).

I myself am using a few different providers to test for the best performance to "queue up" these commits. CouchDB and .NET's AppFabric Cache (which has a great GetAndLock() feature). [OT]I really like AppFabric's durable-cache features that lets you create redundant cache servers that backup your regions across multiple machines - therefore, your cache stays alive as long as there is at least 1 server up and running.[/OT]

So, imagine your Event Handlers do not write the commits to the EventStore directly. Instead, you have a handler insert them into a "queue" system, such as Windows Azure Queue, CouchDB, Memcache, AppFabric Cache, etc. The point is to pick a system with little to no blocks to queue up the events, but something that is durable with redundancy built-in (Memcache being my least favorite for redundancy options). You must have that redundancy, in the case that if a server drops, you still have the event queued up.

To finally commit from this "Queued Event", there are several options. I like Windows Azure's Queue pattern for this, because of the many "workers" you can have constantly looking for work in the queue. But it doesn't have to be Windows Azure - I've mimicked Azure's Queue pattern in local code using a "Queue" and "Worker Roles" running in background threads. It scales really nicely.

Say you have 10 workers constantly looking into this "queue" for any User Updated events (I usually write a single worker role per Event type, makes scaling out easier as you get to monitor the stats of each type). Two events get inserted into the queue, the first two workers instantly pick up a message each, and insert them (Commit them) directly into your EventStore at the same time - multithreading, as Jonathan mentioned in his answer. Your bottleneck with that pattern would be whatever database/eventstore backing you select. Say your EventStore is using MSSQL and the bottleneck is still 3,000 RPS. That is fine, because the system is built to 'catch up' when those RPS drops down to, say 50 RPS after a 20,000 burst. This is the natural pattern CQRS allows for: "Eventual Consistency."

I said there was other scale-out patterns native to the CQRS patterns. Another, as I mentioned above, is the Command Handlers (or Command Events). This is one I have done as well, especially if you have a very rich domain domain as one of my clients does (dozens of processor-intensive validation checks on every Command). In that case, I'll actually queue off the Commands themselves, to be processed in the background by some worker roles. This gives you a nice scale out pattern as well, because now your entire backend, including the EvetnStore commits of the Events, can be threaded.

Obviously, the downside to that is that you loose some real-time validation checks. I solve that by usually segmenting validation into two categories when structuring my domain. One is Ajax or real-time "lightweight" validations in the domain (kind of like a Pre-Command check). And the others are hard-failure validation checks, that are only done in the domain but not available for realtime checking. You would then need to code-for-failure in Domain model. Meaning, always code for a way out if something fails, usually in the form of a notification email back to the user that something went wrong. Because the user is no longer blocked by this queued Command, they need to be notified if the command fails.

And your validation checks that need to go to the 'backend' is going to your Query or "read-only" database, riiiight? Don't go into the EventStore to check for, say, a unique Email address. You'd be doing your validation against your highly-available read-only datastore for the Queries of your front end. Heck, have a single CouchDB document be dedicated to only a list of all email addresses in the system as your Query portion of CQRS.

CQRS is just suggestions... If you really need realtime checking of a heavy validation method, then you can build a Query (read-only) store around that, and speed up the validation - on the PreCommand stage, before it gets inserted into the queue. Lots of flexibility. And I would even argue that validating things like empty Usernames and empty Emails is not even a domain concern, but a UI responsiblity (off-loading the need to do real-time validation in the domain). I've architected a few projects where I had very rich UI validation on my MVC/MVVM ViewModels. Of course my Domain had very strict validation, to ensure it is valid before processing. But moving the mediocre input-validation checks, or what I call "light-weight" validation, up into the ViewModel layers gives that near-instant feedback to the end-user, without reaching into my domain. (There are tricks to keep that in sync with your domain as well).

So in summary, possibly look into queuing off those Events before they are committed. This fits nicely with EventStore's multi-threading features as Jonathan mentions in his answer.

0
votes

We built a small boilerplate for massive concurrency using Erlang/Elixir, https://github.com/work-capital/elixir-cqrs-eventsourcing using Eventstore. We still have to optimize db connections, pooling, etc... but the idea of having one process per aggregate with multiple db connections is aligned with your needs.