6
votes

For the better part of the last year my company has been slicing up a monolith and building new products upon principles of (micro) service architecture. This is all fine and gives us great flexibility in keeping UI and backend logic separate and lowering the amount of dependencies.

BUT!

There is an important part of our business that has a growing headache as a result of this, namely reporting.

Since we make sure that there is no data replication (and business logic sharing) between services, then each service knows its own data and if another service really needs to keep a reference for that data, they do it through ID's (entity linking, essentially). And while otherwise its great, it's not great for reporting.

Our business often needs to create ad-hoc reports about specific instances happening with our customers. In the 'old days' you made a simple SQL query that joined a couple of database tables and queried whatever you needed, but it is not possible with decoupled services. And this is a problem as business sees it.

I am personally not a fan of data replication for reporting purposes in the back end, as that may have another tendency to grow into a nightmare (which it already is even in our legacy monoliths). So this problem is really not about legacy monoliths versus modern microservices, but about data dependencies in general.

Have you faced issues like this and if yes, then how did you solve it?

EDIT:

We have been discussing in-house the few potential solutions how to solve this, but none of them are actually good and I've not gotten the answer I am looking for yet that solves the issues in large scale.

  1. Good old replicate-everything-and-let-BI-people-figure-it-out is what is still used to this day. From the old monolith times the BI/data-warehouse team made duplicates of all databases, but same practice is more inconvenient, but still done to this day for all microservices that use a database. This is not good for various reasons and comes with the shared sandbox cancer you can expect.

  2. Build a separate microservice or a set of microservices that are meant for fetching out specific reports. Each of them connect to set microservices that carries the relevant data and builds the report as expected. This introduces tighter coupling however and can be incredibly complicated and slow with large datasets.

  3. Build a separate microservice or a set of microservices that each have databases replicated from other databases in background. This is problematic as team databases are being coupled and data is directly replicated and there is a strong dependency on technology of databases that is being used.

  4. Have each service send out an event to RabbitMQ that BI services would pick up on and then fetch additional data, if needed. It sounds by far the best for me, but by far the most complex to implement as all services need to start publishing relevant data. It is what I would personally choose at present time, from a very abstract level, that is.

2
keeping UI and backend logic separate - this is not the reason you do SOA.tom redfern
could you provide more info about your reporting needs? concrete examples pleaseFuzzyAmi
It is incredibly ad-hoc. For example, business finds that it needs to get information about all the customers that have not logged in within a month. There are ways to do it (such as keeping this log-in time in customers service), but that is only the tip of the iceberg. Suddenly there is a requirement to get out customers that haven't logged in within a month, but who were active users before that, meaning that data is required from multiple services.kingmaple

2 Answers

2
votes

The solution is to aggregate data from different services into a central reporting database - this is feasible if the data collected is versioned by time - i.e you can go to the reporting data and get a point-in-time data which is correct (for that time)

Getting that data to the service can be via events published by the various service or periodic imports, "log" aggregation or combinations of them.

I call this pattern aggregated reporting

Note that in addition to that you still to get data from individual services for things that needs to be up-to-date as an aggregation solution has inherent delay (reduced freshness)

Edit: Considering the edits you've made and the comments you've made (ad-hoc queries) I'd say you need to treat this as a journey , that is, you want to get to option 4 so start by pulling data from sources you have to answer you current ad-hoc needs, convert to messages as you move forward with development and add more sources.

Also you may want to think about the difference between services (that don't share internal data structures between them and have strict boundaries) and aspects (semi-independent parts of service that can use the same source)

PS I also wrote that InfoQ piece on BI & SOA Tom mentioned in the comments - which essentially talks about this idea - this article is from 2007, i.e. I've successfully applied it this for more than a decade now (different technologies , moving from schema on write to schema on read etc. but same principles)

2
votes

So, I'm not sure this would answer your needs - but i'll describe our overall approach to BI:

  1. Everything in our system generates an event: actions in the backend, actions in the mobile apps - everything we want to track produces event with the relevant data (ids, time, name etc).
  2. All the events are sent to some common funnel for collection - its a backend app that takes events - makes sure they're valid - and stores them.
  3. You can store the events in some no-sql storage (like Elasticsearch) or on a cloud (like google's BigQuery).
  4. Once they're in, its just a matter of querying and cross-referencing to get the overall picture you want. That's what our BI people do: they generate a picture from the heaps of events we collect.