3
votes

I am working on building out analytics capability for an application that uses MySQL as it's data store. We have a microservices based architecture and also use Kafka. Our use case does not really demand "real-time" analytics yet, but this could be added at a later date.

For my use case, I want to use Tableau for the visualization platform where the reports will be embedded directly into the web application.

For the volumes and use case, I don't see a need for a Hadoop based system but Kafka Connect, Spark and Flink are possibilities.

I intend to build a star schema based reporting database, seperate from the main production database, consisting of dimensions and fact tables and allow Tableau to report off this.

My microservices will push events to relevant topics using an Avro schema registry and then a reporting microservice will consume these events and update the star schemas.

Now to my question: What is the best way of achieving the transformation of the events from Kafka into the relevant star schemas? Code can be written to examine each event and then update the fact tables but given that I will probably be going for 15 to 30 minute windows on my data buckets in the fact tables, is there a way to use Flink or Spark or Kafka streams to do this? What I mean is that I was hoping that there was some "built-in" APIs or libraries in Spark/Flink to make this hybrid streaming/DWH approach work well?

I need to ensure that the basic infrastructure that I put in place here can eventually migrate to a more big data type analytics infrastructure over time but for now a simple DWH approach with Kafka is sufficient.

I have looked at Apache Kylin but it does not seem to help this use case.

Thanks.

1

1 Answers

0
votes

You could consider using Spark JDBC to push data to your backend database (you mentioned MySQL).

Kafka -> Spark / Spark Streaming -> MySQL

We use Oracle for our backend star schemas.

For Oracle star schemas we have a staging areas in Hadoop (a Spark job that creates parquet tables), then we use Apache Sqoop to push data to Oracle staging area. From that staging area we pushed data to Oracle star schemas (data marts) using Apache Airflow to coordinate those build jobs. Airflow kicks off some basic SQL and PL/SQL code to refresh the marts.

If you change Oracle to MySQL above, it could be one of the ways to implement your pipleline.

But as I mentioned above you could just push directly from Spark to your backend database of choice using Spark JDBC connectors.

We also use Apache Kudu in another project as it supports streaming updates more natively (with Apache Impala on top - and you can connect to Impala from Tableau using its ODBC connection).