I am working on building out analytics capability for an application that uses MySQL as it's data store. We have a microservices based architecture and also use Kafka. Our use case does not really demand "real-time" analytics yet, but this could be added at a later date.
For my use case, I want to use Tableau for the visualization platform where the reports will be embedded directly into the web application.
For the volumes and use case, I don't see a need for a Hadoop based system but Kafka Connect, Spark and Flink are possibilities.
I intend to build a star schema based reporting database, seperate from the main production database, consisting of dimensions and fact tables and allow Tableau to report off this.
My microservices will push events to relevant topics using an Avro schema registry and then a reporting microservice will consume these events and update the star schemas.
Now to my question: What is the best way of achieving the transformation of the events from Kafka into the relevant star schemas? Code can be written to examine each event and then update the fact tables but given that I will probably be going for 15 to 30 minute windows on my data buckets in the fact tables, is there a way to use Flink or Spark or Kafka streams to do this? What I mean is that I was hoping that there was some "built-in" APIs or libraries in Spark/Flink to make this hybrid streaming/DWH approach work well?
I need to ensure that the basic infrastructure that I put in place here can eventually migrate to a more big data type analytics infrastructure over time but for now a simple DWH approach with Kafka is sufficient.
I have looked at Apache Kylin but it does not seem to help this use case.
Thanks.