I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore. So I see two ways to handle this:
- using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
- using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related. In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ? In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)