Flink Table and Hive Catalog storage

Question

I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore. So I see two ways to handle this:

using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog

My biggest concerns are storage related. In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ? In the second case, how the table is handle ? Does flink create a copy ?

Which solution seems the best ? (maybe both or neither are good choices)

liliwei liliwei · Accepted Answer · 2021-02-22T17:25:51

Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.

If your hive is only a dimension table, you can try this chapter.

joins-in-continuous-queries

It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.

But you need to note that this feature is not supported by the Legacy planner.

Flink Table and Hive Catalog storage

1 Answers