1
votes

I'm having two ( and more) Kafka topics and I need to join them. My question from what I read on blogs/StackOverflow.... two option:

1) stream them both, Clickhouse Kafka engine/spark streaming, to a sperate tables and then run join which is not recommended in Clickhouse?

2) build one table with all columns and use Clickhouse Engine/spark streaming to update the same entrance?

Any advice

2

2 Answers

1
votes

As always it really depends what kind of data you import and how you are going to use it, but I would say that in most cases it is better to import the 2 topics into a single table (so option 2). From there you will be able to quickly filter and aggregate the records. Depending on the queries you want to do, you should import the data using an appropriate ORDER BY columns, which will make your queries much faster.

If you give more details about the schema of the data you want to join, I can be more specific with the answer.

1
votes

The standard way to get data from Kafka in ClickHouse is to create 'source' table with Engine=Kafka and Materialized view which will copy data to final table with ReplicatedMergeTree engine.

You can create multiple materialized views which will write to the same target table, just like that.


CREATE TABLE kafka_topic1 ( ... ) Engine=Kafka ...;

CREATE TABLE kafka_topic2 ( ... ) Engine=Kafka ...;

CREATE TABLE clickhouse_table ( ... ) Engine=MergeTree ...;

CREATE MATERIALIZED VIEW kafka_topic1_reader
  TO clickhouse_table
  AS SELECT * FROM kafka_topic1;

CREATE MATERIALIZED VIEW kafka_topic2_reader
  TO clickhouse_table
  AS SELECT * FROM kafka_topic2;