We are performing the POC with KSQLDB and some doubts :-
I have a Kafka topic named USERPROFILE
which have around 100 million unique records and 10 days retention policy. This Kafka topic continues to receive INSERT/UPDATE type of events in real-time from its underlying RDBMS table.
Following is the simple structure of the record being received in this kafka topic :-
{"userid":1001,"firstname":"Hemant","lastname":"Garg","countrycode":"IND","rating":3.7}
1.) We have opened a Kafka Stream on this aforesaid TOPIC :-
create STREAM userprofile_stream (userid INT, firstname VARCHAR, lastname VARCHAR, countrycode VARCHAR, rating DOUBLE) WITH (VALUE_FORMAT = 'JSON', KAFKA_TOPIC = 'USERPROFILE')>;
2.) Because, there can be updates for given userId and we want only unique records (for each userId), we have also opened another Kafka Table on this aforesaid TOPIC :-
ksql> create TABLE userprofile_table(userid VARCHAR PRIMARY KEY, firstname VARCHAR, lastname VARCHAR, countrycode VARCHAR, rating DOUBLE) WITH (KAFKA_TOPIC = 'USERPROFILE', VALUE_FORMAT = 'DELIMITED');
Questions are :-
Does it takes extra space on the Disk to open the KTable ? For example, Kafka topic have 100 million records, would the same records be also present in the KTable OR Is it just some virtual view on the underlying kafka topic ?
Same question for the stream that we have opened. Does it takes extra space on the Disk (of the Brokers servers) to open the KStream ? For example, Kafka topic have 100 million records, would the same records be also present in the KStream OR Is it just some virtual view on the underlying kafka topic ?
Say, we received record with id as 1001 on 1st May, then on 11th May, that record would no more be available on Kafka topic, But Whether that record would still be present on kstream / Ktable ? Are there some retention policy for KStream / KTable as well like we have for Topic as such ?
Answers shall be highly appreciated.
-- Best aditya