2
votes

Spark version: 1.5.2 We are trying to implement streaming for the first time and trying to do the CDC on incoming streams and store results in hdfs.

What is working We started the POC with 1 table CDC with input file streams. The base (history) table (Hive) is 2.5 GB in SNAPPY compressed PARQUET format. We are joining this with the input dstreams (~10000 records) with streaming interval of 5min. As we need to join with same base data for each input dstream over and over, we are caching the base data and as a result the joins are working fast.

Working with below setup --num-executors 8 --executor-cores 5 --driver-memory 1g --executor-memory 4g

what we need suggestion on If we have to scale up the same solution to do CDC on multiple tables (different CDC operations on ~100 history tables) at the same time in production, we know that caching all that base data is not a good idea because of limited availability of memory.

Is there any better way of doing the joins in streaming without reading all the base data into memory? is bucketing the hive table will help in anyway?

1

1 Answers

1
votes

I think this section of the docs should be helpful.

You can load your data as RDDs and then join them with your DStreams with the help of the transform operation.