Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar. We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.