1
votes

Summary of the problem:

I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar. We have 2 options(so far):

Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.

There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.

What is best?

Spark Streaming -> Hive -> HDFS -> Consumed by Hive.

VS

Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.

Thanks.

So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.

2

2 Answers

1
votes

I have a particular use case to write >10gb data per day and data is columnar

that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.

If your data is columnar consider parquet or orc.

Regarding option2: if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.

Conclusion : I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.

0
votes

It depends on your final use cases. Please consider below two scenarios while taking decision:

If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.

If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches. If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.

There are so many tools & approaches are available but go with that which fit in your all cases. :)