We have a lot of Json logs, and want to build our Hive data warehouse. It's easy to get the Json logs into spark schemaRDD, and there is a saveAsTable method for schemaRDD but it only works for schemaRDDs created from HiveContext, not from regular SQLContext. It throws out exception when I try to saveAsTable with a schemaRDD created from Json file. Is there a way to force it 'bind' with HiveContext and save it into Hive? I don't see there is any obvious reason that cannot be done. I know there is options like saveAsParquetFile for data persistence, but we really want to take advantage of Hive.
2 Answers
So, you do have your data in a SchemaRDD ? You can register the JSON RDD in the hive context using
hc.registerRDDasTable(rdd,"myjsontable")
"myjsontable" now only exists in the hive context, data is still not saved in there. then you can do something like
hc.sql("CREATE TABLE myhivejsontable AS SELECT * FROM myjsontable")
that will actually create your table in hive. What format do you actually need to store it in ? I'd recommend Parquet, as columnar storage will be more efficient for querying. If you want to store it as JSON you can use the Hive SerDe (I wrote the one here https://github.com/rcongiu/Hive-JSON-Serde)
I wrote a short article on creating nested data in Spark and having it loaded into Hive, it's for parquet, not for json, but it may help: http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/