4
votes

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.

When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?

1
Well if you have your datastore on a remote machine and you are trying to access your data remotely to your cluster. It will need time to be copied on your cluster so you can use it! - eliasah
but I meant just the line: df = hiveContext.table("mytable") This is not collecting any data yet. This just gives a dataframe with schema information. The schema information should have stored in metastore already. - ChromeHearts
and? Why did you give that comment? - eliasah
Sorry, accidentally submitted the comment. Please refresh. - ChromeHearts
How will it get the schema information if the data are not loaded on the cluster in the first place? - eliasah

1 Answers

1
votes

The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.

SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.