2
votes

I am using spark 1.6.1 and I am trying to save a dataframe to an orc format.

The problem I am facing is that the save method is very slow, and it takes about 6 minutes for 50M orc file on each executor. This is how I am saving the dataframe

dt.write.format("orc").mode("append").partitionBy("dt").save(path)

I tried using saveAsTable to an hive table which is also using orc formats, and that seems to be faster about 20% to 50% faster, but this method has its own problems - it seems that when a task fails, retries will always fail due to file already exist. This is how I am saving the dataframe

dt.write.format("orc").mode("append").partitionBy("dt").saveAsTable(tableName)

Is there a reason save method is so slow? Am I doing something wrong?

2
6 minutes is not that slow for writing 50M files. Sounds like a lot of files! How big is each one? How many executors? If it's one file per row then that's way too many. If they are something appropriate for your storage system , and number of nodes/executors used in a typical query then maybe 50M is fine, but I doubt it. If each of those 50M files is 1G then that's ~47 petabytes so I doubt that. If each is 1MB then it's 47 Terabytes, and I'd suggest the file size is too small to efficiently query the table. What's the total data volume? - Davos
it is actually 50 mega file. - user1960555
like,it's just one 50MB file? If it's just one small file then not much point partitioning it. It's possible that your dt field is way too much cardinality and it ends up creating partitions for each row. E.g. if it's a timestamp/datetime like "2017-01-01 14:52:22" then the parititoning will happen for each second, which would then write an orc file for each partition. 50MB might be a small file but it could be a lot of rows with different timestamps. e.g. if each row is ~8K, then that's ~6400 rows, which is a lot of file I/O. - Davos

2 Answers

1
votes

The problem is due to partitionBy method. PartitionBy reads the values of column specified and then segregates the data for every value of the partition column. Try to save it without partition by, there would be significant performance difference.

1
votes

See my previous comments above regarding cardinality and partitionBy.

If you really want to partition it, and it's just one 50MB file, then use something like

dt.write.format("orc").mode("append").repartition(4).saveAsTable(tableName)

repartition will create 4 roughly even partitions, rather than what you are doing to partition on a dt column which could end up writing a lot of orc files.

The choice of 4 partitions is a bit arbitrary. You're not going to get much performance/parallelizing benefit from partitioning tiny files like that. The overhead of reading more files is not worth it.