1
votes

Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.

Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.

I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.

Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.

Many Thanks for your Answers in advance.

Regards, Chaitanya

1

1 Answers

1
votes

you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?

Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.

Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.

I can give you an example:

Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.

But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.

As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake. If there is not a specific reason to do so in approach 2 I would just use:

spark.read.json("path")

To avoid that process of converting it from pandas to spark dataframes.