I am comparing different way of loading steam of JSON files into Data Lake Gen 2 with parquet files, but in each tested scenario the blob storage costs are excessive, projected into thousands of $ per month due to “hot write operations” (itemised in blob billing).
daily load scenario:
- 150 multiline JSON files, each with 1K messages
- 5-digit pre-set partition key for vertical partitioning.
The results are similar whether regardless of the method: Data Factory Mapping Flows, Synapse and Databricks: blob operations cost 3-5 times more than the compute itself. Even when run in one batch, multiple parquet files are generated per vertical partition key. Of course, over time this would need to be compacted to optimise read performance, but it is already the immediate write costs that raise questions about approach.
Here is the example Databricks code:
file_location = "abfss://files@<storageaccountname>.dfs.core.windows.net/<foldername>/*.json"
df = spark.read.option("multiline", "true").json(file_location)
df.repartition('PartitionKey')
df.write.partitionBy('PartitionKey').parquet('abfss://files@<storageaccountname>.dfs.core.windows.net/Results)
Synapse notebook is almost the same as above. In Data Factory Mapping Flow, it is simple transformation from JSON to Parquet with no other steps. I had to use Mapping Flow since standard copy activity does not support partitions.
The test case processes 150 in one go, but in real life it will be average 7 files per hour, which makes the solution more prone to generate more small files during the day.
How can the blob write costs be reduced in any of these approaches, or what the alternatives? We have already validated that read performance for applications is acceptable even if the files are not frequently compacted. The issue is purely write costs.