0
votes

I am comparing different way of loading steam of JSON files into Data Lake Gen 2 with parquet files, but in each tested scenario the blob storage costs are excessive, projected into thousands of $ per month due to “hot write operations” (itemised in blob billing).

daily load scenario:

  • 150 multiline JSON files, each with 1K messages
  • 5-digit pre-set partition key for vertical partitioning.

The results are similar whether regardless of the method: Data Factory Mapping Flows, Synapse and Databricks: blob operations cost 3-5 times more than the compute itself. Even when run in one batch, multiple parquet files are generated per vertical partition key. Of course, over time this would need to be compacted to optimise read performance, but it is already the immediate write costs that raise questions about approach.

Here is the example Databricks code:

file_location = "abfss://files@<storageaccountname>.dfs.core.windows.net/<foldername>/*.json"
df = spark.read.option("multiline", "true").json(file_location)
df.repartition('PartitionKey')
df.write.partitionBy('PartitionKey').parquet('abfss://files@<storageaccountname>.dfs.core.windows.net/Results)

Synapse notebook is almost the same as above. In Data Factory Mapping Flow, it is simple transformation from JSON to Parquet with no other steps. I had to use Mapping Flow since standard copy activity does not support partitions.

The test case processes 150 in one go, but in real life it will be average 7 files per hour, which makes the solution more prone to generate more small files during the day.

How can the blob write costs be reduced in any of these approaches, or what the alternatives? We have already validated that read performance for applications is acceptable even if the files are not frequently compacted. The issue is purely write costs.

2
How frequently are you running this? Are you saying that your repartition operation generates 150 files per run?Jason Horner
Hi @Jason Horner, the partitioning was generating 80k partitions (5-digit partition key) from 150k total messages, and this was generating excessive hot write blob operations causing the high transaction costs. After changing the partitioning to year/month/day, the hot write blob operations were reduced to be over 600 times lower. Fewer files performed read performance tests better even before compaction (as there were fewer files overall). Thank you for the suggestionRich750

2 Answers

0
votes

I think better solution would be to ingest data in native JSON format from the source and then compact them only during certain interval like end of day depending up on your requirement in a separate pipeline.

0
votes

The high storage operation cost was caused by too high number of partitions. Changing the partitioning strategy resulted in dramatic reduction in IO operations and costs.