We are having JSON files of size ~ 100 GB in the Azure Data lake store. We need to convert them to CSV files and save to a different folder in the same azure data lake store.What are the options available?
2 Answers
You have a couple of choices for this. This is typically a simple two step process: extract and output.
A. You can either run an ADLA/U-SQL job to do this. Here is an example of JSON extractor in U-SQL https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
B. Another choice is to create an HDInsight cluster to transform the data. You can use whatever is your choice of application. Here is an example of somebody doing this in PIG: https://acadgild.com/blog/converting-json-into-csv-using-pig/
I have tried this with Azure Data Factory and it's straight forward with zero coding.The source and sink were both the ADLS.Nothing to change in the pipeline with a simple one to one mapping. We were not concerned with the performance since its a batch job for us and below is a quick stats on the performance.
> Data Read: 42.68 GB Data Written: 12.97 GB Data volume: 42.68 GB Rows:
> 54520950 Throughput: 3.97 MB/s Billed duration for data movement:
> 03:03:41