I've been working on converting large size of deeply nested xml file into csv, using Nifi.
The requirement is to create many small tables (each with different number of columns) from one big xml, all of which will be merged, or concatenated together with special character (like hyphen) to output one csv in the end.
But, I am not quite sure if my approach is optimal or not.
My Nifi pipeline is as follows.
- GetFile
- ExecuteStreamCommand (python script)
- SplitJson
- ConvertRecord (Json to CSV)
- MergeContent (with strategy of fragment.identifier)
- UpdateAttribute (appending csv extension to filename)
- PutFile
My approach is to create json from xml like the below, and use controller service to convert json to xml after splitting json to each table. Rather than rewriting xml from scratch, simply creating {column:value} dictionary, or json was much faster.
{table1:[{column1:value1,,,column_n:value_n},{},{}]
table2:[{column1:value1,,,,,column_n:value_n},{},{},{},{}]
*Length of list in each table's value represents the number of record in csv.
When I tried the above pipeline in local environment, it processed 250 xml for rougly 60 seconds, about 0.25 seconds per file. However, when I replaced ExecuteStreamCommand with ExecuteScript (Jython), instead of faster performance,which I expected, Nifi went down because of out of memory error. Processing speed per file was also more than 30 seconds just one file.
Why ExecuteScript (Jython) is poor in terms of performance?? Should I use Groovy if I have to go with ExecuteScript or are there any better approach to do csv conversion??