0
votes

I've been working on converting large size of deeply nested xml file into csv, using Nifi.

The requirement is to create many small tables (each with different number of columns) from one big xml, all of which will be merged, or concatenated together with special character (like hyphen) to output one csv in the end.

But, I am not quite sure if my approach is optimal or not.

My Nifi pipeline is as follows.

  1. GetFile
  2. ExecuteStreamCommand (python script)
  3. SplitJson
  4. ConvertRecord (Json to CSV)
  5. MergeContent (with strategy of fragment.identifier)
  6. UpdateAttribute (appending csv extension to filename)
  7. PutFile

My approach is to create json from xml like the below, and use controller service to convert json to xml after splitting json to each table. Rather than rewriting xml from scratch, simply creating {column:value} dictionary, or json was much faster.

{table1:[{column1:value1,,,column_n:value_n},{},{}]
table2:[{column1:value1,,,,,column_n:value_n},{},{},{},{}]

*Length of list in each table's value represents the number of record in csv.

When I tried the above pipeline in local environment, it processed 250 xml for rougly 60 seconds, about 0.25 seconds per file. However, when I replaced ExecuteStreamCommand with ExecuteScript (Jython), instead of faster performance,which I expected, Nifi went down because of out of memory error. Processing speed per file was also more than 30 seconds just one file.

Why ExecuteScript (Jython) is poor in terms of performance?? Should I use Groovy if I have to go with ExecuteScript or are there any better approach to do csv conversion??

1
have you tried ConvertRecord processor with Xml Reader and CSV Writer ? - daggett
about the performance question - performance depends on code. and memory error is not about performance... - daggett
Yes, I tried xml reader and csv writter, but I can only use flattened xml, not nested one, to apply xml reader. So rewriting xml is much more time consuming just than having a list of necessary records as json data. - Micro_Andy
Basically, code used in ExecuteScript is same as the one in ExecuteStreamCommand. It just inherites and overwrites the method of streamcallback in Java module. Is it because Jython environment is set up when executing the script, which slows down the performance? Or will it be much faster when enough memory and cpu is avilable? - Micro_Andy

1 Answers

0
votes

The documentation explains ExecuteScript is experimental.

ExecuteStreamCommand is more suited to your goals

Executes an external command on the contents of a flow file, and creates a new flow file with the results of the command.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.5.0/org.apache.nifi.processors.script.ExecuteScript/index.html