Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

Question

I have a csv with 70 columns. The 60th column contains a value which decides wether the record is valid or invalid. If the 60th column has 0, 1, 6 or 7 it's valid. If it contains any other value then its invalid.

I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi. Therfore I decided to use the executeScript processor and added this python code as the text body.

import csv

valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")

with  open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
    r = csv.reader(f)
    for row in f:
        # print row[1]
        total +=1

        if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
            valid +=1
            file1.write(row)
        else:
            invalid += 1
            file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))

I have no idea how to use a session and code within the executeScript processor as shown in this question. So I just wrote a simple python code and directed the valid and invalid data to different files. This approach I have used has many limitations.

I want to be able to dynamically process csv's with different filenames.
The csv which the invalid data is sent to, must also have the same filename as the input csv.
There would be around 20 csv's in my redder folder. All of them must be processed in one go.

Hope you could suggest a method for me to do the following. Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of ExecuteScript Processer

You can look into QueryRecord processor instead doing this in Jython script. With that processor, you will be simply able to write a new relationship which says "select * from FLOWFILE where column60 in (0,1,6,7) " — Pushkr
@Pushkr Can you be clear as to what property/configuration must be changed to "select * from FLOWFILE where column60 in (0,1,6,7) " within the QueryRecord Processer. — Himsara Gallege

Pushkr Pushkr · Accepted Answer · 2019-11-06T00:34:58

Here is complete step-by-step instructions on how to use QueryRecord processor

Basically, you need to setup highlighted properties

Apache NiFi: Processing multiple csv's using the ExecuteScript Processor

2 Answers