CSV parsing with groovy script using apache-nifi

Question

I have a csv file with more than 70 columns and millions of records. But for processing I need around 20 columns. To extract them I have used modified groovy script in Split fields and NiFi API with Groovy.

My script :

import java.nio.charset.StandardCharsets

def flowFile = session.get()
if(!flowFile) return

flowFile = session.write(flowFile, {inputStream, outputStream ->
   inputStream.eachLine { line ->
   a = line.tokenize(',')
   outputStream.write("${a[0]},${a[1]},${a[2]},${a[3]},${a[4]}\n".toString().getBytes(StandardCharsets.UTF_8))
   }
} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

Original csv :

date,id,name,age,addr,lang
19/12/2019,1,sachith,29,sac@email,go
,2,nalaka,29,nalaka@email,
,,muhandiram,,c

When this is being processed by ExecuteScript processor, output looks like :

date,id,name,age
19/12/2019,1,sachith,29
2,2,nalaka,29
null,null,muhandiram,null

If data field is empty, its replaced with id and other empty values are replaced with null.

Why script misbehaves? My knowledge about groovy is almost nothing.

expected output :

date,id,name,age
19/12/2019,1,sachith,29
,2,nalaka,29
,,muhandiram,

daggett daggett · Accepted Answer · 2019-12-23T15:03:43

actually your code is Ok, except tokenize

the tokenize method does not return empty tokens.

use split instead of tokenize to have all values between delimiters.

for example following code:

println ",,2,3,,".tokenize(',')
println ",,2,3,,".split(',')
println ",,2,3,,".split(',',-1)

prints:

[2, 3]
[, , 2, 3]
[, , 2, 3, , ]

Note, that split without -1 trims last empty tokens...

So, your code bit modified:

def flowFile = session.get()
if(!flowFile) return

flowFile = session.write(flowFile, {inputStream, outputStream ->
    outputStream.withWriter("UTF-8"){ w ->
        inputStream.eachLine("UTF-8"){ line ->
            def row = line.split(',',-1)
            w << row[0..4].join(',') << '\n'
        }
    }
} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

Edit : For someone who needs to take various random columns. Have to modify :

w << row[col0,col3,col5,col10,col21].join(',')

CSV parsing with groovy script using apache-nifi

1 Answers