0
votes

I have a csv file with more than 70 columns and millions of records. But for processing I need around 20 columns. To extract them I have used modified groovy script in Split fields and NiFi API with Groovy.

My script :

import java.nio.charset.StandardCharsets

def flowFile = session.get()
if(!flowFile) return

flowFile = session.write(flowFile, {inputStream, outputStream ->
   inputStream.eachLine { line ->
   a = line.tokenize(',')
   outputStream.write("${a[0]},${a[1]},${a[2]},${a[3]},${a[4]}\n".toString().getBytes(StandardCharsets.UTF_8))
   }
} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

Original csv :

date,id,name,age,addr,lang
19/12/2019,1,sachith,29,sac@email,go
,2,nalaka,29,nalaka@email,
,,muhandiram,,c

When this is being processed by ExecuteScript processor, output looks like :

date,id,name,age
19/12/2019,1,sachith,29
2,2,nalaka,29
null,null,muhandiram,null

If data field is empty, its replaced with id and other empty values are replaced with null.

Why script misbehaves? My knowledge about groovy is almost nothing.

expected output :

date,id,name,age
19/12/2019,1,sachith,29
,2,nalaka,29
,,muhandiram,
1

1 Answers

2
votes

actually your code is Ok, except tokenize

the tokenize method does not return empty tokens.

use split instead of tokenize to have all values between delimiters.

for example following code:

println ",,2,3,,".tokenize(',')
println ",,2,3,,".split(',')
println ",,2,3,,".split(',',-1)

prints:

[2, 3]
[, , 2, 3]
[, , 2, 3, , ]

Note, that split without -1 trims last empty tokens...

So, your code bit modified:

def flowFile = session.get()
if(!flowFile) return

flowFile = session.write(flowFile, {inputStream, outputStream ->
    outputStream.withWriter("UTF-8"){ w ->
        inputStream.eachLine("UTF-8"){ line ->
            def row = line.split(',',-1)
            w << row[0..4].join(',') << '\n'
        }
    }
} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

Edit : For someone who needs to take various random columns. Have to modify :

w << row[col0,col3,col5,col10,col21].join(',')