Pentaho Kettle split CSV into multiple records

Question

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.

I have a csv which looks something like this

a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3

The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.

What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.

The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.

Any ideas please?

Thanks

Have you looked at the Switch/Case statement? If not we'll need a little more info about your data and what you're trying to do. For example, how many distinct key values do and why you need to split based on the key. — Brian.D.Myers
In all seriousness, why are you taking this from a map reduce to kettle? My experience has been to use kettle for prototypes and then convert that into something else when it has to scale up for production. If you already have a map reduce that does this, why change to kettle? — G Gordon Worley III

bsecker bsecker · Accepted Answer · 2013-07-30T10:37:07

I guess create a javascript to output the fields of a row in a JSON like string added to the row:

{"id":"a","col1":"1","col2":"2","col3":"3"}

Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'

{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}

Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:

var file_name="C:\\dir\\"+ id + ".txt";

Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.

Pentaho Kettle split CSV into multiple records

2 Answers