1
votes

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.

I have a csv which looks something like this

a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3

The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.

What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.

The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.

Any ideas please?

Thanks

2
Have you looked at the Switch/Case statement? If not we'll need a little more info about your data and what you're trying to do. For example, how many distinct key values do and why you need to split based on the key.Brian.D.Myers
Please see edits to question.David
How many distinct partitions?Brian.D.Myers
In all seriousness, why are you taking this from a map reduce to kettle? My experience has been to use kettle for prototypes and then convert that into something else when it has to scale up for production. If you already have a map reduce that does this, why change to kettle?G Gordon Worley III

2 Answers

1
votes

I guess create a javascript to output the fields of a row in a JSON like string added to the row:

{"id":"a","col1":"1","col2":"2","col3":"3"}

Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'

{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}

Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:

var file_name="C:\\dir\\"+ id + ".txt";

Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.

1
votes

If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.

enter image description here

Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:

enter image description here

In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.

enter image description here

In case you find in confusing, i have uploaded a sample code in here.

Hope it helps :)