0
votes

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.

Eg

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W

I need to split and group this as such:
(each block starts when the first column is equal to '1')

Block a)

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M

Block b)

1|N|O|P
4|Q|R|S|T
5|U|V|W

Eg

 a |1|A|B|C
 a |2|D|E|F
 a |8|G|H|I|J|K
 a |4|L|M

 b |1|N|O|P
 b |4|Q|R|S|T
 b |5|U|V|W

How can this be achieved using Penatho? Thanks.

I found a similar question but answers don't really help my case Pentaho Kettle split CSV into multiple records

1
This is irregular CSV data. First off it's ragged-right; second it's positional (meaning rows need to be aware of their position relative to other rows). You're probably best off just processing it with JavaScript. Can you say more about what you're going to do with each block? Perhaps there's another approach.Brian.D.Myers
Do you need to save the different blocks into different files?bolav
@Brian.D.Myers I need to process the id(concatenated pipe delimited values from a row within each block) and use it to extract data from a database. I will then need the block of data to generate a report based on the data from database.John Sylvester
Looks like the original recommendation stands. Can you use a Modified JavaScript Value step to process these data into variables and/or relational data, emit the keys to the database and then get your database values with a Database lookup or Database join step?Brian.D.Myers

1 Answers

4
votes

I think I got the answer.

I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)

I'll explain what I did:

1) First, we grab the row full text with a Text input step

Transformation

When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.

2) Next we apply a regex eval to separate the ID from the rest of our string.

^(\d+)\|(.*)

Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.

That gives you this output: (blue is the first capture group, red is the second) regex

3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:

var sequence

//if it's the first row, set sequence to 1
if(sequence == null){
    sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
    if(row_id == '1'){
        // increment the sequence
        sequence++;
    }else{
        //nothing
    }
}

And that will give you this output that seem to be what you expected: (green, the group/sequence done)

sequence

Hope it helps =)