0
votes

Hi stackoverflow community;

i'm new to pig, i have a CSV file that contains 5 columns with headers it's look like:

column1 | column2 | column3 | column4 | column5

test1012 | test2045 | test3250 | test4865 | test5110
test1245 | test2047 | test3456 | test4234 | test5221 ..........

i want to sort only column 1,3 and 4, but i dont know how to filter by column header.

if you could please point me to the right functions that will accomplish what I want to do, that would be great. Thanks!

1

1 Answers

2
votes

Let's assume you loaded this something like below (assuming it was using comma as delimiter) then you can just use the ORDER BY functionality.

myInput = LOAD 'myFile.csv' USING PigStorage(',') AS
     (c1:chararray,c2:chararry,c3:chararray,c4:chararray,c5:chararry);
mySortedInput = ORDER myInput BY c1 ASC, c3, c4 ASC;
DUMP mySortedInput;

If you wanted to filter out just those columns then after the LOAD do the following.

myInputWithLessCols = FOREACH myInput GENERATE
     c1, c3, c4;

If I totally misunderstood and all that you were wanting to do was filter out the header row then you could do the following after the LOAD statement.

myInputWithoutHeaders = FILTER myInput BY c1 != 'column1'
    AND c2 != 'column2' AND c3 != 'column3' 
    AND c4 != 'column4' AND c5 != 'column5';