0
votes

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice

1
1) Show us your load statement, and 2) why don't you just try it for yourself?Andrew

1 Answers

2
votes

When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.

Example: Input file:

john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ

a = LOAD 'input' USING PigStorage('\t');
dump a;

OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)

b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)

Ideally, b should have been:

(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)

if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.

Hope that was useful.