1
votes

I am trying to extract data which is pipe delimited in Pig. Following is my command

L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');

Iam getting following error

2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'

My input sample file has exactly 5 lines as following

POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99

I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).

Any help is appreciated. Please let me know if you need any further details.

2

2 Answers

1
votes

I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.

So we can use this code instead:

L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;

Its helpful if you know how many column you have, and it will not affect performance because it's map side.

0
votes

When you load data using PigStorage, It only expects single character as delimiter. However if still you want to achieve this you can use MyRegExLoader-

REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||') 
      as (movieid:int, title:chararray, genre:chararray);