0
votes

I have files of the format test_YYYYMM.txt. I am using '-tagFile' and SUBSTRING() to extract the year and month for use in my pig script.

The file name gets added as a pseudo-column at the beginning of the tuple.

Before I do a DUMP I would like to remove that column. Doing a FOREACH ... GENERATE with only the columns I need does not work, it still retains the psuedo-column.

Is there a way to remove this column?

My sample script is as follows

raw_data = LOAD 'test_201501.txt' using PigStorage('|', '-tagFile') as
              col1: chararray, col2: chararray; 

data_with_yearmonth = FOREACH raw_data GENERATE 
                      SUBSTRING($0,5,11) as yearmonth,
                      'TEST_DATA' as test,
                      col1,
                      col2;

DUMP data_with_yearmonth;

Expected Output: 201501, TEST_DATA, col1, col2

Current Output: 201501, TEST_DATA, test_YYYYMM.txt, col1, col2

1

1 Answers

2
votes

First of all, if col1 and col2 are string then you should declare them as CHARARRAY in Pig. Plus, I guess your current output is actually : 201501, TEST_DATA, test_YYYYMM.txt, col1. Tell me if I'm wrong, but as you used '-TagFile' the first column is the file title, this is why you access to it with $0 in your SUBSTRING.

You can try with this code :

raw_data = LOAD 'text_201505.txt' 
           USING PigStorage('|', '-tagFile') 
           AS (title: CHARARRAY, col1: CHARARRAY, col2: CHARARRAY); 

data_with_yearmonth = FOREACH raw_data 
                         GENERATE 
                             SUBSTRING($0,5,11) AS yearmonth,
                             'TEST_DATA' AS test,
                             col1,
                             col2;

 DUMP data_with_yearmonth;