0
votes

I'm using Pig on Hadoop to analyse logs in CSV format. From times to times, my data provider is adding new fields in the logs, all new fields are added at the end of each line.

I would like to know are to properly implement the loading of these CSV files when the globbing pattern matches both "old format" and "new format" files, while still being able to access new fields on the most recent files.

Let's take a practical example :

2014/12/20/log_2014-12-20.csv:
  f1, f2, f3

2014/12/21/log_2014-12-21.csv:
  f1, f2, f3

2014/12/22/log_2014-12-22.csv:
  f1, f2, f3

2014/12/23/log_2014-12-23.csv:
  f1, f2, f3, f4, f5

2014/12/24/log_2014-12-24.csv:
  f1, f2, f3, f4, f5

Notice how new fields appeared starting on 23rd December, 2014 : f4 and f5.

When using the following Pig statement, data from files before 2014-12-23 will not load, thus only data beginning at 2014-12-23 will be available in the Pig alias MYDATA:

MYDATA = LOAD 's3://mybucket/logs/2014/12' using PigStorage(',') as (
  f1: int,
  f2: int,
  f3: int,
  f4: int,
  f5: int
);

If I want to load data from all expected timerange, I need to omit the new fields :

MYDATA = LOAD 's3://mybucket/logs/2014/12' using PigStorage(',') as (
  f1: int,
  f2: int,
  f3: int
);

But I can't take advantage of the new fields on the most recent data. As in my real-world use case, the above statements are stored in a Pig macro for the log data to be used from multiple scripts, adding the new fields in the macro is breaking my scripts which are loading not-so-recent data.

What are your suggestions for handling such a change in data scheme ?

Thanks for your help.

1

1 Answers

0
votes

I have made good experiences using Parquet (http://parquet.incubator.apache.org/). They also provide Pig storage and loader. Loader allows to specify specific schema you want to read the data in which will fill fields that are not available in the data with NULL (some simple form of schema evolution). In your case you would need to convert data into Parquet format first, but then it should work just as you expect.