I need some guidance/help with a simple task to create a schema in Apache Pig for my data file. I have two files that would contribute to this task. First file is a data file which contains the data with no column header, and a second file contains the column header for the data file. So basically, the column_header file is the schema for the data file. How do i outline this in a pig script? Here's what i got so far.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
This is the output for
DUMP column_header
(accept_language,browser,browser_height,browser_width)
When i do,
DUMP data;
only the first line column of data is being output, which is wrong.
en-US
en-US
en-US
en-US
Instead it should be,
en-US 638 755 1600
en-US 638 655 1342
en-US 638 723 1612
en-US 638 231 1234
How can i trick Pig to use "column_header" as a string that can be use during the PigStorage AS statement on the second line of code?
Edit: This code will work but instead of hard-coding my column_header i would like pig script to read it instead.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as (accept_language,browser,browser_height,browser_width);
schema = foreach data generate accept_language,browser,browser_height,browser_width;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;