0
votes

I need some guidance/help with a simple task to create a schema in Apache Pig for my data file. I have two files that would contribute to this task. First file is a data file which contains the data with no column header, and a second file contains the column header for the data file. So basically, the column_header file is the schema for the data file. How do i outline this in a pig script? Here's what i got so far.

column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;

This is the output for

DUMP column_header

(accept_language,browser,browser_height,browser_width)

When i do,

DUMP data;

only the first line column of data is being output, which is wrong.

en-US

en-US

en-US

en-US

Instead it should be,

en-US 638 755 1600

en-US 638 655 1342

en-US 638 723 1612

en-US 638 231 1234

How can i trick Pig to use "column_header" as a string that can be use during the PigStorage AS statement on the second line of code?

Edit: This code will work but instead of hard-coding my column_header i would like pig script to read it instead.

column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as (accept_language,browser,browser_height,browser_width);
schema = foreach data generate accept_language,browser,browser_height,browser_width;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
1
question is not clear what u need .. - sandeep rawat
Hi Sandeep, which part of the question is not clear? I just need to create a schema for my data file. And i have a file which contains the column headers for the data file. How do i achieve this? - Benedict Lee
what are you looking you want to set header for out put file or any thing else u are doing... - sandeep rawat
I do not want to set header or whatsoever. I just need the documentation design (Schema) for the data file. Which in the code i am output it to a file. - Benedict Lee
feel it is not the correct way of creating design document. - sandeep rawat

1 Answers

0
votes

you can not achieve such parameterization from in the pig script directly, you can to the same thing by

data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as $column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;

and run the pig script by , pig -param_file (location of the file) column

The file should be of the format column_header = complete schema

https://blogs.msdn.microsoft.com/bigdatasupport/2014/08/12/how-to-use-parameter-substitution-with-pig-latin-and-powershell/