0
votes

Hey guys i have one more question I am just not able to understand the behavior of pig

I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).

But when I load the data from /user/sga/transformeddata location and do

temp = load '/user/sga/transformeddata' using PigStorage();

gen = foreach temp generate page_type;

dump gen;

getting following error:

databytearray can not be cast to java.lang.String

but if i do

gen = foreach temp generate *;

   dump gen;

it works fine

any help is totally appreciated to understand this.

As required presenting the code:

STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');

union_of_all_records is an alias in pig.

now another script which will consume this data

lookup_data =
        LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);

load_denorm_clickstream_record =
        LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');

and join on these two aliases

denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
    JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;

step x :    final_output =
        FOREACH join_with_lookup
                GENERATE denorm_clickstream_record::page_type as page_type;

at step x i get the above error.

1

1 Answers

1
votes

I think you have to options:

1) You have to tell Pig the schema that the data has. For example:

temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);

2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.