0
votes

I am working with pig-0.16.0 I'm trying to join two tab delimited files (.tsv) using pig script. Some of the column fields are of integer type, so I am trying to load them as int. But I see that whichever columns I made 'int' are not loaded with data and they shows as empty. My join was not outputting any result, so I took a step back and found out this problem occurred at the loading step. I am pasting a part of my pig script here:

REGISTER /usr/local/pig/lib/piggybank.jar;
-- $0 = streaminputs/forum_node.tsv
-- $1 = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);

LUFN = LIMIT u_f_n 10;

STORE LUFN INTO 'pigout/LN';

u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);

LUFUU = LIMIT u_f_u 10;

STORE LUFUU INTO 'pigout/LU';

I tried using long, but still the same issue, only chararray seemed to work here. So, what could be the problem?

Snippets from two input .tsv files:

forum_nodes.tsv:

"id"    "title" "tagnames"  "author_id" "body"  "node_type" "parent_id" "abs_parent_id" "added_at"  "score" "state_string"  "last_edited_id"    "last_activity_by_id"   "last_activity_at"  "active_revision_id"    "extra" "extra_ref_id"  "extra_count"   "marked"
"5339"  "Whether pdf of Unit and Homework is available?"    "cs101 pdf" "100000458" ""  "question"  "\N"    "\N"    "2012-02-25 08:09:06.787181+00" "1" ""  "\N"    "100000921" "2012-02-25 08:11:01.623548+00" "6922"  "\N"    "\N"    "204"   "f"

forum_users.tsv:

"user_ptr_id"   "reputation"    "gold"  "silver"    "bronze"
"100006402" "18"    "0" "0" "0"
"100022094" "6354"  "4" "12"    "50"
"100018705" "76"    "0" "3" "4"
"100021176" "213"   "0" "1" "5"
"100045508" "505"   "0" "1" "5"
1
I would suggest editing your question to add a short portion of the input files, so that other users can try and reproduce the problem (see also MCVE).lfurini
looking data shared in question data is string as it is quoted ie "18" is string chararray ...sandeep rawat

1 Answers

1
votes

You need to replace quotes to let pig know its int or else it will display blank. You should use CSVLoader OR CSVExcelStorage, see my tests:

Sample File:

"1","test"

Test 1 - Using CSVLoader:

You can use CSVLoader or CSVExcelStorage if you want to ignore quotes - see example here

PIG Commands:

register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);

output:

(1,test)

Test 2 - Replacing double quotes:

PIG commands:

file1 = load 'file1.txt' using PigStorage(',');
data  = foreach file1 generate REPLACE($0,'\\"','') as (f1:int) ,$1 as (f2:chararray);

output:

(1,"test")

Test 3 - using data as it is:

PIG commands:

file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);

Output:

(,"test")