2
votes

I'm trying to append one dataset to another one in Apache Pig. There were several examples but I think that different than my problem.

Here is my pig script:

line1 = load 'line1/points' using Table();

line20 = load 'line20/points' using Table();

DESCRIBE line1;

DUMP line1;

DESCRIBE line20;

DUMP line20;

X = UNION line1, line20;

DESCRIBE X;

DUMP X;

I get this:

line1: {key: bytearray,y: (name: chararray,value: long),x: (name: chararray,value: long),columns: {(name: chararray,value: bytearray)}}

(ab48a8567d58cfea52905db0e94d88d3,(y,3),(x,3))

(ab48a8567d58cfea52905db0e94d88d3,(y,1),(x,1))

(ab48a8567d58cfea52905db0e94d88d3,(y,2),(x,2))

line20: {key: bytearray,y: (name: chararray,value: long),x: (name: chararray,value: long),columns: {(name: chararray,value: bytearray)}}

(203146881b7ef0d26902ea440e734b79,(y,20),(x,20))

(203146881b7ef0d26902ea440e734b79,(y,21),(x,21))

(203146881b7ef0d26902ea440e734b79,(y,22),(x,22))

X: {key: bytearray,y: (name: chararray,value: long),x: (name: chararray,value: long),columns: {(name: chararray,value: bytearray)}}

(203146881b7ef0d26902ea440e734b79,(y,21),(x,21))

(203146881b7ef0d26902ea440e734b79,(y,22),(x,22))

(203146881b7ef0d26902ea440e734b79,(y,20),(x,20))

(203146881b7ef0d26902ea440e734b79,(y,20),(x,20))

(203146881b7ef0d26902ea440e734b79,(y,21),(x,21))

(203146881b7ef0d26902ea440e734b79,(y,22),(x,22))

The result is just a double copy of the 'line20' dataset. Why?

I would like to have values from 'line1' and then values from 'line20'.

BTW: ... using Table(); - this is just my implementation of CassandraStorage, where I provide automatically types for columns.

Thanks for your help!

Solution

Configuration is shared. I forgot about it and I was using for both Table() instances the same ID to initialize them.

1
if you could reduce your sample data to just a few lines/columns - that would help.Ruslan
Done. I hope it helps.ahypki
Thanks. This really looks weird. I would try to do the same with text files as inputs, loading them with normal PigStorage. Just for a sanity check. If that helps then I would conclude that the problem is in Table(). Are you sure that both invocations don't overlap with each other?Ruslan
Yes, UNION on two files loaded with ... PigStorage(','); works fine. I have just checked. No, these two Table() invocations do not overlap each other. However, I will search for a problem in my Table() class. Thanks.ahypki
I was wrong. Sorry. Two Table() instances were indeed overlapping. @Pradeep Gollakota pointed out that Configuration object is shared. That was my mistake. Thank you for your help.ahypki

1 Answers

1
votes

I ran into a similar problem when working with Apache Accumulo. Pig was trying to do a Map side join on two Accumulo tables. However, since the API did not support reading from multiple tables simultaneously due to reuse of the Configuration object, this could not be done. HBase does not have this problem because, even though the Configuration object is shared, multiple tables configurations are stored under different keys. I have not worked with Cassandra, so I can't exactly be sure. But I would guess that it's a problem with the Table() LoadFunc. Check to see that the LoadFunc isn't clobbering the configuration from multiple invocation of the LoadFunc.

A quick way to test it is to switch the order of relations in the union. I'd be willing to bet that if you wrote UNION line20, line1; You'll see two copies of line1.