0
votes

I'm trying to load data from a MongoDB BSON file into Pig using com.mongodb.hadoop.pig.BSONLoader (https://github.com/mongodb/mongo-hadoop/blob/master/pig/README.md) but I'm getting stuck. The data on MongoDB includes variable size arrays and I'm not sure how to load that into pig (as a tuple?). Here's a sample record from MongoDB:

{"_id": {"$oid": "52fbbca6e4b029a79cd17ff7"},
 "field": "value",
 "variableSizeArray": [
    "value1",
    "value2",
    "valueN"
 ]
}

I've tried the following options and none of them seems to work:

raw = LOAD 'file:///tmp/teststreams.bson' using com.mongodb.hadoop.pig.BSONLoader('','field:chararray,variableSizeArray:()');
raw = LOAD 'file:///tmp/teststreams.bson' using com.mongodb.hadoop.pig.BSONLoader('','field:chararray,variableSizeArray:{T:(h:chararray)}');

Thanks for any help on this.

1

1 Answers

2
votes

Finally figured it out. The way to do this is by not trying to specify the data type. This works:

raw = LOAD 'file:///tmp/teststreams.bson' using com.mongodb.hadoop.pig.BSONLoader('','field,variableSizeArray');