1
votes

We are working on a Proof of Concept with MongoDB and Amazon EMR. We have been able to get a simple end to end solution working where it can read data from one collection in mongo, perform map/reduce functions and then write the output to another collection in Mongo.

My question is - is it possible to read in additional collections from Mongo that would be used for lookup purposes. i.e. all data in collection1 would have the map/reduce functions performed on it but the map/reduce functions would use data from collection2 and collection3 for lookup purposes.

If this is not possible - then what is the best way to get the lookup data into hadoop so it can be used for lookup purposes?

1
Can you not just pull in multiple collection data in the map process of map/reduce? That would seem logical.Neil Lunn

1 Answers

0
votes

It's possible to look up external resources within a Map-Reduce process. But ..

  • ... you cannot use MongoDB Hadoop Connector for look ups, because it only creates input for Map-Reduce jobs. Instead use the Java MongoDB driver to query your collection, like you would in any standard Java application.
  • ... it could become a performance problem, if you're massively querying the same collection in parallel. (Although this could be solved by adding more servers to the MongoDB cluster, that may go beyond your budget.)

An alternative could be to use Hadoop's caching mechanism. Therefore you need to export the data into a file on the Hadoop cluster (hdfs://...). The data will be read only once per job and copied to slave nodes. If this is a good alternative for you depends on the size of the file and how up-to-date the data needs to be.