apache spark stand alone connecting to mongodb with scala using casbah

Question

i would like to perform a Apache Spark map-reduce on 5 files and output them to mongodb. I would prefer not using HDFS since NameNodes are a single point of failure (http://wiki.apache.org/hadoop/NameNode).

A. Is it possilbe to read multiple files in RDD, perform a map reduction on a key from all the files and use the casbah toolkit to output the results to mongodb

B. Is it possible to use the client to read from mongodb into RDD, perform a map reduce and right output back to mongodb using the casbah toolkit

C. Is it possible to read multiple files in RDD, map them with keys that exist in mongodb, reduce them to a single document and insert them back into mongodb

I know all of this is possible using the mongo-hadoop connector. I just dont like the idea of using HDFS since it is a single point of failure and backUpNameNodes are not implemented yet.

Ive read some things on line but they are not clear.

MongoDBObject not being added to inside of an rrd foreach loop casbah scala apache spark

Not sure whats going on there. The JSON does not even appear to be valid...

resources:

https://github.com/mongodb/casbah

http://docs.mongodb.org/ecosystem/drivers/scala/

Daniel Darabos Daniel Darabos · Accepted Answer · 2015-02-02T21:28:25

Yes. I haven't used MongoDB, but based on other things I've done with Spark, these should all be quite possible.

However, do keep in mind that a Spark application is not typically fault-tolerant. The application (aka "driver") itself is a single point of failure. There's a related question on that topic (Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode), but I think it doesn't have a really good answer at the moment.

I have no experience running a critical HDFS cluster, so I don't know how much of a problem the single point of failure is. But another idea may be running on top of Amazon S3 or Google Cloud Storage. I would expect these to be way more reliable than anything you can cook up. They have large support teams and lots of money and expertise invested.

apache spark stand alone connecting to mongodb with scala using casbah

1 Answers