1
votes

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.

How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.

Is that possible on hdinsight? Is it even possible with hadoop? Thanks,

1
I should method that 10gen maintains a mongodb connector for hadoop but I can't install it in hdinsight (it's a hosted service)alexandru.topliceanu

1 Answers

1
votes

It is certainly possible to have that data pushed from Mongo into Hadoop.

Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.

Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.

Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.