Map reduce using hive on cassandra cluster

Question

Hi I am using datastax enterprise for hadoop and cassandra integration. I have configured 3 cassandra nodes and 2 analytics node(On which hive will run).

So I am confused if there is some data which is not present on hive nodes but on cassandra nodes, will it not be processed during map reduce or map reduce will pull the data from cassandra nodes and run the map reduce. Please help

So I have 4 machines (replication factor 3)

machine 1) cassandra node|token value=0         |data owned(25%)
machine 2)-cassandra node|token value=2^127*.5  |data owned(33%)
machine 3)-analytics node|token value=2^127*.25 |data owned(33%)
machine 4) analytics node|token value=2^127*.75 |data owned(8%)

shouldn't they be owning 25% each Also I now think that data will be replicated in all nodes not in just 3 nodes

jbellis jbellis · Accepted Answer · 2013-02-24T05:13:28

DSE will make sure a full copy of your dataset is replicated to whichever set of nodes you designate as analytics. So it's generally a non-issue. If enough analytics nodes fail, it may have to go to a non-analytics node to fetch the data ... but you'd be better advised to bring the analytics nodes back online.

Map reduce using hive on cassandra cluster

1 Answers