1
votes

I am working on BI process that will read data from cassandra, create summaries using Map Reduce and write back to a different keyspace.

Starting with a single node, everything worked as i expected, but when moving to a multi-node, i am not sure I fully understand the topology and configuration.

I have a setup with 3 nodes. Each has a Cassandra node (version 1.1.9), data node and task tracker (version 0.20.2+923.421- CDH3U5) . The NameNode and job tracker are on a different server. At this point i am trying to run Pig script from the DataNode server.

The thing i am not sure of is the pig argument PIG_INITIAL_ADDRESS. I assumed the query would run on all Cassandra nodes, each task tracker would only query the local Cassandra node, and the reducer would handle any duplicates. Based on that assumption i thought the PIG_INITIAL_ADDRESS should be localhost. But when running the pig script it fails:

java.io.IOException: Unable to connect to server localhost:9160

My questions are- should the initial address be any one of the Cassandra nodes, and Splitting the map on the cluster is done from Cassandra keys partitions (will i get the distribution i need)? IF I where to use java map reduce, will i still need to supply the initial address? Is the current implementation assumes pig is running from a Cassandra node?

1

1 Answers

1
votes

The PIG_INITIAL_ADDRESS is the address of one of the Cassandra nodes in your ring. In order to have the Hadoop job read data from or write data to Cassandra, it just needs to have some properties set. Those properties are also available to set in the job properties or in the default Hadoop configuration on the server that you're running the job from. Other than that, it's just like submitting a job to a job tracker.

For more information, I would look at the readme that's in the cassandra source download under examples/pig. There is a lot of explanation in there as well.