I am trying to store data into Accumulo using Pyspark (Python + Spark). Right now I am using the pyaccumulo library to write data into Accumulo by passing the pyaccumulo egg file to the SparkContext using the pyFiles argument. I was wondering if there was a better way to do this. I have seen the examples for the Cassandra and HBase output formats and was wondering if something like that could be done for Accumulo. Cassandra and HBase seem to be using the saveAsNewAPIHadoopDataset(conf, keyConv, valueConv) function and pass a config dict, a keyconverter and a valueconverter. Does anyone have any idea as to what could be the corresponding values to be passed to saveAsNewAPIHadoopDataset() for Accumulo?
0
votes
1 Answers
0
votes
Taking a guess, as I have no idea how it's supposed to work, you'd need something like
- AccumuloOutputFormat.ConnectorInfo.principal
- AccumuloOutputFormat.ConnectorInfo.token
- AccumuloOutputFormat.InstanceOpts.zooKeepers
- AccumuloOutputFormat.InstanceOpts.name
To get a full list of properties, I'd run a normal MapReduce example (http://accumulo.apache.org/1.7/examples/mapred.html) and look at the configuration values.