Storing data into Accumulo with Pyspark

Question

I am trying to store data into Accumulo using Pyspark (Python + Spark). Right now I am using the pyaccumulo library to write data into Accumulo by passing the pyaccumulo egg file to the SparkContext using the pyFiles argument. I was wondering if there was a better way to do this. I have seen the examples for the Cassandra and HBase output formats and was wondering if something like that could be done for Accumulo. Cassandra and HBase seem to be using the saveAsNewAPIHadoopDataset(conf, keyConv, valueConv) function and pass a config dict, a keyconverter and a valueconverter. Does anyone have any idea as to what could be the corresponding values to be passed to saveAsNewAPIHadoopDataset() for Accumulo?

What are the values passed to saveAsNewAPIHadoopDataset supposed to be? Configuration for an OutputFormat? — elserj
Yes. A configuration dict containing certain condiguration parameters, a key converter and a value converter. — thisisshantzz

elserj elserj · Accepted Answer · 2015-10-13T03:10:06

Taking a guess, as I have no idea how it's supposed to work, you'd need something like

AccumuloOutputFormat.ConnectorInfo.principal
AccumuloOutputFormat.ConnectorInfo.token
AccumuloOutputFormat.InstanceOpts.zooKeepers
AccumuloOutputFormat.InstanceOpts.name

To get a full list of properties, I'd run a normal MapReduce example (http://accumulo.apache.org/1.7/examples/mapred.html) and look at the configuration values.

Storing data into Accumulo with Pyspark

1 Answers