What cassandra client to use for haoop integration?

Question

I am trying to build a data services layer using cassandra as the backend store. I am new to Cassandra and not sure what client to use for cassandra - thrift or cql 3? We have a lot of mapreduce jobs using Amazon elastic mapreduce (EMR) that will be reading/ writing the data from cassandra at high volume. The total data volume will be > 100 TB with billions of rows in Cassandra. The mapreduce jobs may be read or write heavy with high qps (>1000 qps). The requirements are as follows:

Simplicity of client code. It seems thrift has in-built integration with Hadoop for bulk data loading using sstableloader (http://www.datastax.com/dev/blog/bulk-loading).
Ability to define new columns at run time. We may need to add more columns depending on application requirements. It seems cql3 does not allow definition of columns dynamically at runtime.
Performance of bulk read/ write. Not sure which client is better. However, I found this post that claims thrift client has better performance for high data volume: http://jira.pentaho.com/browse/PDI-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

I could not find any authoritative source of information that answers this question. Appreciate if you could help with this since I am sure this is a common problem for most folks and would benefit the overall community.

Many thanks in advance.

-Prateek

first of all forget thrift, its the base API of cassandra, try some wrapper API's, simple to code. (native CQL Driver, Astyanax, Hector, Pelops). All are java based — abhi

Lyuben Todorov Lyuben Todorov · Accepted Answer · 2013-05-02T09:22:00

Hadoop and Cassandra are both written in Java so definitely pick a java based driver. As far as simplicity of code goes I'd go for Astyanax, their wiki page is really good and documentation is solid all round. And yes atyanax does allow you to define columns at runtime as you please but be aware that thrift based APIs are being superseded by cql apis.

If however you want to go down the pure cql3 route, datastax's driver is what I'd advise you to use. It allows for asynchronous connections and is continuously updated (view the logs). The code is also very clean although documentation isn't quite there yet, but there are tests in the source that you can look at.

But to be honest, there are so many questions about the APIs that you should read though them and form an opinion for yourself:

Also for performance here some benchmarks (they are however outdated!) showing that cql is catching up (and somewhat overtaking when it comes to prepared statements) thrift:

What cassandra client to use for haoop integration?

1 Answers