49
votes

I have recently started working with Cassandra Database. Now I am in the process of evaluating which Cassandra client we should go forward with.

I have seen various post on stackoverflow about which client to use for Cassandra but none has very definitive answer.

My team has asked me to do some research on this and come up with certain pros and cons for each Cassandra Client API’s in Java.

As I mentioned, I recently got involved with Cassandra so not have that much idea why certain people choose Pelops client and why certain people go with Astyanax and some other clients.

I know brief things about each of the Cassandra clients, by which I mean I am able to make that work and start reading and writing to Cassandra database.

Below is the information I have so far.

CASSANDRA APIS

  • Hector (Production-Ready)
    The most stable of the Java APIs, ready for prime-time.

  • Astyanax (The Up and Comer)
    A clean Java API from Netflix. It isn't as widely used as Hector, but it is solid.

  • Kundera (The NoSQL ORM)
    JPA compliant, this is handy when you want to interact with Cassandra via objects.
    This constrains you somewhat in that you won't be able to have a dynamic number of columns/names, etc. But it does allow you to port over ORMs, or centralize storage onto Cassandra for more traditional uses.

  • Pelops
    I've only used Pelops briefly. It was a straight forward API, but didn't seem to have the momentum behind it.

  • PlayORM (ORM without the constraints?)
    I just heard about this. It looks like it is trying to solve the impedance mismatch between traditional JPA-based ORMs and NoSQL by introducing JQL. It looks promising.

  • Thrift (Avoid Me!)
    This is the "low-level" API.

Below are our priorities in deciding Cassandra Client-

  • First priorities are: low latency overhead, Asynch API, and reliability/stability for production environment.
    (e.g. a more user-friendly APIs that can be had in the DAL that wraps the client).
  • Connection pooling and partition awareness are some other good feature to have.
  • Able to detect any new nodes that got added.
  • Good Support as well (as pointed by dean below)

Can anyone provide some thoughts on this? And also any pros and cons for each Cassandra Client and also which client can fulfill my requirements will be of great help as well.

I believe, mainly I will be revolving around Astyanax client or New Datastax client that uses Binary protocol I guess basis on my research so far. But don't have certain information to back my research and present it to my team.

Any comparison between Astyanax client and New Datastax client(which uses new Binary protocol) will be of great help.

It will be of great help to me in my research and will get lot of knowledge on this from different people who have used different clients in the past.

5
You can also add cassandra-jdbc to your list code.google.com/a/apache-extras.org/p/cassandra-jdbcphatfingers
Good point phatfingers. Got to know one more stuff. Cool.arsenal
I chose astyanax at some point and I can say for sure it's easy to use and very stable. Few pointers: datastax driver is beta for now; Astyanax on native protocolIvan Velykorodnyy

5 Answers

23
votes

Thrift is becoming more of a legacy API:

First, you should be aware that the Thrift API is not going to be getting new features ; it's there for backwards compatibility, and not recommended for new projects.
- the paul

So I'd avoid Thrift based APIs (thrift is only kept for backwards compatibility).

In saying that if you do need to use a thrift based API I'd go for Astyanax. Astyanax is very easy to use (compared to other thrift APIs but my personal experience is that Datastax's driver is even easier).

So you should have a look at Datastax's API (and GitHub repo)? I'm not sure if there any compiled versions of the API for download but you can easily build it with Maven. Also if you take a look at the GitHub repo's commit logs it undergoes very frequent updates.

The driver works exclusively with CQL3 and is asynchronous but be warned that Cassandra 1.2 is the earliest supported version.

Performance
Astyanax is thrift based and Datastax's drive is the binary protocol. Here are the latest benchmarks I could find between thrift and CQL (note these are definitely out of date). But in fairness the small difference in performance shown in these benchmarks will rarely matter.

Asynch support
Datastax's asynch support is a definite advantage over Astyanax (Netflix tried implementing it but decided not to).

Documentation
I cant really argue against Netflix's wiki. The documentation is excellent and its updated fairly frequently. Their wiki includes code examples, and you can find tests in the source code if you need to see the code at work. I struggled to find any documentation of the Datastax driver however test are provided in the GitHub repository so that is a starting point.

Also have a look at this answer (well.. not my one anyway) It looks into some advantages/disadvantages of Thrift and CQL.

8
votes

I would recommend Datastax java driver for Cassandra http://www.datastax.com.

For JPA like support try my mapping tool. http://valchkou.com/cassandra-driver-mapping.html

Annotation driven No mapping files, no scripts, no configuration files. No need for DDL scripts. Schema automatically synchronized with the entity definition.

Usage sample:

   Entity entity = new Entity();
   mappingSession.save(entity);
   entity = mappingSession.get(Entity.class, id);
   mappingSession.delete(entity); 

available on maven central

   <dependency>
      <groupId>com.valchkou.datastax</groupId>
      <artifactId>cassandra-driver-mapping</artifactId>          
    </dependency>
3
votes

I would also add decent support as well. We post answers to playORM all the time on stack overflow ;). It also is about to start supporting mongodb(work is nearly finished) so any clients can run on mongodb or cassandra. It has it's own query language such that this port works just fine. You always have access to the raw astyanax interface too when really need the speed.

Also, your note on asynch...thrift previously did not support asynch so no clients did either as they generated the thrift code. Since that has changed, I don't know of a client that has added the asynch stuff in.

I know hbase has an asynch client though. Anyways, just thought I would add my 2 cents in case it helps a little.

EDIT: I was recently in the cassandra-thrift generated source code and it is not a very good api for async development with send and a recv() method but you don't know when to call the recv method. Aaron morton on cassandra user list has a blog on how you can really do it but it is not clean at all...have to grab the selector from thrift deep down and do some stuff so you know when to call the recv method...pretty nasty stuff.

later, Dean

2
votes

I've used Hector, Astyanax and Thrift directly. I've also used the Python client PyCassa.

The features that I found important and differentiating were:

  • Ease of use of the API
  • Composite column support
  • Connection pooling
  • Latency
  • Documentation

One of the major issues is getting the types correct. You want to be able to pass in longs, Strings, byte[], etc.. Both Hector and Astyanax solve this by using Serializer objects. In Astyanax you specify them higher up the chain so you have to specify them less often. In Hector the syntax is often very clunky and hard to adapt if you change your schema.

Since Python has dynamic types, it is much easier to deal with this in PyCassa. Since it's not an option for you I won't say much about it, but I found it easiest to use (by far) but also quite slow.

Composite column support is very confusing in Hector. Astyanax has annotations to greatly simplify this.

As far as I know, the connection pooling is the same for Hector and Astyanax. Both will avoid downed hosts and discover new ones added to the ring. Both of these features a crucial to reliability and maintainability. Pelops appears to have these features but I've never tried it.

A key difference between Astyanax and Hector is the latency optimizations. Astyanax has the ability to route read and write requests to a replica node, potentially avoiding an extra networking hop. This can reduce the latency by a few milliseconds.

At last look, Astyanax had poor documentation, but it seems much improved now.

The only advantage of Hector I can see today is that it is much more widely used so probably less buggy. But Astyanax has a better feature set.

1
votes

I have a similar recommendation as Valchkou. DataStax java CQL driver, is very good. I tried astyanax, kundera and buffalosw's playorm. Astyanax is very low level and some what complex. Kundara and playorm are generic ORMs for nosql databases, and are complex to setup and to get started.

Datastax apis are pretty much similar to a JDBC driver and you have to embed CQL statements in your DAO and write several lines of code to load and save your entities. To solve this problem, I wrote a java object mapper called cassandra-jom, built around datastax cql driver. Cassandra-jom annotations are very similar to JPA/Hibernate annotations and can even create/update your column family schema from your object model. It is easy to use and reliable and used in my other live web applications. Check it out at its github page.

https://github.com/w3cloud/cassandra-jom