13
votes

I'm trying to understand the connection pooling in Datastax Cassandra Driver, so I can better use it in my web service.

I have version 1.0 of the documentation. It says:

The Java driver uses connections asynchronously, so multiple requests can be submitted on the same connection at the same time.

What do they understand by connection? When connecting to a cluster, we have: a Builder, a Cluster and a Session. Which one of them is the connection?

For example, there is this parameter:

maxSimultaneousRequestsPerConnection - number of simultaneous requests on all connections to a host after which more connections are created.

So, these connections are automatically created, in the case of connection pooling (which is what I would expect). But what exactly are the connections? Cluster objects? Sessions?

I'm trying to decide what to keep 'static' in my web service. For the moment, I decided to keep the Builder static, so for every call I create a new Cluster and a new Session. Is this ok? If the Cluster is the Connection, then it should be ok. But is it? Now, the logger says, for every call:

2013:12:06 12:05:50 DEBUG Cluster:742 - Starting new cluster with contact points

2013:12:06 12:05:50 DEBUG ControlConnection:216 - [Control connection] Refreshing node list and token map

2013:12:06 12:05:50 DEBUG ControlConnection:219 - [Control connection] Refreshing schema

2013:12:06 12:05:50 DEBUG ControlConnection:147 - [Control connection] Successfully connected to...

So, it connects to the Cluster every time? It's not what I want, I want to reuse connections.

So, the connection is actually the Session? If this is the case, I should keep the Cluster static, not the Builder.

What method should I call, to be sure I reuse connections, whenever possible?

3

3 Answers

8
votes

You are right, the connection is actually in the Session, and the Session is the object you should give to your DAOs to write into Cassandra.

As long as you use the same Session object, you should be reusing connections (you can see the Session as being your connection pool).

Edit (2017/4/10) : I precised this answer following @William Price one. Please be aware that this answer is 4 years old, and Cassandra have changed a fair bit in the meantime !

15
votes

The accepted answer (at the time of this writing) is giving the correct advice:

As long as you use the same Session object, you [will] be reusing connections.

However, some parts were originally oversimplified. I hope the following provides insight into the scope of each object type and their respective purposes.

Builder ≠ Cluster ≠ Session ≠ Connection ≠ Statement

A Cluster.Builder is used to configure and create a Cluster

A Cluster represents the entire Cassandra ring

A ring consists of multiple nodes (hosts), and the ring can support one or more keyspaces. You can query a Cluster object about cluster- (ring)-level properties.

I also think of it as the object that represents the calling application to the ring. You communicated your application's needs (e.g. encryption, compression, etc.) to the builder, but it is this object that first implements/communicates with the actual C* ring. If your application uses more than one authentication credential for different users/purposes, you likely have different Cluster objects even if they connect to the same ring.

A Session itself is not a connection, but it manages them

A session may need to talk to all nodes in the ring, which cannot be done with a single TCP connection except in the special case of rings that contain exactly one(1) node. The Session manages a connection pool, and that pool will generally have at least one connection for each node in the ring. This is why you should re-use Session objects as much as possible. An application does not directly manage or access connections.

A Session is accessed from the Cluster object; it is usually "bound" to a single keyspace at a time, which becomes the default keyspace for the statements executed from that session. A statement can use a fully-qualified table name (e.g. keyspacename.tablename) to access tables in other keyspaces, so it's not required to use multiple sessions to access data across keyspaces. Using multiple sessions to talk to the same ring increases the total number of TCP connections required.

A Statement executes within a Session

Statements can be prepared or not, and each one either mutates data or queries it (and in some cases, both). The fastest, most efficient statements need to communicate with at most one node, and a Session from a topology-aware Cluster should contact only that node (or one of its peers) on a single TCP connection. The least efficient statements must touch all replicas (a majority of nodes), but that will be handled by the coordinator node on the ring itself, so even for these statements the Session will only use a single connection from the application.

Also, versions 2 and 3 of the Cassandra binary protocol used by the driver use multiplexing on the connections. So while a single statement requires at least one TCP connection, that single connection can potentially service up to 128 or 32k+ asynchronous requests simultaneously, depending on the protocol version (respectively).

4
votes

Just an update for the community. You can set connection pool in the following way

private static Cluster cluster;

cluster.getConfiguration().getPoolingOptions().setMaxConnectionsPerHost(HostDistance.LOCAL,100);