0
votes

Cassandra is a column family datastore which means that each column has its own timestamp/version and it is possible to update a specific column of a Cassandra row which is often referred to as partial updates.

I am trying to implement a pipeline which makes the data in Cassandra column family also searchable in a search engine like Solr or Elastic Search.

I know Datastax Enterprise Edition does provide this Cassandra Solr Integration out of the box.

Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.

How does the partial updates done in Cassandra are written to Solr?

In other words does partial updates done in Cassandra get written into Solr without the updates stepping onto each other?

1

1 Answers

0
votes

I can see where you might be coming from here but its also important for anyone reading this to know that the following statement is not correct

Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.

To add some colour to this let me try to explain. When an update is written to Cassandra, regardless of the content, the new mutation goes into the write path as outlined here:

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataWritten.html

DSE search uses "secondary index hook" on the table where incoming writes are then pushed into an indexing queue which will be written into documents and stored in the Lucene index. The architecture gives an overview at a high level here:

https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/searchArchitecture.html

This blog post is a bit old now but still outlines the concepts of this:

http://www.datastax.com/dev/blog/datastax-enterprise-cassandra-with-solr-integration-details

So any update regardless of whether it is a single column or an entire row will be indexed at the same time.