2
votes

I have data that looks like this

  {super-row-key1 [{ts1 {version-ts1 value, version-ts2 value}}
                   {ts2 {version-ts1 value}}]
   super-row-key2 ...}

These keys and values look something like

{"4447c9a6-9912-44d7-a6b5-cef40735f92c:2011-06"
 [{1291180500000 {1351709255098 -0.008084167000000001}}
  {1291184100000 {1351709255098 -0.004395833}}
  {1291185000000 {1351709255098 -0.003075}}]
 ...}

So I am trying to figure out if ClojureWerks Cassandra Cascading tap already supports operations across all of the rows. As you can see, the super-row keys, the super-rows, and the super-columns are all generated (uuids, dates, timestamps, etc). In the examples and the code I have seen I am led to believe that fixed names identifying column names, column field names, key column names, and field mappings are needed to be specified in advance.

At the Hadoop level of Cassandra's support for MapReduce it appears Cassandra does support fetching all rows of data from a given column family. From the documentation:

"Cassandra rows or row fragments (that is, pairs of key + SortedMap of columns) are input to Map tasks for processing by your job, as specified by a SlicePredicate that describes which columns to fetch from each row."

So it appears that it is definitely possible at a low level, but it is unclear how to accomplish what I'm trying to do at the Cascading level.

Does this requires adapting or creating a variant of the existing tap, or can it be done somehow with the existing one?

1

1 Answers

3
votes

I assume that Robert refers to: https://github.com/ifesdjeen/cascading-cassandra

I tried to get pingles/cascading.cassandra to work with Cascalog, but no success, all dependencies, therefore all the interfaces had to be changed. So I decided to write my own thing (not always the best idea).

Now, to the answer:

It took me a little bit longer than I expected to get to understand how exactly to answer you, but I'm bringing good news :)

First of, I did not plan to include wide row support into the tap, but it turned out to be that it works even in the current version. Unfortunately, I can't just yet push examples, because Cassaforte (https://github.com/clojurewerkz/cassaforte, cassandra driver we're using relies on Clojure 1.4 because of the bug with primitive type hints: http://dev.clojure.org/jira/browse/CLJ-852 if I'm not mistaken, and Midje has hard version set, so it doesn't support 1.4, so I'm forced to use an outdated version of our own driver).

Reason for not including wide rows was that cassandra team themselves discourages using them and recommends using composite columns instead, because they could be read in a better way, and there's no need to fetch an entire supercolumn in order to get partial data. I realize that it's not always easy though, especially if there was an app written longer time ago.

Next up,

You're right that right now you should specify names. I somehow didn't foresee generated column names.

In order to fetch all the columns, you have to use SlicePredicate, and specify empty byte buffers and slice start and slice finish of SliceRange you pass into it. So you can set SliceRange (.setSlice_range) instead of (.setColumn_names), and it will be entirely same thing, you can make that change in CasssandraScheme.java https://github.com/ifesdjeen/cascading-cassandra/blob/master/src/main/java/com/clojurewerkz/cascading/cassandra/CassandraScheme.java#L247 if you decide to stick to our tap. What I'd do, is when there're no column names specified, we just fetch all of them.

Another change that's going to be required is deserialization of values. Probably here you have a better feeling about how to deal with wide rows. In essence, you get a response like:

Key / {java.nio.HeapByteBuffer[pos=65 lim=70cap=93]=org.apache.cassandra.db.Column@478bb374}

So format would be pretty much the same. Here, you only have to deserialize key and convert column into the tuple. If amount of key-value pairs within a column varies, you'll have to fill it up (probably) with nulls, otherwise it could be hard to understand/debug.

Once again, if you decide to go with out tap, you'd have to upgrade to Cassaforte beta10 snapshot, (at least for initial tests) remove midje from project.clj and comment out everything related to it.

If you like, you can use cassaforte code to populate a smaller dataset (i usually go with a couple of records): https://github.com/clojurewerkz/cassaforte/blob/master/test/clojurewerkz/cassaforte/thrift/core_test.clj#L26