1
votes

I am designing an application which will accept data/events from customer facing systems persist them for audit and act as source to replay messages in case downstream systems needed a correction in any data feed.

I don't plan to do much analytics on this data ( which will be done in a downstream system ). But I am expected to persist this data and let run adhoc queries.

Few characteristics of my system

(1) 99 % write - 1 % read (2) High write throughput (Roughly 30000 Events a second , each event having roughly 100 attributes in it) (3) Dynamic nature of data. Cant conform to fixed schema.

These characteristics makes me think of Apache cassandra as an option either with widerow feature or map to store my attributes .

I did few samples with single node and Kundera ORM to write events to map , and get a maximum write throughput of 1500 events a second / thread . I can scale it out with more threads and cassandra nodes.

But, is it close to what I should be getting from your experience ? Few of the benchmarks available on net looks confusing .. ( I am on cassandra 2.0, with Kundra ORM 2.13)

2
I is very difficult to provide an answer, as your question is very vague (and unclear), and as we have no idea what the data model look like. - Cedric H.
Thanks for the response Cedric . I am looking to see , am I in the correct direction (Does 1500 writes/thread/node look real ) . - Biju V
I'm not an expert so I'll let someone else post a real answer, but are sure the 1500 limit comes from Cassandra or from your ORM/app? - Cedric H.
Thanks for the response Cedric . I am looking to see , am I in the correct direction (Does 1500 writes/thread/node look real- I was expecting much more?) . Data Model is a simple flat table with few columns and rest a map of attributes ( I write around 100 attributes to this map) - Biju V
CREATE TABLE user_events ( event_time timeuuid PRIMARY KEY, attributes map<text, text>, session_token text, state text, system text, user text ) - Biju V

2 Answers

0
votes

It seems that your Cassandra data model is "overusing" the map collection type. If that answering your concern about "Dynamic nature of data. Cant conform to fixed schema.", there are other ways.

CREATE TABLE user_events ( event_time timeuuid PRIMARY KEY, attributes map, session_token text, state text, system text, user text )

It looks like the key-value pairs stored in the attributes column are the actual payload of your event. Therefore they should be rows in partitions, using the keys of your map as the clustering key.

CREATE TABLE user_events(
     event_time TIMEUUID,
     session_token TEXT STATIC,
     state TEXT STATIC,
     system TEXT STATIC,
     USER TEXT STATIC,
     attribute TEXT,
     value TEXT,
     PRIMARY KEY(event_time, attribute)
);

This makes event_time and attribute part of the primary key, event_time is the partition key and attribute is the clustering key.

The STATIC part makes these data "properties" of the events and are stored only once per partition.

-1
votes

Have you tried to go through cassandra.yaml and cassandra-env.sh? tuning the nodes cluster it is very important for optimizing performance, you also might want to take a look on OS parameters, you also need to make sure swap memory is 0. That helped me to increase my cluster performance