Cassandra Data Model For Correlation Application

Question

I've been mulling over Cassandra documentation and videos on youtube for a couple of weeks now.

I'm implementing a log storage and correlation system, and I'd like to use Cassandra for this. I can't seem to wrap my head around the Cassandra data model for this type of application though. I wasn't able to find too many in depth examples of the best data model for this type of application.

The logging system revolves around HTTP web traffic. My log sources will expand in time, but for now they will include Proxy logs, application logs, and some other system logs that include the hostname/IP and event data.

My correlations will revolve around source and destination IP addresses and hosts names, domain names, geoip-location, http method (GET/PUT/Connect), and some other possible correlations around the file type being requested (for example .jar .exe .pdf). Correlation around time will be important in all of these cases as well.

I've read in many places that data modeling for Cassandra starts off by thinking about the queries you will be running. So I've specified a few examples here. There are more examples but the following would be a good start, and any queries will follow similar correlation patterns.

Example Query 1: Show me where IP 10.0.0.1 has been seen in the logs with .jar extension in the URL within the past 24 hours or the past week

Example Query 2: Show me all PUT requests going to domain xyz.com for the past 24 hours

Example Query 3: Show me all log events for source host 192.168.1.1 from time01 through time02

Example Query 4: Show me all communication between 10.0.0.1 and 192.168.1.1 that occurred yesterday.

Example Query 5: Compare all new events against an existing list of domains and IPs and show me any new events containing these IPs and domains.

I can provide more details if needed. Any guidance will be useful.

Thanks!

I'm just looking for pointers and guidance, as I stated in my last sentence. Even just a link to a similar project or implementation would be a great help. — user3324184
For each query, find what is the single entity you're asking about: that is your partition key (first column in the primary key). Then find out over which columns will your query do a slice (i.e.: read adjacent rows) - these are your "between X and Y" or "within last X" or similar. These columns will have to be your clustering keys (columns 2+ of PKey). Lastly, you will need to duplicate your data in as many tables as you need to support all your queries. — Daniel S.
As far as correlation and other complex analysis, you will likely need MapReduce for that, Cassandra alone won't be able to solve this efficiently. — Daniel S.

Navid Navid · Accepted Answer · 2014-02-19T00:29:55

For query 1, you can have the IP address and file type as composite key and timestamp as clustering column. This way you can query for an IP with file type and timestamp range.

For query 2, domain and method (put, get, ...) as composite key, timestamp as clustering column, you might need to add UUID or request id as clustering ID to make your compound primary key unique.

Query 3, IP as primary key, timestamp clustering column, +UUID if needed

Query 4, IPA and IPB as composite primary key and timestamp as clustering column. In this case if the communication is by directional, you need to store IPB and IPA as well.

Query 5, you have to do this in the client program

Cassandra Data Model For Correlation Application

1 Answers