I've been mulling over Cassandra documentation and videos on youtube for a couple of weeks now.
I'm implementing a log storage and correlation system, and I'd like to use Cassandra for this. I can't seem to wrap my head around the Cassandra data model for this type of application though. I wasn't able to find too many in depth examples of the best data model for this type of application.
The logging system revolves around HTTP web traffic. My log sources will expand in time, but for now they will include Proxy logs, application logs, and some other system logs that include the hostname/IP and event data.
My correlations will revolve around source and destination IP addresses and hosts names, domain names, geoip-location, http method (GET/PUT/Connect), and some other possible correlations around the file type being requested (for example .jar .exe .pdf). Correlation around time will be important in all of these cases as well.
I've read in many places that data modeling for Cassandra starts off by thinking about the queries you will be running. So I've specified a few examples here. There are more examples but the following would be a good start, and any queries will follow similar correlation patterns.
Example Query 1: Show me where IP 10.0.0.1 has been seen in the logs with .jar extension in the URL within the past 24 hours or the past week
Example Query 2: Show me all PUT requests going to domain xyz.com for the past 24 hours
Example Query 3: Show me all log events for source host 192.168.1.1 from time01 through time02
Example Query 4: Show me all communication between 10.0.0.1 and 192.168.1.1 that occurred yesterday.
Example Query 5: Compare all new events against an existing list of domains and IPs and show me any new events containing these IPs and domains.
I can provide more details if needed. Any guidance will be useful.
Thanks!