0
votes

I have this use case where I would need to constantly listen to a kafka topic and write to 2000 column families(15 columns each.. time series data) based on a column value from a Spark streaming App. I have a local Cassandra installation set up. Creating these column families takes around 1.5 hrs on a CentOS VM using 3 cores and and 12 gigs of ram. In my spark streaming app I'm doing some preprocessing for storing these stream events to Cassandra. I'm running into issues with the amount of time it takes for my streaming app to complete this.
I was trying to save 300 events to multiple column families(roughly 200-250) based on key for this my app takes around 10 minutes to save them. This seems to be strange as printing these events to screen grouped by key takes less than a minute, but only when I am saving them to Cassandra it takes time. I have had no issues saving records in the order of 3 million to Cassandra . It took less than 3 minutes(but this was to a single column family in Cassandra).

My requirement is to be as real-time as possible and this seems to be nowhere close. Production environment would have roughly 400 events every 3 seconds.

Is there any tuning that i need to do With the YAML file in Cassandra or any changes to cassandra-connector itself

INFO  05:25:14 system_traces.events                      0,0
WARN  05:25:14 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:14 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:16 ParNew GC in 340ms.  CMS Old Gen: 1308020680 -> 1454559048; Par Eden Space: 251658240 -> 0; 
WARN  05:25:16 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:16 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:17 ParNew GC in 370ms.  CMS Old Gen: 1498825040 -> 1669094840; Par Eden Space: 251658240 -> 0; 
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:19 ParNew GC in 382ms.  CMS Old Gen: 1714792864 -> 1875460032; Par Eden Space: 251658240 -> 0; 
W
2
Could you give more details amount the number of column families? 133 records a second should be trivial to save.RussS
@RussS The 133 records per second end up in around ~ 100 different column families .. I see a lot of ParNew GC in the cassandra logs Also Tombstone threshold warnings. I have attached some console messages from c*sainath reddy
This looks like you are having issues do to the volume of columns and column families in your cluster. I would suggest bringing this up with the C* user mailing list.RussS
@RussS I have posted this in the mailing list but didnt get any response yet .. Is there a better platform for raising this issue cassandra-user-incubator-apache-org.3065146.n2.nabble.com/…sainath reddy

2 Answers

1
votes

I suspect you're hitting edge cases in cassandra related to the large number of CFs/columns defined in the schema. Typically when you see tombstone warnings, it's because you've messed up the data model. However, these are in system tables, so obviously you've done something to the tables that the authors didnt expect (lots and lots of tables, and probably drop/recreating them a lot).

Those warnings were added because scanning past tombstones looking for live columns causes memory pressure, which causes GC, which causes pauses, which causes slowness.

Can you squish the data into significantly fewer column families? You may also want to try clearing out the tombstones (drop gcgs for that table to zero, run major compaction on system if it's allowed?, raise it back to default).

0
votes

You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.