0
votes

In my spark job I am reading data from cassandra using java cassandra util. My query reads like-

JavaRDD<CassandraRow> cassandraRDD = functions.cassandraTable("keyspace","column_family").
select("timeline_id","shopper_id","product_id").where("action=?", "Viewed")

My row key level is set on action column. When I am running my spark job its causing the over utilisation of cpu but when I remove the filter on the action column its working fine.

Please find below the create table script for the column family-

CREATE TABLE keyspace.column_family (
    action text,
    timeline_id timeuuid,
    shopper_id text,
    product_id text,
    publisher_id text,
    referer text,
    remote_ip text,
    seed_product text,
    strategy text,
    user_agent text,
    PRIMARY KEY (action, timeline_id, shopper_id)
) WITH CLUSTERING ORDER BY (timeline_id DESC, shopper_id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

What I am suspecting is as action_item is the row key, all data is getting served from single node (hot spot) and thats why that nodes CPU might be shooting up. Also while reading there is only a single partition of RDD getting created in the spark job. Any help will be appreciated.

1
Can you post here the CREATE TABLE script of your column_family ?doanduyhai
Added it in the question.Y0gesh Gupta

1 Answers

1
votes

Ok you're having a data model issue here. action = partition key so all similar actions are stored in a single partition = (one node + replicas).

How many distinct actions do you have in total ? Your intuition about having hotspot is justified.

You probably need a different partition key OR need to add an extra column to the partition key to let Cassandra distributes the data evenly on the cluster.

Read this blog post : http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/