I have a Cassandra table XYX with columns( id uuid, insert a timestamp, header text)
Where id and insert are composite primary key.
I'm using Dataframe and in my spark shell I'm fetching id and header column. I want to have distinct rows based on id and header column.
I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition.
After fetching I'm using dropDuplicates to get distinct records.