1
votes

My Cassandra DB not responding as expected Row result. please see the below details of my Cassandra keyspace creation and to query of Count(*)

Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra
3.11.0 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE key1 WITH replication = {'class':'SimpleStrategy', 'replicationfactor' : 1};

cqlsh> CREATE TABLE Key.Transcation_CompleteMall (i text, i1 text static, i2 bigint , i3 int static, i4 decimal static, i5 bigint static, i6 decimal static, i7 decimal static, PRIMARY KEY ((i),i1));


cqlsh> COPY Key1.CompleteMall (i,i1,i2,i3,i4,i5,i6,i7) FROM '/home/gpadmin/all.csv' WITH HEADER = TRUE; Using 16 child processes

Starting copy of Key1.completemall with columns [i, i1, i2, i3, i4, i5, i6, i7]. Processed: 25461792 rows; Rate:   15162 rows/s; Avg. rate:   54681 rows/s
> **bold**25461792  rows imported from 1 files in 7 minutes and 45.642 seconds (0 skipped).

cqlsh> select count(*) from Key1.transcation_completemall; OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1 cqlsh> exit


[gpadmin@hmaster ~]$ cqlsh --request-timeout=3600
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.


cqlsh> select count(*) from starhub.transcation_completemall;

 count
---------
 **bold**2865767

(1 rows)

Warnings :
Aggregation query used without partition key

cqlsh>

I got only 2865767 rows but Copy command shows that 25461792 Rows accepted Cassandra. all.csv file has 2.5G size. For evaluating I exported the table to another file test.csv file which file size I wondered it became 252Mb.

My question is that, is Cassandra will automatically remove the duplicate in a row ? If yes how the Cassandra delete the duplicate in a table? Like primary Key repetition or Partition Key or like exact field duplication?

or

What would be the possibility that data get Loss

Expected your valuable suggestion

Advance Thanks to you all

2
(i,i1,i2,i3,i4,i5,i6,i7) Worst. Column. Names. Ever. - Aaron
example can be Worst filed @Aaron - StratQuest

2 Answers

4
votes

Cassandra will overwrite data with same primary key (Ideally all database will not have duplicate values for primary key(some throws constraint error,while some overwrites data)).

Example:

CREATE TABLE test(id int,id1 int,name text,PRIMARY KEY(id,id1))

INSERT INTO test(id,id1,name) VALUES(1,2,'test');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');
INSERT INTO test(id,id1,name) VALUES(1,2,'test2');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');

SELECT * FROM test;
 -----------------
|id  |id1  |name  |
 -----------------
|1   |2    |test2 |
 -----------------
|1   |1    |test1 |
 -----------------

The above statement will have only 2 records in table one with primary key (1,1) and other with primary key(1,2).

So in your case if values of i and i1 have duplicates that data will be overwritten.

0
votes

Maybe check LIMIT option on SELECT statement, see ref doc here

Ref doc says:

Specifying rows returned using LIMIT

Using the LIMIT option, you can specify that the query return a limited number of rows.

SELECT COUNT() FROM big_table LIMIT 50000; SELECT COUNT() FROM big_table LIMIT 200000; The output of these statements if you had 105,291 rows in the database would be: 50000, and 105,291. The cqlsh shell has a default row limit of 10,000. The Cassandra server and native protocol do not limit the number of rows that can be returned, although a timeout stops running queries to protect against running malformed queries that would cause system instability.