new to cassandra, still learning.
create table url (
id_website int,
url varchar,
data varchar,
primary key(url, id_website)
);
Hi I have a table of url for a website.
I don't want all the url being on the same node, that's why primary key is url first, so it will be the partition key.
most of the time I'm going to retrieve the data for a specific url, eg : "url = ? and id_website = ?"
However what about the performance when I want to retrieve a part/all the urls of a website:
select * from url where id_website = 1 allow filtering limit XX;
I think this query is going to be dispatched on all the nodes, then, table scanning for id_website= 1 until limit is reach then merged and sent back to my client.
But is this scanning going to use an index and be effective or read the values of the column id_website one by one and compare (ineffective so) ? I did set id_website part of the primary key so I expect it to be indexed, but I really don't know.
Do we have some tools on cassandra like the EXPLAIN of mysql to check if a query is using index or not.
Thanks.
--
EDIT
Create a second table with id_website as partition key (and write/delete in batch)
I don't want to use this solution because I may have one or two website which are really huge and have millions of urls (and millions of others website with little of urls).
If I have a partition key on id_website, and that this two or three website stay on the same node it may cause storage problem or the node handling these websites maybe too much solicited while the other get nothing. I want to spread the data over all the nodes. That's why I insisted to partition on the url.
You create a secondary index on id_website (which creates a table for you)
What about this solution ? If I understand, each node would have a table indexing the rows it stores based on id_website (so not the rows of others nodes). So I can spread my urls across many nodes, I won't have one node handling a big indexing containing all the urls of a specific website.
Now when I use my query
select * from url where id_website = 1 allow filtering limit XX;
Each node receive the query, but they don't have to loop through the partition (url column) this time, they can directly lookup up in the index the urls belonging to id_website, and return the rows (or nothing). Right ?
The contra of this solution is everytime the request is done, it's going to hit each node, however, it should be fast thanks to the new index ?