4
votes

new to cassandra, still learning.

create table url (
  id_website int,
  url varchar,
  data varchar,
  primary key(url, id_website)
);

Hi I have a table of url for a website.

I don't want all the url being on the same node, that's why primary key is url first, so it will be the partition key.

most of the time I'm going to retrieve the data for a specific url, eg : "url = ? and id_website = ?"

However what about the performance when I want to retrieve a part/all the urls of a website:

select * from url where id_website = 1 allow filtering limit XX;

I think this query is going to be dispatched on all the nodes, then, table scanning for id_website= 1 until limit is reach then merged and sent back to my client.

But is this scanning going to use an index and be effective or read the values of the column id_website one by one and compare (ineffective so) ? I did set id_website part of the primary key so I expect it to be indexed, but I really don't know.

Do we have some tools on cassandra like the EXPLAIN of mysql to check if a query is using index or not.

Thanks.

--

EDIT

Create a second table with id_website as partition key (and write/delete in batch)

I don't want to use this solution because I may have one or two website which are really huge and have millions of urls (and millions of others website with little of urls).

If I have a partition key on id_website, and that this two or three website stay on the same node it may cause storage problem or the node handling these websites maybe too much solicited while the other get nothing. I want to spread the data over all the nodes. That's why I insisted to partition on the url.

You create a secondary index on id_website (which creates a table for you)

What about this solution ? If I understand, each node would have a table indexing the rows it stores based on id_website (so not the rows of others nodes). So I can spread my urls across many nodes, I won't have one node handling a big indexing containing all the urls of a specific website.

Now when I use my query

select * from url where id_website = 1 allow filtering limit XX;

Each node receive the query, but they don't have to loop through the partition (url column) this time, they can directly lookup up in the index the urls belonging to id_website, and return the rows (or nothing). Right ?

The contra of this solution is everytime the request is done, it's going to hit each node, however, it should be fast thanks to the new index ?

1

1 Answers

5
votes

You're on the right way. Using allow filtering you're just asking cassandra to scan all nodes: very ineffective. id_website is indexed within each partition but since you are not telling Cassandra where to go he must hit all partitions (all nodes) even those who doesn't contain information for the selected id_website -- once Cassandra hit a partition knows how to look for this information and does not need to scan the whole partition to get data back.

To solve this problem in Cassandra you have to denormalize and in this situation you can do it in two possible ways:

  1. Create a second table with id_website as partition key (and write/delete in batch)
  2. You create a secondary index on id_website (which creates a table for you)

**EDIT DUE TO QUESTION EDIT**

What you said is right: secondary indexes are handled as "local indexes" -- each node creates a local index table only for the data it owns. The following is a good reading about secondary indexes (that you already understood)

Once you created the index you have to remove ALLOW FILTERING from the query.

HTH, Carlo