0
votes

I have been trying out Cassandra and need some help in understanding a few issues. I am new to cassandra and I am not sure of translating a MySQL DB to Cassandra would lead me to pitfalls which due to say inexperience or limited knowledge of cassandra. So I hope I can get the useful information from experienced cassandra users/developers.

Below are sample keyspaces I have created. I would like to know any sort of drawback in the design if someone from their experience can point out.

create keyspace Students with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use Students;
create column family StudentID with column_type = 'Super' and comparator = 'UTF8Type' and subcomparator = 'UTF8Type' and default_validation_class = 'UTF8Type' and column_metadata = 
[{column_name : 'First Name', validation_class : UTF8Type}, 
{column_name : 'Last Name', validation_class : UTF8Type}, 
{column_name : 'Subjects', validation_class : UTF8Type}, 
{column_name : 'Class', validation_class : UTF8Type}];


 set StudentID[utf8('1968')]['00001']['First Name'] = 'Mark';
 set StudentID[utf8('1968')]['00001']['Last Name'] = 'Myers';
 set StudentID[utf8('1968')]['00001']['Subjects'] = 'Maths, Chemistry';
 set StudentID[utf8('1968')]['00001']['Class'] = '10th grade';


create keyspace Teachers with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use Teachers;
create column family TeacherID with column_type = 'Super' and comparator = 'UTF8Type' and subcomparator = 'UTF8Type' and default_validation_class = 'UTF8Type' and column_metadata = 
[{column_name : 'First Name', validation_class : UTF8Type}, 
{column_name : 'Last Name', validation_class : UTF8Type}, 
{column_name : 'Subjects', validation_class : UTF8Type}, 
{column_name : 'Class', validation_class : UTF8Type}];

set TeacherID[utf8('777')]['234-333']['First Name'] = 'Mark';
set TeacherID[utf8('777')]['234-333']['Last Name'] = 'Myers';
set TeacherID[utf8('777')]['234-333']['Subjects'] = 'Maths, Chemistry,physics';
set TeacherID[utf8('777')]['234-333']['Class'] = '10th grade, 11th grade, 9th grade';



create keyspace Subjects with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use Subjects;
create column family SubjectNames with default_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata = 
[{column_name : 'Names1', validation_class : UTF8Type}];


set SubjectNames[utf8('Current')]['Name1']= 'maths';
set SubjectNames[utf8('Current')]['Name2']= 'physics';
set SubjectNames[utf8('Current')]['Name3']= 'chemistry';
set SubjectNames[utf8('Current')]['Name4']= 'CS';

Three keyspaces - Students, Teachers and Subjects. I would definitely need some relationship amongst these keyspaces and would also require querying data. e.g.

  • I would query for students with a certain subject and/or class
  • A teacher with certain class
  • List all subjects taken up by a certain student from a certain class.

From what I know, I would definitely need to create secondary indices to make the queries work. That is, retrieving data on certain clauses.

What I know if I am correct

  • We donot have a 'like' clause in cassandra
  • For each value for a column (the very last key-value pair), the value must be broken up. That is to individual words. Say, I want to get a list of subjects so each subject must reside in a distinct column associated to it. I cannot query column values that are like "subjectA,subjectB" instead I would break it up to SubjectA and SubjectB and put them in different columns.

Below are the keyspaces.

studentssubjectteachers

1

1 Answers

1
votes

First off, is Cassandra the right tool for the job you're trying to do? Cassandra does an excellent job dealing with distributed, loosely coupled, data sets that need high speed reading and writing capability but it starts getting clunky when you want to enforce a relational model on top of it, hence my question. If you have a highly relational dataset, like the example you showed here, where the emphasis rests on determining relationships between information, then MySQL would be a better tool than Cassandra.

I think you are confusing keyspaces as a 1-1 mapping with MySQL tables. A keyspace would more directly correspond to a database instead of a table in the database. Firstly, you might want to redesign your keyspace setup to put everything together, something like this:

keyspace: School
Column Family: Student ; Row Key: StudentID ; Col1 = First Name, Col2 = Last Name, Col3 = subjects, Col4 = class.

Repeat for your other two column families--not sure if you need supers or not.

To do cross cutting retrievals, you would need to make a column family such as:

Column Family: Class ; RowKey: ClassId (ie 10th Grade) ; col1= (TeacherId:TeacherId), Col2 = (StudentId:StudentId) 

to build a relationship column family between a specific class and all the people who belong to it.

Breaking Up
Yes, you would need to break them up by subject and place them into their own column families. Caveat to this being you can use secondary indices (as of Cassandra .7) that allow you to perform more equality type queries such as:

get users where birth_date = 1973;

Also refer to this document regarding the use of Secondary Indices. Relevant quote being,

Cassandra’s built-in secondary indexes are best for cases when many rows contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index. For example, suppose you had a user table with a billion users and wanted to look up users by the state they lived in. Many users will share the same column value for state (such as CA, NY, TX, etc.). This would be a good candidate for a secondary index. On the other hand, if you wanted to look up users by their email address (a value that is typically unique for each user), it may be more efficient to manually maintain a dynamic column family as a form of an “index”. Even for columns containing unique data, it is often fine performance-wise to use secondary indexes for convenience, as long as the query volume to the indexed column family is moderate and not under constant load.

If you haven't already seen it, the DataStax website will answer lots of your Cassandra questions, I highly recommend browsing through it if you're going to use Cassandra extensively.

In short, your two options are to decouple the items and create column families for each relationship you want to maintain OR to possibly use secondary indices depending on how you separate your data. I personally prefer the former method--despite the boilerplate--because I think it scales out better.