2
votes

I have lots of documents (50 million ) each of which have sub parts might be zero to 50 for each document . Each part has meta data . Can I use dynamic column family to store meta data ?

This document have 4 parts in it

Here I want to store meta data say size , name , count etc for each part of E3 i.e. E2,E1,E0 . But I need to retrieve all (E3,E2,E1,))with same ID at a time . Some documents might be like E50,E49 with 50 sub parts in it .


Columns of Casandra column table

E_0_COL_ID
E_0_COL_Name
E_0_COL_Size
...
..
E_1_COL_ID
E_1_COL_Nmae
E_1_COL_Size..

..
..

E_n_COL_NAME
E_n_COL_SIZE

Is this correct use case for Cassandra dynamic columns ?

CQL SCHEMA

First create table


CREATE TABLE DOCUMENT_TABLE(

E_0_COL_ID bigint,
E_0_COL_Name text,
E_0_COL_Size bigint

PRIMARY KEY (E_0_COL_ID )

)

Add dynamic columns as per number of parts 

ALTER TABLE test.DOCUMENT_TABLE ADD E_1_COL_ID bigint;
ALTER TABLE test.DOCUMENT_TABLE ADD E_1_COL_Name text;
ALTER TABLE test.DOCUMENT_TABLE ADD E_1_COL_Size bigint;

ALTER TABLE test.DOCUMENT_TABLE ADD E_2_COL_ID bigint;
ALTER TABLE test.DOCUMENT_TABLE ADD E_2_COL_Name text;
ALTER TABLE test.DOCUMENT_TABLE ADD E_2_COL_Size bigint;

...

ALTER TABLE test.DOCUMENT_TABLE ADD E_n_COL_ID bigint;
ALTER TABLE test.DOCUMENT_TABLE ADD E_n_COL_Name text;
ALTER TABLE test.DOCUMENT_TABLE ADD E_n_COL_Size bigint

;

IS THIS CORRECT USE OF DYNAMIC COLUMN FAMILY IN CASSANDRA ?

1
Should I use custom type for each document and use CQL collection ?NILESH SALPE

1 Answers

0
votes

The downside of that approach is you are adding those columns for all documents in the table, even if they don't have a particular sub part. So many rows will return those unused columns as null, which may be an annoyance to deal with in your application, not to mention having so many hard coded column names.

It might be more flexible and easier to work with a fixed schema rather than dynamically adding columns like that. You could store the sub parts as clustering columns with the main document id as the partition key, and the sub part id as a clustering column, like this:

CREATE TABLE docs (
    id       bigint,
    sub_id   bigint,
    sub_name text,
    sub_size bigint,
    PRIMARY KEY (id, sub_id)
);

// insert main document with sub_id set to -1 to indicate it is the main document name
INSERT INTO docs ( id, sub_id, sub_name, sub_size) VALUES (123, -1, "Main doc name", 456);

// insert a sub part to the main document
INSERT_INTO docs ( id, sub_id, sub_name, sub_size) VALUES (123, 1, "Sub part 1 name", 789);

To get the document and all its parts:

SELECT * FROM docs WHERE id=123;

This will return the main document name as the first row (since the clustered data is sorted by ascending sub_id) and the remaining rows will be the sub parts. This would also allow you to do range queries on the sub_id such as:

SELECT * FROM docs WHERE id=123 and sub_id > 20 and sub_id < 50; 

Another possibility would be to use one of the collection types, such as the list type. You could create a user defined type for the sub_id, sub_name, sub_size tuple, and then in your main table have a list of sub parts:

CREATE TYPE sub_doc (
    id   bigint,
    name text,
    size bigint,
)

CREATE TABLE docs (
    id      bigint,
    name    text,
    size    bigint,
    subdocs list<sub_doc>,
    PRIMARY KEY (id)
);

Which to choose would depend on the access and modification patterns you expect in your application. The list would be good if for some reason you wanted all the sub docs returned in a single row. The clustering approach would be easier to add and remove rows on the fly, and to have easier access to sub parts as individual rows.