0
votes

I'm trying to make a meeting/attendee program while learning HBase.

I'm debating about whether to denormalize a relationalship from 3rd Normal Form to either 1st Normal Form (atomocity/no sets of values within a column) to 0 Normal Form (not atomic/sets of values exist within one column).

In essence I'm trying to determine how to convert the following relational schema into HBase:

CREATE TABLE customer (
    customer_id INT PRIMARY KEY
    ,capacity INT
);
CREATE TABLE attendee (
    attendee_id INT PRIMARY KEY
    customer_id INT REFERENCES customer (customer_id)
);
CREATE TABLE customer_dedicated_hosts (
    customer_id INT REFERENCS customer (customer_id)
    ,dedicated_host_attendee_id INT REFERENCES attendee (attendee_id)
);
CREATE TABLE meeting (
    meeting_id INT PRIMARY KEY
    ,host_attendee_id INT REFERENCES attendee (attendee_id)
);
CREATE TABLE meeting_attendee (
    meeting_id INT
    ,attendee_id INT
    ,CONSTRAINT ... PRIAMRY KEY (meeting_id, attendee_id)
);

A "Customer" has 1:M Attendees.

An Attendee has M:N Meetings.

A Meeting is hosted by an attendee and thus links to Attendee via the host_attendee_id FK.

A Customer has a number of attendees who are allowed to host meetings -- listed in CustomerDedicatedHosts. If an attendee of a customer hosts meetings who is not a dedicated host, he should be fined.

Each meeting has a capacity of attendees for a given customer. If a customer goes over his capacity for a single meeting, he should be fined.

I'm curious if this should all be done in one or two tables with one column family -- denormalized table with tons of repetition. The equivalent of

CREATE TABLE hostapp (
    customer_id INT
    ,capacity
    ,dedicated_host_attendee_id
    --ROWKEY == customer_id, dedicated_host_attendee_id
);
CREATE TABLE meetingapp (
    customer_id INT
    ,attendee_id INT
    ,meeting_id INT
    ,host_attendee_id INT
    --ROWKEY == customer_id, meeting_id, attendee_id
);

I can't quite wrap my head around denormalization in this scenario. Why not split the "hostapp" into two tables, one with two columns (customer_id, capacity), and one with two columns (customer_id, dedicated_host_attendee_id). I guess I could live with the repeating host_attendee_id in the meetingapp table, but why not split the meeting app into two tables (customer_id, meeting_id, host_attendee_id) and (meeting_id, attendee_id)?

Is that the proper way to design this schema, or should it be done differently?

I'm also curious as to what extend I can abuse the columns in a column family, to use them like a nested table in oracle.

CREATE TABLE meetingapp (
    customer_id INT
    meeting_id INT
    host_attendee_id INT
    attendees VARRAY(<INT>)
);

In hbase terminology, one column family would always have the following three columns: customer_id, meeting_id, host_attendee_id. Either the same or another column family would have these columns: attendee1, attendee2, ... attendeeN; in other words, variable number of columns depending on the number of column families, similar to a nested table or varray in Oracle.

How is this best approached?

1

1 Answers

1
votes

There's a lot of flexibility in HBase and you can do all you describe and more (like putting actual data in the key etc.). To design the right schema you need to think about your access patterns into the data (both reads and writes)

For instance when you have to update a lot of columns together you may want to keep them in the same row (to get atomicity), if you need to have access via Hive (or other SQL frontends) you need to be more conservative in how you use columns, keys etc. if you access the data on some dimensions more frequently you promote that into the key or promote some data as well etc. etc.

So essentially - If you want advice about the right design you need to provide more context one what it is you are trying to do with the data