0
votes

I'm using Hbase coupled with phoenix for interractive analytics and i'm trying to desing my hbase row key for an iot project but i'm not very sure if i'm doing it right.

My Database can be represented into something like this :

Client--->Project ----> Cluster1 ---> Cluster 2 ----> Sensor1
Client--->Project ----> Building ----> Sensor2
Client--->Project ----> Cluster1 ---> Building ----> Sensor3

What i have done is a Composite primary key of ( Client_ID, Project_ID,Cluster_ID,Building_iD, SensorID)

(1,1,1#2,0,1)
(1,1,0,1,2)
(1,1,1,1,3)

And we can specify multiple Cluster or building with a seperator # 1#2#454 etc and if we don't have a node we insert 0.

And in the columns family we will have the value of the sensor and multiples meta_data.

My Question is this hbase row key design for a request that say we want all sensors for the cluster with ID 1 is valid ?

I thought also to just put the Sensor_ID,TimeStamp in the key and put all the rooting in the column family but with this design im not sure its a good fit for my requests .

My third idea for this project is to combine neo4j for the rooting and hbase for the data.

Anyone got any experience on similar problems to guide me on the best approach to design this database ?

1
do you have an idea of the maximum number of projects/clusters/sensors that a given client might have?Marsellus Wallace
how many data points does each sensor generate?Marsellus Wallace
@Gevorg No i don't have any maximum number in mind, Its a top10 and top60 sensors so it may generate around 1440 data points a day per/sensor, lately im trying to look up time series database that fit well in the hadoop ecosystem like opentsdb , any suggestions ?azelix
I think that you are on the right track. Make sure to deeply understand how data is stored in HBase and how OpenTSDB defines the schema to address the time series data domain. It is worth it to read the documentation/manual of both technologies.Marsellus Wallace

1 Answers

1
votes

It seems that you are dealing with time series data. Once of the main risks of using HBase with time series data (or other forms of monotonically increasing keys) is hotspotting. This is dangerous scenario that might arise and make your cluster behave as a single machine.

You should consider OpenTSDB on top of HBase as it approaches the problem quite nicely. The single most important thing to understand is how it engineers the HBase schema/key. Note that the timestamp is not in the leading part of the key and it assumes a number of distinct metric_uid >>> of the number of slave nodes and region servers (This is essential for a balanced cluster).

An OpenTSDB key has the following structure:

<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]

Depending on your specific use case you should engineer your metric_uid appropriately (maybe a compound key unique to a sensor reading) as well as the tags. Tags will play a fundamental role in data aggregation.

NOTE: As of v2.0 OpenTSDB introduced the concept of Trees that could be very helpful to 'navigate' your sensor readings and facilitate aggregations. I'm not too familiar with them but I assume that you could create a hierarchical structure that will help determining which sensors are associated with which client, project, cluster, building, and so on...

P.S. I don't think that there is room for Neo4J in this project.