4
votes

I am working on a Big Data solution for sensor data and predictive analytics. I am new to Big Data, and have read about the lambda-architecture. I thought about using Cassandra Database together with Hadoop. Cassandra is a high available and Partition tolerance database and Hadoop hdfs a file system for large analytics jobs.

If I receive the data from a Internet of Things Device, should the data be saved first in Hadoop and then to Cassandra? The lambda architecture has Hadoop in batch layer, receiving the data and sending it to the serving layer to a nosql database.

Why should the data be first in Hadoop? and what kind of data is stored in Cassandra if Hadoop contains the raw data?

The stream layer is out of Focus at the moment. I just want to understand the usage of Cassandra and Hadoop together.

The data in Hadoop is for large analytics and in cassandra there should be the result from my Hadoop jobs.

Does that mean i can store my raw data in both? i can store my raw data in Cassandra and in Hadoop if not only the large analytics jobs are useful for my application?

Example

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’);

if this is my insert and i have thousands of them in one single minute. I want to do some large jobs i use Hadoop ?

But also i need every single Data Row for my application without analytics. Cassandra is storing it too?

3
The data that you want to show real time to user needs to be in cassandra.Ashraful Islam
so i can also safe my data+timestamp on both? And additionally i have the possibility to do analytics on large datasets and pass the result also to cassandra in a Different table?Khan
how is the data passed from hadoop to cassandra if i dont want analytics on the raw data? or should the data be passed to both?Khan
if you don't want analysis you can just store data only to cassandra. Later you can copy data to hadoop if neededAshraful Islam
yes, later you can integrate with hadoop. If your data will grow then you data model must be good enough to support these link can help you academy.datastax.com/resources/… datastax.com/dev/blog/advanced-time-series-data-modelling stackoverflow.com/questions/30091144/… ebaytechblog.com/2012/07/16/…Ashraful Islam

3 Answers

1
votes

The trade off is between the latency and throughput. Hadoop is supposed to provide the high throughput but the latency is quite high. So hadoop is used for batch processing in lambda architecture. But there may be requirement when you would like to pass on the pre-computed data ( Or summarized data) to another layer like visualization layer .These precomputed data is basically stored in cassandra or hbase to have low latency.

1
votes

As you receive the data from a IoT Device, you need to save this data as quickly as you can. That's exactly what Cassandra is great for.
Than you need to process this data, and as the data amount is large, in the realistic case you do not want to have on-the-fly data processing, but to have batch(nightly, for example) processing instead.
And it's turn of Hadoop here.
So you have to extract the data from Cassandra, then put into Hadoop's file system (hdfs) and then do some processing (via Hive or Spark).
You could also think of having Cassandra-Spark direct streaming job, but I'd suggest to copy data from Cassandra first, as this allows to use this data as sandbox (to debug jobs, testing new algorithms, etc.) without any impact on Casandra cluster performance.

1
votes

You can read about Cassandra and big data here.
Disclaimer: I am the author of this post.