Big data lambda architecture with cassandra and hadoop

Question

I am working on a Big Data solution for sensor data and predictive analytics. I am new to Big Data, and have read about the lambda-architecture. I thought about using Cassandra Database together with Hadoop. Cassandra is a high available and Partition tolerance database and Hadoop hdfs a file system for large analytics jobs.

If I receive the data from a Internet of Things Device, should the data be saved first in Hadoop and then to Cassandra? The lambda architecture has Hadoop in batch layer, receiving the data and sending it to the serving layer to a nosql database.

Why should the data be first in Hadoop? and what kind of data is stored in Cassandra if Hadoop contains the raw data?

The stream layer is out of Focus at the moment. I just want to understand the usage of Cassandra and Hadoop together.

The data in Hadoop is for large analytics and in cassandra there should be the result from my Hadoop jobs.

Does that mean i can store my raw data in both? i can store my raw data in Cassandra and in Hadoop if not only the large analytics jobs are useful for my application?

Example

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’);

if this is my insert and i have thousands of them in one single minute. I want to do some large jobs i use Hadoop ?

But also i need every single Data Row for my application without analytics. Cassandra is storing it too?

The data that you want to show real time to user needs to be in cassandra. — Ashraful Islam
so i can also safe my data+timestamp on both? And additionally i have the possibility to do analytics on large datasets and pass the result also to cassandra in a Different table? — Khan
how is the data passed from hadoop to cassandra if i dont want analytics on the raw data? or should the data be passed to both? — Khan
if you don't want analysis you can just store data only to cassandra. Later you can copy data to hadoop if needed — Ashraful Islam
yes, later you can integrate with hadoop. If your data will grow then you data model must be good enough to support these link can help you academy.datastax.com/resources/… datastax.com/dev/blog/advanced-time-series-data-modelling stackoverflow.com/questions/30091144/… ebaytechblog.com/2012/07/16/… — Ashraful Islam

kanishka vatsa kanishka vatsa · Accepted Answer · 2016-11-22T06:53:42

The trade off is between the latency and throughput. Hadoop is supposed to provide the high throughput but the latency is quite high. So hadoop is used for batch processing in lambda architecture. But there may be requirement when you would like to pass on the pre-computed data ( Or summarized data) to another layer like visualization layer .These precomputed data is basically stored in cassandra or hbase to have low latency.

Big data lambda architecture with cassandra and hadoop

3 Answers