1
votes

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.

Overall Setup

Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.

Daily the magnitude of the data is approx. 50 million records, per site.

Data Description

Each record consists of the following values

  1. Sensor ID
  2. Point ID
  3. Timestamp
  4. Proximity

where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point. Each second there is approx. 1000 such new records. A record is never updated.

Query Workload

Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.

Generally, we therefore have a write-once query-many scenario.

Initial Strategy

If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.

However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.

Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.

My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.

My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.

Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.

So my question is, how would fellow Igniters approach this problem?

1

1 Answers

0
votes

It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.