I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
- Sensor ID
- Point ID
- Timestamp
- Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point. Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?