I've two files in cloud storage.Contains of File1 in Avro format that has data from temperature sensor.
time_stamp | Temperature
1000 | T1
2000 | T2
3000 | T3
4000 | T3
5000 | T4
6000 | T5
Contains of File2 in Avro format that has data from wind sensor.
time_stamp | wind_speed
500 | w1
1200 | w2
1500 | w3
2200 | w4
2500 | w5
3000 | w6
I'want to combine output like below
time_stamp |Temperature|wind_speed
1000 |T1 |w1 (last earliest reading from wind sensor at 500)
2000 |T2 |w3 (last earliest reading from wind sensor at 1500)
3000 |T3 |w6 (wind sensor reading at 3000)
4000 |T3 |w6 (last earliest reading from wind sensor at 3000)
5000 |T4 |w6 (last earliest reading from wind sensor at 3000)
6000 |T5 |w6(last earliest reading from wind sensor at 3000)
I am looking for the solution in apache beam to combine above file. Right now it is reading from file but in future it may come via pubsub. I want to find out custom way of combining two PCollection and create another PCollection tempDataWithWindSpeed.
PCollection<Temperature> tempData = p.apply(AvroIO
.read(AvroAutoGenClass.class)
.from("gs://my_bucket/path/to/temp-sensor-data.avro")
PCollection<WindSpeed> windData = p.apply(AvroIO
.read(AvroAutoGenClass.class)
.from("gs://my_bucket/path/to/wind-sensor-data.avro")
PCollection<WindSpeed> tempDataWithWindSpeed = ?
KV<Long, WindData>andKV<Long, TempeData>, where the key is the timestamp bin in both cases where the timestamp belongs to (Eg: 2200 belongs to key 2000, so you have to round down to thousands). Once you have created the groups, you can select the min or max or whatever you need sensor value. Hope this helps :) - jszule