I am looking for a way to do sql like lead/lag function in Google Dataflow/Beam. In my case if done in sql, it would be something like
lead(balance, 1) over(partition by orderId order by order_Date)
In Beam, we parse the input text file and create a class Client_Orders
to hold the data. For simplicity, let's say we orderId
, order_Date
and balance
members in this class. And we create partitions with the orderId
by constructing KV in PCollections
PCollection <KV<String, Iterable<Client_Orders>>> mainCollection = pipeline.apply(TextIO.Read.named("Reading input file")
.from(options.getInputFilePath()))
.apply(ParDo.named("Extracting client order terms from file") // to produce Client_Orders object
.apply('create KV...", GroupByKey.<String, Client_Orders>create());
In Beam, I know we can do windowing, but that requires in general to set a window size in terms of duration Windows.of(Duration.standardDays(n))
, but that doesn't seem to help in this case, should I iterate through the PCollection using order_Date
?