I am new to the dataflow programming model and I have a few questions about the way dataflow stores intermediate state in a windowed streaming process. Let's say I am windowing by day and then performing an aggregation. When a new event comes in, it needs to access all the data that is in that window and group.
Is this data stored in memory, on disk, in GCS, or somewhere completely different?
Does the volume of intermediate data effect the number of machines necessary for a job?
What happens to the data when the window is closed?
If I am performing an operation such as summing which does not require all of the data to be stored in an intermediate state, is there a way to tell dataflow to only store the results of the last update?