I have two sources that I need to join together. Let's say each of these sources are about 100M rows of data, and I want to join two the results of two queries that run against these sources. So, conceptually speaking the join query looks like this:
SELECT *
FROM
(query1 against ElasticSearch source -- results are of unknown size)
JOIN
(query2 against BigQuery source -- results are of unknown size)
ON query1.joinkey = query2.joinkey
In other words, the results of query1 could be anywhere from 0 bytes/rows to 10GB/100M rows. Same with the results of query2.
How does Apache Beam or Cloud Dataflow deal with 'unknown-sized' joins? For example, in the case where I define two run-time queries. Additionally, for the above case, is Apache Beam a good resource to use or might there be better options?
I suppose in the case where the two sizes may be of unlimited size, it might work best to do the join (conceptually at least) as:
WITH query1 AS (
es_query -> materialize to BigQuery
)
SELECT *
FROM
query1 JOIN query2 USING (joinkey)