0
votes

I have two sources that I need to join together. Let's say each of these sources are about 100M rows of data, and I want to join two the results of two queries that run against these sources. So, conceptually speaking the join query looks like this:

SELECT *
FROM
    (query1 against ElasticSearch source -- results are of unknown size)
    JOIN
    (query2 against BigQuery source -- results are of unknown size)
    ON query1.joinkey = query2.joinkey

In other words, the results of query1 could be anywhere from 0 bytes/rows to 10GB/100M rows. Same with the results of query2.

How does Apache Beam or Cloud Dataflow deal with 'unknown-sized' joins? For example, in the case where I define two run-time queries. Additionally, for the above case, is Apache Beam a good resource to use or might there be better options?

I suppose in the case where the two sizes may be of unlimited size, it might work best to do the join (conceptually at least) as:

WITH query1 AS (
    es_query -> materialize to BigQuery
)
SELECT *
FROM
     query1 JOIN query2 USING (joinkey)