2
votes

When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO with from, which reads the table in question, and the second approach is to use fromQuery where you specify a query that reads all the data from the same table. So my question is:

  • Is it any cost or performance benefit for using one over the other?

I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.

So in short, what is the difference between:

BigQueryIO.read(...).from(tableName)

And

BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
1

1 Answers

8
votes

from is both cheaper and faster than fromQuery(SELECT * FROM ...).

  • from directly exports the table and exporting data is free for BigQuery.
  • fromQuery(SELECT * FROM ...) will first scan the entire table ($5/TB) and export the result.