0
votes

Reading from BigQuery and Filtering Data, I have 2 Way's

  1. Read From BigQuery in Dataflow(Using BigqueryIO.readTableRow.from(ValueProvider)) Whole Data and Then Filter on Basis of Condition like max Date

  2. Reading From BigQuery in Dataflow Using NestedValueProvider by making a Query Which will Only Fetch the Required Data is Much Slower.

Since there would be an Issue if I read whole Data and my Table is in Append Mode which will Increase the Time of Reading Data as Day Pass.

But if I read only Particular Date Data Which will Make my Pipeline Reading Time Consistent.

But for 200 Records Nested Value Provider is taking Much more time than Reading Whole Data Using BigqueryIO.readTableRow.from(ValueProvider).

What am I Missing anyone can help?

My Snippet is Below Please Find.

Snippet:

PCollection<TableRow> targetTable = input.apply("Read TRUSTED_LAYER_TABLE_DESCRIPTION", BigQueryIO
                    .readTableRows()
                    .withoutValidation()
                    .withTemplateCompatibility()
                    .fromQuery(NestedValueProvider.of(options.get(Constants.TABLE_DESCRIPTION.toString())
                            , new QueryTranslator(options.get(Constants.ETL_BATCH_ID.toString())))).usingStandardSql());

Nested Value Provider Class Snippet:

public class QueryTranslator implements SerializableFunction{

    /**
     * Read data with max etlbatchid from query
     */

    ValueProvider<String> etlbatchid;
    public QueryTranslator(ValueProvider<String> etlbatchid){
        this.etlbatchid = etlbatchid;
    }

    private static final long serialVersionUID = -2754362391392873056L;

    @Override
    public String apply(String input) {
        String batchId = this.etlbatchid.get();
        if(batchId.equals("-1"))
            return String.format("SELECT * from `%s`", input);
        else
            return String.format("SELECT * from `%s` where etlbatchid = %s;", input,batchId);
    }
}
1
Not sure what you are asking. Can you try and clarify? - Graham Polley
@GrahamPolley I have Again Updated the Question Which Might Help You. - BackBenChers

1 Answers

0
votes

Depending on your use case, both of the 2 ways can be employed and you should consider the pros and cons of each one that you choose.

The first one (reading a whole table) will be very fast as Dataflow can easily split the workload in multiple shards and process it with parallelism hence the rapidity. The downside is that the cost is likely to be higher due to intensive CPU use.

The second option is expected to be slower due to multiple operations BigQuery will perform but will be cost effective. The cons for this option will be that probably, you'll be hitting one or several quota and limit of BigQuery that will require elaborate coding to overthrown.

You can also check if you can implement these examples for reading the whole table, use a string query and use a filter method (inspired from this StackOverflow thread).