Reading from BigQuery and Filtering Data, I have 2 Way's
Read From BigQuery in Dataflow(Using BigqueryIO.readTableRow.from(ValueProvider)) Whole Data and Then Filter on Basis of Condition like max Date
Reading From BigQuery in Dataflow Using NestedValueProvider by making a Query Which will Only Fetch the Required Data is Much Slower.
Since there would be an Issue if I read whole Data and my Table is in Append Mode which will Increase the Time of Reading Data as Day Pass.
But if I read only Particular Date Data Which will Make my Pipeline Reading Time Consistent.
But for 200 Records Nested Value Provider is taking Much more time than Reading Whole Data Using BigqueryIO.readTableRow.from(ValueProvider).
What am I Missing anyone can help?
My Snippet is Below Please Find.
Snippet:
PCollection<TableRow> targetTable = input.apply("Read TRUSTED_LAYER_TABLE_DESCRIPTION", BigQueryIO
.readTableRows()
.withoutValidation()
.withTemplateCompatibility()
.fromQuery(NestedValueProvider.of(options.get(Constants.TABLE_DESCRIPTION.toString())
, new QueryTranslator(options.get(Constants.ETL_BATCH_ID.toString())))).usingStandardSql());
Nested Value Provider Class Snippet:
public class QueryTranslator implements SerializableFunction{
/**
* Read data with max etlbatchid from query
*/
ValueProvider<String> etlbatchid;
public QueryTranslator(ValueProvider<String> etlbatchid){
this.etlbatchid = etlbatchid;
}
private static final long serialVersionUID = -2754362391392873056L;
@Override
public String apply(String input) {
String batchId = this.etlbatchid.get();
if(batchId.equals("-1"))
return String.format("SELECT * from `%s`", input);
else
return String.format("SELECT * from `%s` where etlbatchid = %s;", input,batchId);
}
}