2
votes

I am getting "Transport error during HTTP request" on google dataflow on using BigQueryIO.Read.

Code:

PipelineOptionsFactory.register(UserClickOptions.class);
DataflowPipelineOptions options = 
  PipelineOptionsFactory.fromArgs(args).as(UserClickOptions.class);
Pipeline p = Pipeline.create(options);

......

final PCollection<TableRow> result =
  p.apply(BigQueryIO.Read.named("lookup")
      .fromQuery(Query.getLookupQuery()));

My query is like:

SELECT ROW_NUMBER() OVER() as Id, prop11 as X FROM (
  SELECT prop11 FROM       
    (TABLE_DATE_RANGE([Mapping.network_v2_],
     DATE_ADD(CURRENT_TIMESTAMP(), -1, 'YEAR'),CURRENT_TIMESTAMP())) 
  WHERE (REGEXP_MATCH(pagename,"^[1-9][^404]"))
  AND (prop11 IS NOT null AND prop11 !="")
  GROUP EACH BY prop11;

if run the query on Bigquery console it works and pipeline works for 3 months of data. Rows around 12 millions for a year.

Exception trace:

(9e64c4000646eed): Workflow failed. Causes: (d785ff42d5d40091): 
S06:CPNlookup+ParDo(Cpn2Csv)+ParDo(Row2KV)
+cpn2File/FileBasedSink.ReshardForWrite/Window.Into()
+cpn2File/FileBasedSink.ReshardForWrite/RandomKey
+cpn2File/FileBasedSink.ReshardForWrite/GroupByKey/Reify
+cpn2File/FileBasedSink.ReshardForWrite/GroupByKey/Write failed.,
(eec2b6fa988abf05): BigQuery: query execution in project "cip-newsuk-data" failed.,
(eec2b6fa988abbce): BigQuery execution failed.,
(eec2b6fa988ab897): Transport error during HTTP request.

Update 12/Apr/2006 :: I am getting these failure for a day worth of data but frequency is very less compare 3 months of data.

1
Could you add the exceptions you saw indicating the transport error? Also, did the pipeline eventually succeed/fail (exceptions are automatically retried, so it is possible to have exceptions and still succeed). - Ben Chambers
@BenChambers exception added - gana
Thanks, we're taking a look. - Sam McVeety
When you run the query from the BigQuery console, how long is it reported as taking? - Ben Chambers
For a day query takes about 3-5 sec, 3 month takes about 20 sec, 6 months takes about 40-50 sec. - gana

1 Answers

3
votes

We think we tracked this down and hope it will get fixed in the near future. In the meantime, there may be a work around that gets you unblocked for now.

The error you are seeing might be avoided if you try using TABLE_DATE_RANGE_STRICT instead of TABLE_DATE_RANGE. This will require that all tables expected by the date range are present, which may or may not be the case for you.

As an example, if you change your query to:

SELECT ROW_NUMBER() OVER() as Id, prop11 as X FROM (
  SELECT prop11 FROM       
    (TABLE_DATE_RANGE_STRICT([Mapping.network_v2_],
     DATE_ADD(CURRENT_TIMESTAMP(), -1, 'YEAR'),CURRENT_TIMESTAMP())) 
WHERE (REGEXP_MATCH(pagename,"^[1-9][^404]"))
AND (prop11 IS NOT null AND prop11 !="")
GROUP EACH BY prop11;

See the documentation on table wildcard functions for more details on TABLE_DATE_RANGE_STRICT.