Workflow/Pipeline failing on write for specific step

Question

We've created a pipeline, which is performing a transformation from 3 streams located in GCS ('Clicks', 'Impressions', 'ActiveViews'). We have the requirement that we need to write the individual streams back out to GCS, but to separate files (to be later loaded into BigQuery), because they all have slightly a different schema.

One of the writes has failed twice in succession with different errors each time, which turn causes the pipeline to fail.

These are the last 2 workflow/pipeline represented visually from the GDC, which show the failure:

Write failing

The 1st error:

Feb 21, 2015, 12:55:14 PM (b0cbc05dfc56dbd9): Workflow failed. Causes: (f98c177c56055863): Map task completion for Step "ActiveViews-GSC-write" failed. Causes: (2d838e694976dc6): Expansion failed for filepattern: gs://cdf/binaries/tmp-38156614004ed90e-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9].avro.

The 2nd error:

Feb 21, 2015, 1:20:15 PM (19dcdcf1fe125eeb): Workflow failed. Causes: (2a27345ef73673d3): Map task completion for Step "ActiveViews-GSC-write" failed. Causes: (8f79a20dfa5c4d2b): Unable to view metadata for file: gs://cdf/binaries/tmp-2a27345ef7367fe6-00001-of-00015.avro.

It's only happening on the "ActiveViews-GCS-Write" step.

Any idea what we're doing wrong?

Does loading avro files to bigquery works for you? From what I see only CSV and JSON are supported. — G B
We're only using CSV files. I don't know why the error message says avro. — Graham Polley
polleyg@ Have you had a chance to check whether your original code is now working? — Jeremy Lewi

Graham Polley Graham Polley · Accepted Answer · 2015-02-24T00:27:06

We've found a workaround. The problem seems to be when more than one input source is specified and a flatten is used to merge them.

Using a flatten for the 2 input sources (e.g. all our files for 1st-2nd Feb) doesn't work (or we've done it wrong):

PCollection<String> pc1 = pipeline.apply(TextIO.Read.from("gs://<bucket_name>/NetworkImpressions_20150201*"); //1st Feb
PCollection<String> pc2 = pipeline.apply(TextIO.Read.from("gs://<bucket_name>/NetworkImpressions_20150202*"); //2nd Feb
PCollectionList<String> all = PCollectionList.of(pc1).and(pc2);
PCollection<String> flattened = all.apply(Flatten.<String>pCollections());

Instead, we just use GLOB (without a flatten) and it works every time:

pipeline.apply(TextIO.Read.from("gs://<bucket_name>/Files_2015020[12]*");

Workflow/Pipeline failing on write for specific step

2 Answers