Is skipping leading rows when reading files in google dataflow possible

Question

I want to skip leading rows when reading files while using google dataflow. Is that feature available in the lastest version? The files are kept in google storage. I will be writing these files to big query.

bq load command has option --skip_leading_rows . This option skips the leading rows when reading from the files.

I want a similar feature to this in google dataflow. My input is in following format.

I want google dataflow to ignore the first line and write only the rest of the lines to big Query

In general, the built-in TextIO transform does not support this, but let's try to find something that works. Could you edit the question and give a short example snippet of the format of the input you want to read? — jkff
Possible duplicate of Skipping header rows - is it possible with Cloud DataFlow? — Graham Polley
Hey but that question was answered almost 1.5 years ago. So new features may have been added to dataflow since then. — abhishek jha
We can use Filter.byPredicate() to filter out the leading header in the input files. But my fear is that it will increase the execution time of the code as the filter check will be applied on each row in the input files. This feature of skipping the header row is available in other similar technologies like spark etc without impacting the performance. — abhishek jha
@abhishekjha - Filter.byPredicate() is the optimal thing to do here. I would not worry about the performance in this case, telling apart the header row from other rows in the format you showed in the example seems like something very CPU-cheap. — jkff

Graham Polley Graham Polley · Accepted Answer · 2016-08-10T08:21:24

This feature is not supported directly in Dataflow/ParDo's.

You need to use a Filter.byPredicate() to achieve this.

e.g.

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));

Is skipping leading rows when reading files in google dataflow possible

1 Answers