1
votes

Trying to use Google DataFlow Java SDK but for my usecases my input files are .parquet files.

Couldn't find any out-of-the-box functionality to read parquet into DataFlow pipeline as Bounded data source. As i understand i can create a coder and/or sink a bit like AvroIO based on Parquet Reader.

Did anyone can advise how the best way to implement it? or point me to a reference with How-to \ examples?

Appreciate your help!

--A

1

1 Answers

3
votes

You can find progress towards ParquetIO (out of the box functinonality as you called it) at https://issues.apache.org/jira/browse/BEAM-214.

In the meantime, it should be possible to read Parquet files using Hadoop FileInputFormat in both Beam and Dataflow SDKs.