0
votes

looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?

example- 3 files present in directory-

hdfs://app/data/customer/file1.parquet

hdfs://app/data/customer/file2.parquet

hdfs://app/data/customer/file3.parquet

Thanks!

2

2 Answers

1
votes

You can use FetchParquet processor in combination with ListHDFS/GetHDFS..etc processors.

This processor added starting from NiFi-1.2 version and Jira NiFi-3724 addressing this improvement.

  • ListHDFS //stores the state and runs incrementally.
  • GetHDFS //doesn't stores the state get's all the files from the configured directory (Keep source file property to True incase you don't want to delete the source file).
  • You can use some other ways(using UpdateAttribute..etc) to add fully qualified filename as attribute to the flowfile then feed the connection to FetchParquet processor then processor fetches those parquet files.

Based on the RecordWriter specified FetchParquet Processor reads parquet files and write them in the format specified in RecordWriter.

Flow:

ListHDFS/GetHDFS -> FetchParquet -> other processors
0
votes

If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:

  • A combination of ListHDFS and FetchHDFS
  • GetHDFS

The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.