Read Parquet Files using Apache Arrow

Question

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):

pyarrow.parquet.write_table(table, "example.parquet")

Now I want to read these files (and preferably get an Arrow Table) using a Java program.

In Python, I can simply use the following to get an Arrow Table from my Parquet file:

table = pyarrow.parquet.read_table("example.parquet")

Is there an equivalent and easy solution in Java?

I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.

Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.

Also, can you please provide Maven dependencies if your solution uses Maven.

I am on Windows and using Eclipse.

Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.

The PyArrow Table object is not part of the Apache Arrow specification and was not implemented in Java. I am trying to find a solution too. I already implemented with Spark 3.0.1 using Parquet instead. I keep looking for a framework-independent solution. — João Paraná
Perhaps Dremio (github.com/dremio/dremio-oss) can provide a solution. — João Paraná

avloss avloss · Accepted Answer · 2020-05-27T16:13:26

it's somewhat an overkill, but you can use Spark.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Read Parquet Files using Apache Arrow

1 Answers