7
votes

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):

pyarrow.parquet.write_table(table, "example.parquet")

Now I want to read these files (and preferably get an Arrow Table) using a Java program.

In Python, I can simply use the following to get an Arrow Table from my Parquet file:

table = pyarrow.parquet.read_table("example.parquet")

Is there an equivalent and easy solution in Java?

I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.

Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.

Also, can you please provide Maven dependencies if your solution uses Maven.


I am on Windows and using Eclipse.


Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.

1
The PyArrow Table object is not part of the Apache Arrow specification and was not implemented in Java. I am trying to find a solution too. I already implemented with Spark 3.0.1 using Parquet instead. I keep looking for a framework-independent solution.João Paraná
Perhaps Dremio (github.com/dremio/dremio-oss) can provide a solution.João Paraná

1 Answers

-1
votes