Drill failing to read most of the columns in Parquet generated by Spark

Question

I am running Drill 1.15 in distributed mode on top of datanodes only (3 nodes with 32GB memory each). I am trying to read parquet file generated from Spark job in HDFs.

Generated file is being read in spark, just fine but when reading in Drill it doesn't seem to work for columns except a few.

org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Exception occurred while reading from disk. File: [file_name].parquet Column: Line Row Group Start: 111831 File: [file_name].parquet Column: Line Row Group Start: 111831 Fragment 0:0 [Error Id: [Error_id] on [host]:31010]

In drill config for dfs, i have default config for parquet format.

I am trying to run a simple query :

select * from dfs.`/hdfs/path/to/parquet/file.parquet`

File size if also in 10s of MBs not alot.

I am using Spark 2.3 version to generate the parquet file with 1.15 version of Drill.

Is there any config i am missing or some other point?

That's interesting question, but not that valuable unless you can provide a minimal reproducible example. — 10465355
@user10465355 i have added query sample and nodes information. Is there any other specific detail that you are looking for? I can definitely provide you that — Avik Aggarwal

Vitalii Diravka Vitalii Diravka · Accepted Answer · 2019-02-09T22:35:18

Looks like a bug.
Please create Jira ticket and provide file.parquet and log files.
Thanks

Drill failing to read most of the columns in Parquet generated by Spark

1 Answers