Presto failing to query hive table

Question

On EMR I created a dataset in parquet using spark and storing it on S3. I am currently able to create an external table and query it using hive but when I try to perform the same query using presto I obtain an error (the part referred changes at every run).

2016-11-13T13:11:15.165Z        ERROR   remote-task-callback-36 com.facebook.presto.execution.StageStateMachine Stage 20161113_131114_00004_yp8y5.1 failed
com.facebook.presto.spi.PrestoException: Error opening Hive split s3://my_bucket/my_table/part-r-00013-b17b4495-f407-49e0-9d15-41bb0b68c605.snappy.parquet (offset=1100508800, length=68781800): null
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:475)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:247)
    at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createHiveRecordCursor(ParquetRecordCursorProvider.java:96)
    at com.facebook.presto.hive.HivePageSourceProvider.getHiveRecordCursor(HivePageSourceProvider.java:129)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:107)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:48)
    at com.facebook.presto.operator.TableScanOperator.createSourceIfNecessary(TableScanOperator.java:268)
    at com.facebook.presto.operator.TableScanOperator.isFinished(TableScanOperator.java:210)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:375)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
    at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
    at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:420)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.lambda$createParquetRecordReader$0(ParquetHiveRecordCursor.java:416)
    at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
    at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:416)
    ... 16 more

The parquet location is constituted by 128 parts - the data is stored on S3 and encrypted using client-side encryption with KMS. Presto uses a custom encryption-materials provider (specified using presto.s3.encryption-materials-provider) that simply returns a KMSEncryptionMaterials object initialized with my master key. I am using EMR 5.1.0 (Hive 2.1.0, Spark 2.0.1, Presto 0.152.3).

stevel stevel · Accepted Answer · 2016-11-15T10:48:20

Does this surface when encryption is turned off?

There was a bugreport which surfaced against the ASF s3a client (not the EMR one), where things were breaking when the filesystem listed length != actual file length. That is: because of the encryption, the file length in a list was > the length in a read.

We couldn't repro this in our tests, and our conclusion anyway was "filesystems must not do that" (indeed, it's a fundamental requirement of the Hadoop FS spec: listed len must equal actual length). If the EMR code is getting this wrong, then it's something in their driver which the downstream code cannot be expected to handle

Presto failing to query hive table

1 Answers