How to access local parquet files from spark sql?

Question

I'm using a Dropwizard web service to access a number of parquet files, I need to use "real" sql (strings) not spark DDL (which I've tried and does work but doesn't meet my needs). I'm launching the service from eclipse with spark in standalone mode. Spark version is 1.4.1.

Problem is spark doesn't resolve parquet references within plain SQL, called like this: (I have a test copy in the folder I'm launching the web service from in ./bro/conn.parquet)

DataFrame df = sqlContext
        .sql(sql)
        .limit(5)
;

For example

http://localhost:8080/table/query?sql=select%20ts%20from%20%20parquet.`./bro/conn.parquet`

fails with the error shown below. I've tried all the permutations I can think of for that sql statement (omit the ./, absolute paths, omit the backticks, etc) but no joy.

Does parquet access via SQL work or just the DDL API (which I can't use for this use case). Is there a way to use the DDL api to load the dataframe (DataFrame df = sqlContext.read().parquet(path)) and then apply full sql commands (minus the from clause) to the result?

0:0:0:0:0:0:0:1 - - [02/Jun/2016:12:47:06 +0000] "GET /table/query?sql=select%20ts%20from%20%20parquet.`./bro/conn.parquet` HTTP/1.1" 500 1483 74 74
ERROR [2016-06-02 13:02:29,810] com.yammer.dropwizard.jersey.LoggingExceptionMapper: Error handling a request: fc462d5554bce965
! java.lang.RuntimeException: Table Not Found: parquet../bro/conn.parquet
! at scala.sys.package$.error(package.scala:27)
! at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
! at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
! at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
! at scala.collection.AbstractMap.getOrElse(Map.scala:58)
! at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115)
! at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:222)
! at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
! at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
! at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
! at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
! at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
! at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
! at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
...
0:0:0:0:0:0:0:1 - - [02/Jun/2016:13:02:29 +0000] "GET /table/query?sql=select%20ts%20from%20%20parquet.`./bro/conn.parquet` HTTP/1.1" 500 1483 20 20

Bradjcox Bradjcox · Accepted Answer · 2016-06-06T14:41:29

0

votes

This is caused by a bug in spark-sql 1.4.1. Upgrading to 1.6.1 fixed it.

How to access local parquet files from spark sql?

1 Answers