0
votes

Is there a way in Apache Spark to create a Spark SQL proxy table that simply proxies to an underlying (custom) data source?

I have a custom data source that supports predicate pushdown by implementing org.apache.spark.sql.sources.PrunedFilteredScan and now I would like to use Spark SQL against that data source where filter predicates are passed through (pushed down) to the data source. Registering the data source as an ordinary temporary table (using sqlContext.read.format("mydatasource").load().createOrReplaceTempView("myTable")) is not an option as this will ultimately pull all data into Spark.

1
"as this will ultimately pull all data into Spark." can you elaborate how you checked that out and/or what made you think that it happens? - Jacek Laskowski

1 Answers

1
votes

Neither temporary views (Dataset.createTempView and Dataset.createOrReplaceTempView) nor external tables (Catalog.createExternalTable before 2.2, Catalog.createTable since 2.2) should pull all data into Spark, and all these options support prtedicate pushdowns to the same extent as the underlying source.