How to change query plan before execution (possibly turning an optimization off)?

Question

I have a simple spark SQL query :

SELECT x, y
FROM t1 INNER JOIN t2 ON t1.key = t2.key
WHERE expensiveFunction(t1.key)

Where expensiveFunction is a spark UDF (User-defined function).

When I look at the query plan generated by spark, I see that it has two filter operations instead of just one: it checks not only expensiveFunction(t1.key), but also expensiveFunction(t2.key).

In general, this optimization is not a bad thing, because it reduces the number of records to join, and joining is an expensive operation. But in my case expensiveFunction(t2.key) always returns true, so I would like to remove it.

Is there a way to change the query plan before executing a query ? Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2018-01-23T14:26:33

Is there a way to change the query plan before executing a query?

In general, yes. There are few extension points in Spark SQL query planner and optimizer that would make the wish doable

Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?

That's nearly impossible unless the optimization allows for that. In other words you'd have to find out whether the rule has an option to turn it off, e.g. CostBasedJoinReorder with spark.sql.cbo.enabled or spark.sql.cbo.joinReorder.enabled configuration properties (when either is off CostBasedJoinReorder does nothing).

You could write a custom logical operator that would make the optimization void (as it would not be matched given unknown logical operator) and at optimization phase you'd remove it.

Use extendedOperatorOptimizationRules to register custom optimizations.

How to change query plan before execution (possibly turning an optimization off)?

3 Answers