How does Spark do bytecode to machine code instructions run time conversion?

Question

After reading some articles about Whole State Code Generation, spark does bytecode optimizations to convert a query plan to an optimized execution plan.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html

Now my next question is but still after doing these optimizations related to bytecodes and all, it might still be plausible that conversion of those bytecode instructions to machine code instructions could be a possible bottleneck because this is done by JIT alone during the runtime of the process and for this optimization to take place JIT has to have enough runs.

So does spark do anything related to dynamic/runtime conversion of optimized bytecode ( which is an outcome of whole stage code gen) to machine code or does it rely on JIT to convert those byte code instructions to machine code instructions. Because if it relies on JIT then there are certain uncertainties involved.

Also one more thing , i am somewhat interested in knowing is that if spark does not do this optimization, then it might be slow in some cases when compared to regular query engines which do not do this byte code optimizations. Reason being that they use the same code segment over and over again which eventually gets JIT optimized ( when in hot path ) , which might never be the case with SPARK , because we are generating different or optimized byte codes for each type of query. — mridul_verma
As I'm the author of the linked article, I'd be interested in "After reading some articles about Whole State Code Generation"? I'd like to read them to explore this area better. — Jacek Laskowski
No issues, but again to my point it seems that if spark does not do machine code instructions generation , then JIT might not actually kick in , which could make it slow in some cases. Does that make sense ? — mridul_verma

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2017-12-18T08:33:31

spark does bytecode optimizations to convert a query plan to an optimized execution plan.

Spark SQL does not do bytecode optimizations.

Spark SQL simply uses CollapseCodegenStages physical preparation rule and eventually converts a query plan into a single-method Java source code (that Janino compiles and generates the bytecode).

So does spark do anything related to dynamic/runtime conversion of optimized bytecode

No.

Speaking of JIT, WholeStageCodegenExec does this check whether the whole-stage codegen generates "too long generated codes" or not that could be above spark.sql.codegen.hugeMethodLimit Spark SQL internal property (that is 8000 by default and is the value of HugeMethodLimit in the OpenJDK JVM settings).

The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the current query plan. The default value is 8000 and this is a limit in the OpenJDK JVM implementation.

There are not that many physical operators that support CodegenSupport so reviewing their doConsume and doProduce methods should reveal whether if at all JIT might not kick in.

How does Spark do bytecode to machine code instructions run time conversion?

1 Answers