We are executing within AWS Glue, using Python 3 and Spark 2.4 with the pyspark SparkSQL module. I am trying to figure out if it's faster to run an operation using the native, built-in Spark API transformations (map, filter, etc.) than it is to run the same operation by executing a raw SparkSQL query.
For example, if I want to do a simple filter operation (let's say, grabbing only the rows of data where the year is 2020), are the following equivalent?
data = spark.read.csv(filepath)
data_filtered = data.filter(data.year = 2020)
data = spark.read.csv(filepath)
spark.registerDataFrameAsTable(data, 'data')
data_filtered = spark.sql("SELECT * FROM data WHERE year = 2020")
This is obviously a very simple example, but the same question applies to other operations as well - joins, unions, mapping, etc. What if operations are being chained together? Would a sequence of transformations and actions written as a long, complicated SQL query be as optimized as the same sequence written using the built-in operations?
Part of my confusion stems from the fact that I'm unsure how Spark is optimizing the execution plan for raw queries, as opposed to built-in operations. I know that there are transformations and actions, which are distinct in the API. Does the Spark backend categorize operations similarly for queries?
Normally, I would try running a few experiments, but because we're working with very large data and we pay for AWS Glue jobs by time and computation resources, and to fully answer this question would require lots of testing on different operations chained together, I would much prefer to have the question answered here.