0
votes

We are executing within AWS Glue, using Python 3 and Spark 2.4 with the pyspark SparkSQL module. I am trying to figure out if it's faster to run an operation using the native, built-in Spark API transformations (map, filter, etc.) than it is to run the same operation by executing a raw SparkSQL query.

For example, if I want to do a simple filter operation (let's say, grabbing only the rows of data where the year is 2020), are the following equivalent?

data = spark.read.csv(filepath)
data_filtered = data.filter(data.year = 2020)
data = spark.read.csv(filepath)
spark.registerDataFrameAsTable(data, 'data')
data_filtered = spark.sql("SELECT * FROM data WHERE year = 2020")

This is obviously a very simple example, but the same question applies to other operations as well - joins, unions, mapping, etc. What if operations are being chained together? Would a sequence of transformations and actions written as a long, complicated SQL query be as optimized as the same sequence written using the built-in operations?

Part of my confusion stems from the fact that I'm unsure how Spark is optimizing the execution plan for raw queries, as opposed to built-in operations. I know that there are transformations and actions, which are distinct in the API. Does the Spark backend categorize operations similarly for queries?

Normally, I would try running a few experiments, but because we're working with very large data and we pay for AWS Glue jobs by time and computation resources, and to fully answer this question would require lots of testing on different operations chained together, I would much prefer to have the question answered here.

1

1 Answers

0
votes

As per the documentation the performance is same whether you work with any of the api like data frame or dataset or spark sql. Because internally everything gets converted to RDD(resilient distributed dataset) which is the default abstraction that spark provides. You can compare the explain plan for both the spark sql and data frame and it would be same. If you are performing multiple actions like count, take,collect etc. it would materialize the plan and that could have time difference but in most cases you will have same performance for both spark sql and data frame api if you are doing same operations.