Computing the size of a derived table in Spark SQL query

Question

Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.

For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -

select
a.id, b.name, b.cust
from a
left join (select id, name, cust 
           from tbl
           where size > 100
           ) b
on a.id = b.id

We use Spark SQL 2.4. Any comments appreciated.

Ed Elliott Ed Elliott · Accepted Answer · 2020-09-09T11:23:06

I have had to something similar before (to work out how many partitions to split to when writing).

What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.

Computing the size of a derived table in Spark SQL query

1 Answers