PySpark - Broadcast spark dataframe

Question

I am trying to broadcast spark dataframe, tried couple of approach but not able to broadcast it. I want to loop all the columns for some processing from another data frame where in SchemaWithHeader colName Result is 1. For example - Loop is required for columns - Name, Age and Salary.

Approach 1

SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)]

rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader)
SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))

getting below error

 SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
AttributeError: 'Broadcast' object has no attribute 'map'

Dataframe doesn't have any broadcast method. I am not using SQL query to join 2 data frames but using some loop to access SchemaWithHeader data frame.

Approach 2

SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)]

rdd = spark.sparkContext.parallelize(SchemaDFWithoutHeader)
SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))

SchemaDF = spark.createDataFrame(SchemaWithHeader)
spark.sparkContext.broadcast(SchemaDF)
SchemaDF.registerTempTable("DFSchema")

getting below error

py4j.Py4JException: Method __getstate__([]) does not exist

Harjeet Kumar Harjeet Kumar · Accepted Answer · 2018-12-26T05:59:24

Error says it all... In your code below

rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader)

rdd is a broadcasted variable, to use map on it do rdd.value. Following is the way to use it.

SchemaWithHeader = rdd.value.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))

Hope This helps... Keep Sharing with Community :)

Edit 1: Since you are broadcasting a list rdd.value will give you a list as output. list in python does not have map function. so you are getting error mentioned in comments. Moreover if you try to broadcast a RDD you will get following error " It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations;"

Basically You cannot Broadcast an RDD because it is a distributed data structure already and has partitions and these partitions already sit on multiple machines.

Note : Hope the code that you wrote was just to demonstrate the issue. As i could not understand your thought process behind this. However, Answer is still valid. I recommend you to understand broadcast variables concept , before implementing in your Project.

Cheers!

Harjeet

PySpark - Broadcast spark dataframe

1 Answers