I am trying to broadcast spark dataframe, tried couple of approach but not able to broadcast it. I want to loop all the columns for some processing from another data frame where in SchemaWithHeader colName Result is 1. For example - Loop is required for columns - Name, Age and Salary.
- Approach 1
SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
getting below error
SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
AttributeError: 'Broadcast' object has no attribute 'map'
Dataframe doesn't have any broadcast method. I am not using SQL query to join 2 data frames but using some loop to access SchemaWithHeader data frame.
- Approach 2
SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.parallelize(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1]))) SchemaDF = spark.createDataFrame(SchemaWithHeader) spark.sparkContext.broadcast(SchemaDF) SchemaDF.registerTempTable("DFSchema")
getting below error
py4j.Py4JException: Method __getstate__([]) does not exist