I am little stuck as how to get this value correct. Below is my sample data:
col_name,Category,SegmentID,total_cnt,PercentDistribution
city,ANTIOCH,1,1,15
city,ARROYO GRANDE,1,1,15
state,CA,1,3,15
state,NZ,1,4,15
I am trying to get the output dataframe as :
I could arrive till this. Need your help here.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
import json
join_df=spark.read.csv("/tmp/testreduce.csv",inferSchema=True, header=True)
jsonSchema = StructType([StructField("Name", StringType())
, StructField("Value", IntegerType())
, StructField("CatColName", StringType())
, StructField("CatColVal", StringType())
])
def reduceKeys(row1, row2):
row1[0].update(row2[0])
return row1
res_df=join_df.rdd.map(lambda row: ("Segment " + str(row[2]), ({row[1]: row[3]},row[0],row[4])))\
.reduceByKey(lambda x, y: reduceKeys(x, y))\
.map(lambda row: (row[0], row[1][2],row[1][1], json.dumps(row[1][0]))).toDF(jsonSchema)
My Current code output:
It is not grouping the data correctly based on segment id and CatColName.