Spark processing json data with hundreds of colums

Question

I am doing a POC for Spark application in scala in LOCAL mode. I need to process a json dataset, with 300 columns but only fewer records. We are using Spark SQL and our program runs perfectly fine for 30 - 40 columns in the dataset. We are doing inner joins and outer joins using Spark SQL and other conditions in Where clause. The problem is the SQL is not executing for 300 columns join, its just stuck. Not sure how to analyze the SQL. Is there is solution to this problem without having to run it in distributed mode? Would doing in inner join on the dfs alleviate the problem. Something like this, df1.join(df2, col("id1") == col("id2"), "inner").

Thanks

SanBan SanBan · Accepted Answer · 2019-11-04T19:05:21

Can you please provide some sample code/ how the Json looks / How do you know spark app is just 'stuck'

Without looking into how much nested your json is , generally you can create a hash like sha256 on concatenating all 300 columns (and accounting for nulls ) and then join on the hash value.

Spark processing json data with hundreds of colums

1 Answers