if i have a wide dataframe (200m cols) that contains only IP addresses and i want to drop the columns that contains null values or poorly formatted IP addresses, what would be the most efficient way to do this in Spark? My understanding is that Spark performs row based processing in parallel, not column based. Thus if I attempt to apply transformations on a column, there would be a lot of shuffling. Would transposing the dataframe first then applying filters to drop the rows, then retransposing be a good way to take advantage of the parallelism of spark?
val inputDF = spark.sql(select "'AAA' as col1, 'AAAA' as col2") ; val commandStatement = Array["sum(if(length(col1),1,0)) as col1_check", "sum(if(length(col2),1,0)) as col2_check"]; val outputDF = inputDF.selectExpr(commandStatement:_*); ###DO SOME CHECK LOGIC###
if you want I can go in detail in answers – afeldman