1
votes

if i have a wide dataframe (200m cols) that contains only IP addresses and i want to drop the columns that contains null values or poorly formatted IP addresses, what would be the most efficient way to do this in Spark? My understanding is that Spark performs row based processing in parallel, not column based. Thus if I attempt to apply transformations on a column, there would be a lot of shuffling. Would transposing the dataframe first then applying filters to drop the rows, then retransposing be a good way to take advantage of the parallelism of spark?

1
200 M columns that is a lot. Your best option is to sample you dataframe (get a 10% so your process has less to deal with), and then run your error logic on all columns at the same time, drop bad columns, loop through reducing the number of columns with iteration. Try something like this val inputDF = spark.sql(select "'AAA' as col1, 'AAAA' as col2") ; val commandStatement = Array["sum(if(length(col1),1,0)) as col1_check", "sum(if(length(col2),1,0)) as col2_check"]; val outputDF = inputDF.selectExpr(commandStatement:_*); ###DO SOME CHECK LOGIC### if you want I can go in detail in answersafeldman
By doing the boolean followed by a sum you will get a total number of 'bad' rows for each column. Do some percentile checking and you can drop that column if it above your percent fail critereon.afeldman

1 Answers

0
votes

You can store a matrix in CSC format using the structure org.apache.spark.ml.linalg.SparseMatrix

If you can get away with filtering on this datatype and converting back to a dataframe that would be your best bet