I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):
t1.id var1 var2 table2.id table2.var3 table2.var4
1 1.3 4 1 ['a','b','d'] ['x','y','z']
2 3.0 5 2 ['a','c','m','n'] ['x','z']
3 2.3 5
I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.
it's a similar issue as discussed in PySpark replace Null with Array
But I have over 10 variables from table 2 and each has a different dimension.
Any suggestion as what I can do? Can I do countvectorizer before the join?