2
votes

I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):

t1.id var1 var2     table2.id        table2.var3  table2.var4
1  1.3  4               1          ['a','b','d']  ['x','y','z']
2  3.0   5              2          ['a','c','m','n'] ['x','z']
3  2.3   5

I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.

it's a similar issue as discussed in PySpark replace Null with Array

But I have over 10 variables from table 2 and each has a different dimension.

Any suggestion as what I can do? Can I do countvectorizer before the join?

1
Technically speaking, if the ALL of the resulting rows are null after the left outer join, then there was nothing to join on. Are you sure that's working correctly? If only SOME of the results are null, then you can get rid of them by changing the left_outer join to an inner join.Petras Purlys
Only some rows are null and I need to keep these rows.TL16
This problem is solved in stackoverflow.com/questions/40352675/…TL16

1 Answers

0
votes

Dataframe have .na.fill() attribute.

replace_cols = {col:'' for col in df.columns}
df.na.fill(replace_cols)