3
votes

I'm pulling my hair out trying to solve what I feel is an extremely simple problem, but I'm not sure if there's some spark voodoo occurring as well.

I have two tables, which are both very small. Table A has about 90K rows and Table B has about 2K rows.

Table A

A  B  C  D
===========
a1 b1 c1 d1
a1 b1 c2 d2
a1 b1 c3 d3
a2 b2 c1 d1
a2 b2 c2 d2
.
.
.

Table B

A  B  E  F
===========
a1 b1 e1 f1
a2 b2 e2 f2

I want a table that looks like

Result Table

A  B  C  D  E  F
=================
a1 b1 c1 d1 e1 f1
a1 b1 c2 d2 e1 f1
a2 b2 c1 d1 e2 f2
.
.
.

I was a little loose, but the idea is I want to join the table with fewer rows on the table with more rows and it's okay to have multiple associated values in the final table.

This should be really simple:

table_a.join(table_b, table_a.a == table_b.a, table_a.b == table_b.b).select(..stuff..)

HOWEVER, for almost all of the resulting values in the Result Table (which should have about 90K rows since Table A has about 90K rows), I get null values in columns E and F.

When I save the result of just Table B, I see all the columns and values. When I save the result of just Table A, I see all the columns and values. (i.e I could do a paper and pencil join)

The weird thing is that even though ~89K rows have null values in columns E and F in the Result Table, there are a few values that do randomly join.

Does anyone know what's going on or how I can diagnose this?

1
Sorry, I may be misreading your question. Are you saying that the E and F columns are mostly empty? If so, why are you surprised that they're also mostly empty in a join? Your code looks correct, so I'm wondering if your understanding of the data is where the problem lies. To start, you may want to figure out how many rows from B are even joining onto A (perhaps by doing a join in base Spark). You may also want to consider if you're expecting an outer join or an inner join and performing the opposite (see the "how" argument in the join docstring).Galen Long
@nightingalen Sorry for the confusion. I'm saying that Table A and Table B are fully populated. Then, when I join them, the columns I expect to have values (namely E and F) are null. I want to associate with each row of Table A a row from the smaller table Table B, which is fully populated. When I do a single unit test on some mock dataframes, it works, but then when I run in production, I get null values.confused
Where is your data coming from? Are you generating columns E and F on the fly, or in some way differently than the other columns?David Griffin
@DavidGriffin I generate E and F on the fly. So the flow is: table_a_df = build_table_a(...) table_b_df = build_table_b(...) <- E and F are created from existing data result_df = table_a_df.join(table_b_df, ....)confused
Any time you can't different r results from a small vs large dataset it's because your code is not serializing correctly- it's how I knew to ask if those were dynamic. Something in the code that creates those columns is not serializing. Has nothing to do with joinDavid Griffin

1 Answers

0
votes

Have you tried <=> instead of == in your join?