I'm pulling my hair out trying to solve what I feel is an extremely simple problem, but I'm not sure if there's some spark voodoo occurring as well.
I have two tables, which are both very small. Table A
has about 90K rows and Table B
has about 2K rows.
Table A
A B C D
===========
a1 b1 c1 d1
a1 b1 c2 d2
a1 b1 c3 d3
a2 b2 c1 d1
a2 b2 c2 d2
.
.
.
Table B
A B E F
===========
a1 b1 e1 f1
a2 b2 e2 f2
I want a table that looks like
Result Table
A B C D E F
=================
a1 b1 c1 d1 e1 f1
a1 b1 c2 d2 e1 f1
a2 b2 c1 d1 e2 f2
.
.
.
I was a little loose, but the idea is I want to join the table with fewer rows on the table with more rows and it's okay to have multiple associated values in the final table.
This should be really simple:
table_a.join(table_b, table_a.a == table_b.a, table_a.b == table_b.b).select(..stuff..)
HOWEVER, for almost all of the resulting values in the Result Table
(which should have about 90K rows since Table A
has about 90K rows), I get null
values in columns E
and F
.
When I save the result of just
Table B
, I see all the columns and values. When I save the result of justTable A
, I see all the columns and values. (i.e I could do a paper and pencil join)
The weird thing is that even though ~89K rows have null
values in columns E
and F
in the Result Table
, there are a few values that do randomly join.
Does anyone know what's going on or how I can diagnose this?
Table A
andTable B
are fully populated. Then, when I join them, the columns I expect to have values (namelyE
andF
) arenull
. I want to associate with each row ofTable A
a row from the smaller tableTable B
, which is fully populated. When I do a single unit test on some mock dataframes, it works, but then when I run in production, I getnull
values. – confusedE
andF
on the fly, or in some way differently than the other columns? – David GriffinE
andF
on the fly. So the flow is:table_a_df = build_table_a(...)
table_b_df = build_table_b(...)
<-E
andF
are created from existing dataresult_df = table_a_df.join(table_b_df, ....)
– confused