I have a problem in pyspark when joining two dataframes. The first dataframe is a one single column dataframe "zipcd", and the second one is a dataframe with four columns.
The problem arises whenever I try to join the two dataframes because Pyspark returns me in my new dataframe, regarding the one single column of zipcd, a column that all its value are the same (the first row is duplicated in all rows, and it is not like this).
For instance:
Zip.select("Zip").show()
+------------+
| Zip|
+------------+
| 6.0651002E8|
| 6.0623002E8|
| 6.0077203E8|
| 6.0626528E8|
| 6.0077338E8|
| 0.0|
and the other dataframe is zipcd:
zip_cd1.show()
+-----+
|zipcd|
+-----+
|60651|
|60623|
|60077|
|60626|
|60077|
| 0|
Whenever I try to join the dataframes, it always happens the following:
Zip1=zip_cd1.join(Zip).select('Zip','zipcd')
Zip1.show()
+------------+-----+
| Zip|zipcd|
+------------+-----+
| 6.0651002E8|60651|
| 6.0623002E8|60651|
| 6.0077203E8|60651|
| 6.0626528E8|60651|
| 6.0077338E8|60651|
| 0.0|60651|
It happens no matter if I change the type of join, and I don't have any idea of what's happening.
Expected output:
+------------+-----+
| Zip|zipcd|
+------------+-----+
| 6.0651002E8|60651|
| 6.0623002E8|60623|
| 6.0077203E8|60077|
| 6.0626528E8|60626|
| 6.0077338E8|60077|
| 0.0|0 |