I am trying to create a pivot table on a PySpark SQL dataframe, which doesn't drop the null values. My input table has the following structure:
I am running everything in the IBM Data Science Experience cloud under Python 2 with spark 2.1.
When doing it on a pandas dataframe the "dropna=false" parameter gives me the result I want.
table= pd.pivot_table(ratings,columns=['movieId'],index=[ 'monthyear','userId'], values='rating', dropna=False)
As an output I get the following:
In PySpark SQL I am using at the moment the following command:
ratings_pivot = spark_df.groupBy('monthyear','userId').pivot('movieId').sum("rating").show()
As an output I get the following:
As you can see, all the entries with only null values are not shown. Is there a possibility to use something similar like dropna=false in SQL? Since this is very specific, I can´t find anything about that in the internet.
I just extracted a small dataset for reproduction:
df = spark.createDataFrame([("1", 30, 2.5,200912), ("1", 32, 3.0,200912), ("2", 40, 4.0,201002), ("3", 45, 2.5,200002)], ("userID", "movieID", "rating", "monthyear"))
df.show()
+------+-------+------+---------+
|userID|movieID|rating|monthyear|
+------+-------+------+---------+
| 1| 30| 2.5| 200912|
| 1| 32| 3.0| 200912|
| 2| 40| 4.0| 201002|
| 3| 45| 2.5| 200002|
+------+-------+------+---------+
If I now run the pivot query, I get the following result:
df.groupBy("monthyear","UserID").pivot("movieID").sum("rating").show()
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 3|null|null|null| 2.5|
+---------+------+----+----+----+----+
What I want now, is that in the results looks like the following:
+---------+------+----+----+----+----+
|monthyear|UserID| 30| 32| 40| 45|
+---------+------+----+----+----+----+
| 201002| 2|null|null| 4.0|null|
| 200912| 2|null|null|null|null|
| 200002| 2|null|null|null|null|
| 200912| 1| 2.5| 3.0|null|null|
| 200002| 1|null|null|null|null|
| 201002| 1|null|null|null|null|
| 200002| 3|null|null|null| 2.5|
| 200912| 3|null|null|null|null|
| 201002| 3|null|null|null|null|
+---------+------+----+----+----+----+