Does data shuffle occur when use Hive window function on data that already on the same node?
Specifically in the following example, before use window function data are already repartitioned by 'City' with Spark repartition() function, which should ensure all data of city 'A' co-localized on the same node (assuming data for a city can fit in to one node).
df = sqlContext.createDataFrame(
[('A', '1', 2009, "data1"),
('A', '1', 2015, "data2"),
('A', '22', 2015, "data3"),
('A', '22', 2016, "data4"),
('BB', '333', 2014, "data5"),
('BB', '333', 2012, "data6"),
('BB', '333', 2016, "data7")
],
("City", "Person","year", "data"))
df = df.repartition(2, 'City')
df.show()
# +----+------+----+-----+
# |City|Person|year| data|
# +----+------+----+-----+
# | BB| 333|2012|data6|
# | BB| 333|2014|data5|
# | BB| 333|2016|data7|
# | A| 22|2016|data4|
# | A| 22|2015|data3|
# | A| 1|2009|data1|
# | A| 1|2015|data2|
# +----+------+----+-----+
Then I have to do a window function partition by 'Person', which is not the partition key in Spark repartition() as follows.
df.registerTempTable('example')
sqlStr = """\
select *,
row_number() over (partition by Person order by year desc) ranking
from example
"""
sqlContext.sql(sqlStr).show(100)
# +----+------+----+-----+-------+
# |City|Person|year| data|ranking|
# +----+------+----+-----+-------+
# | BB| 333|2016|data7| 1|
# | BB| 333|2014|data5| 2|
# | BB| 333|2012|data6| 3|
# | A| 1|2015|data2| 1|
# | A| 1|2009|data1| 2|
# | A| 22|2016|data4| 1|
# | A| 22|2015|data3| 2|
# +----+------+----+-----+-------+
Here are my questions:
Is there any relation or difference between Spark "repartition" and Hive "partition by"? Under the hood, are they translated to the same thing on Spark?
I want to check whether my following understanding is correct. Even all data already on the same node, if I call Spark df.repartition('A_key_different_from_current_partidion_key'), data will be shuffled to many nodes, instead of stay together on the same node.
BTW, I would also curious whether it is simple to implement the example Hive query with Spark window function.