0
votes

I have a set of films data/ratings, and I need to calculate the average of the ratings by film. It's like a sum on ratings groupby movieId in SQL. Thank you very much for your help

I've tried to use aggregateBYKey, but I don't know how to use seqOp and CombOp functions. I'm new to PySpark.

here is a chunk of my RDD: [movieId, userId, rating, film]

[('1', '1', 4.0, 'Toy Story (1995)'),
 ('1', '5', 4.0, 'Toy Story (1995)'),
 ('1', '7', 4.5, 'Toy Story (1995)'),
 ('1', '15', 2.5, 'Toy Story (1995)'),
 ('1', '17', 4.5, 'Toy Story (1995)'),
 ('1', '18', 3.5, 'Toy Story (1995)'),
 ('1', '19', 4.0, 'Toy Story (1995)'),
 ('1', '21', 3.5, 'Toy Story (1995)'),
 ('1', '27', 3.0, 'Toy Story (1995)'),
 ('1', '31', 5.0, 'Toy Story (1995)'),
 ('1', '32', 3.0, 'Toy Story (1995)'),
 ('1', '33', 3.0, 'Toy Story (1995)'),
 ('1', '40', 5.0, 'Toy Story (1995)'),
 ('1', '43', 5.0, 'Toy Story (1995)'),
 ('1', '44', 3.0, 'Toy Story (1995)'),
 ('1', '45', 4.0, 'Toy Story (1995)'),
 ('1', '46', 5.0, 'Toy Story (1995)'),
 ('1', '50', 3.0, 'Toy Story (1995)'),
 ('1', '54', 3.0, 'Toy Story (1995)'),
 ('1', '57', 5.0, 'Toy Story (1995)')]

I need to calculate the average rating for each film, something like:

[('1', average_ratings_of_film_1, film_name_1),
('2', average_ratings_of_film_2, film_name_2)]

thank you very much for your help

1
you can do groupby and take average df.groupBy("movie").avg("rating").show()PIG
thank you Pig, this solves it, but I need to do it using RDD's :/...twister9458

1 Answers

0
votes

You can use the following to convert your list to a DF and then use groupby().avg()

data = spark.sparkContext.parallelize(
[('1', '1', 4.0, 'Toy Story (1995)'),
 ('1', '5', 4.0, 'Toy Story (1995)'),
 ('1', '7', 4.5, 'Toy Story (1995)'),
 ('1', '15', 2.5, 'Toy Story (1995)'),
 ('1', '17', 4.5, 'Toy Story (1995)'),
 ('1', '18', 3.5, 'Toy Story (1995)'),
 ('1', '19', 4.0, 'Toy Story (1995)'),
 ('1', '21', 3.5, 'Toy Story (1995)'),
 ('1', '27', 3.0, 'Toy Story (1995)'),
 ('1', '31', 5.0, 'Toy Story (1995)'),
 ('1', '32', 3.0, 'Toy Story (1995)'),
 ('1', '33', 3.0, 'Toy Story (1995)'),
 ('1', '40', 5.0, 'Toy Story (1995)'),
 ('1', '43', 5.0, 'Toy Story (1995)'),
 ('1', '44', 3.0, 'Toy Story (1995)'),
 ('1', '45', 4.0, 'Toy Story (1995)'),
 ('1', '46', 5.0, 'Toy Story (1995)'),
 ('1', '50', 3.0, 'Toy Story (1995)'),
 ('1', '54', 3.0, 'Toy Story (1995)'),
 ('1', '57', 5.0, 'Toy Story (1995)')])

df = data.toDF(schema=["movie_id", "user_id", "rating", "movie"])

group = df.groupby("movie").avg("rating")
group.show()
#+----------------+-----------+
#|           movie|avg(rating)|
#+----------------+-----------+
#|Toy Story (1995)|      3.875|
#+----------------+-----------+