I have a set of films data/ratings, and I need to calculate the average of the ratings by film. It's like a sum on ratings groupby movieId in SQL. Thank you very much for your help
I've tried to use aggregateBYKey, but I don't know how to use seqOp and CombOp functions. I'm new to PySpark.
here is a chunk of my RDD: [movieId, userId, rating, film]
[('1', '1', 4.0, 'Toy Story (1995)'),
('1', '5', 4.0, 'Toy Story (1995)'),
('1', '7', 4.5, 'Toy Story (1995)'),
('1', '15', 2.5, 'Toy Story (1995)'),
('1', '17', 4.5, 'Toy Story (1995)'),
('1', '18', 3.5, 'Toy Story (1995)'),
('1', '19', 4.0, 'Toy Story (1995)'),
('1', '21', 3.5, 'Toy Story (1995)'),
('1', '27', 3.0, 'Toy Story (1995)'),
('1', '31', 5.0, 'Toy Story (1995)'),
('1', '32', 3.0, 'Toy Story (1995)'),
('1', '33', 3.0, 'Toy Story (1995)'),
('1', '40', 5.0, 'Toy Story (1995)'),
('1', '43', 5.0, 'Toy Story (1995)'),
('1', '44', 3.0, 'Toy Story (1995)'),
('1', '45', 4.0, 'Toy Story (1995)'),
('1', '46', 5.0, 'Toy Story (1995)'),
('1', '50', 3.0, 'Toy Story (1995)'),
('1', '54', 3.0, 'Toy Story (1995)'),
('1', '57', 5.0, 'Toy Story (1995)')]
I need to calculate the average rating for each film, something like:
[('1', average_ratings_of_film_1, film_name_1),
('2', average_ratings_of_film_2, film_name_2)]
thank you very much for your help
df.groupBy("movie").avg("rating").show()
– PIG