I would like to count which user views how often which category. I am a newbie in Spark and Python. Here is the demo data:
dataSource = sc.parallelize( [("user1", "film"), ("user1", "film"), ("user2", "film"), ("user2", "books"), ("user2", "books")] )
I reduced this by key user and collected all the categories. Then I splitted to count later:
dataReduced = dataSource.reduceByKey(lambda x,y : x + "," + y)
catSplitted = dataReduced.map(lambda (user,values) : (values.split(","),user))
The output format for each user looks like this -> ([cat1,cat1,cat2,catn], user)
Could someone please tell me how to count the categories with Spark and Python or do you have a different way to solve this problem?