count the lines in rdd depended on the lines context, pyspark

Question

I try to understand currently, how RDD works. For example, I want to count the lines based on the context in some RDD object. I have some experince with DataFrames and my code for DF, which has for example columns A, B and probably some other columns, is looking like:

df = sqlContext.read.json("filepath")
df2 = df.groupBy(['A', 'B']).count()

The logical part of this code is clear for me - I do groupBy operation over column name in DF. In RDD I don't have some column name, just similar lines, which could be a tuple or a Row objects... How I can count similar tuples and add it as integer to the unique line? For example my first code is:

df = sqlContext.read.json("filepath") 
rddob = df.rdd.map(lambda line:(line.A, line.B))

I do the map operation and create a tuple of the values from the keys A and B. The unique line doesn't have any keys anymore (this is most important difference to the DataFrame, which has column name). Now I can produce something like this, but it calculate just a total number of lines in RDD.

rddcalc = rddob.distinct().count()

What I want for my output, is just:

((a1, b1), 2)
((a2, b2), 3)
((a2, b3), 1)
...

PS

I have found my personal solution for this question. Here: rdd is initial rdd, rddlist is a list of all lines, rddmod is a final modified rdd and consequently the solution.

rddlist = rdd.map(lambda line:(line.A, line.B)).map(lambda line: (line, 1)).countByKey().items()
rddmod = sc.parallelize(rddlist)

In fact groupBy isn't recommended because it requires shuffling the partitions, hence moving many data among all nodes. — Alberto Bonsanto
@Alberto Bonsanto, thank you for the interest in this topic. I don't think that groupBy is dangerous for DF and for RDD it doesn't exist. — Guforu
Well you can find some reasons explained by databricks here Prefer reduceByKey over groupByKey — Alberto Bonsanto
Hi @Guforu, I have read this message a couple of times but I still don't understand what you are trying to achieve. Do you want the number of time a specific combination of tuple appear in your RDD? — PinoSan

Katya Willard Katya Willard · Accepted Answer · 2016-03-28T16:00:45

I believe what you are looking for here is a reduceByKey. This will give you a count of how many times each distinct pair of (a,b) lines appears. It would look like this:

rddob = df.rdd.map(lambda line: (line.A + line.B, 1))
counts_by_key = rddob.reduceByKey(lambda a,b: a+b)

You will now have key, value pairs of the form: ((a,b), count-of-times-pair-appears)

Please note that this only works if A and B are strings. If they are lists, you have to create a "primary key" type of object to perform the reduce on. You can't perform a reduceByKey where the primary key is some complicated object.

count the lines in rdd depended on the lines context, pyspark

1 Answers