0
votes

I try to understand currently, how RDD works. For example, I want to count the lines based on the context in some RDD object. I have some experince with DataFrames and my code for DF, which has for example columns A, B and probably some other columns, is looking like:

df = sqlContext.read.json("filepath")
df2 = df.groupBy(['A', 'B']).count()

The logical part of this code is clear for me - I do groupBy operation over column name in DF. In RDD I don't have some column name, just similar lines, which could be a tuple or a Row objects... How I can count similar tuples and add it as integer to the unique line? For example my first code is:

df = sqlContext.read.json("filepath") 
rddob = df.rdd.map(lambda line:(line.A, line.B))

I do the map operation and create a tuple of the values from the keys A and B. The unique line doesn't have any keys anymore (this is most important difference to the DataFrame, which has column name). Now I can produce something like this, but it calculate just a total number of lines in RDD.

rddcalc = rddob.distinct().count()

What I want for my output, is just:

((a1, b1), 2)
((a2, b2), 3)
((a2, b3), 1)
...

PS

I have found my personal solution for this question. Here: rdd is initial rdd, rddlist is a list of all lines, rddmod is a final modified rdd and consequently the solution.

rddlist = rdd.map(lambda line:(line.A, line.B)).map(lambda line: (line, 1)).countByKey().items()
rddmod = sc.parallelize(rddlist)
1
In fact groupBy isn't recommended because it requires shuffling the partitions, hence moving many data among all nodes.Alberto Bonsanto
@Alberto Bonsanto, thank you for the interest in this topic. I don't think that groupBy is dangerous for DF and for RDD it doesn't exist.Guforu
Well you can find some reasons explained by databricks here Prefer reduceByKey over groupByKeyAlberto Bonsanto
ok, thank you, interesting articleGuforu
Hi @Guforu, I have read this message a couple of times but I still don't understand what you are trying to achieve. Do you want the number of time a specific combination of tuple appear in your RDD?PinoSan

1 Answers

1
votes

I believe what you are looking for here is a reduceByKey. This will give you a count of how many times each distinct pair of (a,b) lines appears. It would look like this:

rddob = df.rdd.map(lambda line: (line.A + line.B, 1))
counts_by_key = rddob.reduceByKey(lambda a,b: a+b)

You will now have key, value pairs of the form: ((a,b), count-of-times-pair-appears)

Please note that this only works if A and B are strings. If they are lists, you have to create a "primary key" type of object to perform the reduce on. You can't perform a reduceByKey where the primary key is some complicated object.