I try to understand currently, how RDD works. For example, I want to count the lines based on the context in some RDD object. I have some experince with DataFrames and my code for DF, which has for example columns A
, B
and probably some other columns, is looking like:
df = sqlContext.read.json("filepath")
df2 = df.groupBy(['A', 'B']).count()
The logical part of this code is clear for me - I do groupBy
operation over column name in DF. In RDD I don't have some column name, just similar lines, which could be a tuple or a Row objects... How I can count similar tuples and add it as integer to the unique line? For example my first code is:
df = sqlContext.read.json("filepath")
rddob = df.rdd.map(lambda line:(line.A, line.B))
I do the map operation and create a tuple of the values from the keys A
and B
. The unique line doesn't have any keys anymore (this is most important difference to the DataFrame, which has column name).
Now I can produce something like this, but it calculate just a total number of lines in RDD.
rddcalc = rddob.distinct().count()
What I want for my output, is just:
((a1, b1), 2)
((a2, b2), 3)
((a2, b3), 1)
...
PS
I have found my personal solution for this question. Here: rdd is initial rdd, rddlist is a list of all lines, rddmod is a final modified rdd and consequently the solution.
rddlist = rdd.map(lambda line:(line.A, line.B)).map(lambda line: (line, 1)).countByKey().items()
rddmod = sc.parallelize(rddlist)
groupBy
isn't recommended because it requires shuffling the partitions, hence moving many data among all nodes. – Alberto Bonsanto