2
votes

I have created and RDD where every element is a dictionary.

rdd.take(2)

[{'actor': 'brad',
  'good': 1,
  'bad': 0,
  'average': 0,}
 {'actor': 'tom',
  'good': 0,
  'bad': 1,
  'average': 1,}]

I have a list of dictionary of the form:

d2:

[{'good': 1.4,
  'bad': 0.4,
  'average': 0.6,}
 {'good': 0.4,
  'bad': 1.7,
  'average': 1.2,}]

I want to assign the values of d2 to the Rdd.

Both Rdd and d2 have the same length. Every dictionary in the RDD has an extra key "actor". The order is the same. I want the dictionary of d2 assigned to dictionary of rdd in numerical order. That is first dictionary of d2 updates the values of the first dictionary in rdd

I want to get it as

[{'actor': 'brad', 'good': 1.4, 'bad': 0.4, 'average': 0.6,} {'actor': 'tom', 'good': 0.4, 'bad': 1.7, 'average': 1.2,}]

I tried:

for dic in d2:
   for key in rdd.filter(lambda x: x).first().keys():
       rdd.filter(lambda x: x).first()[key]=dic[key]

This is not working. How do I update the values.

1

1 Answers

3
votes

Is this good for you?

rdd = sc.parallelize([{'actor': 'brad',
  'good': 1,
  'bad': 0,
  'average': 0},
 {'actor': 'tom',
  'good': 0,
  'bad': 1,
  'average': 1}])
d2 = [{'good': 1.4,
  'bad': 0.4,
  'average': 0.6},
 {'good': 0.4,
  'bad': 1.7,
  'average': 1.2}]

def update_and_return_dict(_dict, update_dict):
    _dict.update(update_dict)
    return _dict
print rdd.zipWithIndex().map(lambda x: update_and_return_dict(x[0], d2[x[1]])).collect()

[{'bad': 0.4, 'good': 1.4, 'average': 0.6, 'actor': 'brad'}, {'bad': 1.7, 'good': 0.4, 'average': 1.2, 'actor': 'tom'}]