Pyspark inverted index

Question

I am creating an inverted index of documents, where the output should contain a word (from the text file) followed by all the files it appeared in. Something like

[word1: file1.txt file2.txt] [word2: file2.txt file3.txt]

I have written the code but it throws me this error.

for k, v in iterator: TypeError: () takes exactly 2 arguments (1 given)

Code:

from pyspark import SparkContext    
sc = SparkContext("local", "app")

path = '/ebooks'
rdd = sc.wholeTextFiles(path)

output = rdd.flatMap(lambda (file,contents):contents.lower().split())\
            .map(lambda file,word: (word,file))\
            .reduceByKey(lambda a,b: a+b)
print output.take(10)

I cannot figure out a way to emit both key and value (word and the filename) in the map. How can i go about it?

In mapreduce, the (word, key) pair can be emitted (key is the filename) but how can this be done in spark?

ags29 ags29 · Accepted Answer · 2017-12-06T16:46:47

I haven't tested this on dummy data, but looking at your code, I think the following modification should work:

output = rdd.flatMap(lambda (file,contents):[(file, word) for word in contents.lower().split()])\
      .map(lambda (file, word): (word,[file]))\
      .reduceByKey(lambda a,b: a+b)

Pyspark inverted index

2 Answers