6
votes

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"\t\t"+ count

results.saveAsTextFile("/var/www/myoutput")
2
what is the problem, can you show the error please? - Alberto Bonsanto
Please format properly your question highlighting the code - mgaido
Traceback (most recent call last): File "/home/cloudera/PythonTask/sorteddata.py", line 24, in <module> results.saveAsTextFile("var/www/myoutput") AttributeError: 'list' object has no attribute 'saveAsTextFile' - RACHITA PATRO
Try saving sortedwordsCount instead - WoodChopper
Thank you all for all your help. - RACHITA PATRO

2 Answers

8
votes

since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.

As you know list is python object/data structure and append is method to add element.

>>> x = []
>>> x.append(5)
>>> x
[5]

Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.

So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.

As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file

1
votes

Change results=sortedwordsCount.collect() to results=sortedwordsCount, because using .collect() results will be a list.