0
votes

The following program throws an error

from pyparsing import Regex, re
from pyspark import SparkContext
sc = SparkContext("local","hospital")
LOG_PATTERN ='(?P<Case_ID>[^ ;]+);(?P<Event_ID>[^ ;]+);(?P<Date_Time>[^ ;]+);(?P<Activity>[^;]+);(?P<Resource>[^ ;]+);(?P<Costs>[^ ;]+)'
logLine=sc.textFile("C:\TestLogs\Hospital.log").cache()
#logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
for line in logLine.readlines():
    match = re.search(LOG_PATTERN,logLine)
    Case_ID = match.group(1)
    Event_ID = match.group(2)
    Date_Time = match.group(3)
    Activity = match.group(4)
    Resource = match.group(5)
    Costs = match.group(6)
    print Case_ID
    print Event_ID  
    print Date_Time
    print Activity
    print Resource
    print Costs

Error:

Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in for line in logLine.readlines(): AttributeError: 'RDD' object has no attribute 'readlines'

If i add the open function to read the file then i get the following error:

Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in f = open(logLine,"r") TypeError: coercing to Unicode: need string or buffer, RDD found

Can't seem to figure out how to read line by line and extract words that match the pattern. Also if i pass only a single logline logLine='1;35654423;30-12-2010:11.02;register request;Pete;50' it works. I'm new to spark and know only basics in python. Please help.

2

2 Answers

2
votes

You are mixing things up. The line

logLine=sc.textFile("C:\TestLogs\Hospital.log")

creates an RDD, and RDDs do not have a readlines() method. See the RDD API here:

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

You can use collect() to retrieve the content of the RDD line by line. readlines() is part of the standard Python file API, but you do not usually need it when working with files in Spark. You simply load the file with textFile() and then process it with RDD API, see the link above.

1
votes

As answered by Matei, readlines() is Python API and sc.textFile will create an RDD, so the error that RDD has no attributes readlines().

If you have to process file using Spark APIs, you can use filter API on RDD created for pattern and then you can split the output based on delimiter.

An example as below:

    logLine = sc.textFile("C:\TestLogs\Hospital.log")
    logLine_Filtered = logLine.filter(lambda x: "LOG_PATTERN" in x)
    logLine_output  = logLine_Filtered(lambda a: a.split("<delimiter>")[0], a.split("<delimiter>")[1].....).collect()
logLine_output.first()

Dataframe would be even better