The following program throws an error
from pyparsing import Regex, re
from pyspark import SparkContext
sc = SparkContext("local","hospital")
LOG_PATTERN ='(?P<Case_ID>[^ ;]+);(?P<Event_ID>[^ ;]+);(?P<Date_Time>[^ ;]+);(?P<Activity>[^;]+);(?P<Resource>[^ ;]+);(?P<Costs>[^ ;]+)'
logLine=sc.textFile("C:\TestLogs\Hospital.log").cache()
#logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
for line in logLine.readlines():
match = re.search(LOG_PATTERN,logLine)
Case_ID = match.group(1)
Event_ID = match.group(2)
Date_Time = match.group(3)
Activity = match.group(4)
Resource = match.group(5)
Costs = match.group(6)
print Case_ID
print Event_ID
print Date_Time
print Activity
print Resource
print Costs
Error:
Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in for line in logLine.readlines(): AttributeError: 'RDD' object has no attribute 'readlines'
If i add the open
function to read the file then i get the following error:
Traceback (most recent call last): File "C:/Spark/spark-1.6.1-bin-hadoop2.4/bin/hospital2.py", line 7, in f = open(logLine,"r") TypeError: coercing to Unicode: need string or buffer, RDD found
Can't seem to figure out how to read line by line and extract words that match the pattern.
Also if i pass only a single logline logLine='1;35654423;30-12-2010:11.02;register request;Pete;50'
it works. I'm new to spark and know only basics in python. Please help.