1
votes

I'm trying to split string based on a regular expression inside lambda function, the string is not getting split. I'm sure the regular expression is working fine. check the regex test link https://regex101.com/r/ryRio6/1

from pyspark.sql.functions import col,split
import re

r = re.compile(r"(?=\s\w+=)")
adsample = sc.textFile("hdfs://hostname/user/hdfs/sample/Log18Dec.txt")
splitted_sample = adsample.flatMap(lambda (x): ((v) for v in r.split(x)))

for m in splitted_sample.collect():
    print(m)

not sure where i'm going wrong..

sample line from the file:

|RECEIVE|Low| eventId=139569 msg=W4N Alert :: Critical : Interface Utilization for GigabitEthernet0/1 90.0 % in=2442 out=0 categorySignificance=/Normal categoryBehavior=/Communicate/Query categoryDeviceGroup=/Application

regex should match space before the key

output

|RECEIVE|Low|
eventId=139569
msg=W4N Alert :: Critical : Interface Utilization for GigabitEthernet0/1 90.0 %
in=2442
out=0
categorySignificance=/Normal
categoryBehavior=/Communicate/Query
categoryDeviceGroup=/Application
1
Can you share the data in Log18Dec.txt, and output you are expecting?rohitkulky
Or can you at least tell us what you expect (by this I mean "Can you describe what your regex is supposed to match?"), and what you get?Oli
@Oli, rohikulky edited with sample line and desired outputMohan M

1 Answers

1
votes
from pyspark.sql.functions import col,split
import re

#r = re.compile(r"(?=\s\w+=)")
adsample = sc.textFile("hdfs://hostname/user/hdfs/sample/Log18Dec.txt")
splitted_sample = adsample.flatMap(lambda (x): ((v) for v in re.split('\s+(?=\w+=)',x)))

for m in splitted_sample.collect():
    print(m)