I just started learning on Apache Beam using Python and stuck with this problem for a while, hope to get some help from anyone who is good in Apache Beam.
This is my problem statement:
I have a text file that looks like this:
BEGIN=burger
blue
lettuce
mayonise
END=burger
BEGIN=fish
green
strawberry
ketchup
END=fish
May I know how could I use apache beam to split burger and fish into different PCollections so that I can perform different operations to these 2 PCollections?
Here I attach with my code snippets in Python
import apache_beam as beam
from apache_beam import Create, Map, ParDo, Filter
from apache_beam.io import ReadFromText
class SplitRow(beam.DoFn):
def process(self,element):
return element.splitlines()
def ExtractBurger(element):
if element == "BEGIN=burger":
return element
p = beam.Pipeline()
squares = (
p
# | "Read From Text" >> ReadFromText("gs://abc.txt")
| "Create dummy text file" >> Create([
'BEGIN=burger',
'blue',
'lettuce',
'mayonise',
'END=burger',
'BEGIN=fish',
'green',
'strawberry',
'ketchup',
'END=fish',
])
| "Decode and split lines" >> ParDo(SplitRow())
| "Extract out Burger" >> Filter(ExtractBurger)
| Map(print)
)
p.run()
My output is this
BEGIN=burger
I am able to extract out the rows which has "BEGIN=burger" but what I really want is to extract out all the data between "BEGIN=burger" to "END=burger" into 1 PCollection and "BEGIN=fish" to "END=fish" into another PCollection, not sure if it's possible to do it as I feel like Apache Beam could only do row operations, how do I write a logic that does something like this
- If found BEGIN=burger
- Continue to loop through next rows until you find END=burger
- Take the whole section and write it into a PCollection
Appreciate if anyone could give some insights! Thank you!