I am new to Apache Beam and just started working on it with Python SDK. Regarding Apache beam I know high level of Pipelines, Pcollections, Ptransforms, ParDo and DoFn.
In my current project pipeline has been implemented using PANDAS to read, transform and write file using below mentioned syntax
I wanted to understand if this is correct implementation of Apache Beam as we are directly reading and writing files using PANDAS only and not processing the files element by element.
steps:
- create Pipeline
- create pcollection of input file path
- Call DoFn and pass the file path
- Do everything inside DoFn (read, transform and write) using PANDAS.
sample high level code:
import **required libraries
class ActionClass(beam.DoFn):
def process(self, file_path):
#reading file using PANDAS into dataframe
df = pandas.read_csv('file_path')
# do some transformation using pandas
#write dataframe to output file from inside DoFn only.
return
def run():
p = beam.Pipeline(options=options)
input = p | beam.io.ReadFromText('input_file_path') --reading only file path
output = input | 'PTransform' | beam.ParDo(ActionClass)