Using PANDAS with Apache Beam

Question

I am new to Apache Beam and just started working on it with Python SDK. Regarding Apache beam I know high level of Pipelines, Pcollections, Ptransforms, ParDo and DoFn.

In my current project pipeline has been implemented using PANDAS to read, transform and write file using below mentioned syntax

I wanted to understand if this is correct implementation of Apache Beam as we are directly reading and writing files using PANDAS only and not processing the files element by element.

steps:

create Pipeline
create pcollection of input file path
Call DoFn and pass the file path
Do everything inside DoFn (read, transform and write) using PANDAS.

sample high level code:

import **required libraries

class ActionClass(beam.DoFn):

    def process(self, file_path):
        #reading file using PANDAS into dataframe 
        df = pandas.read_csv('file_path')
        # do some transformation using pandas
        #write dataframe to output file from inside DoFn only.
        return

def run():

    p = beam.Pipeline(options=options)

    input = p | beam.io.ReadFromText('input_file_path') --reading only file path

    output = input | 'PTransform' | beam.ParDo(ActionClass)

Ayoub Ennassiri Ayoub Ennassiri · Accepted Answer · 2020-08-30T15:11:25

In my opinion, if you have a high number of small CSV files that you want to process with pandas, then this is probably a valid use case with Apache Beam.

Thanks

Using PANDAS with Apache Beam

2 Answers