0
votes

I am in my new company as a data engineer working on building google cloud platform (GCP) batch ETL pipelines. My team's data scientist has recently passed me a data model (.py file written with python 3.6).

The data model has a main function that I can call and get a dataframe as an output, I intend to append this dataframe to a bigquery table. Is there anyway that I can just import this main function and integrate it into a pipeline using apache beam (Dataflow), without having to re-code the data model into PTransform? Or would I be better off just using cloud scheduler and cloud functions to achieve what I want?

I am a complete beginner with dataflow and apache beam, so any help or links to guides would be greatly appreciated!

2

2 Answers

2
votes

Instead of converting to a CSV file and loading it into BigQuery (which is a longer and potentially costlier process) you could take advantage of BigQuery's built-in integrations with Pandas:

import pandas

df = pandas.DataFrame(
{
    'my_string': ['a', 'b', 'c'],
    'my_int64': [1, 2, 3],
    'my_float64': [4.0, 5.0, 6.0],
}
)
full_table_id = 'my_dataset.new_table'
project_id = 'my-project-id'

df.to_gbq(full_table_id, project_id=project_id)

In terms of pipeline orchestration, I personally like Apache Airflow which integrates well with Cloud Composer

Edit: check out the extensive docs for df.to_gbq for more control over how you are loading your dataframe.

0
votes

If you have a dataframe, the easiest way is to transform it into CSV and then load it into BigQuery (load job)

Don't forget to add job_config.write_disposition = 'WRITE_APPEND' to add the data to the existing table. Look into the other param, schema autodetect, CSV seperator, skip leading row can help you to achieve your load.

Do it in a function, or Cloud Run, trigger by a scheduler. It works well!