0
votes

I want to load data from hundreds of CSV files on Google cloud Storage and append them to a single table on Bigquery on a daily basis using cloud dataflow (preferable using python SDK). Can you please let me know how I Can accomplish that?

Thanks

1
What exactly is the question? Yes, you can use the Python SDK for Dataflow to load to a BigQuery table. Or (my preference) just load into BigQuery and then run a query to transform.Elliott Brossard
Hi Elliott! I want to run a daily batch job which loads data from csv files in cloud storage to a bigquery table..I agree with your point but even for direct load, the dataflow code needs to convert csv line data to JSON (key value pairs) to be written to bigquery..bigquery only accepts key value pairs when you use apache beam (dataflow SDK)Parth Desai
It sounds very similar to this answer.Elliott Brossard
Yes but its in java..I am more familiar with python. Any resources for python to convert pcollection to tablerow?Parth Desai

1 Answers

0
votes

We can do it through Python as well. Please find the below code snippet.

def format_output_json(element):
    """
    :param element: is the row data in the csv
    :return: a dictionary with key as column name and value as real data in a row of the csv.

    :row_indices: I have hard-coded here, but can get it at the run time.
    """
    row_indices = ['time_stamp', 'product_name', 'units_sold', 'retail_price']
    row_data = element.split(',')
    dict1 = dict()
    for i in range(len(row_data)):
        dict1[row_indices[i]] = row_data[i]

    return [dict1]