9
votes

I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.

I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.

In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.

2
did you manage to get this working in the end, I am having some similar issues and keep running into the suggestion that it would be best to get the cloud function to send data to big query directly, and then take it from there... thanks - Daniel Vieira
Yes, I did manage to get this to work. I was able to read the contents of the data using the top-comment and then used the SDK to place the data into Pub/Sub. I'm happy to help if you can give me your specific issue :) - rara-aaa

2 Answers

16
votes

The function does not actually receive the contents of the file, just some metadata about it.

You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.

Putting that together with the tutorial you're using, you get a function like:

from google.cloud import storage

storage_client = storage.Client()

def hello_gcs_generic(data, context):
    bucket = storage_client.get_bucket(data['bucket'])
    blob = bucket.blob(data['name'])
    contents = blob.download_as_string()
    # Process the file contents, etc...
3
votes

This is an alternative solution using pandas:

Cloud Function Code:

import pandas as pd

def GCSDataRead(event, context):
    bucketName = event['bucket']
    blobName = event['name']
    fileName = "gs://" + bucketName + "/" + blobName
    
    dataFrame = pd.read_csv(fileName, sep=",")
    print(dataFrame)