4
votes

I have a large CSV file, on the order of 1 GB big, and want to create entities into the datastore, one entity per row.

That CSV file is currently residing in Google Cloud Storage. Is there a clean way to do this? All the examples I can find online seem to rely on having the CSV file locally, or don't look like they would scale very well. Ideally there's a streaming API that lets me read in small enough pieces from Cloud Storage to make update calls to the Datastore, but I haven't been able to find anything like that.

2

2 Answers

2
votes

The buffer you receive when you open a GCS file is a streaming buffer, which can be pickled. But GCS does not support the iterator protocol to read lines of the CSV. You have to write your own wrapper. Like:

with gcs.open('/app_default_bucket/csv/example.csv', 'r') as f:
        csv_reader = csv.reader(iter(f.readline, ''))
        for row in csv_reader:
            logging.info(' - '.join(row))

If you are familiair with the blobstore you can use it to read large CSV's from GCS using blobstore.create_gs_key( "/gs" + <gcs_file_name_here>). Example here