5
votes

I want to save pandas data frame directly to Google Cloud Storage. I tried different ways using write-a-pandas-dataframe-to-google-cloud-storage-or-bigquery. But I am not able to save.

Note: I can use google.cloud package only

Below is the code I tried

from google.cloud import storage
import pandas as pd
input_dict = [{'Name': 'A', 'Id': 100}, {'Name': 'B', 'Id': 110}, {'Name': 'C', 'Id': 120}]
df = pd.DataFrame(input_dict)

Try:1

destination = f'gs://bucket_name/test.csv'
df.to_csv(destination)

Try:2

storage_client = storage.Client(project='project')
bucket = storage_client.get_bucket('bucket_name')
gs_file = bucket.blob('test.csv')
df.to_csv(gs_file)

I am getting below errors

for option 1 : No such file or directory: 'gs://bucket_name/test.csv'

option 2: 'Blob' object has no attribute 'close'

Thanks,

Raghunath.

4
I attempted similar setup and it worked for me. Is your Python code in GCP? And is the Cloud Storage bucket already created? Your try1 Solution should work via the Cloud Shell.oakinlaja
hi, Raghunath were you able to find the answer to it? I had the exact same issue I'm writing a python script which will be triggered by airflow which writes df to CSV and keeps it in GCS bucket, but I'm getting Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.Praneeth Kumar
Currently there is no solution to this requirement. I have developed code to create temporary file and then upload to GSRaghunath

4 Answers

14
votes
from google.cloud import storage
import os
from io import StringIO # if going with no saving csv file

# say where your private key to google cloud exists
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/your-google-cloud-private-key.json'

df = pd.DataFrame([{'Name': 'A', 'Id': 100}, {'Name': 'B', 'Id': 110}])

Write it to a csv file on your machine first and upload it:

df.to_csv('local_file.csv')
gcs.get_bucket('BUCKET_NAME').blob('FILE_NAME.csv').upload_from_filename('local_file.csv', content_type='text/csv')

If you do not want to create a temp csv file, use StringIO:

f = StringIO()
df.to_csv(f)
f.seek(0)
gcs.get_bucket('BUCKET_NAME').blob('FILE_NAME.csv').upload_from_file(f, content_type='text/csv')
0
votes

Write the file to a directory before writing to GCS.

import pandas as pd
from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('[bucket_name]')
blob = bucket.blob('panda.csv')

input_dict = [{'Name': 'A', 'Id': 100}, {'Name': 'B', 'Id': 110}, {'Name': 'C', 'Id': 120}]
df = pd.DataFrame(input_dict)
df.to_csv('/home/[path]/panda.csv')

blob.upload_from_filename('/home/[path]/panda.csv')
print('File panda.csv uploaded')
-1
votes

maybe this post can help you

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

source Write a Pandas DataFrame to Google Cloud Storage or BigQuery

-1
votes

this worked for me

BUCKET_NAME= "TEST-BUCKET"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
    
fileout = "/folder1/consolidatedOutput.csv"

#convert data frame to string and write it

destination_blob = bucket.blob(file_out)
destination_blob.upload_from_string(df.to_string(index=False,justify='left'))