4
votes

Goal - To read csv file uploaded on google cloud storage bucket.

Environment - Run Jupyter notebook using SSH instance on Master node. Using python on Jupyter notebook trying to access a simple csv file uploaded onto google cloud storage bucket.

Approaches -

1st approach - Write a simple python program

Wrote following program

import csv
f = open('gs://python_test_hm/train.csv' , 'rb' ) 
csv_f = csv.reader(f)
for row in csv_f
     print row

Results - Error message "No such file or directory"

2nd Approach - Using gcloud Package tried to access the train.csv file. The sample code is shown below. Below code is not the actual code. The file on google Cloud storage in my version of code was referred to "gs:///Filename.csv" Results - Error message "No such file or directory"

Load data from CSV

import csv
from gcloud import bigquery
from gcloud.bigquery import SchemaField
client = bigquery.Client()
dataset = client.dataset('dataset_name')
dataset.create()  # API request

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
 ]
table = dataset.table('table_name', SCHEMA)
table.create()

with open('csv_file', 'rb') as readable:
    table.upload_from_file(
        readable, source_format='CSV', skip_leading_rows=1)

3rd Approach -

import csv
import urllib

url = 'https://storage.cloud.google.com/<bucket>/train.csv'


response = urllib.urlopen(url)
cr = csv.reader(response)
print cr

for row in cr:
    print row

Results - Above code doesn't result in any error but it displays the XML content of the google page as shown below. I am interested in viewing the data of the train csv file.

['<!DOCTYPE html>']
['<html lang="en">']
['  <head>']
['  <meta charset="utf-8">']
['  <meta content="width=300', ' initial-scale=1" name="viewport">']
['  <meta name="google-site-verification" content="LrdTUW9psUAMbh4Ia074-   BPEVmcpBxF6Gwf0MSgQXZs">']
['  <title>Sign in - Google Accounts</title>']

Can someone throw some light on what could be possibly wrong here and how do I achieve my goal? Your help is highly appreciated.

Thanks so much for your help!

2
It seems that the file is stored in a place that requires authentication (i.e. it's not a public site). If that's true, you would have to authenticate before you're able to access the file. It doesn't matter if you're logged in on your browser, because Python isn't using your browser cookies. - dispesi
^^^ this is the answer - Nick

2 Answers

8
votes

I assume you are using Jupyter notebook running on a machine in Google Cloud Platform (GCP)? If that's the case, you will already have the Google Cloud SDK running on that machine (by default).

With this setup you have 2 easy options to work with Google Cloud Storage (GCS):

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = bucket.blob('train.csv')
blob.upload_from_string('this is test content!')

Reading from GCS:

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('python_test_hm')
blob = storage.Blob('train.csv', bucket)
content = blob.download_as_string()
0
votes

The sign in page your app fetches isn't actually the object - it's an auth redirect page that, if interacted-with, would proceed to serve the object. You should check out the documentation on Cloud Storage to see about how auth works, and look up the auth details for whichever library or means you use to access the bucket / object.