I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).
Attempts so far:
Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.
The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.
This is where I am at:
import snowflake.connector
import pandas as pd
from itertools import chain
SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'
query = '''
select *
from table(%s)
;
'''
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='MYUSERNAME',
account='MYACCOUNT',
authenticator = 'externalbrowser',
warehouse='<WH>',
database='<DB>',
role='<ROLE>',
schema='<SCHEMA>'
)
return conn
def fetch_pandas_concat_df(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(dat, columns =colstring.split(","))
grow.append(df)
rows += df.shape[0]
print(rows)
return pd.concat(grow)
def fetch_pandas_concat_list(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
grow.append(dat)
colstring = ','.join([col[0] for col in cur.description])
rows += len(dat)
print(rows)
# note that grow is a list of list of tuples(?) [[(),()]]
return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))
cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes....
df3.head()
df3.shape
cur.close()