0
votes

I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).

Attempts so far:

  1. Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)

  2. Amend my previous code to work in batches .. - this is what I am hoping for help with here.

The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.

This is where I am at:

import snowflake.connector
import pandas as pd
from itertools import chain

SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'

query = '''
select * 
from table(%s)
;
'''
def create_snowflake_connection():
    conn = snowflake.connector.connect(
            user='MYUSERNAME',
            account='MYACCOUNT',
            authenticator = 'externalbrowser',
            warehouse='<WH>',
            database='<DB>',
            role='<ROLE>',
            schema='<SCHEMA>'
    )

    return conn


def fetch_pandas_concat_df(cur):
    rows = 0
    grow = []
    while True:
        dat = cur.fetchmany(50000)
        if not dat:
            break
        colstring = ','.join([col[0] for col in cur.description])
        df = pd.DataFrame(dat, columns =colstring.split(",")) 
        grow.append(df)

        rows += df.shape[0]
    print(rows)

    return pd.concat(grow)    

def fetch_pandas_concat_list(cur):
    rows = 0
    grow = []
    while True:
        dat = cur.fetchmany(50000)
        if not dat:
            break      
        grow.append(dat) 
        colstring = ','.join([col[0] for col in cur.description])    
        rows += len(dat)      
    print(rows)  

    # note that grow is a list of list of tuples(?) [[(),()]]

    return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))


cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes.... 
df3.head()
df3.shape
cur.close()
1
It takes 4 minutes to load the data to R. I stopped the above python code at the pd.DataFrame(list(chain(*grow)), columns = colstring.split(",")) section after 30 minutes on this line... In the meantime I will download the data from snowflake to csv and load from there.... - user1420372

1 Answers

0
votes

The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?

Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.