Load data from snowflake to pandas dataframe (python) in batches

Question

I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).

Attempts so far:

Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.

The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.

This is where I am at:

import snowflake.connector
import pandas as pd
from itertools import chain

SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'

query = '''
select * 
from table(%s)
;
'''
def create_snowflake_connection():
    conn = snowflake.connector.connect(
            user='MYUSERNAME',
            account='MYACCOUNT',
            authenticator = 'externalbrowser',
            warehouse='<WH>',
            database='<DB>',
            role='<ROLE>',
            schema='<SCHEMA>'
    )

    return conn


def fetch_pandas_concat_df(cur):
    rows = 0
    grow = []
    while True:
        dat = cur.fetchmany(50000)
        if not dat:
            break
        colstring = ','.join([col[0] for col in cur.description])
        df = pd.DataFrame(dat, columns =colstring.split(",")) 
        grow.append(df)

        rows += df.shape[0]
    print(rows)

    return pd.concat(grow)    

def fetch_pandas_concat_list(cur):
    rows = 0
    grow = []
    while True:
        dat = cur.fetchmany(50000)
        if not dat:
            break      
        grow.append(dat) 
        colstring = ','.join([col[0] for col in cur.description])    
        rows += len(dat)      
    print(rows)  

    # note that grow is a list of list of tuples(?) [[(),()]]

    return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))


cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes.... 
df3.head()
df3.shape
cur.close()

It takes 4 minutes to load the data to R. I stopped the above python code at the pd.DataFrame(list(chain(*grow)), columns = colstring.split(",")) section after 30 minutes on this line... In the meantime I will download the data from snowflake to csv and load from there.... — user1420372

Jacobm001 Jacobm001 · Accepted Answer · 2020-06-27T21:28:37

The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?

Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.

Load data from snowflake to pandas dataframe (python) in batches

1 Answers