0
votes

I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:

  1. Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
  2. Unload data from Snowflake into Parquet files and then read them into a dataframe.

CONTEXT I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.

Typically a simulation run produces around 40-50 million rows, each having 18 columns.

I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).

Any help appreciated.

LATEST UPDATE (09/11/2020) I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.

1
Any reason you couldn't just query the data directly in Snowflake and leverage SQL via Python, rather than load it into Pandas? Use the power of Snowflake, rather than loading all that data into memory on a python server. That doesn't answer your question, but thought I'd throw it out there.Mike Walton
@MikeWalton I need to enrich the data in the columns, before they eventually get written into a SQLite database. Pandas.to_sql() is helpful for writing it to SQLite.Sidsy
do you mean that you have to check if all data is same as it was in previous dataset ? if so hash_agg(*) from table vs old table checks thatsimpleuser008

1 Answers

0
votes

Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:

resultSet = pd.read_sql(text(sqlQuery), SnowEngine)

once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.