Large data from SQL server to local hard disc to Tableau and Pandas

Question

I am trying to export a large dataset from SQL Server to my local hard disc for some data analysis. The file size goes up to 30gb, with 6 million over rows and about 10 columns.

This data will then be fed through python Pandas or Tableau for consumption. I am thinking the size of the file itself will give me poor performances during my analysis.

Any best practices to be shared for analyzing big-ish data on a local machine?

I am running an i7 4570 with 8gb ram. I am hoping to be less reliant on SQL queries and be able to run huge analysis offline.

Due to the nature of the database, a fresh extract needs to happen and this process will have to repeat itself, meaning there will not be much of appending happening.

I have explored HDFStores and also Tableau Data Extracts, but still curious whether I can get better performances by reading whole CSV files.

Is there a compression method of sorts that I might be missing out? Again the objective here is to run the analysis without constant querying to the server, the source itself (which I am optimizing) will refresh itself every morning so when I get in office I can just focus on getting coffee and some blazing fast analytics done.

Isn't this a dupe of this: stackoverflow.com/questions/14262433/… — EdChum
I am not limited to exploring only Pandas, I am even open on having a small database on my machine if it is much faster. Just do not know of the best practices. — BernardL
but you then make this question opinion based as everyone will have a different answer, the dupe link deals with how to use pandas with HDFS effectively which to me makes this question very similar IMO — EdChum

e h e h · Accepted Answer · 2015-12-15T10:56:11

With Tableau you would want to take an extract of the CSV (it will be much quicker to query than a CSV). That should be fine since the extract sits on disk. However, as mentioned, you need to create a new extract once your data changes.

With Pandas I usually load everything into memory, but if it doesn't fit then you can read the CSV in chunks using chunksize (see this thread: How to read a 6 GB csv file with pandas)

Large data from SQL server to local hard disc to Tableau and Pandas

1 Answers