0
votes

I am trying to export a large dataset from SQL Server to my local hard disc for some data analysis. The file size goes up to 30gb, with 6 million over rows and about 10 columns.

This data will then be fed through python Pandas or Tableau for consumption. I am thinking the size of the file itself will give me poor performances during my analysis.

Any best practices to be shared for analyzing big-ish data on a local machine?

I am running an i7 4570 with 8gb ram. I am hoping to be less reliant on SQL queries and be able to run huge analysis offline.

Due to the nature of the database, a fresh extract needs to happen and this process will have to repeat itself, meaning there will not be much of appending happening.

I have explored HDFStores and also Tableau Data Extracts, but still curious whether I can get better performances by reading whole CSV files.

Is there a compression method of sorts that I might be missing out? Again the objective here is to run the analysis without constant querying to the server, the source itself (which I am optimizing) will refresh itself every morning so when I get in office I can just focus on getting coffee and some blazing fast analytics done.

1
Isn't this a dupe of this: stackoverflow.com/questions/14262433/…EdChum
I am not limited to exploring only Pandas, I am even open on having a small database on my machine if it is much faster. Just do not know of the best practices.BernardL
but you then make this question opinion based as everyone will have a different answer, the dupe link deals with how to use pandas with HDFS effectively which to me makes this question very similar IMOEdChum

1 Answers

0
votes

With Tableau you would want to take an extract of the CSV (it will be much quicker to query than a CSV). That should be fine since the extract sits on disk. However, as mentioned, you need to create a new extract once your data changes.

With Pandas I usually load everything into memory, but if it doesn't fit then you can read the CSV in chunks using chunksize (see this thread: How to read a 6 GB csv file with pandas)