0
votes

I am very new to Python. In our company we use Base SAS for data analysis (ETL, EDA, basic model building). We want to check whether replacing it with Python is possible for big chunk of data. With respect to that i have following few questions :

  1. How do python handle large files? my PC has RAM of 8gb and i have a flat file of 30gb (say a csv file). I would generally do operations like left join, deleting, group by etc. on such file. This is easily doable in SAS i.e. I don't have to be worried about low RAM. Are the same operations doable in python? would appreciate if somebody can provide the list of libraries & code for the same.

  2. How can i perform SAS operations like "PROC SQL" in python to create dataset in my local PC while fetching the data from server. i.e. In sas i would download 10mln rows (7.5 gb of data) from SQL server by performing following operation


libname aa ODBC dsn =sql user = pppp pwd = XXXX; libname bb '<<local PC path>>';

proc sql outobs = 10000000;
create table bb.foo as
select * from aa.bar
;quit;

What is the method to perform the same in python. Again just to remind you - my PC has only 8 gb RAM

1

1 Answers

0
votes

Python and specially python 3.X provides a lot of tools for handling large files.One of them is using iterators.

Python returns the result of inputs (reading from text or csv or ...) actually the result of open file as an iterator, thus you won't have the problem of loading whole of file in memory, with this trick you can read your file line by line and based on your need handle them.

For example if you want to chuck your file in a blocks you can use a deque object to preserve the lines which are belong to one block (based on your condition).

Alongside the collections.deque function, you can use some itertools functions to handle and apply your conditions on your lines, for example if you want to access to next line in each iteration you can use itertools.zip_longest function and for creating multiple independent iterator from your file object you can use itertools.tee.

Recently I wrote a code for filtering some huge log files (30GB and larger ) which performs very good.I have putted the code in github which you can check it and use it.

https://github.com/Kasramvd/log-filter