2
votes

My data is 88200(rows)*29403(columns)(14Gb approximately). The data has been created in matlab using dlmwrite. I have tried to use the following methods to read the file in python. In all attempts i have run out of memory:

my system: ubuntu 16.04, RAM 32Gb, Swap 20Gb Python 2.7.12, pandas :0.19, GCC 5.4.0

1> using csv:

import csv
import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')

2a> using numpy loadtxt:

import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
data = numpy.loadtxt(raw_data, delimiter=",")

2b> using numpy genfromtxt:

import numpy
x=np.genfromtxt('vectorized_image_dataset.txt',skip_header=0,skip_footer=0,delimiter=',',dtype='float32')

3> using pandas.read_csv:

from pandas import *
import numpy as np

tp = read_csv(filepath_or_buffer='data.txt', header=None, iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True)

In all the above methods it ran out of memory.

The data file has been created using dlmwrite (matlab). a list of images(list.txt) are read one by one, converted to float, vectorized and stored using dlmwrite. The code is as below:

fileID = fopen('list.txt');
N=88200;
C = textscan(fileID,'%s');
fclose(fileID);

for i=1:N

A=imread(C{1}{i});
% convert the file to vector
B=A(:);
% convert the above vector to a row
D=B';
% divide by 256
%E=double(D)/double(256);
E=single(D)/single(256);
dlmwrite('vectorized_image_dataset.txt',E,'-append');
clear A;clear B;clear D;clear E;
end
2
Have you tried reading the file line by line? Open it with with open("data.txt", "r") as f:" and then process each line at a time using a for loop: for line in f: . - GeckStar
I need the whole data in a numpy array. If I read line by line, i would have to append the data corresponding to the new line into the numpy array. That would involve resizing the array on each iteration. In matlab array resizing is very slow, i guess it will be slow in numpy as well? anyway i will give it a try. - user27665
Instead of appending one array row per line try reading chunks of the data in a loop (halves or quarters) and concatenate the arrays afterwards. - Nils Werner
If the target is a numpy.array and that doesn't fit into memory, all the suggestions about how to better read the file will not help. You might want to look at numpy.memmap an/or PyTables - hvwaldow
Are you using 32bit Python or 64bit python? - TheBlackCat

2 Answers

1
votes
def read_line_by_line(file_path: str):
    with open(filepath) as file:
        for line in file:
            yield line

Maybe this function will help you - I am not very familiar with Numpy/Pandas, but it seems like you are trying to load all the data at once and store it in memory. With function, above, you will be using generator to yield only one line at a time – no need to store everything in RAM.

0
votes

I solved it using pandas.read_csv. I broke up my data.txt into four pieces of 22050 lines each. Then I did

tp1 = read_csv(filepath_or_buffer='data_first_22050.txt', header=None, iterator=True, chunksize=1000)
df1 = concat(tp1, ignore_index=True)
tp2 = read_csv(filepath_or_buffer='data_second_22050.txt', header=None, iterator=True, chunksize=1000)
df2 = concat(tp2, ignore_index=True)>>> frames=[df1,df2]
result=concat(frames)
del frames, df1, df2, tp1, tp2
tp3 = read_csv(filepath_or_buffer='data_third_22050.txt', header=None, iterator=True, chunksize=1000)
df3 = concat(tp3, ignore_index=True)
frames=[result,df3]
result2=concat(frames)
del frames, df3, tp3, result
tp4 = read_csv(filepath_or_buffer='data_fourth_22050.txt', header=None, iterator=True, chunksize=1000)
df4 = concat(tp4, ignore_index=True)
frames=[result2,df4]
result3=concat(frames)
del frames, tp4, df4, result2
A=result3.as_matrix()
A.shape

(88200, 29403)