Problem statement
How to chunk read a csv file using pandas which has an overlap between chunks?
For an example, imagine the list indexes represents the index of some dataframe I wish to read in.
indexes = [0,1,2,3,4,5,6,7,8,9]
read_csv(filename, chunksize=None):
indexes = [0,1,2,3,4,5,6,7,8,9] # read in all indexes at once
read_csv(filename, chunksize=5):
indexes = [[0,1,2,3,4], [5,6,7,8,9]] # iteratively read in mutually exclusive index sets
read_csv(filename, chunksize=5, overlap=2):
indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]] # iteratively read in indexes sets with overlap size 2
Working solution
I have a hack solution using skiprows and nrows, but it gets progressively slower as it reads the csv file.
indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes) # this I can work out before reading the whole file in rather cheaply
chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)] # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
skiprows = [*range(chunk[0], chunk[1])]
pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)
Does anyone have any insights or improved solutions for this problem?