1
votes

I have a large dataframe where each row is containing large amount of text data, I am trying to partition this dataframe on some column in my dataframe i.e. column 11 and then write into multiple files

partitioncount = 5
trainingDataFile = 'sometrainingDatFileWithHugeTextDataInEachColumn.tsv'
df = pd.read_table(trainingDataFile, sep='\t', header=None, encoding='utf-8')
# prepare output files and keep them to append the dataframe rows
outputfiles = {}
filename = "C:\Input_Partition"
for i in range(partitioncount):
    outputfiles[i] = open(filename + "_%s.tsv"%(i,), "a")

#Loop through the dataframe and write to buckets/files
for index, row in df.iterrows():
    #partition on a hash function
    partition = hash(row[11]) % partitioncount
     outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")

This code results with in error : IndexError Traceback (most recent call last) in () ---> 73 outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")

c:\python27\lib\site-packages\pandas\core\indexing.pyc in getitem(self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key):

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_axis(self, key, axis) 1747 1748 # validate the location -> 1749 self._is_valid_integer(key, axis) 1750 1751 return self._get_loc(key, axis=axis)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _is_valid_integer(self, key, axis) 1636 l = len(ax) 1637 if key >= l or key < -l: -> 1638 raise IndexError("single positional indexer is out-of-bounds") 1639 return True 1640

IndexError: single positional indexer is out-of-bounds

What is the most efficient and scalable way to do this i.e. iterate data frame's rows , do some operations on rows (which I am not showing in code above and irrelevant to the problem in hand) and finally write each row (with large amount of text data) to a text file.

Appreciate any help!

1
partition = hash(row[11]) % partitioncount has me a bit baffled. What is this supposed to do? - roganjosh
its just a hash function to randomly select a bucket/file. it gets the value from column 11 , hash it (to randomize) and then apply modulo 5 , which will give you exactly 5 partitions at max. - Vivek Jain

1 Answers

0
votes

IIUC you can do it this way:

filename = r'/path/to/output_{}.csv'

df.groupby(df.iloc[:, 11].map(hash) % partitioncount) \
  .apply(lambda g: g.to_csv(filename.format(g.name), sep='\t', index=False))