I have a large dataframe where each row is containing large amount of text data, I am trying to partition this dataframe on some column in my dataframe i.e. column 11 and then write into multiple files
partitioncount = 5
trainingDataFile = 'sometrainingDatFileWithHugeTextDataInEachColumn.tsv'
df = pd.read_table(trainingDataFile, sep='\t', header=None, encoding='utf-8')
# prepare output files and keep them to append the dataframe rows
outputfiles = {}
filename = "C:\Input_Partition"
for i in range(partitioncount):
outputfiles[i] = open(filename + "_%s.tsv"%(i,), "a")
#Loop through the dataframe and write to buckets/files
for index, row in df.iterrows():
#partition on a hash function
partition = hash(row[11]) % partitioncount
outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")
This code results with in error : IndexError Traceback (most recent call last) in () ---> 73 outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")
c:\python27\lib\site-packages\pandas\core\indexing.pyc in getitem(self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key):
c:\python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_axis(self, key, axis) 1747 1748 # validate the location -> 1749 self._is_valid_integer(key, axis) 1750 1751 return self._get_loc(key, axis=axis)
c:\python27\lib\site-packages\pandas\core\indexing.pyc in _is_valid_integer(self, key, axis) 1636 l = len(ax) 1637 if key >= l or key < -l: -> 1638 raise IndexError("single positional indexer is out-of-bounds") 1639 return True 1640
IndexError: single positional indexer is out-of-bounds
What is the most efficient and scalable way to do this i.e. iterate data frame's rows , do some operations on rows (which I am not showing in code above and irrelevant to the problem in hand) and finally write each row (with large amount of text data) to a text file.
Appreciate any help!
partition = hash(row[11]) % partitioncount
has me a bit baffled. What is this supposed to do? - roganjosh