So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever. I'm even trying to use pandarallel but that is not working well either. In any case, here is the code that I'm using...
def classifyForecast(dataframe):
buckets = len(dataframe[dataframe['QUANTITY'] != 0])
try:
adi = dataframe.shape[0] / buckets
cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()
if adi < 1.32:
if cov < .49:
dataframe['TYPE'] = 'Smooth'
else:
dataframe['TYPE'] = 'Erratic'
else:
if cov < .49:
dataframe['TYPE'] = 'Intermittent'
else:
dataframe['TYPE'] = 'Lumpy'
except:
dataframe['TYPE'] = 'Smooth'
try:
dataframe['ADI'] = adi
except:
dataframe['ADI'] = np.inf
try:
dataframe['COV'] = cov
except:
dataframe['COV'] = np.inf
return dataframe
from pandarallel import pandarallel
pandarallel.initialize()
def quick_classification(df):
return df.parallel_apply(classifyForecast(df))
Also, please note that I am splitting the dataframe up into batches. I don't want the function to work on each row, but instead I want it to work on the chunks. That way I can get the .mean()
and .std()
of specific columns.
It shouldn't take 48 hours to complete. How do I speed this up?
apply
. You are supposed to pass a function toparallel_apply
, and that function will be called once for each row. You are not PASSING the function, you are CALLING your function. It will do its work in the normal method, then return a dataframe. You then pass that dataframe toparallel_apply
. Who knows what that will do. – Tim Robertsadi
andcov
values need to apply to the entire dataframe. Right? But if that's the case, what is the rest of the code doing? Maybe you should describe the problem in words. – Tim Robertsexcept
s? – Axe319print
and a percentage of completion. I'm cutting the dataframe into parts by looking at thecp_ref
column and using it to pull out unique data that is more than just one row at a time. I need it to apply the function to the dataframe in these chunks. – Ravaal