Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

Question

I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.

This is the code I have come up with , which I know is an overkill.. if I could :

group the col of dataframe
get median/mode/mean of groups of the cols
replace the missing of those groups
recombine them back to original df

it would be great.

but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)

def fillMissing(df, dataType):
'''
Args:
    df ( 2d array/ Dict):
                         eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
    dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1 
                        indicating categorical/continuous variable eg:  ('attribute1':1, 'attribute2': 0)

Returns:
    dataframe wih missing values filled
    writes a file with missing values replaces.    


'''
dataLabels = list(df.columns.values)

# the dictionary to hold the values to put in place of nan
replaceValues = {}

for eachlabel in dataLabels:

    thisSer = df[eachlabel]
    if dataType[eachlabel] == 1:                        # if its a continuous variable 
        _,pval = stats.normaltest(thisSer)
        groupedd = thisSer.groupby(df['class'])

        innerDict ={}
        for name, group in groupedd:
            if(pval < 0.5):
                groupMiddle = group.median()            # get the median of the group
            else:
                groupMiddle = group.mean()              # get mean (if group is normal )
            innerDict[name.strip()] = groupMiddle
        replaceValues[eachlabel] = innerDict

    else:                                               # if the series is continuous
        # freqCount = collections.Counter(thisSer)
        groupedd = thisSer.groupby(df['class'])
        innerDict ={}
        for name, group in groupedd:
            freqC = collections.Counter(group)      
            mostFreq = freqC.most_common(1)             # get the most frequent value of the attribute(grouped by class)
            # newGroup = group.replace(np.nan , mostFreq)
            innerDict[name.strip()] = mostFreq[0][0].strip()
        replaceValues[eachlabel] = innerDict
print replaceValues


# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df   

mask=False
for col in df.columns: mask = mask | df[col].isnull()

# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]


for _, row in dfnulls.iterrows():
    for colname in dataLabels:
        if pd.isnull(row[colname]):
            if row['class'].strip() == '>50K':
                row[colname] = replaceValues[colname]['>50K']
            else:
                row[colname] = replaceValues[colname]['<=50K']
        newfile.write(str(row[colname]) + ",")
    newdf.append(row)
    newfile.write("\n")

# here add newdf to dfnotNulls to get finaldf

return finaldf

It helps if you can provide a small sample dataset and an example of the results you're trying to achieve. — JohnE
Thanks @JohnE I am still reading your replay and trying to understand it; the dataset I am using is archive.ics.uci.edu/ml/machine-learning-databases/adult I am working with adult.data file. — swati saoji

JohnE JohnE · Accepted Answer · 2015-03-31T19:33:58

If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode at the bottom as it is slightly trickier than mean and median.

df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)

df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))

df
    v    v_mean  v_med  v_mode
1   1  1.000000      1       1
1   2  2.000000      2       2
1   2  2.000000      2       2
1 NaN  1.666667      2       2
2   3  3.000000      3       3
2   4  4.000000      4       4
2   4  4.000000      4       4
2 NaN  3.666667      4       4

Note that mode() may not be unique, unlike mean and median and pandas returns it as a Series for that reason. To deal with that, I just took the simplest route and added [0] in order to extract the first member of the series.

Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

1 Answers