1
votes

Edit: I am now looking for a clean way to impute the missing value of my dataset on Python

     a       b         c
0    1       2         Falcon

1    np.nan   3        Falcon

2    np.nan   np.nan   Falcon
  
3    np.nan    4       Bird

4    np.nan   5        Bird
 
5    5        np.nan   Bird

6    6        7        Bird

I will need to impute using the following conditions based on their individual group.
(1) for missing values that has a value in its preceding and previous row, fill it by interpolating
(2) for missing values that has a value in its preceding or previous row, fill it with the preceding or previous row value
(3) for missing values that does not satisfy (1) and (2), fill it with 0


An example of the result will be

     a   b      c
0    1   2      Falcon

1    1   3      Falcon

2    0   3      Falcon
  
3    0   4      Bird

4    5   5      Bird

5    5   6      Bird

6    6   7      Bird

I have tried the code provided in the comment below, however i failed to set it via its respective group. How do I go about doing it in Python?

2
Do you have code for not so clean way? What kind of interpolation? - Roman Susi
Linear interpolation. I am not sure how to create a function/loop for python to detect only the fields that satisfied the 3 conditions. I have tried "dataset.fillna(method='ffill', inplace=True, limit=1)" to satisfy condition 2, but when i used this, it also fill in the missing value of row 5 np.nan - anali

2 Answers

0
votes

This is not complete answer yet, but a couple of hints. Assuming we deal with a single array here (eg, a).

First, concatenate [nan] at the beginning and at the end of a to get a1. This will simplify applying imputation rules.

Second, you can now take three arrays a1[:-2], a1[1:-1], a1[2:] to have all three moving window values.

Of course, you will need to write a function to impute the value based on those three values, but it will be simple:

def impute(xprev, x, xnext):
   if not isnan(x):
      return x
   # the rest of the logic here left as an exercise

Hint 2: There probably are more efficient constructs (like np.stack), but one simple way to have all three values is the normal zip:

result_list = [impute(prev, this, next) for (prev, this, next) in zip(a1[:-2], a1[1:-1], a1[2:])]
# turn list to array or you can use np.fromiter(( impute(...) for ... ), np.float64) to make array directly from iterable

I hope those hints are enough to build the solution.

In the solution above I applied a couple of general patterns, which I discovered practicing programming:

  1. Padding with convenient stop values instead of checking indexes for borders
  2. Using zip to have small-sized moving windows

Note. It's also possible to generate indexes for the extended array (a1) instead of using zip or stack. So I believe there will be other answers with more optimizations as Numpy has a lot of functions to offer. Also the impute above can probably be made into ufunc and applied to stacked shifted arrays - however, without profiling it's hard to tell whether it will be more efficient than the simple list comprehension above or corresponding solution with an index.

0
votes

I cannot imagine a single direct way, so I would use one pass for each of your requirements. The nice thing is that as each pass will fill in some values, following passes will not try to fill them:

  1. for missing values that has a value in its preceding and previous row, fill it by interpolating

     df[df.isna()&(~df.shift().isna())&(~df.shift(-1).isna())] = df.interpolate()
    
  2. for missing values that has a value in its preceding or previous row, fill it with the preceding or previous row value

     df[df.isna()&(~df.shift().isna())] = df.ffill()
     df[df.isna()&(~df.shift(-1).isna())] = df.bfill()
    
  3. for missing values that does not satisfy (1) and (2), fill it with 0

     df.fillna(0)
    

With your sample data, it gives as expected:

     a    b
0  1.0  2.0
1  1.0  3.0
2  0.0  3.5
3  0.0  4.0
4  5.0  5.0
5  5.0  6.0
6  6.0  7.0

It does require 4 operations, but all of them are vectorized, so the global processing time should be tolerable.