Imputation of Missing Value on Python for Consecutive NaN

Question

Edit: I am now looking for a clean way to impute the missing value of my dataset on Python

     a       b         c
0    1       2         Falcon

1    np.nan   3        Falcon

2    np.nan   np.nan   Falcon
  
3    np.nan    4       Bird

4    np.nan   5        Bird
 
5    5        np.nan   Bird

6    6        7        Bird

I will need to impute using the following conditions based on their individual group.
(1) for missing values that has a value in its preceding and previous row, fill it by interpolating
(2) for missing values that has a value in its preceding or previous row, fill it with the preceding or previous row value
(3) for missing values that does not satisfy (1) and (2), fill it with 0

An example of the result will be

     a   b      c
0    1   2      Falcon

1    1   3      Falcon

2    0   3      Falcon
  
3    0   4      Bird

4    5   5      Bird

5    5   6      Bird

6    6   7      Bird

I have tried the code provided in the comment below, however i failed to set it via its respective group. How do I go about doing it in Python?

Do you have code for not so clean way? What kind of interpolation? — Roman Susi
Linear interpolation. I am not sure how to create a function/loop for python to detect only the fields that satisfied the 3 conditions. I have tried "dataset.fillna(method='ffill', inplace=True, limit=1)" to satisfy condition 2, but when i used this, it also fill in the missing value of row 5 np.nan — anali

Roman Susi Roman Susi · Accepted Answer · 2021-02-18T07:08:38

This is not complete answer yet, but a couple of hints. Assuming we deal with a single array here (eg, a).

First, concatenate [nan] at the beginning and at the end of a to get a1. This will simplify applying imputation rules.

Second, you can now take three arrays a1[:-2], a1[1:-1], a1[2:] to have all three moving window values.

Of course, you will need to write a function to impute the value based on those three values, but it will be simple:

def impute(xprev, x, xnext):
   if not isnan(x):
      return x
   # the rest of the logic here left as an exercise

Hint 2: There probably are more efficient constructs (like np.stack), but one simple way to have all three values is the normal zip:

result_list = [impute(prev, this, next) for (prev, this, next) in zip(a1[:-2], a1[1:-1], a1[2:])]
# turn list to array or you can use np.fromiter(( impute(...) for ... ), np.float64) to make array directly from iterable

I hope those hints are enough to build the solution.

In the solution above I applied a couple of general patterns, which I discovered practicing programming:

Padding with convenient stop values instead of checking indexes for borders
Using zip to have small-sized moving windows

Note. It's also possible to generate indexes for the extended array (a1) instead of using zip or stack. So I believe there will be other answers with more optimizations as Numpy has a lot of functions to offer. Also the impute above can probably be made into ufunc and applied to stacked shifted arrays - however, without profiling it's hard to tell whether it will be more efficient than the simple list comprehension above or corresponding solution with an index.

Imputation of Missing Value on Python for Consecutive NaN

2 Answers