Removing numpy array columns with the same non-missing value, when missing values present

Question

I have a numpy array from which I need to remove columns which have the same value for non-missing cells, and remove columns with all values missing. The array:

>>> x = np.array([[ 1.,  2.,  2., np.NaN,  2.,  2.,  1.],
       [ 2., np.NaN,  1., np.NaN,  2.,  2.,  1.],
       [np.NaN,  1.,  1., np.NaN,  2.,  2.,  1.],
       [ 1.,  2., np.NaN, np.NaN,  2., np.NaN,  0.],
       [ 0.,  1.,  1., np.NaN,  2., np.NaN,  0.],
       [ 1.,  1.,  1., np.NaN,  2.,  2.,  1.],
       [np.NaN,  1.,  0., np.NaN,  2., np.NaN,  2.],
       [ 2.,  1.,  1., np.NaN,  2.,  2.,  1.]])

>>> x
array([[ 1.,  2.,  2., nan,  2.,  2.,  1.],
       [ 2., nan,  1., nan,  2.,  2.,  1.],
       [nan,  1.,  1., nan,  2.,  2.,  1.],
       [ 1.,  2., nan, nan,  2., nan,  0.],
       [ 0.,  1.,  1., nan,  2., nan,  0.],
       [ 1.,  1.,  1., nan,  2.,  2.,  1.],
       [nan,  1.,  0., nan,  2., nan,  2.],
       [ 2.,  1.,  1., nan,  2.,  2.,  1.]])

I can remove the column with all values missing (column index 3)

>>> x[:, ~np.all(np.isnan(x), axis=0)]

array([[ 1.,  2.,  2.,  2.,  2.,  1.],
       [ 2., nan,  1.,  2.,  2.,  1.],
       [nan,  1.,  1.,  2.,  2.,  1.],
       [ 1.,  2., nan,  2., nan,  0.],
       [ 0.,  1.,  1.,  2., nan,  0.],
       [ 1.,  1.,  1.,  2.,  2.,  1.],
       [nan,  1.,  0.,  2., nan,  2.],
       [ 2.,  1.,  1.,  2.,  2.,  1.]])

I can remove all columns where there is the same value in all rows (column index 4)

>>> x[:, ~np.all(x[1:] == x[:-1], axis=0)]

array([[ 1.,  2.,  2., nan,  2.,  1.],
       [ 2., nan,  1., nan,  2.,  1.],
       [nan,  1.,  1., nan,  2.,  1.],
       [ 1.,  2., nan, nan, nan,  0.],
       [ 0.,  1.,  1., nan, nan,  0.],
       [ 1.,  1.,  1., nan,  2.,  1.],
       [nan,  1.,  0., nan, nan,  2.],
       [ 2.,  1.,  1., nan,  2.,  1.]])

but, how do I remove column 6 (index 5) where the non-missing values are the same, but presence of missing values messes up with the boolean check?

EDIT: Desired outcome

array([[ 1.,  2.,  2.,  1.],
       [ 2., nan,  1.,  1.],
       [nan,  1.,  1.,  1.],
       [ 1.,  2., nan,  0.],
       [ 0.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.],
       [nan,  1.,  0.,  2.],
       [ 2.,  1.,  1.,  1.]])

yatu yatu · Accepted Answer · 2020-10-28T13:14:08

You could chain several masks using bitwise operators. You basically need two masks.

One for the NaNs
One to check if the first row values are equal to the rest of the column

Then chain both conditions with a bitwise OR, and check if all rows satisfy the conditions:

m1 = np.isnan(x)
m2 = x[0] == x
x[:, ~(m2|m).all(0)]

array([[ 1.,  2.,  2.,  1.],
       [ 2., nan,  1.,  1.],
       [nan,  1.,  1.,  1.],
       [ 1.,  2., nan,  0.],
       [ 0.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.],
       [nan,  1.,  0.,  2.],
       [ 2.,  1.,  1.,  1.]])

Removing numpy array columns with the same non-missing value, when missing values present

1 Answers