1
votes

I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:

if df.row(column number) == nan
    skip

or

if df.row(column number) != nan
    do stuff

Basically how do I identify if a cell from the csv file is empty.

3

3 Answers

0
votes

Best is to get rid of the NaN rows after you load it, by indexing:

df = df[df['column_to_check'].notnull()]

For example to get rid of NaN values found in column 3 in the following dataframe:

>>> df
     0    1    2    3    4
0  1.0  1.0  NaN  1.0  1.0
1  1.0  NaN  1.0  1.0  1.0
2  NaN  NaN  NaN  NaN  NaN
3  NaN  1.0  1.0  NaN  NaN
4  1.0  NaN  NaN  1.0  1.0

>>> df[df[3].notnull()]
     0    1    2    3    4
0  1.0  1.0  NaN  1.0  1.0
1  1.0  NaN  1.0  1.0  1.0
4  1.0  NaN  NaN  1.0  1.0
0
votes

pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.

Example:

import pandas as pd

import numpy as np

a = np.nan

pd.isnull(a)
Out[4]: True

pd.notnull(a)
Out[5]: False

If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:

enter image description here

The first section would be most pertinent to your question.

0
votes

If you just want to exclude missing values, you can use pd.DataFrame.dropna()

Below is an example based on the one describes by @sacul:

>>> import pandas as pd

>>> df

     0    1    2    3    4
0  0.0  1.0  NaN  1.0  1.0
1  1.0  NaN  1.0  1.0  1.0
2  NaN  NaN  NaN  NaN  NaN
3  NaN  1.0  1.0  NaN  NaN
4  1.0  NaN  NaN  1.0  1.0

>>> df.dropna(axis=0, subset=['3'])

     0    1    2    3    4
0  0.0  1.0  NaN  1.0  1.0
1  1.0  NaN  1.0  1.0  1.0
4  1.0  NaN  NaN  1.0  1.0
  • axis=0 indicates that rows containing NaN are excluded.
  • subset=['3'] indicate to only consider columns "3".

See the link above for details.