3
votes

I am having a nan value when trying to get left part of a string a pandas dataframe, where the left condition is depending on the lengh of the cell in another column of the dataframe :

Example of df :

Phrase Color
Paul like red red
Mike like green green
John like blue blue

My objectives is to obtain a series of the first part of the phrase => before "like {Color}". Here it would be :

|First Name|

| Paul | | Mike | | John |

i try to call the function below :

df["First  Name"] = df["Phrase"].str[:- df["Color"].str.len() - 6]

But i keep having Nan value results. It seems my length calculation of the colors can't transmit to my str[:-x] function.

Can someone help me understand what is happening here and find a solution ?

Thanks a lot. Have a nice day.

2
can you post the expected output?Vivek Kalyanarangan

2 Answers

1
votes

Consider below df:

In [128]: df = pd.DataFrame({'Phrase':['Paul like red', 'Mike like green', 'John like blue', 'Mark like black'], 'Color':['red', 'green', 'blue', 'brown']})

In [129]: df
Out[129]: 
            Phrase  Color
0    Paul like red    red
1  Mike like green  green
2   John like blue   blue
3  Mark like black  brown

Use numpy.where:

In [134]: import numpy as np

In [132]: df['First Name'] = np.where(df.apply(lambda x: x['Color'] in x['Phrase'], 1), df.Phrase.str.split().str[0], np.nan)

In [133]: df
Out[133]: 
            Phrase  Color First Name
0    Paul like red    red       Paul
1  Mike like green  green       Mike
2   John like blue   blue       John
3  Mark like black  brown        NaN
1
votes

Lets break this down and try to understand whats going on.. .str returns a pandas.Series.str (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) object and you want to slice it using a vector.

So basically you are trying to do pandas.Series.str[: <some_vector>] where <some_vector> is - df["Color"].str.len() - 6

Unfortunately, pandas offers no way to slice using a vector, check all methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

So we are restricted to using pandas.Series.str[: some_value]

Now since this some_value changes for every row, you can use the .apply method over each row as follows:

df = pd.DataFrame({
    'Phrase': ['Paul like red', 'Mike like green', 'John like blue'],
    'Color': ['red', 'green', 'blue']
})
>>>
Phrase  Color
0    Paul like red    red
1  Mike like green  green
2   John like blue   blue

def func(x):
    return x['Phrase'][:-len(x['Color'])-6]
   

df['First Name'] = df.apply(func, axis=1)

>>>


print (df)
            Phrase  Color First Name
0    Paul like red    red       Paul
1  Mike like green  green       Mike
2   John like blue   blue       John

Here I have used the same logic but passed the value as a scalar using .apply