1
votes

I have a pandas dataframe with 3 columns where:

  • Category dtype - string
  • Date dtype - datetime
  • Values dtype - float

    df = pd.DataFrame()
    df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
    df['date'] = ['2018-01-01', '2018-01-01', '2018-01-03', '2018-01-05', '2018-01-01', '2018-01-02', '2018-01-06', '2018-01-03', '2018-01-04','2018-01-01']
    df['values'] = [1, 2, -1.5, 2.3, 5, -0.7, -5.2, -5.2, 1, -1.1]
    df
    

Dataframe view

I want to filter for rows that have a positive value and a negative value (with the least difference) close to that date per category.

So, essentially an output that looks like:

df = pd.DataFrame()
df['category'] = ['a', 'a','b', 'b', 'c', 'c']
df['date'] = ['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-01', '2018-01-03']
df['values'] = [1, -1.1, 2, -1.5, 5, -5.2]
df

Filtered Dataframe

I have looked at similar queries on SO (Identifying closest value in a column for each filter using Pandas, How do I find the closest values in a Pandas series to an input number?)

The first one utilises idxmin, which returns first occurence, not the closest in value.

The second link is speaking about a specific value as an input - I don't think a pure np.argsort works in my case.

I can imagine using a complex web of if statements to do this, but, I'm not sure what the most efficient way of doing this is with pandas.

Any guidance would be greatly appreciated.

1
Welcome to StackOverflow. Please take the time to read this post on how to provide a great pandas example as well as how to provide a minimal, complete, and verifiable example and revise your question accordingly. These tips on how to ask a good question may also be useful.jezrael
Thanks @jezrael, I have updated my query.DaytaSigntist

1 Answers

0
votes

IIUC, sort your dataframe first then use idxmin:

df1 = df.sort_values(['category','date'])
df1[df1.groupby('category')['values']\
       .transform(lambda x: x.index.isin([x.ge(0).idxmin(), x.lt(0).idxmin()]))]

Output:

  category        date  values
0        a  2018-01-01     1.0
9        a  2018-01-01    -1.1
1        b  2018-01-01     2.0
2        b  2018-01-03    -1.5
4        c  2018-01-01     5.0
7        c  2018-01-03    -5.2