0
votes

Hi I have a datframe that look like this

        Date     Col1         Col2         Col3       col4 
0     1-4-2020    a        ed kv sc        sf ds       n
1     1-4-2020    b        hed sf ed       df hed      c
2     1-4-2020    c        df xz sf        sf ds       n
3     2-4-2020    a        ed df hed       sf ds       n
4     2-4-2020    b        gv bb sc        s ds        c
5     2-4-2020    c        ed sf sc        hed ds      n

What I want to do is to filter the rows with multiple conditions. If:

  • Tow rows have the same value in Col1. For example a == a
  • And they have the same values in Col4. n==n
  • If both conditions are true search Col2 and Col3 for the key strings "hed" and "sf". they can be in Col2 or Col3(or both)

So the result will look like this:

       Date     Col1         Col2         Col3       col4 
1     1-4-2020    b        hed sf ed      df hed       c
3     2-4-2020    a        ed df hed      sf djs       n
5     2-4-2020    c        ed sf ssc      hed djs      n

I hope it is clear Update

In case Col4

        Date     Col1         Col2         Col3       col4 
0     1-4-2020    a        ed kv sc        sf ds       n
1     1-4-2020    b        hed sf ed       df hed      c
2     1-4-2020    c        df xz sf        sf ds       n
3     1-4-2020    d        df xdg sf       sf hed      y 
4     2-4-2020    a        ed df hed       sf ds       n
5     2-4-2020    b        gv bb sc        s ds        c
6     2-4-2020    c        ed sf sc        hed ds      n
7     2-4-2020    d        df xgfgf       gf gfd       null

In case that the value in Col4 appears only once in the dataframe like in row 3 (hint no next row to compare with) the function should return the row.

        Date     Col1         Col2         Col3       col4 
1     1-4-2020    b        hed sf ed       df hed      c
3     1-4-2020    d        df xdg sf       sf hed      y 
4     2-4-2020    a        ed df hed       sf ds       n
6     2-4-2020    c        ed sf sc        hed ds      n
1

1 Answers

1
votes

First get counts of rows by same Col1 and Col4 by GroupBy.transform with GroupBy.size and join column Col2 with Col3 toegether:

count = df.groupby(['Col1','col4'])['col4'].transform('size')
s = (df['Col2'] + ' ' + df['Col3'])

You can verify output:

print (df.assign(count=count, both=s))
       Date Col1       Col2    Col3 col4  count              both
0  1-4-2020    a   ed kv sc   sf ds    n      2    ed kv sc sf ds
1  1-4-2020    b  hed sf ed  df hed    c      2  hed sf ed df hed
2  1-4-2020    c   df xz sf   sf ds    n      2    df xz sf sf ds
3  2-4-2020    a  ed df hed   sf ds    n      2   ed df hed sf ds
4  2-4-2020    b   gv bb sc    s ds    c      2     gv bb sc s ds
5  2-4-2020    c   ed gf sc  hed ds    n      2   ed gf sc hed ds

Last filter by boolean indexing if counts are 2 by Series.eq, then Series.str.contains for test substrings and chained mask together by & for bitwise AND:

df = df[count.eq(2) & s.str.contains('hed') & s.str.contains('sf')]
print (df)
       Date Col1       Col2    Col3 col4
1  1-4-2020    b  hed sf ed  df hed    c
3  2-4-2020    a  ed df hed   sf ds    n