Filter one data frame based on other data frame in pandas

Question

I have two DataFrames in pandas:

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})

print(df1)

  Name  start  stop
0    A     50    60
1    B    124   200
2    C      1    19
3    C    159   200
4    D     12    24
5    D     26    30
6    E    110   160

print(df2)

  Name  start  stop
0    B    126   129
1    C    143   220
2    D     19    27
3    E    159   200

I want to filter df1 to remove rows based on df2 using the following criteria:

Name should be present in both df1 and df2
The range from start to stop for a Name overlaps with the range from start to stop for that Name in the other DataFrame

This would give:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Where:

A has been dropped as there is no A in df2
B is kept as the start and stop of B in df2 are nested in those of B in df1
One of the C's of df1 has been dropped as its values didn't overlap with df2, whereas the other was kept as it is nested in the start and stop range of C in df2
Both D's are kept as both have an overlap with the range of D in df2
E is kept as its range overlaps with E in df2

Any help would be greatly appreciated!

Andrei Odegov Andrei Odegov · Accepted Answer · 2019-11-08T20:05:28

To solve your problem, I applied an SQL-like way that mimics the following query:

SELECT
  df.Name, df.start_x AS start, df.stop_x AS stop
FROM (
  SELECT
    df1.Name, df1.start AS start_x, df1.stop AS stop_x,
              df2.start AS start_y, df2.stop AS stop_y
    FROM df1
    INNER JOIN df2
      ON df1.Name = df2.Name
) AS df
WHERE (df.stop_y >= df.start_x) AND (df.stop_x >= df.start_y)

This query has been converted to the following code fragment that uses the pandas.merge method. Note that you must use parentheses in the expression (df.stop_y> = df.start_x) & (df.stop_x> = df.start_y). Without them, the code throws the exception

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})
df = pd.merge(df1, df2, on=['Name'])
df = df[(df.stop_y >= df.start_x) & (df.stop_x >= df.start_y)]
df.rename(columns={'start_x':'start', 'stop_x':'stop'}, inplace=True)
df.drop(['start_y', 'stop_y'], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)

Output:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Demo on Repl.it.

Filter one data frame based on other data frame in pandas

2 Answers