1
votes

I have two DataFrames in pandas:

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})

print(df1)

  Name  start  stop
0    A     50    60
1    B    124   200
2    C      1    19
3    C    159   200
4    D     12    24
5    D     26    30
6    E    110   160

print(df2)

  Name  start  stop
0    B    126   129
1    C    143   220
2    D     19    27
3    E    159   200

I want to filter df1 to remove rows based on df2 using the following criteria:

  1. Name should be present in both df1 and df2
  2. The range from start to stop for a Name overlaps with the range from start to stop for that Name in the other DataFrame

This would give:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Where:

  • A has been dropped as there is no A in df2
  • B is kept as the start and stop of B in df2 are nested in those of B in df1
  • One of the C's of df1 has been dropped as its values didn't overlap with df2, whereas the other was kept as it is nested in the start and stop range of C in df2
  • Both D's are kept as both have an overlap with the range of D in df2
  • E is kept as its range overlaps with E in df2

Any help would be greatly appreciated!

2

2 Answers

1
votes

To solve your problem, I applied an SQL-like way that mimics the following query:

SELECT
  df.Name, df.start_x AS start, df.stop_x AS stop
FROM (
  SELECT
    df1.Name, df1.start AS start_x, df1.stop AS stop_x,
              df2.start AS start_y, df2.stop AS stop_y
    FROM df1
    INNER JOIN df2
      ON df1.Name = df2.Name
) AS df
WHERE (df.stop_y >= df.start_x) AND (df.stop_x >= df.start_y)

This query has been converted to the following code fragment that uses the pandas.merge method. Note that you must use parentheses in the expression (df.stop_y> = df.start_x) & (df.stop_x> = df.start_y). Without them, the code throws the exception

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

import pandas as pd

df1 = pd.DataFrame({'Name': ["A", "B", "C", "C","D","D","E"],
                   'start': [50, 124, 1, 159, 12, 26,110],
                   'stop': [60, 200, 19, 200, 24, 30,160]})
df2 = pd.DataFrame({'Name': ["B", "C","D","E"],
                   'start': [126, 143, 19, 159],
                   'stop': [129, 220, 27, 200]})
df = pd.merge(df1, df2, on=['Name'])
df = df[(df.stop_y >= df.start_x) & (df.stop_x >= df.start_y)]
df.rename(columns={'start_x':'start', 'stop_x':'stop'}, inplace=True)
df.drop(['start_y', 'stop_y'], axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)

Output:

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Demo on Repl.it.

0
votes

For anyone who is interested, I figured out a way to do it...

df3=[]
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1["Name"] == row2["Name"]:
            x = range(row1["start"],row1["stop"])
            x = set(x)
            y = range(row2["start"],row2["stop"])
            if len(x.intersection(y)) > 0:
                df3.append(row1)
df3 = pd.DataFrame(df3).reset_index(drop=True)
print(df3)

  Name  start  stop
0    B    124   200
1    C    159   200
2    D     12    24
3    D     26    30
4    E    110   160

Gets the job done albeit a bit clumsy.

Would be interested if anyone can suggest a less messy way!