python - Find symmetric pairs quickly in numpy

Question

from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?

An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.

The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.

Could you give an example of what you understand by symmetric pairs? — yatu
@JerryM. Yes, but it is trivial to remove with df.drop_duplicates() — The Unfun Cat
@molybdenum42 I use itertools product to create an example, the data themselves are not created with itertools product. — The Unfun Cat

Quang Hoang Quang Hoang · Accepted Answer · 2019-10-28T14:28:37

You can sort the values, then groupby:

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)

python - Find symmetric pairs quickly in numpy

6 Answers

`frozenset`