1
votes

I have a dataframe as described below and I need to find out the duplicate groups based on the columns - value1,value2 & value3 (groups should be grouped by id). I need to fill column 'duplicated' with true if the group appears elsewhere in the table,if group is unique fill with false.

note: each group has different id.

df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
                   'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'duplicated' : [] 
                   })

enter image description here

expected result is:

enter image description here

I tried this, but if is comparing rows, I need to compare groups (grouped by id)

import pandas as pd
data = pd.read_excel('C:/Users/path/Desktop/example.xlsx')

# False : Mark all duplicates as True.
data['duplicates'] = data.duplicated(subset= ["value1","value2","value3"], keep=False)

data.to_excel('C:/Users/path/Desktop/example_result.xlsx',index=False)

and I got:

enter image description here

note: the order of the records in the both groups doesnt matter

2
Does the order maters? for example, if in the group C, the lines with the 1 and the 2 are exchanged position, would it still be "duplicates"? - Ben.T
@Ben.T order of rows in group doesnt matter - ncica
@ncica that makes it a bit more complicated. - ywbaek
@DontKnowMuchButGettingBetter, sorry it was my mistake , the code was correct nut I didnt notice that I write that part of code inside of a wrong function, tnx for hint :) - ncica
@ncica: you're welcome, but in the future, please notify us. We're volunteers that help. Note that if you did in fact comment back and then deleted the question, I'm not going to be able to see it given my low reputation. - DontKnowMuchBut Getting Better

2 Answers

2
votes

This may not be very efficient but it works if duplicated groups have the same "order".

import pandas as pd

df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
                   'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
                   'duplicated': [False] * 13
                   })


def check_dup(df, col1, col2):
    # Checks if two groups are duplicates.
    # First checks the sizes, if they are equal then checks actual values.

    df1 = df[df['id'] == col1][['value1', 'value2', 'value3']]
    df2 = df[df['id'] == col2][['value1', 'value2', 'value3']]
    if df1.size != df2.size:
        return False
    return (df1.values == df2.values).all()


id_unique = set(df['id'].values)  # set of unique ids
id_dic = dict.fromkeys(id_unique, False)  # dict for "duplicated" value for each id
for id1 in id_unique:
    for id2 in id_unique - {id1}:
        if check_dup(df, id1, id2):
            id_dic[id1] = True
            break

# Update 'duplicated' column on df
for id_ in id_dic:
    df.loc[df['id'] == id_, 'duplicated'] = id_dic[id_]

print(df)
   id value1 value2 value3  duplicated
0   A      1      1      1        True
1   A      2      2      2        True
2   A      3      3      3        True
3   A      4      4      4        True
4   B      1      1      1       False
5   B      2      2      2       False
6   C      1      1      1        True
7   C      2      2      2        True
8   C      3      3      3        True
9   C      4      4      4        True
10  D      1      1      1       False
11  D      2      2      2       False
12  D      3      3      3       False
1
votes

You can do it like this

First sort_values just in case, set_index the id and stack to change the shape of your data and get a single column with to_frame

df_ = (df.sort_values(by=["value1","value2","value3"])
         .set_index('id')[["value1","value2","value3"]]
         .stack()
         .to_frame()
      )

Second, you can append an set_index with a cumcount per id, drop the level of index with the name of the original column (Value1 ...), unstack to get one row per id, fillna with a random value and use duplicated.

s_dup = df_.set_index([df_.groupby('id').cumcount()], append=True)\
           .reset_index(level=1, drop=True)[0]\
           .unstack()\
           .fillna(0)\
           .duplicated(keep=False)
print (s_dup)
id
A     True
B    False
C     True
D    False
dtype: bool

Now you can just map to the original dataframe:

df['dup'] = df['id'].map(s_dup)
print (df)
   id value1 value2 value3    dup
0   A      1      1      1   True
1   A      2      2      2   True
2   A      3      3      3   True
3   A      4      4      4   True
4   B      1      1      1  False
5   B      2      2      2  False
6   C      2      2      2   True
7   C      1      1      1   True
8   C      3      3      3   True
9   C      4      4      4   True
10  D      1      1      1  False
11  D      2      2      2  False
12  D      3      3      3  False