Comparing the columns of two data frames?

Question

I have two data frames as follows:

  df1:
            id,   f1,   f2,..., f800
            0,     5,  5.6,..,  3,7
            1,   2.4,  1.6,..,  1,7
            2,     3,  2.3,..,  4,4
            ....
            n,   4.7,  9,3,..., 8,2

 df2:
            id,   v1,   v2,..., v200
            0,     5,  5.6,..,  5,7
            1,   2.4,  1.6,..,  6,7
            2,     3,  2.3,..,  4,2
            ....
            n,   4.7,  9,3,..., 3,1

The df1 consists of 800 features and df2 includes only 200 features. The second data frame (df2) is a part of the first data frame (df1). Now, I want to find the position of the columns (in df1) which includes df2 columns/variables. Here the values of the columns should be similar, not the name of the columns. Taking into account above example, my desired output should be either "f1 and f2" or columns [0, 1] from df1.
Any idea to handle the problem?

if I understand it right the columns have different name but same values? are the columns consecutive or random positions? if consecutive you only need to select first 200 columns from df1 df1[:,:200] — Ruli
The name of the columns are different (as you can see in the above example), but the values are similar in some of the columns. SO I am looking for the index of the column which have similar values. — Spedo
I have added some more detail in my question. Yes either the header of columns from df1 (f1 and f2 in df1 are similar to v1 and v2 in df2) or the columns' number from df1. — Spedo

Ruli Ruli · Accepted Answer · 2020-10-01T10:17:57

I would concat the two dataframes so I am sure only same indexes are present

result = pd.concat([df1, df2], axis=1, join='inner')

then you can use this code:

import pandas as pd 
  
def getDuplicateColumns(df): 
    duplicateColumnNames = set() 
    
    for x in range(df.shape[1]-200): 
        col = df.iloc[:, x] 
          
        for y in range(df.shape[1]-200, df.shape[1]):  
            otherCol = df.iloc[:, y] 
            #if the columns are equal mark it down  
            if col.equals(otherCol): 
                duplicateColumnNames.add(df.columns.values[y]) 
                #here you can mark down both names, so you map them
    return list(duplicateColumnNames) 

cols = getDuplicateColumns(result)

and then you can do whatever you need with the selected columns returned, i.e. drop the redundant cols. 200 is the expected number of cols in your second df, you can instead send this as param. If you are sure each col in df1 has only 1 match in df2 you can as well break the inner loop after finding a match.

Comparing the columns of two data frames?

3 Answers