Unify dataframe columns like in Prolog to remove duplicates

Question

I'm learning numpy and pandas, so please forgive me if I'm not using the right terminology or I ask something silly. I've spent quite some time searching online for an answer to my question, with no luck...

I have a few .CSV files containing data in the same domain. Each file has a column which works as a primary key for each row. The name of the primary key column might be different in each file. Each file might have or miss primary keys from other files. It's safe to assume that each file contains most of the primary keys, so some np.NAN will have to be introduced.

I'm trying to load and combine together all the data in a single dataset, possibly removing "duplicated" columns. This is how I do the first part (please let me know if there are better ways of doing it):

import pandas as pd

FILENAME_KEY_TABLE = {
    'file_001.csv': 'id',
    'file_002.csv': 'ident',
    'file_003.csv': 'cid',
    'file_004.csv': 'id',
}

dataset, previous = pd.DataFrame(), None
for filename, key in listing.items():
    resource = pd.read_csv(filename)
    dataset = pd.merge(dataset, resource, left_on=[previous], right_on=[key], how='outer')
    previous = key

The resulting dataset has duplicated columns for sure. The code above keeps copies of all the primary keys, for instance. Moreover, some columns have duplicates (with the same name or a different name) in the other files.

I'm using the following code to identify duplicated columns in the dataset (if there is a better way of doing this, please let me know). The output is a dictionary whose keys are the names of the duplicated column and the values sets of names of the original columns they duplicate.

def get_duplicates(df: pd.DataFrame) -> Dict[str, Set[str]]:
    result = {}
    for x in range(df.shape[1]):
        col = df.iloc[:, x]
        original = df.columns.values[x]
        for y in range(x + 1, df.shape[1]):
            other = df.iloc[:, y]
            if col.equals(other):
                duplicate = df.columns.values[y]
                result.setdefault(duplicate, set()).add(original)

    return result

Unfortunately, the code above compares columns as a whole and it doesn't consider any np.NAN introduced by the merging as a potential match as it should. I tried naively to replace:

if col.equals(other):
    ...

with:

if col.equals(other) or col.isnull() or other.isnull():
    ...

clearly with no success. How should I change the line above to work as intended? For instance, column 'A' and 'B' in the following example actually unify:

df = pd.DataFrame({
    'A': ['x', 'x', np.NAN, np.NAN],
    'B': ['x', np.NAN, 'x', np.NAN],
}, index=[0, 1, 2, 3])

and the unified column would be:

ps.DataGrame({
    'Unified': ['x', 'x', 'x', np.NAN],
}, index=[0, 1, 2, 3])

Ultimately, I'd like to implement something like the unification in Prolog. Two columns unify if each corresponding value matches or any of them is np.NAN (np.NAN works like a variable). If two column unify, I'd like to keep the column with the more descriptive title (the longer title), merge its content with the values from the other column (replace any np.NAN with the corresponding value from the other columns), and eventually remove the other column. Is there a function or package that does this already?

Last but not least, I've noticed that some columns in my data are duplicates, but they are not identified as such because they contain floats with different level of precision (i.e.: 3 or 6 decimal digits). Is there any way to tackle this or similar cases? Again, is there any package or function that might help?

Thanks in advance!

Stefano Bragaglia Stefano Bragaglia · Accepted Answer · 2021-03-24T15:10:42

I modified my code as follows, however it is slow and doesn't find the same duplicates that I was finding with a former pure-python implementation.

import pandas as pd
import swifter

def get_duplicates(df: pd.DataFrame) -> List[Tuple[str, str]]:
    result = []
    for i, original in enumerate(df.columns, start=1):
        for duplicate in df.columns[i:]:
            if df.swifter.apply(lambda x: unifying(x[original], x[duplicate]), axis=1).all():
                result.append((original, duplicate))

    return result


def consolidate(df: pd.DataFrame, original: str, duplicate: str) -> None:
    if df.swifter.apply(lambda x: unifying(x[original], x[duplicate]), axis=1).all():
        if len(duplicate) > len(original):
            original, duplicate = duplicate, original
        df[original] = df.swifter.apply(lambda x: unification(x[original], x[duplicate]), axis=1)
        df.drop(duplicate, axis=1, inplace=True)


def unifying(src: Any, tgt: Any) -> bool:
    return src == tgt or src is np.NAN or tgt is np.NAN


def unification(src: Any, tgt: Any) -> Any:
    return src if src == tgt or tgt is np.NAN else tgt


df = pd.DataFrame({
    'A': ['x', 'x', np.NAN, np.NAN],
    'B': ['x', np.NAN, 'x', np.NAN],
}, index=[0, 1, 2, 3])
for original, duplicate in get_duplicates(df):
    consolidate(df, original, duplicate)

Maybe it's because the comparison in the unif* methods keep into account the type of the data? (ie: '1.0' != 1.0)

Unify dataframe columns like in Prolog to remove duplicates

2 Answers