3
votes

So I have dataset tuple like this:

data = ((tag1, tag2, correlation_value), (tag1, tag3, correlation_value),...,(tag1, tag n, correlation value), (tag2, tag3, correlation value),...,(tag2, tag n, correlation value),......, (tag n-1, tag n, correlation value)).

I need to make a correlation matrix out of this. I already have the correlation values, as defined above by 'correlation value'. However I am not finding the right technique to do so. Most of the previous questions were regarding calculating the correlation (Pearson etc.) from a dataframe of data or a data array. However, here I have already calculated the correlation using a separate algorithm, and I want to put it in a correlation matrix form suing pandas, so that I can then visualize the correlations.

The correlation table should look something like this:

enter image description here

How can I achieve this? Converting directly to a pandas dataframe using pd.DataFrame() and then unpivoting does not work, as I am left with a lot of 'NaN' values, as my tuple 'data' does not have entries for the same tags, so for example, it does not have a (Tag1, Tag1, correlation value) entry.

It also does not have repeated values like (Tag 1, Tag 2, correlation values) AND (Tag 2 , Tag 1, correlation value). Instead it will have only (Tag 1, Tag 2, correlation value).

So in the corresponding dataframe using pd.DataFrame my entry in the dataframe corresponding to the row Tag 2 and the column Tag 1, is again, a NaN value.

How do I solve this?

Thank You.

1

1 Answers

2
votes

Here's how I would do it (this might be suboptimal, since I don't know enough about the data, especially the tags):

I assume your data input looks like (the length isn't fixed):

(('tag1', 'tag2', 0.3), ('tag1', 'tag3', 0.4), ('tag1', 'tag4', 0.5),
 ('tag1', 'tag5', 0.6), ('tag2', 'tag3', 0.5), ('tag2', 'tag4', 0.6),
 ('tag2', 'tag5', 0.7), ('tag3', 'tag4', 0.7), ('tag3', 'tag5', 0.8),
 ('tag4', 'tag5', 0.9))

Working with Numpy and Pandas:

import numpy as np
import pandas as pd

Start with collecting the tags (and set index/columns for DataFrame on the way). (I guess this could be optimised if there's a system behind the tags.)

tags = []
for t1, t2, _ in data:
    tags += [t1, t2]
tags = index = columns = sorted(list(set(tags)))

Then build a mapping between tags and indices:

tags = dict((t, i) for i, t in enumerate(tags))

After that build the correlation matrix:

correlation = np.identity(len(tags))
for t1, t2, corr in data:
    correlation[tags[t1]][tags[t2]] = corr
    correlation[tags[t2]][tags[t1]] = corr

And finally the DataFrame:

df = pd.DataFrame(correlation, index=index, columns=columns)

It worked with my sample data.