3
votes

I'm trying to get a correlation in pandas that's giving me a bit of difficulty. Essentially I want to answer the following question: given a sentence and a value and a dataframe, what word correlates the best with a higher value? What about the worst?

Trivial example:

Sentence      | Score
"hello there" | 100
"hello kid"   | 95
"there kid"   | 5

I'm expecting to see a high correlation value here for the word "hello" and score. Hopefully this makes sense -- if this is possible natively in Pandas I'd really appreciate knowing!

If anything is unclear please let me know.

2

2 Answers

5
votes

I'm not sure that pandas is what you looking for, but yes, you can:

import pandas as pd

df = pd.DataFrame([ ["hello there", 100],
                    ["hello kid",   95],
                    ["there kid",   5]
                  ], columns = ['Sentence','Score'])

s_corr = df.Sentence.str.get_dummies(sep=' ').corrwith(df.Score/df.Score.max())
print (s_corr)

Will return you

hello    0.998906
kid     -0.539949
there   -0.458957

for details see pandas help

  1. str.get_dummies()
  2. corrwith()
1
votes

Here's one way. Take the average score for each occurrence of the word in each string. For example "hello" receives 97.5, "there" receives 52.5 [(100 + 5) / 2], etc.

from collections import defaultdict
import numpy as np

df = pd.DataFrame.from_dict({'Score': {0: 100, 1: 95, 2: 5},
                             'Sentence': {0: 'hello there', 1: 'hello kid', 2: 'there kid'}})

df['WordList'] = df['Sentence'].str.split(' ')

d = defaultdict(list)

for idx, row in df.iterrows():
    for word in row['WordList']:
        d[word].append(row['Score'])

d = {k: np.mean(v) for k, v in d.items()}

{'hello': 97.5, 'there': 52.5, 'kid': 50.0}