2
votes

I'm computing Spearman correlation coefficients for interviewers. It works for Interviewer_1... I don't understand how Scipy interrupts interviewer_2 as having no correlation/0/nan.

import pandas as pd
from pandas import DataFrame
import scipy.stats


df = pd.DataFrame({'Interviewer': ['Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2'],
                    'Score_1': [-1,-1,-1,1,1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,1,-1],
                    'Score_2': [1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]
                    })

df

Sample Data Yields:

    Interviewer Score_1 Score_2
0   Interviewer_1   -1  1
1   Interviewer_1   -1  -1
2   Interviewer_1   -1  -1
3   Interviewer_1   1   -1
4   Interviewer_1   1   1
5   Interviewer_1   -1  1
6   Interviewer_1   -1  -1
7   Interviewer_1   -1  -1
8   Interviewer_1   1   -1
9   Interviewer_1   1   -1
10  Interviewer_2   -1  -1
11  Interviewer_2   -1  -1
12  Interviewer_2   -1  -1
13  Interviewer_2   -1  -1
14  Interviewer_2   -1  -1
15  Interviewer_2   -1  -1
16  Interviewer_2   -1  -1
17  Interviewer_2   -1  -1
18  Interviewer_2   -1  -1
19  Interviewer_2   -1  -1
20  Interviewer_2   -1  -1
21  Interviewer_2   -1  -1
22  Interviewer_2   -1  -1
23  Interviewer_2   1   -1
24  Interviewer_2   -1  -1
25  Interviewer_2   -1  -1
26  Interviewer_2   -1  -1
27  Interviewer_2   -1  -1
28  Interviewer_2   1   -1
29  Interviewer_2   -1  -1

df.groupby('Interviewer').sum()

Yields the Sum:

           Score_1  Score_2
Interviewer     
Interviewer_1   -2  -4
Interviewer_2   -16 -20

Using Scipy:

def applyspearman(row):
    row['Cor'] = scipy.stats.spearmanr(row['Score_1'], row['Score_2'])[0]
    return row

df = df.groupby('Interviewer').apply(applyspearman)

df
    Interviewer Score_1 Score_2 Cor
0   Interviewer_1   -1  1   -0.089087081
1   Interviewer_1   -1  -1  -0.089087081
2   Interviewer_1   -1  -1  -0.089087081
3   Interviewer_1   1   -1  -0.089087081
4   Interviewer_1   1   1   -0.089087081
5   Interviewer_1   -1  1   -0.089087081
6   Interviewer_1   -1  -1  -0.089087081
7   Interviewer_1   -1  -1  -0.089087081
8   Interviewer_1   1   -1  -0.089087081
9   Interviewer_1   1   -1  -0.089087081
10  Interviewer_2   -1  -1  
11  Interviewer_2   -1  -1  
12  Interviewer_2   -1  -1  
13  Interviewer_2   -1  -1  
14  Interviewer_2   -1  -1  
15  Interviewer_2   -1  -1  
16  Interviewer_2   -1  -1  
17  Interviewer_2   -1  -1  
18  Interviewer_2   -1  -1  
19  Interviewer_2   -1  -1  
20  Interviewer_2   -1  -1  
21  Interviewer_2   -1  -1  
22  Interviewer_2   -1  -1  
23  Interviewer_2   1   -1  
24  Interviewer_2   -1  -1  
25  Interviewer_2   -1  -1  
26  Interviewer_2   -1  -1  
27  Interviewer_2   -1  -1  
28  Interviewer_2   1   -1  
29  Interviewer_2   -1  -1

I tried using this formula by hand in Excel (rank functions, abs difference, d^2, and sum of d^, and got different results for both interviewers: p = 1 - (6 Σ d^2i)/(n(n^2-1))

interviewer_1, p = 0.878788

interviewer_2, p = 0.993985

Questions:

  1. Why is Interviewer_2 null? Is the NaN issue related to rank ties?
  2. Why does Scipy's results differ from my results by hand?
1
Definitely a bit odd since "Changes in scipy 0.8.0: rewrite to add tie-handling, and axis." (source)Brad Solomon
Side note - I think you want to be returning the result of scipy.stats.spearmanr in applyspearman, not assigning it to every row as an an additional column in the grouped dataframe? Spearman rank is meant to be a summary statistic, not a per-row measure.Simon Bowly
I am using it as a summary statistic by interviewer. That is why I apply a groupby. Later in my analysis, I flatten my dataframe to one row per interviewer so I can then plot total interviews against the spearmanr score.Christopher
@SimonBowly the function as he has it is correct. .apply applies the function to each grouped sub-DataFrame, so row here is really a DataFrame. See here for some detailBrad Solomon
Got it - I was thinking that there are only 2 groups, so you get 2 spearman values, but you copy those values over each row in the group. If you flatten later, makes total sense. ThanksSimon Bowly

1 Answers

1
votes

Not sure exactly what's happening in the source but you can define your own function with pandas' Series.rank(method='dense') and this seems to clear things up:

def spearmanr(x, y):
    """ `x`, `y` --> pd.Series"""
    assert x.shape == y.shape
    rx = x.rank(method='dense')
    ry = y.rank(method='dense')
    d = rx - ry
    dsq = np.sum(np.square(d))
    n = x.shape[0]
    coef = 1. - (6. * dsq) / (n * (n**2 - 1.))
    return coef

grouped.apply(lambda frame: spearmanr(frame['Score_1'], frame['Score_2']))
Interviewer_1    0.970
Interviewer_2    0.998