I'm computing Spearman correlation coefficients for interviewers. It works for Interviewer_1... I don't understand how Scipy interrupts interviewer_2 as having no correlation/0/nan.
import pandas as pd
from pandas import DataFrame
import scipy.stats
df = pd.DataFrame({'Interviewer': ['Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2'],
'Score_1': [-1,-1,-1,1,1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,1,-1],
'Score_2': [1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]
})
df
Sample Data Yields:
Interviewer Score_1 Score_2
0 Interviewer_1 -1 1
1 Interviewer_1 -1 -1
2 Interviewer_1 -1 -1
3 Interviewer_1 1 -1
4 Interviewer_1 1 1
5 Interviewer_1 -1 1
6 Interviewer_1 -1 -1
7 Interviewer_1 -1 -1
8 Interviewer_1 1 -1
9 Interviewer_1 1 -1
10 Interviewer_2 -1 -1
11 Interviewer_2 -1 -1
12 Interviewer_2 -1 -1
13 Interviewer_2 -1 -1
14 Interviewer_2 -1 -1
15 Interviewer_2 -1 -1
16 Interviewer_2 -1 -1
17 Interviewer_2 -1 -1
18 Interviewer_2 -1 -1
19 Interviewer_2 -1 -1
20 Interviewer_2 -1 -1
21 Interviewer_2 -1 -1
22 Interviewer_2 -1 -1
23 Interviewer_2 1 -1
24 Interviewer_2 -1 -1
25 Interviewer_2 -1 -1
26 Interviewer_2 -1 -1
27 Interviewer_2 -1 -1
28 Interviewer_2 1 -1
29 Interviewer_2 -1 -1
df.groupby('Interviewer').sum()
Yields the Sum:
Score_1 Score_2
Interviewer
Interviewer_1 -2 -4
Interviewer_2 -16 -20
Using Scipy:
def applyspearman(row):
row['Cor'] = scipy.stats.spearmanr(row['Score_1'], row['Score_2'])[0]
return row
df = df.groupby('Interviewer').apply(applyspearman)
df
Interviewer Score_1 Score_2 Cor
0 Interviewer_1 -1 1 -0.089087081
1 Interviewer_1 -1 -1 -0.089087081
2 Interviewer_1 -1 -1 -0.089087081
3 Interviewer_1 1 -1 -0.089087081
4 Interviewer_1 1 1 -0.089087081
5 Interviewer_1 -1 1 -0.089087081
6 Interviewer_1 -1 -1 -0.089087081
7 Interviewer_1 -1 -1 -0.089087081
8 Interviewer_1 1 -1 -0.089087081
9 Interviewer_1 1 -1 -0.089087081
10 Interviewer_2 -1 -1
11 Interviewer_2 -1 -1
12 Interviewer_2 -1 -1
13 Interviewer_2 -1 -1
14 Interviewer_2 -1 -1
15 Interviewer_2 -1 -1
16 Interviewer_2 -1 -1
17 Interviewer_2 -1 -1
18 Interviewer_2 -1 -1
19 Interviewer_2 -1 -1
20 Interviewer_2 -1 -1
21 Interviewer_2 -1 -1
22 Interviewer_2 -1 -1
23 Interviewer_2 1 -1
24 Interviewer_2 -1 -1
25 Interviewer_2 -1 -1
26 Interviewer_2 -1 -1
27 Interviewer_2 -1 -1
28 Interviewer_2 1 -1
29 Interviewer_2 -1 -1
I tried using this formula by hand in Excel (rank functions, abs difference, d^2, and sum of d^, and got different results for both interviewers: p = 1 - (6 Σ d^2i)/(n(n^2-1))
interviewer_1, p = 0.878788
interviewer_2, p = 0.993985
Questions:
- Why is Interviewer_2 null? Is the NaN issue related to rank ties?
- Why does Scipy's results differ from my results by hand?
scipy.stats.spearmanr
inapplyspearman
, not assigning it to every row as an an additional column in the grouped dataframe? Spearman rank is meant to be a summary statistic, not a per-row measure. – Simon Bowly.apply
applies the function to each grouped sub-DataFrame, sorow
here is really a DataFrame. See here for some detail – Brad Solomon