1
votes

Is it possible to have this two correlations different?

Pandas version 0.18.1

from pandas import Series
a = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Crystal Palace']
b = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Man United']
c = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Man United']
d = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'West Ham']


Series(a).corr(Series(b), method="spearman")
0.69999999999999996
Series(c).corr(Series(d), method="spearman")
0.8999999999999998
1
python 3.5.2 and anaconda 4.4.1Tales Tenorio Pimentel
pandas has to rank these strings somehow and so they are ranked alphabetically. Teams may therefore be ranked differently depending on what other teams are present. So pandas is calculating "correctly", but this just isn't the operation you wanted.Alex Riley
I'm no statistician but isn't correlation needed to be done on two series of numbers? What where you expecting as output? In Pandas 0.19.2 the sample code above crashes because strings aren't floats.nico
For Spearman's correlation you need to have data that is measured on ordinal scale. What you have is nominal. I suggest you take a look at similarity measures for nominal attributes instead of calculating correlations.ayhan

1 Answers

2
votes

This is the expected behavior. Spearman Correlation is a rank correlation, meaning it is performed on the rankings of your data, not the data itself. In your example, the data itself may only vary in one location, but the differences in the data produces different rankings. As suggested in the comments, Spearman correlation probably isn't what you actually want to use.

To expand further, underneath the hood pandas is essentially calling scipy.stats.spearmanr to compute the correlation. Looking at the source code for spearmanr, it essentially ends up using scipy.stats.rankdata to perform the ranking, then np.corrcoef to get the correlation:

corr1 = np.corrcoef(ss.rankdata(a), ss.rankdata(b))[1,0]
corr2 = np.corrcoef(ss.rankdata(c), ss.rankdata(d))[1,0]

Which produces the same values you're observing. Now, look at the rankings used in each correlation calculation:

ss.rankdata(a)
[ 1.  3.  4.  5.  2.]

ss.rankdata(b)
[ 1.  2.  3.  5.  4.]

ss.rankdata(c) 
[ 1.  2.  3.  5.  4.]

ss.rankdata(d)
[ 1.  2.  3.  4.  5.]

Notice that the rankings for a and b differ in three locations, compared to the rankings for c and d differing in two locations, so we'd expect the resulting correlations to be different.