9
votes

I've got a df that contains the columns profession and media. I would like to calculate the correlation between those two columns.

Is there a short hack of calculating the correlation of columns of strings? Or do I have transform each profession and media to a number and then calculate the correlation with .corr()?

I found a similar question (Is there a way to get correlation with string data and a numerical value in pandas?) but I would like to check the string, not each word within the string.

df

  profession        media      

0 media lawyer      print
1 student           online
2 student           print
3 professor         online
4 media lawyer      online
1

1 Answers

19
votes

You can convert datatype to categorical and then do it

df['profession']=df['profession'].astype('category').cat.codes
df['media']=df['media'].astype('category').cat.codes
df.corr()