
In one of my previous Stack Overflow questions (here), I was recommended to use Hierarchical Clustering to group strings contained in a list based on Hamming distance. I found several other posters who had questions about Python SciPy package, specifically, the linkage functions (here and here) who had similar problems to mine.

If my understanding of SciPy's linkage function is correct, I need to pass in an array and specify linkage to cluster based on Hamming distance. However, when I try to implement this on a list of several strings:

import scipy.cluster.hierarchy as sch
X = [[i] for i in ['apples','applez','appelz','apppel','orange','oranze','oranqg','orangs']]
Z = sch.linkage(X,method='complete',metric='Hamming')

I receive the following error:

Traceback (most recent call last):
File "C:/Users/nguyenthoh/Desktop/UMI_grouping/scratch_script.py", line 15, in <module>
Z = sch.linkage(X,method='complete',metric='Hamming')
File "C:\Users\User\AppData\Roaming\Python\Python27\site-packages\scipy\cluster\hierarchy.py", line 694, in linkage
y = _convert_to_double(np.asarray(y, order='c'))
File "C:\Users\User\AppData\Roaming\Python\Python27\site-packages\scipy\cluster\hierarchy.py", line 1216, in _convert_to_double
X = X.astype(np.double)
ValueError: could not convert string to float: apples

which obviously indicates that the input for linkage needs to be an array of numbers, not strings.

Is there a way to use strings with SciPy's linkage or do I need to provide a Hamming distance function with linkage so that it can cluster the strings?

Just act according to the docs and the code. The most important thing is how numpy treats u != v. So, without evaluating anything about your task or approach: (0) check the validate-function (probably creating numpy array) (1) check what numpy does with u != v when u or v are object-arrays (from strings). (3) Don't use object-arrays.Naive approach: just convert your text to ascii-values (4)Keep in mind: #u = #vsascha

1 Answers


ValueError: could not convert string to float: apples

You need to do label encoding for categorical variables with categories listed as strings (these could be also numbers typecasted as strings in python).

In python, we have: from sklearn.preprocessing import LabelEncoder.

Look at the documentation and implement it. It will label your string categories as an integer separately for each category.


CODE: le.transform(["tokyo", "tokyo", "paris"])
OUTPUT: array([2, 2, 1]...)

Then you can use it for hierarchical/agglomerative clustering. Maybe also check that all the variables are in the numeric form such as int or float.

Have a nice day!