In one of my previous Stack Overflow questions (here), I was recommended to use Hierarchical Clustering to group strings contained in a list based on Hamming distance. I found several other posters who had questions about Python SciPy package, specifically, the linkage functions (here and here) who had similar problems to mine.
If my understanding of SciPy's linkage function is correct, I need to pass in an array and specify linkage to cluster based on Hamming distance. However, when I try to implement this on a list of several strings:
import scipy.cluster.hierarchy as sch
X = [[i] for i in ['apples','applez','appelz','apppel','orange','oranze','oranqg','orangs']]
Z = sch.linkage(X,method='complete',metric='Hamming')
I receive the following error:
Traceback (most recent call last):
File "C:/Users/nguyenthoh/Desktop/UMI_grouping/scratch_script.py", line 15, in <module>
Z = sch.linkage(X,method='complete',metric='Hamming')
File "C:\Users\User\AppData\Roaming\Python\Python27\site-packages\scipy\cluster\hierarchy.py", line 694, in linkage
y = _convert_to_double(np.asarray(y, order='c'))
File "C:\Users\User\AppData\Roaming\Python\Python27\site-packages\scipy\cluster\hierarchy.py", line 1216, in _convert_to_double
X = X.astype(np.double)
ValueError: could not convert string to float: apples
which obviously indicates that the input for linkage needs to be an array of numbers, not strings.
Is there a way to use strings with SciPy's linkage or do I need to provide a Hamming distance function with linkage so that it can cluster the strings?
u != v
. So, without evaluating anything about your task or approach: (0) check the validate-function (probably creating numpy array) (1) check what numpy does withu != v
when u or v are object-arrays (from strings). (3) Don't use object-arrays.Naive approach: just convert your text to ascii-values (4)Keep in mind: #u = #v – sascha