0
votes

I need to create a Kmeans algorithm for zoo.csv -data from https://archive.ics.uci.edu/ml/datasets/Zoo, which finds out suitable number of clusters (using elbow method)in certain parts of the code and also tests a given number of clusters (n_clusters). But the problem is that the values of anim_name column in the csv-file are string values (aardvark, antelope, etc.) and when I run this code, I get this error message that says: "ValueError: could not convert string to float: 'aardvark'". How could I convert the values of anim_name column into float (or int), so that I could make this algorithm work? I have tried different methods but nothing works so far.

Here is my code so far (I am doing this in Google Colab):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

from google.colab import drive
drive.mount('/content/drive')

data=pd.read_csv('/content/drive/MyDrive/MyFiles/zoo[1].csv', delimiter=',')
data.head()

kmeans=KMeans(n_clusters=2,max_iter=300)
kmeans.fit(data)

y_km=kmeans.predict(data)
clusters=kmeans.labels_
data['clusters']=clusters
data

After the previous part I get this error message:"/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 81 82 """ ---> 83 return array(a, dtype, copy=False, order=order) 84 85

ValueError: could not convert string to float: 'aardvark'"

res1=np.round(data.groupby('clusters').mean(),2)
pd.DataFrame(res1)

scores = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data)
    scores.append(kmeans.inertia_)
plt.plot(range(1, 11), scores)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Scores')
plt.show()
1

1 Answers

0
votes

You can convert any string into zeros and ones, or categorical codes. Here is a sample of both options.

Now, I don't see any data sample here, and when I searched for it, I didn't find anything useful, but I'm sure you can adapt this generic example to suit your needs.

import pandas as pd
dummies = pd.get_dummies(df['your_column'])

vs.

# string
df['zipcode'] = df['zipcode'].astype(str)
# categorical
df['zipcode'] = df['zipcode'].astype('category')

Also, take a look at the links below.

https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

https://developer.ibm.com/tutorials/ba-cleanse-process-visualize-data-set-3/

I think that should do it for you. Post back if you have more questions about these topics.