I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.
My data, in Pandas DataFrame format, looks like this:
Index Wavenumbers (cm-1) %Transmission_i ...
0 650 100 ...
. . . ...
. . . ...
. . . ...
n 4000 95 ...
where, the x-axis for all spectra is the Wavenumbers (cm-1) column and the subsequent columns (%Transmission_i) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:
X = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)
where df is my DataFrame, and np is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1 which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).
How can I get the similar spectra to be clustered properly?
EDIT: Here is a minimum working example.
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
x = 'x-vals'
def cluster_data(df):
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.diff(df[col].values)))
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([dif_list], norm='max')[0]
X = []
for i,j in zip(a,b):
X.append([i,j])
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
return clusters.labels_
def plot_clusters(df, clusters):
colors = ['red', 'green', 'blue', 'black', 'pink']
i = 0
for col in df:
if col == x:
continue
color = colors[clusters[i]]
plt.plot(df[x], df[col], color=color)
i +=1
plt.show()
x1 = np.linspace(-np.pi, np.pi, 201)
y1 = np.sin(x1) + 1
y2 = np.cos(x1) + 1
y3 = np.zeros_like(x1) + 2
y4 = np.zeros_like(x1) + 1.9
y5 = np.zeros_like(x1) + 1.8
y6 = np.zeros_like(x1) + 1.7
y7 = np.zeros_like(x1) + 1
y8 = np.zeros_like(x1) + 0.9
y9 = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7
df = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
'y10':y10})
clusters = cluster_data(df)
plot_clusters(df, clusters)
This produces the following plot, where red is a cluster and pink is noise.

Transmission_icolumns? - felice_ito denote that it is one of many columns of transmission data. The data is a line represented by the Wavenumber column (x-axis) and the transmission column (y-axis) where each row is one point but the column is the data I want to cluster. Is this helpful, or is there more confusion? - Cavenfish