2
votes

I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.

My data, in Pandas DataFrame format, looks like this:

Index     Wavenumbers (cm-1)     %Transmission_i   ...
0         650                    100               ... 
.          .                      .                ...
.          .                      .                ...
.          .                      .                ...
n         4000                   95                ...

where, the x-axis for all spectra is the Wavenumbers (cm-1) column and the subsequent columns (%Transmission_i) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:

X        = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)

where df is my DataFrame, and np is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1 which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).

How can I get the similar spectra to be clustered properly?

EDIT: Here is a minimum working example.

import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

x = 'x-vals'

def cluster_data(df):

    avg_list = []
    dif_list = []
    for col in df:
        if x == col:
            continue
        avg_list.append(np.mean(df[col].values))
        dif_list.append(np.mean(np.diff(df[col].values)))

    a = sk.preprocessing.normalize([avg_list], norm='max')[0]
    b = sk.preprocessing.normalize([dif_list], norm='max')[0]

    X = []
    for i,j in zip(a,b):
        X.append([i,j])

    X = np.array(X)
    clusters = DBSCAN(eps=0.2).fit(X)

    return clusters.labels_

def plot_clusters(df, clusters):
    colors = ['red', 'green', 'blue', 'black', 'pink']
    i      = 0
    for col in df:
        if col == x:
            continue
        color = colors[clusters[i]]
        plt.plot(df[x], df[col], color=color)
        i +=1
    plt.show()


x1  = np.linspace(-np.pi, np.pi, 201)
y1  = np.sin(x1) + 1
y2  = np.cos(x1) + 1
y3  = np.zeros_like(x1) + 2
y4  = np.zeros_like(x1) + 1.9
y5  = np.zeros_like(x1) + 1.8
y6  = np.zeros_like(x1) + 1.7
y7  = np.zeros_like(x1) + 1
y8  = np.zeros_like(x1) + 0.9
y9  = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7

df  = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
                    'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
                    'y10':y10})

clusters = cluster_data(df)

plot_clusters(df, clusters)

This produces the following plot, where red is a cluster and pink is noise. plot made by minimum working example

2
Please clarify: What are all the columns? Is a datapoint one row or one column? Many Transmission_i columns? - felice
You either should use a method accepted in the industry [for infrared scans] or try different methods and see which one suits you: DBSCAN, t-SNE, Kmeans, hierarchical clustering. Different distance measures may be helpful as well. - Sergey Bushmanov
Hey @felice, all the columns are similar to the one shown there I put _i to denote that it is one of many columns of transmission data. The data is a line represented by the Wavenumber column (x-axis) and the transmission column (y-axis) where each row is one point but the column is the data I want to cluster. Is this helpful, or is there more confusion? - Cavenfish
Hey @SergeyBushmanov, I would try different methods but I'm fairly certain my issue is the code is not working properly. Many of the transmission column arrays are very similar (only a slightly different number off for each item in the array) but they still consider it noise rather than cluster. - Cavenfish
Can you please provide us with a minimal reproducible example, e.g. with the dataframe in code with two datapoints? - felice

2 Answers

1
votes

I was able to get a method working, but I'm not fully convinced this is the best method for clustering IR spectra.

First I run through all the spectra and compile a list of the mean and mean of the first derivative of each spectra. The mean is supposed to be representative of the vertical location of the spectra, while the mean of the first derivative is supposed to be representative of the shape of the spectra.

avg_list = []
dif_list = []
for col in df:
    if x == col:
       continue
    avg_list.append(np.mean(df[col].values))
    dif_list.append(np.mean(np.dif(df[col].values)))

Then I normalize each list, this is so I can pick a eps value based on percent changes.

a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([diff_list], norm='max')[0]

After that I make a 2D array for runnning DBSCAN in 2D mode.

X = []
for i,j in zip(a,b):
    X.append([i,j])

Then I run the DBSCAN clustering method with an arbitrary percent difference value for the eps parameter.

X        = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)

Then clusters.labels_ returns an array with the length of the number of spectra in my DataFrame. It works fairly well, but it is rather exclusive and the clusters could be better. Some more fine tuning would be helpful.

0
votes

First, transpose your dataframe, so that you have the datapoints as rows as is the standard. It should look like this:

Index    650    660    ...    4000
0        100    98     ...    95
1        .      .      ...    .
.        .      .      ...    .
n        .      .      ...    .

Then you get your X for the clustering like that:

X = df.values

Next, you cluster:

from sklearn.cluster import DBSCAN
cluster = DBSCAN().fit(X)
print(cluster.labels_)

As a recommendation for spectral data, kmeans (disadvantage: you need to set the number of clusters beforehand) and self-organizing maps (disadvantage: soft clusters instead of hard clusters) work quite well. For example, you find an example here for clustering on hyperspectral data.