3
votes

I'm trying to use k-NN on a tricky simulated dataset. the numpy array is (1000, 100), hence lot of dimensions. Before I run the k-NN for training/classification I need to pre-process/transform the dataset. PCA doesn't work, as the variance of all the features are almost same. The data as csv is available here as gist: https://gist.github.com/modqhx/0ab61da16eae8f371a1d6a787f018a64

On plotting the data, it looks like a 3d spherical structure(here's a screenshot using 'hypertools'): enter image

Any thoughts on how to proceed?

EDIT and in response to the comment: Yes, I understand if there are no "apparent" clustering, why use k-nn. I should have phrased that the right way. The raw data does not, however, some form of dimensionality reduction may reveal clusters. There are 100 dimensions, and PCA does not help as the variance of all 100 features are same. Question becomes, how can we do dimensionality reduction when the variance of all features are almost the same? .. Again, this is an exercise and the point is to make "knn" work! (if that makes any sense). I've been told that upto first and second moments you won't find any clusters, but after that(third moment and after) you may.

1
If there were no apparent clusterings of the data then what are you hoping to accomplish with the knn?WestCoastProjects

1 Answers

1
votes

I mostly concur with the comment of @javadba: if your data set has no obvious clustering property if you look at it, applying k-NN or any other clustering algorithm will only give you artifacts and dubious signals. The reason I'm writing this answer is because I did find some structure in your data.

What I did was first load your data. As I understand it, your (1000,101)-shaped array corresponds to 1000 points in 100-d space (plus a trailing column of zeros/ones that probably doesn't matter now). Note that this sounds like a very sparse object if you think about it. Consider a line with 2 points, a square with 4 points, a cube with 8 points...a 100-dimensional regular grid with the same sparsity (2 points along each dimension) would contain 2^100 points. That's...a lot more than 1000. Unfortunately, I have difficulty visualizing sparse point clouds in 100-d space.

So what I did was choose three axes randomly, and plot the corresponding 3d scatter plot to see if there are any patterns. And I did this multiple times at a time. Code:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

dat = np.loadtxt('foodat.csv',skiprows=1,delimiter=',')
nums = np.arange(dat.shape[1]-1)
for _ in range(5):  
    fig = plt.figure()
    ax = fig.add_subplot(111,projection='3d')
    np.random.shuffle(nums)
    inds = nums[:3]
    plotdat = dat[:,inds]
    ax.scatter(*plotdat.T)
    ax.set_title(str(inds))

Here's how a typical plot looks like from every angle:

typical plot

As you see, it's a mess! Or more scientifically speaking: it's hard to visually distinguish these scatter plots from instances of uniform distributions in a cube. If all of the plots look like this, then there's possible that there's absolutely no clustering to begin with, so you should stop before you even begin. Whatever labels you could assign to your data would be meaningless.

But there's good news: in the interactive window I noticed that the above plot looks much more interesting from a certain angle:

data are split according to dimension 42!

The data clearly show a separation along dimension 42 (of all numbers!). Now this is promising. Note that a data set could even be truly clustered yet this might not be obvious from axis-aligned parallel projections. Imagine the following example scenario in 2d:

example with two non-overlapping diagonal blobs

While the data is clearly clustered, this is far from obvious if we're only looking at axis-aligned projections.

My point is thus that finding evidence for the existence of clusters in your 100-dimensional dataset is really difficult. It's already hard to find evidence of clustering in lower-dimensional subspaces, but even if you can't find evidence of this, that would not mean that your data is not clustered in a diametrical configuration in 100d space.

I would start along this path of looking at lower-dimensional cuts. The fact that your points very nicely separate along dimension 42 suggests that this is not impossible. You could try systematically checking every dimension to see if any other produces such a separation...but you have to keep in mind that even for a dimension such as number 42 such a separation might only occur for certain dimension combinations.

In case dimension 42 is special in that it fully separates most of your points, you could try clustering your data along this axis, and working with the two halved datasets in a reduced, 99-dimensional space.