I'm totally new to machine learning (and full disclosure: this is for school) and am trying to wrap my head around KMeans Clustering and its implementation. I understand the gist of the algorithm and have implemented it in Java but I'm a little confused as to how to use it on a complex dataset.
For example, I have 3 folders, A, B and C, each one containing 8 text files (so 24 text files altogether). I want to verify that I've implemented KMeans correctly by having the algorithm cluster these 24 documents into 3 clusters based on their word usage.
To this effect, I've created a word frequency matrix and performed tfidf on it to create a sparse matrix which is 24 x 2367 (24 documents and 2367 words/ -grams in total). Then, I want to run my KMeans Clustering algorithm on my tfidf matrix and am not getting good results.
In order to try to debug I'm trying to visaulize my tfidf matrix and the centroids I get as output, but I don't quite understand how one would visualize this 24 x 2367 matrix? I've also saved this matrix to a .csv file and want to run a python library on it - but everything I've seen is an n x 2 matrix. How would one go about doing this?
Thanks in advance,