7
votes

I am currently conducting some analysis using NTSB aviation accident database. There are cause statements for most of the aviation incidents in this dataset that describe the factors lead to such event.

One of my objectives here is to try to group the causes, and clustering seems to be a feasible way to solve this kind of problem. I performed the followings prior to the beginning of k-means clustering:

  1. Stop-word removal, that is, to remove some common functional words in the text
  2. Text stemming, that is, to remove a word's suffix, and if necessary, transform the term into its simplest form
  3. Vectorised the documents into TF-IDF vector to scale up the less-common but more-informative words and scale down highly-common but less-informative words
  4. Applied SVD to reduce the dimensionality of vector

After these steps k-means clustering is applied to the vector. By using the events that occurred from Jan 1985 to Dec 1990 I get the following result with number of clusters k = 3:

(Note: I am using Python and sklearn to work on my analysis)

... some output omitted ... 
Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=3, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=True)
Initialization complete
Iteration  0, inertia 8449.657
Iteration  1, inertia 4640.331
Iteration  2, inertia 4590.204
Iteration  3, inertia 4562.378
Iteration  4, inertia 4554.392
Iteration  5, inertia 4548.837
Iteration  6, inertia 4541.422
Iteration  7, inertia 4538.966
Iteration  8, inertia 4538.545
Iteration  9, inertia 4538.392
Iteration 10, inertia 4538.328
Iteration 11, inertia 4538.310
Iteration 12, inertia 4538.290
Iteration 13, inertia 4538.280
Iteration 14, inertia 4538.275
Iteration 15, inertia 4538.271
Converged at iteration 15

Silhouette Coefficient: 0.037
Top terms per cluster:
**Cluster 0: fuel engin power loss undetermin exhaust reason failur pilot land**
**Cluster 1: pilot failur factor land condit improp accid flight contribute inadequ**
**Cluster 2: control maintain pilot failur direct aircraft airspe stall land adequ**

and I generated a plot graph of the data as follows:

Plot result of the k-means clustering

The result doesn't seem like make sense to me. I wonder why all of the clusters contain some common terms like "pilot" and "failure".

One possibility that I can think of (but I am not sure if it is valid in this case) is the documents with these common terms are actually located at the very centre of the the plot graph, therefore they can not be efficiently clustered into a right cluster. I believe this problem cannot be addressed by increasing the number of clusters, as I have just done it and this problem persists.

I just want to know if there is any other factors that could cause the scenario that I am facing? Or more broadly, am I using the right clustering algorithm?

Thanks SO.

1
Or more broadly, am I using the right clustering algorithm - Counter question: If someone asked you to write down the assumptions that k-means clustering makes about the data. Do you know what to answer?cel
Please don't double post questions: datascience.stackexchange.com/q/11076/924Has QUIT--Anony-Mousse

1 Answers

6
votes

I do not want to be a carrier of bad news, but ...

  1. Clustering is a very bad exploration technique - mostly because without a clear, task oriented aim, clustering techniques are actually focused on optimization of some mathematical criterions, which rarely have anything to do with what you want to achieve. Thus k-means in particular will look for minimization of the euclidean distances from cluster centers to all points inside a cluster. Is this anyhow related with the task you want to achieve? Usually the answer is "no", or in the best case "I have no idea".
  2. Representing documents as bag of words leads to very general look at your data, thus it is not a good approach to distinguish between similar objets. Such an approach can be used to distinguish between texts about guns from texts about hockey, but not specialistic texts from the very same domain (which seems to be the case here)
  3. In the end - you cannot really evaluate a clustering, and this is the biggest issue. Thus there are no well established techniques of fitting best clustering.

So, to answer your final questions

I just want to know if there is any other factors that could cause the scenario that I am facing?

There are thousands of such factors. Finding actual, reasonable from the human perspectice, clusters in data is extremely hard. Finding any clusters is exteremely simple - because every clustering technique will find something. But in order to find what is important here one would have to go through whole data exploration here.

Or more broadly, am I using the right clustering algorithm?

Probably not, as k-means is simply a method of minimizing of inner cluster sum of euclidean distances, thus it will not work in most real world scenarios.

Unfortunately - this is not the kind of problem where you can just ask "which alogirhtm to use?" and someone will offer you exact solution.

You have to dig in your data, figure out:

  • way of representation - is tfidf really good? have you preprocessed the vocablurary? Removed meaningless words? Maybe it is wort considering going for some modern word/document representation learning?
  • structure in your data - in order to find best model you should visualize your data, investigate, run statistical analysis, try to figure out what is an underlying metric. Is there any reasonable distribution of points? Are these gaussians? Gaussian mixtures? Is your data sparse?
  • can you provide some expert knowledge? Maybe you can divide part of dataset yourself? semi-supervised techniques are much better defined then any unsupervised ones, thus you might easily get much better results.