I am currently conducting some analysis using NTSB aviation accident database. There are cause statements for most of the aviation incidents in this dataset that describe the factors lead to such event.
One of my objectives here is to try to group the causes, and clustering seems to be a feasible way to solve this kind of problem. I performed the followings prior to the beginning of k-means clustering:
- Stop-word removal, that is, to remove some common functional words in the text
- Text stemming, that is, to remove a word's suffix, and if necessary, transform the term into its simplest form
- Vectorised the documents into TF-IDF vector to scale up the less-common but more-informative words and scale down highly-common but less-informative words
- Applied SVD to reduce the dimensionality of vector
After these steps k-means clustering is applied to the vector. By using the events that occurred from Jan 1985 to Dec 1990 I get the following result with number of clusters k = 3
:
(Note: I am using Python and sklearn to work on my analysis)
... some output omitted ...
Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=3, n_init=1,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=True)
Initialization complete
Iteration 0, inertia 8449.657
Iteration 1, inertia 4640.331
Iteration 2, inertia 4590.204
Iteration 3, inertia 4562.378
Iteration 4, inertia 4554.392
Iteration 5, inertia 4548.837
Iteration 6, inertia 4541.422
Iteration 7, inertia 4538.966
Iteration 8, inertia 4538.545
Iteration 9, inertia 4538.392
Iteration 10, inertia 4538.328
Iteration 11, inertia 4538.310
Iteration 12, inertia 4538.290
Iteration 13, inertia 4538.280
Iteration 14, inertia 4538.275
Iteration 15, inertia 4538.271
Converged at iteration 15
Silhouette Coefficient: 0.037
Top terms per cluster:
**Cluster 0: fuel engin power loss undetermin exhaust reason failur pilot land**
**Cluster 1: pilot failur factor land condit improp accid flight contribute inadequ**
**Cluster 2: control maintain pilot failur direct aircraft airspe stall land adequ**
and I generated a plot graph of the data as follows:
The result doesn't seem like make sense to me. I wonder why all of the clusters contain some common terms like "pilot" and "failure".
One possibility that I can think of (but I am not sure if it is valid in this case) is the documents with these common terms are actually located at the very centre of the the plot graph, therefore they can not be efficiently clustered into a right cluster. I believe this problem cannot be addressed by increasing the number of clusters, as I have just done it and this problem persists.
I just want to know if there is any other factors that could cause the scenario that I am facing? Or more broadly, am I using the right clustering algorithm?
Thanks SO.
Or more broadly, am I using the right clustering algorithm
- Counter question: If someone asked you to write down the assumptions that k-means clustering makes about the data. Do you know what to answer? – cel