Correct order output in K Means and document clustering

Question

I am doing a single document clustering with K Means, I am now working on preparing the data to be clustered and represent N sentences in their vector representations.

However, if I understand correctly, KMeans algorithm is set to create k clusters based on the euclidean distance to k center points. Regardless of the sentences order.

My problem is that I want to keep the order of the sentences and consider them in the clustering task.

Let say S = {1...n} a set of n vectors representing sentences, S_1 = sentence 1 , S_2 = sentence 2 .. etc.

I want that the clusters will be K_1 = S[1..i], K_2 = S[i..j] etc..

I thought maybe transform this into 1D and sum the index of each sentence to the transformed value. But not sure if it will help. And maybe there's a smarter way.

It sounds like you want to do some kind of document segmentation and not clustering. Maybe this strand of research would be relevant: aclweb.org/anthology/W08-1803 — aab
It's not k-means, or clustering anymore. You want so split your document into k segments, thats where the similarity ends IMHO. (P.S. there is a straightforward solution if you consider this a plain old optimization problem) — Has QUIT--Anony-Mousse

Dan Dan · Accepted Answer · 2015-01-23T20:25:22

A quick and dirty way to do this would be to append each lexical item with the sentence number it's in. First sentence segment, then, for this document:

This document's really great. It's got all kinds of words in it. All the words are here.

You would get something like:

{"0_this": 1, "0_document": 1, "0_be": 1, "0_really": 1,...}

Whatever k-means you're using, this should be readily accepted.

I'd warn against doing this at all in general, though. You're introducing a lot of data sparsity, and your results will be more harmed by the curse of dimensionality. You should only do it if the genre you're looking at is (1) very predictable in lexical choice and (2) very predictable in structure. I can't think of a good linguistic reason that sentences should align precisely across texts.

Correct order output in K Means and document clustering

1 Answers