2
votes

Let's assume that we have a 10x5 dataset containing 5 chemical measurements(e.g., var1, var2, var3, var4, var5) on 10 wine samples(rows). We'd like to cluster wine samples based on chemical measurements using k means clustering. It's quite easy to do so. However, I'd like to perform consecutive clustering, starting with clustering the wine samples with a single chemical measurements and then performing clustering operations with all combinations of var1, var2, var3, var4 and var5 (all unary, binary, ternary, quarternary and quinary combinations).

To put it differently, I'm interested in clustering the wine samples based on all possible combinations of measurements given in columns which will result in a total of 31 clustering results, e.g., based on (1)var1, (2)var2, (3)var3, (4)var4, (5)var5, (6)var1 and var2, (7)var1 and var3,..., (31)var1, var2, var3, var4 and var5.

How can I create such a loop in R ?

2
Let's not assume! And, instead, show us a reproducible example.Thomas

2 Answers

1
votes

Let's say you had a dataset:

set.seed(144)
dat <- matrix(rnorm(100), ncol=5)

Now you can get all subsets of columns (indicated by logical vectors saying if we should keep each column), removing the first (which would have removed all our columns).

(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
#     Var1  Var2  Var3  Var4  Var5
# 2   TRUE FALSE FALSE FALSE FALSE
# 3  FALSE  TRUE FALSE FALSE FALSE
# 4   TRUE  TRUE FALSE FALSE FALSE
# ...
# 31 FALSE  TRUE  TRUE  TRUE  TRUE
# 32  TRUE  TRUE  TRUE  TRUE  TRUE

The last step is to run k-means clustering for each subset of columns, which is a simple application of apply (I'll assume you want 3 clusters in each of your models):

mods <- apply(cols, 1, function(x) kmeans(dat[,x], 3))

You can access each of your 31 k-means models using list indexing. For instance:

mods[[1]]
# K-means clustering with 3 clusters of sizes 7, 5, 8
# 
# Cluster means:
#         [,1]
# 1 -1.4039782
# 2 -0.4215221
# 3  0.3227336
# 
# Clustering vector:
#  [1] 1 3 2 1 1 3 3 1 3 3 2 3 2 1 3 3 2 1 1 2
# 
# Within cluster sum of squares by cluster:
# [1] 0.4061644 0.1438443 0.7054191
#  (between_SS / total_SS =  89.9 %)
# 
# Available components:
# 
# [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
# [7] "size"         "iter"         "ifault"   
1
votes
# create a dummy matrix
dummy <- matrix(rnorm(10 * 5), 10, 5)

# create all the combinations of variables
combos <- lapply(1:5, function(x) t(combn(1:5, x)))    

# loop over the combination sets and fit a k-means with 2 clusters to each
kms <- lapply(combos, function(x) {
  lapply(1:nrow(x), function(y) {
    kmeans(dummy[,x[y,]], 2)
  })
})

> sapply(kms, length)
[1]  5 10 10  5  1

# access the results like so:
> kms[[1]][[1]]
K-means clustering with 2 clusters of sizes 3, 7
...