2
votes

In the mlr package, I can perform a clustering. Let´s say I don´t want to know how the model performs on unseen data, but I just want to know what the best number of clusters are regarding a given performance measure.

In this example, I use the moons data set of the dbscan package.

library(mlr)
library(dbscan)
data("moons")

db_task = makeClusterTask(data = moons)

db = makeLearner("cluster.dbscan")

ps = makeParamSet(makeDiscreteParam("eps", values = seq(0.1, 1, by = 0.1)),
  makeIntegerParam("MinPts", lower = 1, upper = 5))

ctrl = makeTuneControlGrid()

rdesc = makeResampleDesc("CV", iters = 3) # I don´t want to use it, but I have to 

res = tuneParams(db, 
  task = db_task, 
  control = ctrl,
  measures = silhouette, 
  resampling = rdesc, 
  par.set = ps)
#> [Tune] Started tuning learner cluster.dbscan for parameter set:
#>            Type len Def                                Constr Req Tunable
#> eps    discrete   -   - 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1   -    TRUE
#> MinPts  integer   -   -                                1 to 5   -    TRUE
#>        Trafo
#> eps        -
#> MinPts     -
#> With control class: TuneControlGrid
#> Imputation value: Inf
#> [Tune-x] 1: eps=0.1; MinPts=1
#> Error in matrix(nrow = k, ncol = ncol(x)): invalid 'nrow' value (too large or NA)

Created on 2019-06-06 by the reprex package (v0.3.0)

However, mlr forces me to use a resampling strategy. Any idea of how to use mlr in cluster tasks without resampling?

1
Your code does not run for me (see the inserted reprex above). Why don't you take a look at the number of clusters calculated by the best performing model during tuning?pat-s
I don't understand why it doesn´t work (I leave it in there until I understand the reason). I had some data sets where the results from the CV and the silhouette plot were different.Banjo

1 Answers

1
votes

mlr is pretty poor when it comes to clustering. It's dbscan function is a wrapper around the very slow fpc package. Others wrap Weka, which is also very slow.

Use the dbscan package instead.

However, parameter tuning doesn't just work in unsupervised settings. You don't have labels, so you only have unreliable "internal" heuristics instead. And most of these are not reliable for DBSCAN because they will assume noise is a cluster, but it isn't. Few tools have support for noise in evaluation (I've seen options for this in ELKI), and I'm not convinced that either of the variants to handle noise is good. You can construct undesirable cases for each variant IMHO. You probably need to use at least two measures in the evaluation of clustering with noise.