mlr: Tune model parameters with validation set

Question

Just switched to mlr for my machine learning workflow. I am wondering if it is possible to tune hyperparameters using a separate validation set. From my minimum understanding, makeResampleDesc and makeResampleInstance accepts only resampling from training data.

My goal is to tune parameters with a validation set and test the final model with the test set. This is to prevent overfitting and knowledge leak.

Here is what I did code-wise:

## Create training, validation and test tasks
train_task <- makeClassifTask(data = train_data, target = "y", positive = 1)
validation_task <- makeClassifTask(data = validation_data, target = "y")
test_task <- makeClassifTask(data = test_data, target = "y")

## Attempt to tune parameters with separate validation data
tuned_params <- tuneParams(
    task = train_task,
    resampling = makeResampleInstance("Holdout", task = validation_task),
    ...
)

From the error message, it looks like evaluation is still trying to resample from the training set:

00001: Error in resample.fun(learner2, task, resampling, measures = measures, : Size of data set: 19454 and resampling instance: 1666333 differ!

Does anyone know what I should do? Am I setting up everything the right way?

You don't need to split the data into separate tasks, you can use just one and nested resampling. In particular, you would have a holdout resampling for the outer and whatever else you want for the inner. — Lars Kotthoff

Boxuan Boxuan · Accepted Answer · 2018-08-07T02:40:54

[Update as of 2019/03/27]

Following @jakob-r's comment, and finally understanding @LarsKotthoff's suggestion, here is what I did:

## Create combined training data
train_task_data <- rbind(train_data, validation_data)

## Create learner, training task, etc.
xgb_learner <- makeLearner("classif.xgboost", predict.type = "prob")
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)

## Tune hyperparameters
tune_wrapper <- makeTuneWrapper(
  learner = xgb_learner,
  resampling = makeResampleDesc("Holdout"),
  measures = ...,
  par.set = ...,
  control = ...
)
model_xgb <- train(tune_wrapper, train_task)

Here is what I did following @LarsKotthoff 's comment. Assume you have two separate datasets for training (train_data) and validation (validation_data):

## Create combined training data
train_task_data <- rbind(train_data, validation_data)
size <- nrow(train_task_data)
train_ind <- seq_len(nrow(train_data))
validation_ind <- seq.int(max(train_ind) + 1, size)

## Create training task
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)

## Tune hyperparameters
tuned_params <- tuneParams(
    task = train_task,
    resampling = makeFixedHoldoutInstance(train_ind, validation_ind, size),
    ...
)

After optimizing the hyperparameter set, you can build a final model and test against your test dataset.

Note: I have to install the latest development version (as of 2018/08/06) from GitHub. Current CRAN version (2.12.1) throws an error when I call makeFixedHoldoutInstance(), i.e.,

Assertion on 'discrete.names' failed: Must be of type 'logical flag', not 'NULL'.

mlr: Tune model parameters with validation set

1 Answers