5
votes

This is a newbie question. How do I define a (classification) tsk that uses data from a (sqlite) database? The mlr3db example seems to write data from memory first. In my case, the data is already in the database. What is maybe a bigger problem, the target data and the features are in different tables.

What I tried:

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "my_data.db")
my_features <- dplyr::tbl(con, "my_features")
my_target <- dplyr::tbl(con, "my_targets")
task <- mlr3::TaskClassif$new("my_task", backend=my_features, target="???")

and then I don't know how to specify the target argument.

Maybe a solution would be to create a VIEW in the database that joins features and targets?

1
In case no solution is provided here, feel free to open an issue at mlr-org/mlr3db.pat-s

1 Answers

3
votes

Having the data split into (1) multiple tables or (2) multiple data bases is possible. In your case, it looks like the data is just split into multiple tables, but you can use the same DBI connection to access them.

All you need is a key column to join the two tables. In the following example I'm using a simple integer key column and an inner_join() to merge the two tables into a single new table, but this somewhat depends on your database scheme.

library(mlr3)
library(mlr3db)

# base data set
data = iris
data$row_id = 1:nrow(data)

# create data base with two tables, split data into features and target and
# keep key column `row_id` in both tables
path = tempfile()
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
DBI::dbWriteTable(con, "features", subset(data, select = - Species))
DBI::dbWriteTable(con, "target", subset(data, select = c(row_id, Species)))
DBI::dbDisconnect(con)

# re-open table
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)

# access tables with dplyr
tbl_features = dplyr::tbl(con, "features")
tbl_target = dplyr::tbl(con, "target")

# join tables with an inner_join
tbl_joined = dplyr::inner_join(tbl_features, tbl_target, by = "row_id")

# convert to a backend and create the task
backend = as_data_backend(tbl_joined, primary_key = "row_id")
mlr3::TaskClassif$new("my_task", backend, target = "Species")