Get data into a format for xgboost in R?

Question

Does someone have a very well-explained example of getting data into a format usable by xgboost in R?

The get started doc doesn't help me. The data (agaricus.train and agaricus.test) are already in a specialized format (dgCMatrix):

> str(agaricus.train)
List of 2
 $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
  .. ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
  .. ..@ Dim     : int [1:2] 6513 126
  .. ..@ Dimnames:List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
  .. ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..@ factors : list()
 $ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...

I saw this example code use sparse.model.matrix, but I'm still having a hard time putting together fairly plain data into the format xgboost needs.

For example, suppose I have two data frames: words and labels.

The words data frame has sentence_id and word_id, with one or more words per sentence.

The data_label data frame has a sentence_id and label (say, 0 or 1 for a binary classification task).

How do I get that data into a format to predict the label for a sentence?

I can split train and test.

Edit: The simplest version of words and data_label:

words <- data.frame(sentence_id=c(1, 1, 2, 2, 2),
                    word_id=c(1, 2, 1, 3, 4))
data_label <- data.frame(sentence_id=c(1, 2), label=c(0, 1))

I am guessing that you know how to convert your data frame into the matrix format that is essential for xgboost and that all your data is in numeric format. Can you post a small sample of your data and the code you are using otherwise it is not clear what the problem may be. — cousin_pete
@cousin_pete Let's not assume I know how to convert to matrix format. There are several different conversion functions, and I don't know which xgboost wants. — dfrankow

Vadim Khotilovich Vadim Khotilovich · Accepted Answer · 2017-04-15T00:27:16

Input to xgb.DMatrix could be either dense matrix, or sparse dgCMatrix, or sparse data stored in a file in LibSVM format. Since you are dealing with textual data, the sparse representation would be the most appropriate. Below is an example of how to convert your example data to dgCMatrix. Here I was assuming a perfect situation with continuous sets of integer sentence_id's starting from 1 that are the same in both tables. If it would not be so in practice, some more data munging on your part would be needed.

library(Matrix)

words <- data.frame(sentence_id=c(1, 1, 2, 2, 2),
                    word_id=c(1, 2, 1, 3, 4))
data_label <- data.frame(sentence_id=c(1, 2), label=c(0, 1))

# quick check of assumptions about sentence_id
stopifnot(min(words$sentence_id) == 1 &&
          max(words$sentence_id) == length(unique(words$sentence_id)))

# sparse matrix construction from "triplet" data
# (rows are sentences, columns are words, and the value is always 1)
smat <- sparseMatrix(i = words$sentence_id, j = words$word_id, x = 1)

# make sure sentence_id are in proper order in data_label:
data_label <- data_label[order(data_label$sentence_id)]
stopifnot(all.equal(data_label$sentence_id, 1:nrow(smat)))

xmat <- xgb.DMatrix(smat, label = data_label$label)

Get data into a format for xgboost in R?

1 Answers