Split sample of R tm corpus objects

Question

I am using the R tm package, trying to split my corpus into a training set and a testing set, and to encode this into metadata for selection. What's the easiest way to do this (suppose I'm trying to split my sample in half)?

Here are some things I've tried:

I know that when I type...

> meta(d)
    MetaID Y
1        0 1
2        0 1

I see IDs, but cannot seem to access them (in order to say the first half belong in one set, and the second in another set). rownames(attributes(d)$DMetaData) gives me the indexes, but this looks ugly, and they're factors.

Now, after converting to a dataframe, say d is my dataset, I just say:

half <- floor(dim(d)[1]/2)
d$train <- d[1:half,]
d$test <- d[(half+1):(half*2),]

But how can I easily do something like...

meta(d, tag="split") = ifelse((meta(d,"ID")<=floor(length(d)/2)),"train","test")

...to get a result like:

> meta(d)
    MetaID Y split
1        0 1 train
2        0 1 train
...      . . ...
100      0 1 test

Unfortunately, meta(d,"ID") doesn't work, but meta(d[[1]],"ID") == 1 does, but is redundant. I'm looking for a whole-vector way of accessing the meta ID, or a generally smarter way of subsetting and assigning to the "split" meta variable.

agstudy agstudy · Accepted Answer · 2013-02-12T01:59:12

A corpus is just a list. SO you can split it like a normal list . Here an example:

I create some data. I use data within the tm package

txt <- system.file("texts", "txt", package = "tm")
(ovid <- Corpus(DirSource(txt)))
A corpus with 5 text documents

Now I split my data to Train and test

nn <- length(ovid)
ff <- as.factor(c(rep('Train',ceiling(nn/2)),   ## you create the split factor as you want
                rep('Test',nn-ceiling(nn/2))))  ## you can add validation set for example...
ll <- split(as.matrix(ovid),ff)
ll
$Test
A corpus with 2 text documents

$Train
A corpus with 3 text documents

Then I assign the new tag

ll <- sapply( names(ll),
              function(x) {
                meta(ll[[x]],tag = 'split') <- ff[ff==x]
                ll[x]
              })

You can check the result:

lapply(ll,meta)
$Test.Test
  MetaID split
4      0  Test
5      0  Test

$Train.Train
  MetaID split
1      0 Train
2      0 Train
3      0 Train

Split sample of R tm corpus objects

2 Answers