1
votes

I am using the R tm package, trying to split my corpus into a training set and a testing set, and to encode this into metadata for selection. What's the easiest way to do this (suppose I'm trying to split my sample in half)?

Here are some things I've tried:

  1. I know that when I type...
> meta(d)
    MetaID Y
1        0 1
2        0 1

I see IDs, but cannot seem to access them (in order to say the first half belong in one set, and the second in another set). rownames(attributes(d)$DMetaData) gives me the indexes, but this looks ugly, and they're factors.

  1. Now, after converting to a dataframe, say d is my dataset, I just say:
half <- floor(dim(d)[1]/2)
d$train <- d[1:half,]
d$test <- d[(half+1):(half*2),]

But how can I easily do something like...

meta(d, tag="split") = ifelse((meta(d,"ID")<=floor(length(d)/2)),"train","test")

...to get a result like:

> meta(d)
    MetaID Y split
1        0 1 train
2        0 1 train
...      . . ...
100      0 1 test

Unfortunately, meta(d,"ID") doesn't work, but meta(d[[1]],"ID") == 1 does, but is redundant. I'm looking for a whole-vector way of accessing the meta ID, or a generally smarter way of subsetting and assigning to the "split" meta variable.

2

2 Answers

4
votes

A corpus is just a list. SO you can split it like a normal list . Here an example:

I create some data. I use data within the tm package

txt <- system.file("texts", "txt", package = "tm")
(ovid <- Corpus(DirSource(txt)))
A corpus with 5 text documents

Now I split my data to Train and test

nn <- length(ovid)
ff <- as.factor(c(rep('Train',ceiling(nn/2)),   ## you create the split factor as you want
                rep('Test',nn-ceiling(nn/2))))  ## you can add validation set for example...
ll <- split(as.matrix(ovid),ff)
ll
$Test
A corpus with 2 text documents

$Train
A corpus with 3 text documents

Then I assign the new tag

ll <- sapply( names(ll),
              function(x) {
                meta(ll[[x]],tag = 'split') <- ff[ff==x]
                ll[x]
              })

You can check the result:

lapply(ll,meta)
$Test.Test
  MetaID split
4      0  Test
5      0  Test

$Train.Train
  MetaID split
1      0 Train
2      0 Train
3      0 Train
2
votes
## use test corpus crude in tm
library(tm)
data(crude)

#random training sample
half<-floor(length(crude)/2)
train<-sample(1:length(crude), half)

# meta doesnt handle lists or vector very well, so loop:
for (i in 1:length(crude)) meta(crude[[i]], tag="Tset") <- "test"
for (i in 1:half) meta(crude[[train[i]]], tag="Tset") <- "train"

# check result
for (i in 1:10) print(meta(crude[[i]], tag="Tset"))

This seems to work.