I am using the R tm package, trying to split my corpus into a training set and a testing set, and to encode this into metadata for selection. What's the easiest way to do this (suppose I'm trying to split my sample in half)?
Here are some things I've tried:
- I know that when I type...
> meta(d) MetaID Y 1 0 1 2 0 1
I see IDs, but cannot seem to access them (in order to say the first half belong in one set, and the second in another set). rownames(attributes(d)$DMetaData) gives me the indexes, but this looks ugly, and they're factors.
- Now, after converting to a dataframe, say d is my dataset, I just say:
half <- floor(dim(d)[1]/2) d$train <- d[1:half,] d$test <- d[(half+1):(half*2),]
But how can I easily do something like...
meta(d, tag="split") = ifelse((meta(d,"ID")<=floor(length(d)/2)),"train","test")
...to get a result like:
> meta(d) MetaID Y split 1 0 1 train 2 0 1 train ... . . ... 100 0 1 test
Unfortunately, meta(d,"ID") doesn't work, but meta(d[[1]],"ID") == 1 does, but is redundant. I'm looking for a whole-vector way of accessing the meta ID, or a generally smarter way of subsetting and assigning to the "split" meta variable.