I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference))
but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor])
but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows
. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
levels(mydata$factor)[mydata$factor[subsetrows]]
? – Marat Talipovunique
, this turns out to give me the right levels:unique(as.character(mydata$factor[subsetrows]))
– Alium Brittas.character.factor
is a wrap-up forlevels(x)[x]
– Marat Talipovmydata$factor[subsetrows]
returns a subset of factors, i.e. numeric indices that serve as a shortcut for characters stored in levels, that is guaranteed to be within the length oflevels(x)
and is guaranteed to match the proper level. In fact, as I noted before,as.character(mydata$factor[subsetrows])
callsas.character.factor
, whose definition isfunction(x) levels(x)[x]
. Thus, my solution and the one proposed by Alium (which I actually like more because of its compactness) are essentially the same thing – Marat Talipov