3
votes

I am facing a problem with a dataset which has overlapping factor levels.

I would like to produce timelines, barplots and statistics by factor level - however, I want the factor levels to be equivocal. That means that observations belonging to more than one level should appear several times in a plot.

Here is an example of how my data structure looks like:

head <- c("ID","YEAR","BRAZIL","GERMANY","US","FRANCE")
data <- data.frame(matrix(c(1,2000,1,0,0,0,
                            2,2010,0,1,1,0,
                            3,2011,0,1,0,0,
                            4,2012,1,0,0,1,
                            5,2012,0,1,0,0,
                            6,2013,0,0,0,1), 
                         nrow=6, ncol=6, byrow=T))
names(data) <- head

Obiously, a possible factor variable "COUNTRY" cannot be created the usual way. It would force factor levels to be clear-cut (in our case there would be 4 levels: Brazil, Germany, US and France):

data$COUNTRY[data$BRAZIL==1 & 
             data$GERMANY==0 & 
             data$US==0 & 
             data$FRANCE==0]  <- "Brazil"
data$COUNTRY[data$BRAZIL==0 & 
             data$GERMANY==1 & 
             data$US==0 & 
             data$FRANCE==0]  <- "Germany"

etc...

factor(data$COUNTRY)

But this is not what, I want...


My problem is that plotting by factor only works if factor levels are properly unambiguous. I would like to produce something like this:

require(ggplot2)
MYPLOT <- qplot(data$YEAR, data$COUNTRY)
MYPLOT + geom_point(aes(size=..count..), stat="bin") + scale_size(range=c(0, 15)) 

with observations belonging to i factor levels to appear i times in the plot.

  • How should I transform my data.frame in order to get what I desire?
  • Should I simply duplicate those observations belonging to i factor levels i times? If yes, how should I do that?
  • Is a workaround which does not require case duplications?

Ideas anyone?

1
Use stat='identity' in your plot call. If that isn't it, please clarify as to how you're determining the COUNTRY column as it is completely unclear to me. - Justin
I edited my question. It should be clear now what I meant with a possible factor variable COUNTRY. stat='identity' is not what I want, I want stat='bin'. - gosz
for what it's worth, the lattice package has a concept/object type called shingles, which are explicitly designed for implementing overlapping classes (this is one of the cool features of lattice that didn't make it into ggplot): stackoverflow.com/questions/5199128/… - Ben Bolker
mh, shingle looks interesting, but it seems it would only accept numeric variables (or at least ordinal scale). However, my `COUNTRY' variable can only be measured on a nominal scale and is therefore not apt for shingle. - gosz

1 Answers

1
votes

I think you have to duplicate those rows to represent each observation. and remove any with 0.

library(reshape2)
d2<-melt(data, id.var=c("ID","YEAR"))
d3<-d2[d2$value!=0,]
library(ggplot2)
qplot(d3$YEAR, d3$variable)