0
votes

I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified.

The minimal example uses the iris dataset. It has a species column which is a factor with 3 levels. The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets.

library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]

iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"

ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")

Iris dataset train and test sample counts

I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts.

geom_bar(aes(y = (..count..)/sum(..count..)))

How can I plot the percentage of each label within each set efficiently?

Bonus: Including the whole dataset, train and test.

1

1 Answers

0
votes
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]

iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"

you need a separate dataframe for the labels

df_labs <- 
 iris %>% 
 group_by(Species) %>% 
 count(Set) %>% 
 mutate(pct = n / sum(n)) %>% 
 filter(Set == "Test")

that you use as the data for the label geom (or text)

ggplot(iris, aes(x = Species, fill = Set)) + 
  geom_bar(position = "dodge") +
  geom_label(data = df_labs, aes(label = scales::percent(pct), y = n / 2))

enter image description here