I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified.
The minimal example uses the iris dataset. It has a species
column which is a factor with 3 levels. The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")
I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts.
geom_bar(aes(y = (..count..)/sum(..count..)))
How can I plot the percentage of each label within each set efficiently?
Bonus: Including the whole dataset, train and test.